## Naive Bayes

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

$$ P(Class | Sample) = \frac{P(Sample | Class) \times P(Class)}{P(Sample)} $$ </br>

$$posterior = \frac{likelihood \times prior}{evidence}$$ </br>

$$P(c_i | x_j) \propto P(x_j|c_i) \times P(c_i)$$ </br>

For categorical features, we can use a Multinouilli distribution, where $\mu_{ic}$ is an histogram over the possible values for $x_i$ in class $c$ : </br>


$$P(\textbf{x} | c) = \prod_{i=1}^{D} Cat (x_j | \mu_{jc})$$ </br>


**During Training**:
 - Compute the prior probability i.e $p(c_i)$ the proportion of samples inside each class of the whole training set.
 - for each feature:
  - if the feature is categorical, compute $p(x_j | c_i)$ for $j=1,2 \ldots D$ and $i=1,2 \ldots C$
       - for each possible values of this feature in the training samples of class $c_i$ compute the probability that this feature appear in class $c_i$

**To Predict**
 - Compute $p(c_i | \textbf{x})$
  - Multiply the prior of each class $p(c_i)$ by
  - for each feature $\textbf{k}$:
      - if categorical, multiply by the probabilities calculated earlier, $p(x_k | c_i)$ where $x_k$ is the value of the input on feature $k$
  - return the highest probability $p(c_i | x)$ of all classes 


### Q1: 5 Marks

In [2]:
# 5 marks. 2.5 Marks X 2

df = pd.read_csv("PlayTennis.csv")
msk = np.random.rand(len(df)) < 0.8
train = #write your code here
test = #write your code here
train.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No


In [4]:
# 10 Marks.
# create the dict data structure
_class = 'Play Tennis'
di = {}
for class_label in train[_class].unique():
    #write your code to create the dictionary as shown in the net cell

In [5]:
di

{'No': {'Outlook': {'Sunny': 0, 'Overcast': 0, 'Rain': 0},
  'Temperature': {'Hot': 0, 'Cool': 0, 'Mild': 0},
  'Humidity': {'High': 0, 'Normal': 0},
  'Wind': {'Weak': 0, 'Strong': 0}},
 'Yes': {'Outlook': {'Sunny': 0, 'Overcast': 0, 'Rain': 0},
  'Temperature': {'Hot': 0, 'Cool': 0, 'Mild': 0},
  'Humidity': {'High': 0, 'Normal': 0},
  'Wind': {'Weak': 0, 'Strong': 0}}}

### Q2: 10 Marks

In [6]:
# 10 Marks. 5 marks X 2
# The expected output can be seen in the next cell.
classLabel = 'Yes'

for classLabel in ['Yes', 'No']:
    for feature in train.columns:
            if feature != _class:
                for item in train[feature].unique():
                    numr = # write your code here
                    denr = # write your code here
                    di[classLabel][feature][item] = numr/denr

In [7]:
di

{'No': {'Outlook': {'Sunny': 0.6666666666666666,
   'Overcast': 0.0,
   'Rain': 0.3333333333333333},
  'Temperature': {'Hot': 0.6666666666666666,
   'Cool': 0.3333333333333333,
   'Mild': 0.0},
  'Humidity': {'High': 0.6666666666666666, 'Normal': 0.3333333333333333},
  'Wind': {'Weak': 0.3333333333333333, 'Strong': 0.6666666666666666}},
 'Yes': {'Outlook': {'Sunny': 0.2857142857142857,
   'Overcast': 0.5714285714285714,
   'Rain': 0.14285714285714285},
  'Temperature': {'Hot': 0.2857142857142857,
   'Cool': 0.42857142857142855,
   'Mild': 0.2857142857142857},
  'Humidity': {'High': 0.2857142857142857, 'Normal': 0.7142857142857143},
  'Wind': {'Weak': 0.5714285714285714, 'Strong': 0.42857142857142855}}}

### Q3: 5 Marks

In [8]:
# Code to Mimic Testing
P_Yes = # write your code here
P_No = # write your code here
print(P_Yes, P_No)

0.7 0.3


In [9]:
yes = []
no = []

for Label in ['Yes', 'No']:
    for index, row in test.iterrows():
        value = P_Yes if Label == 'Yes' else P_No
        for feature in test.columns:
            if feature != _class:
                value*=di[Label][feature][row[feature]]
        yes.append(value) if Label == 'Yes' else no.append(value)

In [10]:
yes

[0.00466472303206997,
 0.00932944606413994,
 0.011661807580174925,
 0.0034985422740524776]

In [11]:
no

[0.0, 0.0, 0.0, 0.0]

In [12]:
print("Predicted Labels")
predicted_labels = []
for item in zip(yes, no):
    predicted_labels.append("Yes" if item[0]>item[1] else "No")
    #print("Yes") if item[0]>item[1] else print("No")
predicted_labels

Predicted Labels


['Yes', 'Yes', 'Yes', 'Yes']

In [13]:
print("True Labels")
list(test[_class])

True Labels


['Yes', 'No', 'Yes', 'No']

### Q4: 5 Marks

In [14]:
# Accuracy
# write your code to find out the accuracy between predicted and true labels

0.5

### Lets compare our result with sklearn

In [15]:
## with sklearn
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix

In [16]:
le = preprocessing.LabelEncoder()
data_train_df = pd.DataFrame(train)
data_train_df_encoded = data_train_df.apply(le.fit_transform)

data_test_df = pd.DataFrame(test)
data_test_df_encoded = data_test_df.apply(le.fit_transform)

### Q5: 10 Marks

In [17]:
x_train = # write your code here to drop "Play Tennis Feature"
y_train = # write your code here to drop "Play Tennis Feature"

x_test = data_test_df_encoded.drop(['Play Tennis'],axis=1)
y_test = data_test_df_encoded['Play Tennis']

### Q6: 5 Marks

In [18]:
model = # write your code here
nbtrain = model.fit(x_train, y_train)

y_pred = nbtrain.predict(x_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.5
