# Bayes Theorem

The underlying principle behind the Naive Bayes algorithm is the Bayes Theorem.

# Bayes Theorem states that-

P(A|B) = P(B|A) * P(A)/ P(B)

If X is the input variables and y is the output variable, we can rewrite the above equation as-

P(y|X) = P(X|y) * P(y)/ P(X)

The "naive" part of the algorithm is that we make the naive assumption that the classes are conditionally independent.

That is, the effect of a predictor(x1) on a given class(y) is independent of the values of other predictors(x2, x3 ...).

We can therefore rewrite P(X|y) as-

P(X|y) = P(x1|y) P(x2|y) ... * P(x(n)|y)

We can remove the denominator P(X) -as it remains constant while solving for y- and introduce a proportionality.

P(y|X) = (const) P(X|y) P(y)

OR

P(y|X) = (const) P(x1|y) P(x2|y) ... P(x(n)|y) * P(y)

This is the basic idea of the Naive Bayes algorithm.

Basic Dataset Which explains the Naive Bayes in the best way

In [14]:
import pandas as pd
data = pd.read_csv('weather.csv')
data

Unnamed: 0,outlook,temperature,humidity,windy,play
0,overcast,hot,high,False,yes
1,overcast,cool,normal,True,yes
2,overcast,mild,high,True,yes
3,overcast,hot,normal,False,yes
4,rainy,mild,high,False,yes
5,rainy,cool,normal,False,yes
6,rainy,cool,normal,True,no
7,rainy,mild,normal,False,yes
8,rainy,mild,high,True,no
9,sunny,hot,high,False,no


In [15]:
# Create a frequncy table from the data

outlook = data.groupby(['outlook', 'play']).size()
temp = data.groupby(['temperature', 'play']).size()
humidity = data.groupby(['humidity', 'play']).size()
windy = data.groupby(['windy', 'play']).size()
play = data.play.value_counts()

In [16]:
# sample output for frequency table

print(temp)

temperature  play
cool         no      1
             yes     3
hot          no      2
             yes     2
mild         no      2
             yes     4
dtype: int64


# Making predictions

We will now use the Naive Bayes algorithm to find the probability of playing tennis given the weather conditions.

For example, to calculate the probabilty that you should play tennis for the following conditions:

outlook- sunny

temperature- mild

humidity- normal

windy- False

We will calculate,

P(y="yes"|X=[sunny, mild, normal, False]) = P(outlook="sunny"|y="yes") P(temp="mild"|y="yes") P(humidity="normal"|y="yes") P(windy="False"|y="yes") P(y="yes")

And prediction would be the maximum of P(y="yes"|X) and P(y="no"|X)

This is implemented in the code below.

In [7]:
# Calculate the total probability to be used later

total_yes = play["yes"]
total_no = play["no"]

total_play = total_yes + total_no

In [8]:
total_play

14

In [9]:
outlook['sunny']['yes'] # example

2

In [17]:
def find_prob(outlook_val, temp_val, humidity_val, windy_val, play_val):
    p_outlook_play = outlook[outlook_val][play_val]/play[play_val]
    p_temp_play = temp[temp_val][play_val]/play[play_val]
    p_humidity_play = humidity[humidity_val][play_val]/play[play_val]
    p_windy_play = windy[windy_val][play_val]/play[play_val]
    p_play = play[play_val]/total_play

    prob = p_outlook_play * p_temp_play * p_humidity_play * p_windy_play * p_play
    return prob

Now we will make predictions depending on the output with the highest probability.

That is, if P(y="yes"|X) > P(y="no"|X), then the prediction would be to play tennis and vice-versa.

In [18]:
def pred_play(outlook_val, temp_val, humidity_val, windy_val):
    prob_yes = find_prob(outlook_val, temp_val, humidity_val, windy_val, "yes")
    prob_no = find_prob(outlook_val, temp_val, humidity_val, windy_val, "no")

    print("Probability that you should play Tennis: ", prob_yes)
    print("Probability that you should not play Tennis: ", prob_no)

    if prob_yes > prob_no:
        print("You should play Tennis today! :)")
  
    else:
        print("You should not play Tennis today! :(")

In [19]:

outlook_value = 'sunny' 
temp_value = 'mild' 
humidity_value = 'high' 
windy_value = False 

In [20]:
# Make and display the predictions

pred_play(outlook_value, temp_value, humidity_value, windy_value)

Probability that you should play Tennis:  0.014109347442680773
Probability that you should not play Tennis:  0.02742857142857143
You should not play Tennis today! :(


# Gaussian Probability Distribution

In [21]:
import pandas as pd
data = pd.read_csv('haberman.csv')  # previous dataset

In [22]:
from math import sqrt
from math import pi
from math import exp
 
# Make a dictionary of seperate classes
def separate_by_class(dataset):
    separated = dict()
    for i in range(len(dataset)):
        vector = dataset[i]
        class_value = vector[-1] # class value will be last column
        if (class_value not in separated):
            separated[class_value] = list()
        separated[class_value].append(vector)
    return separated

In [23]:
# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers)/float(len(numbers))
 
# Calculate the standard deviation of a list of numbers
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
    return sqrt(variance)
 
# Calculate the mean, stdev and count for each column in a dataset
def mean_std(dataset):
    summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)]
    del(summaries[-1]) # removes length and stores mean and std
    return summaries
 

In [26]:
# Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = dict()
    for class_value, rows in separated.items():
        summaries[class_value] = mean_std(rows)
    return summaries

In [25]:
# Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
    exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (sqrt(2 * pi) * stdev)) * exponent
 

In [27]:
# Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row): 
    total_rows = sum([summaries[label][0][2] for label in summaries]) # give the total number of rows for both class combined
    probabilities = dict()
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = summaries[class_value][0][2]/float(total_rows)
        for i in range(len(class_summaries)):
            mean, stdev, _ = class_summaries[i]
            probabilities[class_value] *= calculate_probability(row[i], mean, stdev)
    return probabilities

In [28]:
def predict(summaries, row):
    probabilities = calculate_class_probabilities(summaries, row)
    best_label, best_prob = None, -1
    for class_value, probability in probabilities.items():
        if best_label is None or probability > best_prob:
            best_prob = probability
            best_label = class_value
    return best_label

In [29]:
import random
def train_test_split(data, split, train_data = [], test_data = []):
    for x in range(len(data)):
        if random.random() < split:
            train_data.append(data.loc[x])
        else:
            test_data.append(data.loc[x])

In [30]:
data = data
split = 0.7
train_data = []
test_data  = []
train_test_split(data, split, train_data, test_data)

In [31]:
summarize = summarize_by_class(train_data)
predicted = list()
for row in test_data:
    output = predict(summarize, row)
    predicted.append(output)

In [32]:
def accuracy(test_data, predicted):
    correct = 0
    for i in range(len(test_data)-1):
        if test_data[i][-1] == predicted[i]:
            correct += 1
    return correct / float(len(test_data)) * 100.0
 

In [33]:
accuracy = accuracy(test_data, predicted)
print('Accuracy: ' + repr(accuracy) + '%')

Accuracy: 75.82417582417582%


In [34]:
predicted[0]

1

In [35]:
test_data[0]

age       30
year      62
nodes      3
status     1
Name: 1, dtype: int64

Predict and actual data have same status

# Naive Bayes using scikitlearn(Gaussian Distribution)

In [36]:
col = data.columns[0:3]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data[col] = scaler.fit_transform(data[col])

In [37]:
data.head()

Unnamed: 0,age,year,nodes,status
0,0.0,0.545455,0.019231,1
1,0.0,0.363636,0.057692,1
2,0.0,0.636364,0.0,1
3,0.018868,0.090909,0.038462,1
4,0.018868,0.636364,0.076923,1


In [38]:
X = data[col]
Y = data['status']

In [39]:
X.head()

Unnamed: 0,age,year,nodes
0,0.0,0.545455,0.019231
1,0.0,0.363636,0.057692
2,0.0,0.636364,0.0
3,0.018868,0.090909,0.038462
4,0.018868,0.636364,0.076923


In [76]:
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: status, dtype: int64

In [40]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=1/3,random_state=42)

In [41]:
print(len(X_train))
print(len(X_test))
print(len(y_train))
print(len(y_test))

204
102
204
102


In [42]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
clf = GaussianNB()
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred, normalize=True) * float(100)
print(acc)

74.50980392156863


Using libraries we can implement the algorithm in few lines of code and we can achieve good results