# Naive Bayes algorithm

Naive Bayes is a classification technique based on Bayes’ Theorem(Probability theory) with an assumption that all the features that predicts the target value are independent of each other. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature in determining the target value.


This assumption we just read about is very Naive when we are dealing with real world data because most of the times, features do depend on each other in determining the target - this is why the algorithm gets its name Naive Bayes.


Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.


## Bayes Theorem
The underlying principle behind the Naive Bayes algorithm is the Bayes Theorem.

Bayes Theorem states that-

#### P(A|B) = P(B|A) * P(A)/ P(B)
If X is the input variables and y is the output variable, we can rewrite the above equation as-

#### P(y|X) = P(X|y) * P(y)/ P(X)
The "naive" part of the algorithm is that we make the naive assumption that the classes are conditionally independent.

That is, the effect of a predictor(x1) on a given class(y) is independent of the values of other predictors(x2, x3 ...).

We can therefore rewrite P(X|y) as-

#### P(X|y) = P(x1|y) P(x2|y) ... * P(x(n)|y)

We can remove the denominator P(X) -as it remains constant while solving for y- and introduce a proportionality.

#### P(y|X) = (const) P(X|y) P(y)
OR

#### P(y|X) = (const) P(x1|y) P(x2|y) ... P(x(n)|y) * P(y)¶

This is the basic idea of the Naive Bayes algorithm.

## Import Pandas library to read the data

In [1]:
#using pandas to load the dataset
import pandas as pd

## Read the dataset:

Now, we read the dataset in a varible named data using pandas library which we imported above as 'pd' (data-type: DataFrame).

The dataset used here is the [Tennis Weather Dataset](https://www.kaggle.com/pranavpandey2511/naive-bayes-classifier-from-scratch) from Kaggle.

In [2]:
# Load the data from the CSV file and show it
data = pd.read_csv('https://storage.googleapis.com/kagglesdsdata/datasets/58414/113400/tennis.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20201024%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20201024T041943Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=a5813f549b6f92e5c5c5f58074008ea830e0abf3aa450ea8507a9517d14862d6f1e621e7339f4c5b3871a7c2b7a97be822cef5e3543a8aa3e9a7d8e9f1925ae64ea7a803637fc820635662dfa2c189894ddd8003779b47b08b452682663595ddca8a1d0f8ba077a941bda7d7c29fe157dd700bf764fcea4a4e3258d707180c66988232638b73c8cbc4b573ac483e3e02958a42c16adb954285dc354a3312f761f19b16627132bbccfd06d4d5d55297c01af0d52b0c7e7c6cc02a5bb8dbc8593d789bab79ffa8c4a488c60ec86ad3089e7759b41e55c091111ba650a42abbde33639403fbe6721731804d0291823761b5ea51317f27fb42b9194a47f883ac4415')
data

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes
5,rainy,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,sunny,mild,high,False,no
8,sunny,cool,normal,False,yes
9,rainy,mild,normal,False,yes


## Creating frequency table

We will first create a frequency table so that we can get the values of P(X|y) which we can use to solve for P(y|X).

In [3]:
# Create a frequncy table from the data

outlook = data.groupby(['outlook', 'play']).size()
temp = data.groupby(['temp', 'play']).size()
humidity = data.groupby(['humidity', 'play']).size()
windy = data.groupby(['windy', 'play']).size()
play = data.play.value_counts()

In [4]:
# Display the created frequency table

print(temp)
print('------------------')
print(humidity)
print('------------------')
print(windy)
print('------------------')
print(outlook)
print('------------------')
print('play')
print(play)

temp  play
cool  no      1
      yes     3
hot   no      2
      yes     2
mild  no      2
      yes     4
dtype: int64
------------------
humidity  play
high      no      4
          yes     3
normal    no      1
          yes     6
dtype: int64
------------------
windy  play
False  no      2
       yes     6
True   no      3
       yes     3
dtype: int64
------------------
outlook   play
overcast  yes     4
rainy     no      2
          yes     3
sunny     no      3
          yes     2
dtype: int64
------------------
play
yes    9
no     5
Name: play, dtype: int64


## Making predictions

We will now use the Naive Bayes algorithm to find the probability of playing tennis when the weather conditions are given.

For example, to calculate the probabilty that you should play tennis for the following conditions:

- outlook- rainy
- temp- mild
- humidity- high
- windy- True

We will calculate,

P(y="yes"|X=[rainy, mild, high, True]) = P(outlook="rainy"|y="yes") P(temp="mild"|y="yes") P(humidity="high"|y="yes") P(windy="True"|y="yes") P(y="yes")

And prediction would be the maximum of P(y="yes"|X) and P(y="no"|X)

This is implemented in the code below.

In [5]:
# Calculate the total probability to be used later

total_y = play["yes"]
total_n = play["no"]

play_total = total_y + total_n

In [6]:
# Function to find the probability of whether to play or not using Naive Bayes algorithm
# If the value of play_val is "yes" then it returns the probabilty of playing and not playing if the input is "no"

def find_prob(outlook_val, temp_val, humidity_val, windy_val, play_val):
  p_outlook_play = outlook[outlook_val][play_val]/play[play_val]
  p_temp_play = temp[temp_val][play_val]/play[play_val]
  p_humidity_play = humidity[humidity_val][play_val]/play[play_val]
  p_windy_play = windy[windy_val][play_val]/play[play_val]
  p_play = play[play_val]/play_total

  prob = p_outlook_play * p_temp_play * p_humidity_play * p_windy_play * p_play
  return prob

In [7]:
# Function to make predictions

def pred_play(outlook_val, temp_val, humidity_val, windy_val):
  prob_yes = find_prob(outlook_val, temp_val, humidity_val, windy_val, "yes")
  prob_no = find_prob(outlook_val, temp_val, humidity_val, windy_val, "no")

  print("Probability that you should play Tennis: ", prob_yes)
  print("Probability that you should not play Tennis: ", prob_no)

  if prob_yes > prob_no:
    print("You should play Tennis today! :)")
  
  else:
    print("You should not play Tennis today! :(")

In [8]:
# Making predictions on weather conditions

outlook_value = 'rainy' 
temp_value = 'mild' 
humidity_value = 'high' 
windy_value = True 

In [9]:
# Make and display the predictions

pred_play(outlook_value, temp_value, humidity_value, windy_value)

Probability that you should play Tennis:  0.010582010582010581
Probability that you should not play Tennis:  0.027428571428571438
You should not play Tennis today! :(
