## Naive Bayes for Machine Learning
Dataset: pima-indian-diabetes dataset

### Naive Bayes (Regular - Discrete)
### Naive Bayes (Gaussian - Continuous)
### Naive Bayes (Multinomial - Frequency)

Feature Space is not about occurence or not occurence of an event:


Spam Classifier: ($: 1 - 0), (Lottery) Regular

Spam Classifier: ($: 0-1-2-3-4), Multinomial Bayes Classifier

IRIS Flower classifier: Length and width of the petals & Sepals. (Mu and Sigma)

### Dataset: https://raw.githubusercontent.com/sumitraju/data-science/main/naive_bayes_for_machine_learning/data/pima-indians-diabetes.data.csv

#### The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

 -  Pregnancies: Number of times pregnant <br>
 -  Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test <br>
 -  BloodPressure: Diastolic blood pressure (mm Hg) <br>
 -  SkinThickness: Triceps skin fold thickness (mm) <br>
 -  Insulin: 2-Hour serum insulin (mu U/ml) <br>
 -  BMI: Body mass index (weight in kg/(height in m)^2) <br>
 -  DiabetesPedigreeFunction: Diabetes pedigree function <br>
 -  Age: Age (years) <br>
 -  Outcome: Class variable (0 or 1) <br>

In [1]:
# import dependencies
import numpy as np
import pandas as pd

# other dependencies for publishing image in notebook
from IPython.display import Image
from IPython.core.display import HTML 
%matplotlib  inline

In [2]:
# column has all the name of column name 
# our data is stored in dataframe: data

column = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age","Outcome"]
data = pd.read_csv('https://raw.githubusercontent.com/sumitraju/data-science/main/naive_bayes_for_machine_learning/data/pima-indians-diabetes.data.csv',names=column)

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Bayesian formula

![Naive Bayes Formula](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bayes.PNG)

*Where*,
 - P(c|x) is the posterior probability of class c given predictor ( features).
 - P(c) is the probability of class.
 - P(x|c) is the likelihood which is the probability of predictor given class.
 - P(x) is the prior probability of predictor.

### In a bayes classifier, we calculate the posterior for every class for each observation. Then, classify the observation based on the class with the largest posterior value. we have two classes of outcome So we will calculate two posteriors: one for Outcome 1 and one for Outcome 0.

# Gaussian Naive Bayes Classifier

# Outcome column has two sets namely Outcome 1 and outcome 0
# Naive bayesian of outcome1 is
![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome1.PNG)

![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome0.PNG)

## Now simplify the equation above:

 - P(Outcome1)is the prior probabilities. It is, as you can see, simply the probability an observation is "1" . This is just the number of person of outcome1 in the dataset divided by the total number of people in the dataset. <br><br>

 - p(pregnancies∣outcome1) * p(Glucose∣Outcome1) * p(Blood Pressure∣Outcome1)...  is the likelihood. Notice that we have unpacked person’s data. so it is now every feature in the dataset. The “gaussian” and “naive” come from two assumptions present in this likelihood: <br><br>  
 
 -  If you look each term in the likelihood you will notice that we assume each feature is uncorrelated from each other. That is, Pregnancies is independent of Glucose or BMI etc.. This is obviously not true, and is a “naive” assumption - hence the name “naive bayes.” <br><br>


 ##### ------------------------------------------------------------------------------------------------------------------------------------------------------------------
 
## as the formula our goal is divided into 5 types

1. Calculate Priors
2. Calculate Likelihood
3. Calculate Marginal Probability
4. Apply Bayes Classifier To New Data Point
5. understand what has just happen

### 1.Calculate Priors

#### Priors can be either constants or probability distributions. In our example, this is simply the probability of outcome of patients. 

In [4]:
# Number of patients of outcome 1
n_outcome1 = data['Outcome'][data['Outcome'] == 1].count()
n_outcome1

268

In [5]:
# Number of patients of outcome 0
n_outcome0 = data['Outcome'][data['Outcome'] == 0].count()
n_outcome0

500

In [6]:
# Total people
total_ppl = data['Outcome'].count()
total_ppl

768

In [7]:
# Number of people of outcome1 divided by the total people
P_outcome1 = n_outcome1/total_ppl
P_outcome1

0.3489583333333333

In [8]:
# Number of people of outcome0 divided by the total people
P_outcome0 = n_outcome0/total_ppl
P_outcome0

0.6510416666666666

### 2.Calculate Likelihood

 - We assume have that the value of the features (e.g. the Pregnancy of Outcome1, the Glucose of Outcome1) are normally (gaussian) distributed. <br><br>
 -  This means that p (Pregnancy∣Outcome1) is calculated by inputing the required parameters into the probability density function of the normal distribution:<br><br>
 - Now as per the formula for probability density function, our likelihood will be

# Number of people of outcome0 divided by the total people
P_outcome0 = n_outcome0/total_ppl
P_outcome0

![bay1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bay1.png)

In [9]:
# Now first calculate the means of the data according to outcome

# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Outcome').mean()

# View the values
data_means

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [10]:
# Second calculate the variance of the data according to outcome

# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Outcome').var()

# View the values
data_variance

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,9.103403,683.362325,326.274693,221.710525,9774.345427,59.13387,0.089452,136.134168
1,13.99687,1020.139457,461.897968,312.572195,19234.673319,52.750693,0.138648,120.302588


 #### so you have got the means and variance of the data.
 now just as this formula is for one feature:

![bay1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bay1.png)

##### we have to find the values of all the features and to find it, we have to calculate the mean and variance of all the features.

In [11]:
# Means for outcome1 for all features
outcome1_Pregnancies_mean = data_means['Pregnancies'][data_variance.index == 1].values[0]
outcome1_Glucose_mean = data_means['Glucose'][data_variance.index == 1].values[0]
outcome1_BloodPressure_mean = data_means['BloodPressure'][data_variance.index == 1].values[0]
outcome1_SkinThickness_mean = data_means['SkinThickness'][data_variance.index == 1].values[0]
outcome1_Insulin_mean = data_means['Insulin'][data_variance.index == 1].values[0]
outcome1_BMI_mean = data_means['BMI'][data_variance.index == 1].values[0]
outcome1_DiabetesPedigreeFunction_mean = data_means['DiabetesPedigreeFunction'][data_variance.index == 1].values[0]
outcome1_Age_mean = data_means['Age'][data_variance.index == 1].values[0]


# Variance for outcome1 for all features
outcome1_Pregnancies_variance = data_variance['Pregnancies'][data_variance.index == 1].values[0]
outcome1_Glucose_variance= data_variance['Glucose'][data_variance.index == 1].values[0]
outcome1_BloodPressure_variance = data_variance['BloodPressure'][data_variance.index == 1].values[0]
outcome1_SkinThickness_variance = data_variance['SkinThickness'][data_variance.index == 1].values[0]
outcome1_Insulin_variance = data_variance['Insulin'][data_variance.index == 1].values[0]
outcome1_BMI_variance = data_variance['BMI'][data_variance.index == 1].values[0]
outcome1_DiabetesPedigreeFunction_variance = data_variance['DiabetesPedigreeFunction'][data_variance.index == 1].values[0]
outcome1_Age_variance = data_variance['Age'][data_variance.index == 1].values[0]

# Means for outcome0 for all features
outcome0_Pregnancies_mean = data_means['Pregnancies'][data_variance.index == 0].values[0]
outcome0_Glucose_mean = data_means['Glucose'][data_variance.index == 0].values[0]
outcome0_BloodPressure_mean = data_means['BloodPressure'][data_variance.index == 0].values[0]
outcome0_SkinThickness_mean = data_means['SkinThickness'][data_variance.index == 0].values[0]
outcome0_Insulin_mean = data_means['Insulin'][data_variance.index == 0].values[0]
outcome0_BMI_mean = data_means['BMI'][data_variance.index == 0].values[0]
outcome0_DiabetesPedigreeFunction_mean = data_means['DiabetesPedigreeFunction'][data_variance.index == 0].values[0]
outcome0_Age_mean = data_means['Age'][data_variance.index == 0].values[0]

# Variance for outcome0 for all features
outcome0_Pregnancies_variance = data_variance['Pregnancies'][data_variance.index == 0].values[0]
outcome0_Glucose_variance = data_variance['Glucose'][data_variance.index == 0].values[0]
outcome0_BloodPressure_variance = data_variance['BloodPressure'][data_variance.index == 0].values[0]
outcome0_SkinThickness_variance = data_variance['SkinThickness'][data_variance.index == 0].values[0]
outcome0_Insulin_variance = data_variance['Insulin'][data_variance.index == 0].values[0]
outcome0_BMI_variance = data_variance['BMI'][data_variance.index == 0].values[0]
outcome0_DiabetesPedigreeFunction_variance = data_variance['DiabetesPedigreeFunction'][data_variance.index == 0].values[0]
outcome0_Age_variance = data_variance['Age'][data_variance.index == 0].values[0]

### 3.Marginal probability

##### It is probably one of the most confusing parts of bayesian approaches. In some examples it is completely possible to calculate the marginal probability. 
##### However, in many real-world cases, it is either extremely difficult or impossible to find the value of the marginal probability (explaining why is beyond the scope of this tutorial). 
##### This is not as much of a problem for our classifier as you might think. Why? Because we don’t care what the true posterior value is, we only care which class has a the highest posterior value.
##### And because the marginal probability is the same for all classes 
1) we can ignore the denominator <br><br>
2) calculate only the posterior’s numerator for each class  <br><br>
3) pick the largest numerator. That is, we can ignore the posterior’s denominator and make a prediction solely on the relative values of the posterior’s numerator.<br><br>

### 4. Apply Bayes Classifier To New Data Point

In [12]:
# Create an empty dataframe that we have to predict 
person = pd.DataFrame()

# Create some feature values for this single row
person['Pregnancies'] = [7]
person['Glucose'] = [130]
person['BloodPressure'] = [86]
person['SkinThickness'] = [34]
person['Insulin'] = [0]
person['BMI'] = [33.5]
person['DiabetesPedigreeFunction'] = [0.564]
person['Age'] = [50]
# View the data 
person

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,7,130,86,34,0,33.5,0.564,50


In [13]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

### for now we are ignoring the marginal property aka prior probability

#formula again for reference
![Bayes](https://raw.githubusercontent.com/dearbharat/NBBayes/main/bayes.PNG)

In [14]:


# Where,
#      P(c|x) is the posterior probability of class c given predictor ( features).
#      P(c) is the probability of class.
#      P(x|c) is the likelihood which is the probability of predictor given class.
#      P(x) is the prior probability of predictor.

![Outcome1](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome1.PNG)

In [15]:
# So for now we will only calculate the numerator of the data and will predict based on numerator only

# Numerator of the posterior probability if the unclassified observation is a Outcome1
d_out1 = P_outcome1 * \
p_x_given_y(person['Pregnancies'][0], outcome1_Pregnancies_mean, outcome1_Pregnancies_variance) * \
p_x_given_y(person['Glucose'][0], outcome1_Glucose_mean, outcome1_Glucose_variance) * \
p_x_given_y(person['BloodPressure'][0], outcome1_BloodPressure_mean, outcome1_BloodPressure_variance) * \
p_x_given_y(person['SkinThickness'][0], outcome1_SkinThickness_mean, outcome1_SkinThickness_variance) * \
p_x_given_y(person['Insulin'][0], outcome1_Insulin_mean, outcome1_Insulin_variance) * \
p_x_given_y(person['BMI'][0], outcome1_BMI_mean, outcome1_BMI_variance) * \
p_x_given_y(person['DiabetesPedigreeFunction'][0], outcome1_DiabetesPedigreeFunction_mean, outcome1_DiabetesPedigreeFunction_variance) *\
p_x_given_y(person['Age'][0], outcome1_Age_mean, outcome1_Age_variance) 

In [16]:
d_out1

2.2311606712297367e-13

![Outcome0](https://raw.githubusercontent.com/dearbharat/NBBayes/main/outcome0.PNG)

In [17]:
# Numerator of the posterior probability if the unclassified observation is a Outcome0
d_out2 = P_outcome0 * \
p_x_given_y(person['Pregnancies'][0], outcome0_Pregnancies_mean, outcome0_Pregnancies_variance) * \
p_x_given_y(person['Glucose'][0], outcome0_Glucose_mean, outcome0_Glucose_variance) * \
p_x_given_y(person['BloodPressure'][0], outcome0_BloodPressure_mean, outcome0_BloodPressure_variance) * \
p_x_given_y(person['SkinThickness'][0], outcome0_SkinThickness_mean, outcome0_SkinThickness_variance) * \
p_x_given_y(person['Insulin'][0], outcome0_Insulin_mean, outcome0_Insulin_variance) * \
p_x_given_y(person['BMI'][0], outcome0_BMI_mean, outcome0_BMI_variance) * \
p_x_given_y(person['DiabetesPedigreeFunction'][0], outcome0_DiabetesPedigreeFunction_mean, outcome0_DiabetesPedigreeFunction_variance) *\
p_x_given_y(person['Age'][0], outcome0_Age_mean, outcome0_Age_variance) 

In [18]:
d_out2

1.7904471741735335e-13

### now as we compare this value with outcome1 and outcome0, we can definitely say that the given data that we inserted is infact of type outcome1

<br><br><br><br>
This is just a very simple way of nderstanding the Naive bayes from scratch.
This technique was adapated from here.https://chrisalbon.com/machine_learning/naive_bayes/naive_bayes_classifier_from_scratch/


# -------------------------------------------------------------------------------------------------------------


# Naive bayes using Scikit learn

### now there are three types of naive bayes in scikit learn

 - Multinomial. 
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
 
 - Bernoulli. 
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html
 
 - and finally Gaussian.
 http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
 
 ## a quick reminder, we have implemented Gaussian naive bayesian


In [19]:
#first visualise what we have in our hand
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [20]:

X = data.iloc[:,0:-1] # X is the features in our dataset
y = data.iloc[:,-1]   # y is the Labels in our dataset

In [21]:
# divide the dataset in train test using scikit learn
# now the model will train in training dataset and then we will use test dataset to predict its accuracy

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 

In [22]:
# now preparing our model as per Gaussian Naive Bayesian

from sklearn.naive_bayes import GaussianNB

model = GaussianNB().fit(X_train, y_train) #fitting our model

In [23]:
predicted_y = model.predict(X_test) #now predicting our model to our test dataset

In [24]:
from sklearn.metrics import accuracy_score

# now calculating that how much accurate our model is with comparing our predicted values and y_test values
accuracy_score = accuracy_score(y_test, predicted_y) 
print (accuracy_score)

0.7362204724409449


## wow!! 

### we got 73% accuracy. It means it is accurate about the result 73%

## now further i will test my model to the new data point. remember, from upper model we concluded that that new data point is of outcome1

In [25]:
# the data is stored in Dataframe person
predicted_y = model.predict(person)

In [26]:
print (predicted_y)

[1]
