## Naive Bayes

We'll be using a hand-coded implementation of Naive Bayes Classifier, using the Pima Indians diabetes dataset (https://www.kaggle.com/uciml/pima-indians-diabetes-database), and compare our performance* with sklearn's Naive Bayes classifier.

Please let me know if I got any of the logic/mathematical parts wrong.


---

*  *by performance, we measure only the accuracy of predictions, not optimizing for memory or time

These 2 lines of code are for enabling Google colab to connect with Google Drive; they won't be necessary if you are doing this locally/cloud provider like GCP/AWS/paperspace etc.

In [2]:
from google.colab import files, drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


The datapath for the PIMA Indians diabetes dataset, downloaded from:
https://www.kaggle.com/uciml/pima-indians-diabetes-database

In [0]:
diabetes_datapath = "/content/gdrive/My Drive/ml_data/diabetes.csv"

In [0]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats

In [0]:
diabetes_df = pd.read_csv(diabetes_datapath)

In [6]:
diabetes_df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [7]:
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [8]:
diabetes_df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Since the data types of call columns are numerical, we will be using Gaussian Naive Bayes *i.e.* we assume each feature comes from a Gaussian/normal distribution.

(Strictly, the target value 'Outcome' should be categorical (yes/no or 1/0), but we relegate that to the output *"y"* variable.)

In [0]:
num_samples, num_features = diabetes_df.shape

In [0]:
permuted_idxs = np.random.permutation(num_samples)
diabetes_df_new = diabetes_df.loc[permuted_idxs]

Choosing a 75:25 train and test split

In [0]:
train_set_sz = int(num_samples * 0.75)
train_df = diabetes_df_new[0:train_set_sz]
test_df = diabetes_df_new[train_set_sz+1 :]

In [0]:
test_df_preds = test_df['Outcome']
test_df = test_df.drop(columns='Outcome')

train_df = train_df.reset_index()
#reset_index adds a new "index" column, so we should drop that extra column
train_df = train_df.drop(columns='index')

In [0]:
#TODO: replace .loc[i] with df.iterrows() so no need to reset_index later

def separate_by_class(df, colname):
    """separate the dataframe by predictions
    Returns a dict whose keys are possible predictions, 
    and values are data vectors which have the key as the class of prediction
    'colname' is the column which has the predictions"""
    
    separated = {}
    for i in range(df.shape[0]):
        vec = df.loc[i]
        if vec[colname] not in separated:
            separated[vec[colname]] = []
        separated[vec[colname]].append(vec)
        
    return separated
    

In [0]:
#separated_data = separate_by_class(diabetes_df, 'Outcome')
#separated_data

In [0]:
#outcome_freq = dict()
#outcome_prob = dict()
def calc_pred_prob(df, colname):
    """Calculates the count/frequency of each class of precitions
    and also the probability of each class of predictions
    returns count and probability both as dicts,
    indexed by the class of prediction"""
    
    n_samples = df.shape[0]
    separated_data = separate_by_class(df, colname)
    outcome_freq = dict()
    outcome_prob = dict()
    for pred in separated_data.keys():
        outcome_freq[pred] = len(separated_data[pred])
        outcome_prob[pred] = outcome_freq[pred]/n_samples
        
    return outcome_freq, outcome_prob

In [0]:
outcome_freq, outcome_prob = calc_pred_prob(train_df, 'Outcome')

In [46]:
print(outcome_freq)
print(outcome_prob)

{1.0: 204, 0.0: 372}
{1.0: 0.3541666666666667, 0.0: 0.6458333333333334}


In [0]:
def calculate_freq_table(df, target_colname):
    """Takes the dataframe and calculates freqs and probs of each column"""
    
    num_samples = df.shape[0]
    cols = df.columns
    data_by_class = separate_by_class(df, target_colname)
    pred_classes = data_by_class.keys()

    data_dict = dict()
    for pred in pred_classes:
        class_df = pd.DataFrame(columns=cols)
        col_dict = dict()
        for col in cols:
            mean = df[df[target_colname]==pred][col].mean()
            std_dev = df[df[target_colname]==pred][col].std()
            col_dict[col] = tuple((mean, std_dev))
            #scipy.stats.norm(loc=mean,scale=std_dev).pdf()
            
        data_dict[pred] = col_dict
        
    return data_dict

In [0]:
def make_pred(data_row, train_stats):
    data_cols = data_row.columns
    pred_by_class = dict()
    log_prob = dict()
    max_logprob = 0.0
    out_class = -1
    
    #for col in data_cols:
    for (out_class, stats) in train_stats.items():
        prob = dict()
        for col in data_cols:
            #prob[col] = 
            #scipy.stats.norm(*train_stats[out_class][col]).pdf(data_row[col])
            prob[col] = scipy.stats.norm(*stats[col]).pdf(data_row[col])
            
        pred_by_class[out_class] = prob
        log_prob[out_class] = 0.0
        for val in prob.values():
            log_prob[out_class] += np.log(val)
    
    #To get the key with max value, use the first answer to this question:
    #https://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary
    out_class = max(log_prob.keys(), key=(lambda k: log_prob[k]))
    max_logprob = log_prob[out_class]
    
    return out_class, np.exp(max_logprob)

In [0]:
def get_accuracy(test_df, test_df_preds, train_stats):
    val_set_sz = test_df.shape[0]
    correct_preds = 0
    for idx in test_df.index:
        pred, pred_prob = make_pred(test_df.loc[[idx]], train_stats)
        if pred == test_df_preds.loc[[idx]].get_values():
            correct_preds += 1
    
    return (correct_preds * 100)/val_set_sz

In [0]:
#Calculate the mean and std deviations of data vectors, which are separated by
#output classes, *i.e.* for each output class (1 or 0), we calculate mean and 
#std deviation for each feature

train_stats = calculate_freq_table(train_df, 'Outcome')

In [48]:
train_stats

{0.0: {'Age': (30.723118279569892, 11.502312105882046),
  'BMI': (30.112634408602155, 7.624070022962753),
  'BloodPressure': (67.71774193548387, 17.994860220305053),
  'DiabetesPedigreeFunction': (0.43916397849462374, 0.29513819229879623),
  'Glucose': (109.29569892473118, 25.99997268909407),
  'Insulin': (66.75537634408602, 89.92315885976693),
  'Outcome': (0.0, 0.0),
  'Pregnancies': (3.2096774193548385, 2.9604969538441575),
  'SkinThickness': (20.188172043010752, 14.95048683113169)},
 1.0: {'Age': (37.11764705882353, 10.861686736937802),
  'BMI': (35.29313725490195, 7.723762321230591),
  'BloodPressure': (71.79411764705883, 20.42255019238477),
  'DiabetesPedigreeFunction': (0.5711764705882357, 0.39422825804810024),
  'Glucose': (142.05392156862746, 32.95618877592207),
  'Insulin': (106.08823529411765, 135.0192544518468),
  'Outcome': (1.0, 0.0),
  'Pregnancies': (4.892156862745098, 3.5314403841910655),
  'SkinThickness': (23.848039215686274, 17.775320414575344)}}

In [49]:
#Get the accuracy of our model with the test dataset
get_accuracy(test_df, test_df_preds, train_stats)

73.82198952879581

Now that we have the validation accuracy of a "naive" Naive Bayes classifier, we should compare it with the Gaussian Naive Bayes that sklearn provides to get an idea of the accuracy of the naive basic implementation.

In [0]:
from sklearn.naive_bayes import GaussianNB

In [0]:
y_train = train_df['Outcome'].get_values()
X_train = train_df.drop(columns='Outcome')

X_test = test_df
y_test = test_df_preds

In [51]:
gclf = GaussianNB()
gclf.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [52]:
X_train.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')

In [53]:
gclf.predict(X_test)

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0])

In [54]:
gclf.score(X_test, y_test)

0.7329842931937173