# CS 498AM1 Applied Machine Learning
## Problem 1 : Building a Naive Bayes Classifier
### Prepared by: Vardhan Dongre (vdongre2@illinois.edu)

#### Problem Description: 
Build a Naive Bayes classifier for the mathematics dataset by quantizing the dataset into 2 categories (G3 > 12 and G3 <= 12). The dataset has 30 attributes, choose suitable models for each attribute to find the class conditional probabilities.

__Dataset:__ https://archive.ics.uci.edu/ml/datasets/Student+Performance 


__Part a__ : For <font color = blue>"binary" attributes</font>, use a <font color = blue>binomial model</font>. For the attributes described as <font color = blue>“numeric”</font> use a <font color = blue>normal model</font>. For the attributes described as <font color = blue>“nominal”</font> use a <font color = blue>multinomial model</font>

__Part b__ : For <font color = blue>"binary" attributes</font>, use a <font color = blue>binomial model</font>. For the attributes described as <font color = blue>“numeric”</font> use a <font color = blue>multinomial model</font>. For the attributes described as <font color = blue>“nominal”</font> use a <font color = blue>multinomial model</font>

In [47]:
import pandas as pd
import numpy as np
from numpy import std
from numpy import mean
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn 
%matplotlib inline
from scipy.stats import norm
from sklearn.model_selection import train_test_split
import os

In [48]:
math_df = pd.read_csv("/Users/don/Desktop/AML/student-mat.csv", sep=";")
# Drop G1 and G2 and absences
math_df = math_df.drop(['G1', 'G2', 'absences'], axis = 1)

In [49]:
# Attributes
num_students = math_df.shape[0]
# Binary Attributes
binary_attributes = {'school':['GP','MS'],'sex':['F','M'], 'address':['U','R'], 'famsize':['LE3','GT3'], 'Pstatus':['T','A'], 'schoolsup':['yes','no'], 'famsup':['yes','no'], 'paid':['yes','no'], 'activities':['yes','no'], 'nursery':['yes','no'], 'higher':['yes','no'], 'internet':['yes','no'], 'romantic':['yes','no']}
# Nominal Attributes
nominal_attributes = {'Mjob':['teacher','health','services','at_home','other'], 'Fjob':['teacher','health','services','at_home','other'], 'reason':['home', 'reputation', 'course', 'other'], 'guardian':['mother', 'father', 'other']}
# Numeric Attributes
numeric_attributes = {'age','Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel','freetime', 'goout', 'Dalc', 'Walc', 'health'}

In [50]:
# NA values
#math_df.isnull().sum()
# Check for Unique Vals
#math_df.nunique(axis=0)
math_df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,20.0


In [84]:
# Numeric Probabilities
def numeric_prob(data):
    numeric_prob = {}
    for i in numeric_attributes:
        numeric_prob[i] = data[i].value_counts(normalize=True).to_dict()
    return numeric_prob

# Nominal Probabilities
def nominal_prob(data):
    nominal_prob = {}
    for i in nominal_attributes:
        nominal_prob[i] = data[i].value_counts(normalize=True).to_dict()
    return nominal_prob

# Binary Probabilities
def binary_prob(data):
    bin_prob = {}
    for i in binary_attributes:
        bin_prob[i] = data[i].value_counts(normalize=True).to_dict()
    return bin_prob

# Gaussian Distribution
def normal_prob(data):
    #norm_prob = {}
    for i in numeric_attributes:
        mu = mean(data[i])
        sigma = std(data[i])
        dist = norm(mu, sigma)
        #norm_prob[i] = dist.pdf(data[i])
    return dist

From Bayes' Rule, for classification we need:
$$ \Big[\Pi_j\Big( p(x^{(j)}|y )\Big)p(y)\Big]$$

Since we need to find the class conditional probabilities to evaluate the above rule, the following function finds the class conditional probabilities for each attribute and then gives the following probabilities:

$$prob_0: \Pi_j\Big( p(x^{(j)}|y=0)\Big)$$
$$prob_1: \Pi_j\Big( p(x^{(j)}|y=1)\Big)$$

In [52]:
def ConditionalProbsModelB(data):
    col= X.columns
    prob_1 = np.zeros(len(data))
    prob_0 = np.zeros(len(data))
    for i in range(len(data)):
        pro_1=1
        pro_0=1
        for k in col:
            #print(i,k)
            if(k in binary_attributes):
                cate="binary"
            if(k in nominal_attributes):
                cate="nominal"
            if(k in numeric_attributes):
                cate="numeric"
            if(cate=="binary"):
                if(data.iloc[i][k] in tr_bin_prob_0[k]):
                    pr_0=tr_bin_prob_0[k][data.iloc[i][k]]
                else:
                    pr_0 = 0.00001
                if(data.iloc[i][k] in tr_bin_prob_1[k]):
                    pr_1=tr_bin_prob_1[k][data.iloc[i][k]]
                else:
                    pr_1 = 0.00001
            if(cate=="nominal"):
                if(data.iloc[i][k] in tr_nom_prob_0[k]):
                    pr_0=tr_nom_prob_0[k][data.iloc[i][k]]
                else:
                    pr_0 = 0.00001
                if(data.iloc[i][k] in tr_nom_prob_1[k]):
                    pr_1=tr_nom_prob_1[k][data.iloc[i][k]]
                else:
                    pr_1 = 0.00001
            if(cate=="numeric"):
                if(data.iloc[i][k] in tr_num_prob_0[k]):
                    pr_0=tr_num_prob_0[k][data.iloc[i][k]]
                else:
                    pr_0 = 0.00001
                if(data.iloc[i][k] in tr_num_prob_1[k]):
                    pr_1=tr_num_prob_1[k][data.iloc[i][k]]
                else:
                    pr_1 = 0.00001
        pro_1*=pr_1
        pro_0*=pr_0
        prob_1[i] = pro_1
        prob_0[i] = pro_0
    return prob_0, prob_1

In [92]:
def ConditionalProbsModelA(data):
    col= X.columns
    prob_1 = np.zeros(len(data))
    prob_0 = np.zeros(len(data))
    for i in range(len(data)):
        pro_1=1
        pro_0=1
        for k in col:
            #print(i,k)
            if(k in binary_attributes):
                cate="binary"
            if(k in nominal_attributes):
                cate="nominal"
            if(k in numeric_attributes):
                cate="numeric"
            if(cate=="binary"):
                if(data.iloc[i][k] in tr_bin_prob_0[k]):
                    pr_0=tr_bin_prob_0[k][data.iloc[i][k]]
                else:
                    pr_0 = 0.00001
                if(data.iloc[i][k] in tr_bin_prob_1[k]):
                    pr_1=tr_bin_prob_1[k][data.iloc[i][k]]
                else:
                    pr_1 = 0.00001
            if(cate=="nominal"):
                if(data.iloc[i][k] in tr_nom_prob_0[k]):
                    pr_0=tr_nom_prob_0[k][data.iloc[i][k]]
                else:
                    pr_0 = 0.00001
                if(data.iloc[i][k] in tr_nom_prob_1[k]):
                    pr_1=tr_nom_prob_1[k][data.iloc[i][k]]
                else:
                    pr_1 = 0.00001
            if(cate=="numeric"):
                #if(data.iloc[i][k] in tr_num_prob_0[k]):
                    pr_0=tr_norm_dist_0.pdf(data.iloc[i][k])
                #else:
                #    pr_0 = 0.00001
                #if(data.iloc[i][k] in tr_num_prob_1[k]):
                    pr_1=tr_norm_dist_1.pdf(data.iloc[i][k])
                #else:
                #    pr_1 = 0.00001
        pro_1*=pr_1
        pro_0*=pr_0
        prob_1[i] = pro_1
        prob_0[i] = pro_0
    return prob_0, prob_1

In [54]:
def accuracy(A,B):
    correct = 0
    incorrect = 0
    for i in range(len(A)):
        #print(A[i])
        if A[i] == B.iloc[i]:
            correct += 1
        else:
            incorrect += 1
    return correct/(correct+incorrect)

In [93]:
X = math_df.iloc[:,:29]
y = math_df['G3']
train_score = np.zeros(10)
test_score = np.zeros(10)

for i in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
    X_0 = y_train<=12
    X_1 = y_train>12
    # Quantize data in two classes G3 <= 12(0) and G3 > 12(1)
    Xy0 = X_train[X_0]
    Xy1 = X_train[X_1]

########################### (Part-b) #############################
# # Training 

#     # Probabilities
#     tr_priory0 = len(Xy0)/len(X_train)
#     tr_priory1 = len(Xy1)/len(X_train)
#     # Conditional Probabilities for training data (>12)
#     tr_num_prob_1 = numeric_prob(Xy1) # Numeric Attributes (Multinomial Model)
#     tr_nom_prob_1 = nominal_prob(Xy1) # Nominal Attributes (Multinomial Model)
#     tr_bin_prob_1 = binary_prob(Xy1) # Binary Attributes (Binomial Model)

#     # Conditional Probabilities for training data (<=12)
#     tr_num_prob_0 = numeric_prob(Xy0) # Numeric Attributes (Multinomial Model)
#     tr_nom_prob_0 = nominal_prob(Xy0) # Nominal Attributes (Multinomial Model)
#     tr_bin_prob_0 = binary_prob(Xy0) # Binary Attributes (Binomial Model)
    
#     tr_prob_0, tr_prob_1 = ConditionalProbsModelB(X_train)
#     tr_py0 = tr_priory0*tr_prob_0
#     tr_py1 = tr_priory1*tr_prob_1
#     guess_tr = tr_py1>tr_py0
#     tr_compare = y_train > 12
#     train_score[i] = accuracy(guess_tr, tr_compare)

# # Testing

#     te_prob_0, te_prob_1 = ConditionalProbsModelB(X_test)
#     te_py0 = tr_priory0*te_prob_0
#     te_py1 = tr_priory1*te_prob_1
#     guess_te = te_py1>te_py0
#     te_compare = y_test > 12
#     test_score[i] = accuracy(guess_te, te_compare)
    
########################### (Part-a) #############################
# Training

    # Probabilities
    tr_priory0 = len(Xy0)/len(X_train)
    tr_priory1 = len(Xy1)/len(X_train)
    # Conditional Probabilities for training data (>12)
    tr_norm_dist_1 = normal_prob(Xy1) # Numeric Attributes (Normal (Gaussian) Model)
    tr_nom_prob_1 = nominal_prob(Xy1) # Nominal Attributes (Multinomial Model)
    tr_bin_prob_1 = binary_prob(Xy1) # Binary Attributes (Binomial Model)

    # Conditional Probabilities for training data (<=12)
    tr_norm_dist_0 = normal_prob(Xy0) # Numeric Attributes (Normal (Gaussian) Model)
    tr_nom_prob_0 = nominal_prob(Xy0) # Nominal Attributes (Multinomial Model)
    tr_bin_prob_0 = binary_prob(Xy0) # Binary Attributes (Binomial Model)

    tr_prob_0, tr_prob_1 = ConditionalProbsModelA(X_train)
    tr_py0 = tr_priory0*tr_prob_0
    tr_py1 = tr_priory1*tr_prob_1
    guess_tr = tr_py1>tr_py0
    tr_compare = y_train > 12
    train_score[i] = accuracy(guess_tr, tr_compare)

# Testing
    
    te_prob_0, te_prob_1 = ConditionalProbsModelA(X_test)
    te_py0 = tr_priory0*te_prob_0
    te_py1 = tr_priory1*te_prob_1
    guess_te = te_py1>te_py0
    te_compare = y_test > 12
    test_score[i] = accuracy(guess_te, te_compare)


In [101]:
# Part - a Accuracies
train_score_a = train_score
test_score_a = test_score
# Mean and Standard Deviation of Accuracies
# Train Accuracies
tr_mean_a = mean(train_score)
tr_sd_a = std(train_score)
# Test Accuracies
te_mean_a = mean(test_score)
te_sd_a = std(test_score)

In [65]:
# # Part - b Accuracies
# # Mean and Standard Deviation of Accuracies
# # Train Accuracies
# train_score_b = train_score
# test_score_b = test_score
# tr_mean_b = mean(train_score)
# tr_sd_b = std(train_score)
# # Test Accuracies
# te_mean_b = mean(test_score)
# te_sd_b = std(test_score)

In [118]:
help_dict1 = {'Train_score_a':train_score_a, 'Train_score_b':train_score_b, 'Test_score_a':test_score_a, 'Test_score_b':test_score_b}
summary1 = pd.DataFrame(help_dict1)
help_dict2 ={'Train_mean_a':tr_mean_a, 'Train_mean_b':tr_mean_b, 'Test_mean_a':te_mean_a, 'Test_mean_b':te_mean_b}
summary2 = pd.DataFrame(help_dict2, index=[0])
help_dict3 ={'Train_sd_a':tr_sd_a, 'Train_sd_b':tr_sd_b, 'Test_sd_a':te_sd_a, 'Test_sd_b':te_sd_b}
summary3 = pd.DataFrame(help_dict3, index=[0])

# Results and Discussion

The training and testing scores for the two classifiers over 10 folds are as follows:

In [120]:
print(summary1)

   Train_score_a  Train_score_b  Test_score_a  Test_score_b
0       0.674627       0.671642      0.633333      0.666667
1       0.680597       0.659701      0.600000      0.733333
2       0.659701       0.656716      0.716667      0.750000
3       0.668657       0.671642      0.666667      0.650000
4       0.662687       0.671642      0.700000      0.666667
5       0.653731       0.689552      0.750000      0.550000
6       0.659701       0.674627      0.716667      0.650000
7       0.668657       0.674627      0.666667      0.650000
8       0.677612       0.698507      0.616667      0.500000
9       0.650746       0.680597      0.766667      0.600000


We see that the accuracies obtained in Part a and b have the following statistics:
### Mean

In [117]:
print(summary2)

   Train_mean_a  Train_mean_b  Test_mean_a  Test_mean_b
0      0.665672      0.674925     0.683333     0.641667


### Standard Deviation

In [119]:
print(summary3)

   Train_sd_a  Train_sd_b  Test_sd_a  Test_sd_b
0    0.009534    0.011824   0.053229   0.071976


### Which classifier is more accurate and why?

The difference between the two classfiers is the event models for the numeric features. The normal distribution is a suitable model for continuous type attributes while multinomial is suitable for feature with finite categories. From the above statistics we can see that the mean accuracy on test data in part a is higher than that of part b as well as the standard deviation for part a is smaller than that of part b, thus we can conclude from this data that classifier A performed better than Classifier B 