# Exercise 7 - Naive Bayes Classifier
## Gaussian Naive Bayes Classifier

### AIM
To write a python program to implement Naive Bayes Classifier.

### ALGORITHM:


#### Function GaussianProb(x,$\mu$,$\sigma$)

$$\text{GaussianProb}(x,\mu,\sigma)
= \frac{1}{\sigma\sqrt{2\pi}} *e^{-\left(\frac{(x-\mu)^2}{2\sigma^2}\right)}
$$


```
Algorithm MakeModel(train_set)
    Input : train_set - a 2D list as table containing containing
                        response vector as the last column and
                        other columns as feature matrix
   Output : model - a Dict() with classes as keys and
                    List(Tuple()s) containing Tuples of (mean,stdev) for
                    each feature of the records in the train_set
                    belonging to that class as values
    
    classes = Set(record[-1] for record in train_set)
    class_separated_feature_matrices = Dict(
        class : List(
            record[:-1]
            for record in train_set
            if record[-1] == class
        )
        for class in classes
    )
    model = Dict(
        class : List(
            Tuple(Mean(feature_col),StDev(feature_col))
            for feature_col in columns(feature_matrix)
        )
        for (class,feature_matrix) in class_separated_feature_matrix
    )
    return model
End Algorithm

Algorithm classify(model,test_vector)
    Input : model - the gaussian Naive Bayes model
            test_vector - the feature vector for the test case as Tuple()
   Output : class - The predicted class of the test_vector
   
    classProb = Dict(
        class : Product(
            List(
                gaussianProb(test_feature,mean,stdev)
                for test_feature,(mean,stdev) in zip(test_vector,model[class])
            )
        )
        for class in model
    )
    return argmax(classProb)
End Algorithm
```

### DATASET USED:

**pima-indian-diabetes.csv** - contains a table with each column representing features collected from  the test subjects and the last column is the response vector  consists of the class as 0(diabetes Negative) or 1(diabetes Positive)  The file consists of 769 rows(including header row) and 9 columns. Each row consists of the features extracted from each test subjects

Each column represent a feature as follows

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
   
The 9th or the last column represents the response vector as having diabetes(1) or not(0).


### SOURCE CODE:

In [1]:
import statistics as st
import csv 
import math

def load_csv(filename,header = False):
    table = csv.reader(open(filename, "r"));
    if header :
        table.__next__()
    dataset = [
        [float(item) for item in record]
        for record in table
    ]
    return dataset

def gaussian_prob(x, mean, stdev):
    return math.exp(
        -(
            math.pow(x-mean,2)
            /(2*math.pow(stdev,2))
         )
    )/(math.sqrt(2*math.pi)*stdev)

def columns(array): return zip(*array) 
def argmax(dict_): return  max(dict_,key= dict_.get)  

class GNBClassifier:
    def __init__(self,training_data):
        self.model = GNBClassifier.make_model(training_data)

    @staticmethod
    def make_model(training_data):
        classes = {record[-1] for record in training_data} 
        class_separated_feature_matrices = {
            class_:[
                record[:-1]
                for record in training_data
                if record[-1] == class_
            ]
            for class_ in classes
        }
        class_seperated_feature_mean_and_stdevs = {
            class_:[
                (st.mean(feature_col),st.stdev(feature_col))
                for feature_col in columns(feature_matrix)
            ]
            for class_,feature_matrix in class_separated_feature_matrices.items()
        }
        return class_seperated_feature_mean_and_stdevs
    
    def classify(self,test_vector):
        class_probs = {
            class_:math.prod([
                gaussian_prob(test_feature,mean,stdev)
                for test_feature,(mean,stdev) in zip(test_vector,self.model[class_])
            ])
            for class_ in self.model
        }
        return argmax(class_probs)
    
    def compute_accuracy(self,testing_data):
        result = [
            original_class==self.classify(test_vector)
            for *test_vector,original_class in testing_data
        ]
        return result.count(True)/len(result)
        
if __name__ == "__main__":
    dataset = load_csv("pima-indians-diabetes.csv",True)
    split_ratio = .75
    split_length = int(len(dataset)*split_ratio)
    train = dataset[:split_length] 
    test = dataset[split_length:]
    print('The length of the training set',len(train))
    print('The length of the testing set',len(test))
    diabetes_predictor = GNBClassifier(train)
    print(
        "\nThe accuracy of predictions for this classifier is:",
        diabetes_predictor.compute_accuracy(test)
    )
    print("\nPredicting diabetes for few patients from the testing set :")
    for *test_vector,_ in test[:5]:
        print(
            "\nThe patitent with features : ",
            test_vector,
            "is predicted for diabetes as:",
            "POSITIVE" if diabetes_predictor.classify(test_vector) else "NEGATIVE"
            ,sep="\n"
        )

The length of the training set 576
The length of the testing set 192

The accuracy of predictions for this classifier is: 0.7760416666666666

Predicting diabetes for few patients from the testing set :

The patitent with features : 
[6.0, 108.0, 44.0, 20.0, 130.0, 24.0, 0.813, 35.0]
is predicted for diabetes as:
NEGATIVE

The patitent with features : 
[2.0, 118.0, 80.0, 0.0, 0.0, 42.9, 0.693, 21.0]
is predicted for diabetes as:
NEGATIVE

The patitent with features : 
[10.0, 133.0, 68.0, 0.0, 0.0, 27.0, 0.245, 36.0]
is predicted for diabetes as:
POSITIVE

The patitent with features : 
[2.0, 197.0, 70.0, 99.0, 0.0, 34.7, 0.575, 62.0]
is predicted for diabetes as:
POSITIVE

The patitent with features : 
[0.0, 151.0, 90.0, 46.0, 0.0, 42.1, 0.371, 21.0]
is predicted for diabetes as:
POSITIVE


### Alternative method using numpy

In [2]:
import numpy as np

def gaussian_prob(x,mean,stdev):
    x = x.reshape(-1,1,x.shape[-1])
    return np.exp(
        -(np.power(x-mean,2)/(2*np.power(stdev,2)))
    )/(np.sqrt(2*np.pi)*stdev)

class GNBClassifier:
    def __init__(self,train_x,train_y):
        classes = np.unique(train_y)
        class_seperated_train_x =[train_x[train_y==class_] for class_ in classes]
        self.means = np.array([d.mean(axis=0) for d in class_seperated_train_x])
        self.stdevs = np.array([d.std(axis=0) for d in class_seperated_train_x])
    
    def classify(self,test_x):
        return np.argmax(
            np.prod(gaussian_prob(test_x,self.means,self.stdevs),axis = -1),axis=-1
        )
    
    def compute_accuracy(self,test_x,test_y):
        return (self.classify(test_x) == test_y).mean()
        
if __name__ == "__main__":
    dataset = np.loadtxt("pima-indians-diabetes.csv",delimiter=",",skiprows=1)
    split_ratio = .75
    split_length = int(len(dataset)*split_ratio)
    train = dataset[:split_length] 
    train_x,train_y = train[:,:-1],train[:,-1]
    test = dataset[split_length:]
    test_x,test_y = test[:,:-1],test[:,-1]
    
    print('The length of the training set',len(train))
    print('The length of the testing set',len(test))
    diabetes_predictor = GNBClassifier(train_x,train_y)
    print(
        "\nThe accuracy of predictions for this classifier is:",
        diabetes_predictor.compute_accuracy(test_x,test_y)
    )
    
    print("\nPredicting diabetes for few patients from the testing set :")
    for  test_vector in test_x[:5]:
        print(
            "\nThe patitent with features : ",test_vector,
            "is predicted for diabetes as:",
            "POSITIVE" if diabetes_predictor.classify(test_vector) else "NEGATIVE"
            ,sep="\n"
        )

The length of the training set 576
The length of the testing set 192

The accuracy of predictions for this classifier is: 0.7760416666666666

Predicting diabetes for few patients from the testing set :

The patitent with features : 
[  6.    108.     44.     20.    130.     24.      0.813  35.   ]
is predicted for diabetes as:
NEGATIVE

The patitent with features : 
[  2.    118.     80.      0.      0.     42.9     0.693  21.   ]
is predicted for diabetes as:
NEGATIVE

The patitent with features : 
[ 10.    133.     68.      0.      0.     27.      0.245  36.   ]
is predicted for diabetes as:
POSITIVE

The patitent with features : 
[  2.    197.     70.     99.      0.     34.7     0.575  62.   ]
is predicted for diabetes as:
POSITIVE

The patitent with features : 
[  0.    151.     90.     46.      0.     42.1     0.371  21.   ]
is predicted for diabetes as:
POSITIVE


---