<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Supervised-Learning" data-toc-modified-id="Supervised-Learning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Supervised Learning</a></span><ul class="toc-item"><li><span><a href="#Derivation-of-Naive-Bayes-Posterior-Probability" data-toc-modified-id="Derivation-of-Naive-Bayes-Posterior-Probability-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Derivation of Naive Bayes Posterior Probability</a></span><ul class="toc-item"><li><span><a href="#Role-of-Priors-in-Naive-Bayes-Classifier" data-toc-modified-id="Role-of-Priors-in-Naive-Bayes-Classifier-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Role of Priors in Naive Bayes Classifier</a></span></li></ul></li><li><span><a href="#Implementation-of-Naive-Bayes-Classifier" data-toc-modified-id="Implementation-of-Naive-Bayes-Classifier-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Implementation of Naive Bayes Classifier</a></span><ul class="toc-item"><li><span><a href="#Fitting-the-model-using-two-features" data-toc-modified-id="Fitting-the-model-using-two-features-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Fitting the model using two features</a></span></li><li><span><a href="#Fitting-the-model-using-all-features" data-toc-modified-id="Fitting-the-model-using-all-features-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Fitting the model using all features</a></span></li></ul></li></ul></li></ul></div>

# Supervised Learning

## Derivation of Naive Bayes Posterior Probability

* The various probablities involved are,

    * **Posterior Probabiliry:** $P(Tt|x_1,x_2,\dots,x_n)$
    * **Likelihood:** $P(x_1,x_2,\dots,x_n|Tt)$
    * **Prior Probabiity:** $P(Tt)$
---
* Using the Bayes rule for conditional probabilities we can write,
$$P(Tt|x_1,x_2,\dots,x_n) = \frac{P(x_1,x_2,\dots,x_n|Tt)\times P(Tt)}{P(x_1,x_2,\dots,x_n)}$$
* By independance,
$$P(x_1,x_2,\dots,x_n) = \prod_{i=1}^{n} P(x_i)$$
* The likelihood term can be manipulated using the chain rule and assumption of independant variables to give,
$$P(x_1,x_2,\dots,x_n|Tt) = \prod_{i=1}^{n} P(x_i|Tt) $$
* Combining all this we get,
$$P(Tt|x_1,x_2,\dots,x_n) = \frac{\prod_{i=1}^{n} P(x_i|Tt)\times P(Tt)}{\prod_{i=1}^{n} P(x_i)}$$

### Role of Priors in Naive Bayes Classifier

* While building a classifier, we are interested in the posterior probablity calculated above. It can be defined as , given a sample test point what is the probability of it belonging to class Tt.
* Therefore the classifications are made using argmax function for posterior probablity of the point belonging to various classes.
* We can safely ignore the denominator for our decison rule, since it will be the same in all cases.
* Thus,
$$P(Tt|x_1,x_2,\dots,x_n) \propto \prod_{i=1}^{n} P(x_i|Tt)\times P(Tt)$$
* As can be seen from the formula the prior plays an important role in classification.
* When training a Naive Bayes Classifier, the prior probabilities are taken to be the frequency of occurences of samples belonging to the particular class in the training dataset.
* Thus, if our training dataset has more samples belonging to a particular class, the corresponding prior probabilities will also be higher.
* This will cause concerns if the number of training data points available for various classes is heavily skewed as there is a chance for the model to become biased towards the class with high prior probability.

## Implementation of Naive Bayes Classifier

In [1]:
#Libraries to be used.
import pandas as pd
import numpy as np
import pprint

In [2]:
#Utilities to be used.
pp = pprint.PrettyPrinter(width=60, depth=1)


#A class that helps with processing the csv files.
class process_csv:
    def __init__(self, path):
        self.path = path
        self.features = []

    def read_csv_to_df(self):
        #A function that takes in csv files and returns a dataframe after some processing.
        self.features = [*pd.read_csv(self.path, nrows=1)]
        self.features.pop(
            0)  #Removes the first coulmn since it is not a feature.
        df = pd.read_csv(self.path, usecols=[col for col in self.features])
        return df

    def scale_features(self, df):
        #A fucntion that scales the values in a dataframe.
        df_scaled = df.apply(lambda seq: (seq.astype(float) - min(seq)) /
                             (max(seq) - min(seq)),
                             axis=0)
        return df_scaled

    def df_to_numpy(self, df, scaling):
        #Converts a dataframe in to an numpy array.
        df_scaled = self.scale_features(df)
        if scaling:
            return df_scaled.to_numpy()
        else:
            return df.to_numpy()

    def select_class(self, class_name, no_of_features_is_two, scaling):
        #A function that filters samples belonging to the malignant class from the training data.
        df = self.read_csv_to_df()
        label_for_class = (df['diagnosis'] == class_name)
        if no_of_features_is_two:
            class_df = df.loc[label_for_class,
                              ["concave points_worst", "radius_mean"]]
        else:
            class_df = df.loc[label_for_class, self.features[3:]]
        class_arr = self.df_to_numpy(class_df, scaling)
        return class_arr

In [3]:
#A class that performs classification using naive bayes.
class Naive_Bayes_Classifier:
    def __init__(self, X_train_M, X_train_B):
        self.training_data = [X_train_M, X_train_B]
        self.Classes = ['M', 'B']
        self.params = dict((Class, {}) for Class in self.Classes)

    def fit(self):
        #A method that estimates the values of parameters from the training data.
        #Unpacking the training data.
        X_m, X_b = self.training_data
        #List of various statisitical params as a list.
        N = [X_m.shape[0], X_b.shape[0]]
        arr_mean = [
            np.mean(X_m, axis=0, dtype=np.float64),
            np.mean(X_b, axis=0, dtype=np.float64)
        ]
        arr_std = [
            np.std(X_m, axis=0, dtype=np.float64),
            np.std(X_b, axis=0, dtype=np.float64)
        ]
        for Class in self.Classes:
            class_index = self.Classes.index(Class)
            self.params[Class]['Prior'] = N[class_index] / sum(N)
            self.params[Class]['Mean'] = arr_mean[class_index]
            self.params[Class]['Std'] = arr_std[class_index]
        #return self.params

    def log_likelihood(self, x, mu, sigma):
        return np.sum(-np.log(sigma) - (x - mu)**2 / (2 * sigma**2))

    def predict(self, X_test):
        # Calculates probabilities for each class per sample and assigns the class with maximum probability
        pred = []
        for i in range(X_test.shape[0]):
            sample = X_test[i, :]
            logp_m = self.log_likelihood(
                sample, self.params['M']['Mean'],
                self.params['M']["Std"]) + self.params['M']["Prior"]
            logp_b = self.log_likelihood(
                sample, self.params['B']['Mean'],
                self.params['B']["Std"]) + self.params['B']["Prior"]
            if logp_m > logp_b:
                pred.append('M')
            else:
                pred.append('B')
        return pred

    def pred_accuracy(self, test_data_dict):
        correct_pred = 0
        N = 0
        for label, X_test in test_data_dict.items():
            N += X_test.shape[0]
            pred = self.predict(X_test)
            correct_pred += pred.count(label)
        pred_accuracy = (correct_pred / N) * 100
        return pred_accuracy

### Fitting the model using two features

In [4]:
path_tr_csv = r"Cancer_train.csv"
path_test_csv =  r"Cancer_test.csv"
# training_data = process_csv(path_tr_csv)
# class_M_tr_data_arr = training_data.select_class('M', num_of_features_is_2=True,scaling=False)
# class_B_tr_data_arr = training_data.select_class('B', num_of_features_is_2=True,scaling=False)
# model = Naive_Bayes_Classifier(class_M_tr_data_arr,
#                                         class_B_tr_data_arr)

In [5]:
def fit_model_get_acc(path_tr, path_test, num_of_features_is_2,scaling):
    training_data = process_csv(path_tr)
    class_M_tr_data_arr = training_data.select_class('M', num_of_features_is_2,scaling)
    class_B_tr_data_arr = training_data.select_class('B', num_of_features_is_2,scaling)
    test_data = process_csv(path_test)
    class_M_test_data_arr = test_data.select_class('M', num_of_features_is_2,scaling)
    class_B_test_data_arr = test_data.select_class('B', num_of_features_is_2,scaling)
    #Fitting the model for two features.
    classifier = Naive_Bayes_Classifier(class_M_tr_data_arr,
                                        class_B_tr_data_arr)
    classifier.fit()
    print(classifier.params)
    #FInding the accuracy of the prediction using labelled test data.
    test_data = {'M': class_M_test_data_arr, 'B': class_B_test_data_arr}
    pred_accuracy = classifier.pred_accuracy(test_data)
    return (round(pred_accuracy, 2))

In [6]:
accuracy = fit_model_get_acc(path_tr_csv,
                             path_test_csv,
                             num_of_features_is_2=True,scaling=False)
print('The accuracy of the model fitted using two features(without scaling) is:', accuracy)

{'M': {'Prior': 0.3755020080321285, 'Mean': array([ 0.18146984, 17.36069519]), 'Std': array([0.04524136, 3.04235151])}, 'B': {'Prior': 0.6244979919678715, 'Mean': array([ 0.07428264, 12.15746302]), 'Std': array([0.0355992 , 1.75488937])}}
The accuracy of the model fitted using two features(without scaling) is: 92.86


In [7]:
accuracy = fit_model_get_acc(path_tr_csv,
                             path_test_csv,
                             num_of_features_is_2=True,scaling = True)
print('The accuracy of the model fitted using two features(with scaling) is:', accuracy)

{'M': {'Prior': 0.3755020080321285, 'Mean': array([0.58196191, 0.37358364]), 'Std': array([0.17267036, 0.17729321])}, 'B': {'Prior': 0.6244979919678715, 'Mean': array([0.42447223, 0.47625936]), 'Std': array([0.20342399, 0.16145822])}}
The accuracy of the model fitted using two features(with scaling) is: 72.86


### Fitting the model using all features

In [8]:
accuracy = fit_model_get_acc(path_tr_csv,
                             path_test_csv,
                             num_of_features_is_2=False,scaling= False)
print('The accuracy of the model fitted using all features(without scaling) is:', accuracy)

{'M': {'Prior': 0.3755020080321285, 'Mean': array([2.16757754e+01, 1.14680909e+02, 9.63925668e+02, 1.03207861e-01,
       1.45522941e-01, 1.60242086e-01, 8.73688770e-02, 1.92603743e-01,
       6.28813904e-02, 6.00933155e-01, 1.20367540e+00, 4.25386631e+00,
       7.04881818e+01, 6.76953476e-03, 3.25637807e-02, 4.20966310e-02,
       1.50284866e-02, 2.03546096e-02, 4.08903209e-03, 2.09856150e+01,
       2.93179144e+01, 1.40272781e+02, 1.39690160e+03, 1.45265722e-01,
       3.76061230e-01, 4.51971016e-01, 1.81469840e-01, 3.22417112e-01,
       9.17448128e-02]), 'Std': array([3.57142728e+00, 2.07116443e+01, 3.38446779e+02, 1.26657444e-02,
       5.41539879e-02, 7.38036696e-02, 3.29195482e-02, 2.80912919e-02,
       7.59419383e-03, 3.25573409e-01, 4.82332341e-01, 2.43512361e+00,
       5.34282106e+01, 2.94496609e-03, 1.90931328e-02, 2.24378078e-02,
       5.62771642e-03, 1.03143788e-02, 2.11773069e-03, 4.00944647e+00,
       5.15325902e+00, 2.74803176e+01, 5.41468371e+02, 2.17902914e-02,
 

In [9]:
accuracy = fit_model_get_acc(path_tr_csv,
                             path_test_csv,
                             num_of_features_is_2=False,scaling=True)
print('The accuracy of the model fitted using all features(with scaling) is:', accuracy)

{'M': {'Prior': 0.3755020080321285, 'Mean': array([0.35727548, 0.36690317, 0.28180297, 0.41552135, 0.32035411,
       0.33827041, 0.37071633, 0.35683455, 0.27214386, 0.15196072,
       0.26250831, 0.14142528, 0.11043213, 0.14413571, 0.1901257 ,
       0.23410371, 0.27583515, 0.17550247, 0.25542688, 0.41453511,
       0.40763328, 0.42802778, 0.30397811, 0.42451051, 0.32259308,
       0.37345859, 0.58196191, 0.32705916, 0.24075044]), 'Std': array([0.13039165, 0.17762988, 0.15834508, 0.1784159 , 0.18414087,
       0.18321749, 0.18198656, 0.16218991, 0.15994511, 0.12151889,
       0.15045146, 0.11794651, 0.10443152, 0.10346647, 0.15036568,
       0.16897212, 0.15752439, 0.14513394, 0.18018639, 0.20404308,
       0.17109094, 0.21319098, 0.18518703, 0.16215427, 0.17058465,
       0.16141653, 0.17267036, 0.14600883, 0.14295367])}, 'B': {'Prior': 0.6244979919678715, 'Mean': array([0.3129276 , 0.48483351, 0.37707046, 0.35799012, 0.29395501,
       0.11096103, 0.29814255, 0.40354749, 0.25016077,