# ID2214 Assignment 2 Group no. [1]
### Project members: 
[Romain Rey, rrey@kth.se]
[Álvaro Orgaz Expósito, alvarooe@kth.se]
[Mastafa Foufa, foufa@kth.se]
[Ankita Pillay, pillay@kth.se]

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.


### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above,
and thereby 


## Load NumPy, pandas and time

In [4]:
import numpy as np
import pandas as pd
import time


## Reused functions from Assignment 1

In [5]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

def create_normalization(df_init, normalizationtype = 'minmax'):
    df = df_init.copy()
    normalization = {}
    for col in df.columns:
        if(col not in ['ID', 'CLASS'] and df[col].dtype in ['float64', 'float32', 'int']):
            if(normalizationtype == 'minmax'):
                min = df[col].min()
                max = df[col].max()
                diff = max-min
                df[col] = df[col].apply(lambda x: (x-min)/(diff))
                normalization[col] = ('minmax', min, max)
                
            elif(normalizationtype == 'zscore'):
                mean = df[col].mean()
                std = df[col].std()
                df[col] = df[col].apply(lambda x: (x-mean)/(std))
                normalization[col] = ('zscore', mean, std)
              
    return(df, normalization)

def apply_normalization(df_init, normalization):
    df = df_init.copy()
    for col in df.columns:
        if(col not in ['ID', 'CLASS'] and df[col].dtype in ['float64', 'float32', 'int']):
            normalizationtype, arg1, arg2 = normalization[col]
            
            if(normalizationtype == 'minmax'):
                diff = arg2-arg1
                #min(max()) to limit the output range to [0,1], see hint 2
                df[col] = df[col].apply(lambda x: min(max((x-arg1)/(diff), 0),1))
                
            elif(normalizationtype == 'zscore'):
                df[col] = df[col].apply(lambda x: (x-arg1)/(arg2))
              
    return(df)

def create_imputation(df_init):
    df = df_init.copy()
    imputation = {}
    nrow = df.shape[0]
    for col in df.columns:
        type_col = df[col].dtype
        
        #if colonne to impute as number
        if(col not in ['ID', 'CLASS'] and type_col in ['float64', 'float32', 'int64', 'int32']):
            na = np.sum(df[col].isna())
            if(na == nrow):
                mean = 0
            else:
                mean = df[col].mean()
            df[col].fillna(mean, inplace=True)
            imputation[col] = mean
            
        #if column not to impute as object or category
        elif(col not in ["ID","CLASS"]):                        
            na = np.sum(df[col].isna())
            if(na == nrow):
                if(type_col == 'object'):
                    mode = ''
                elif(type_col == 'category'):
                    mode = df[col].cat.categories[0]
            else:
                mode = df[col].mode()[0]
            df[col].fillna(mode, inplace=True)
            imputation[col] = mode
            
    return(df, imputation)

def apply_imputation(df_init, imputation):
    df = df_init.copy()
    for col in df.columns:
        if(col not in ['ID','CLASS']):                                      
            df[col].fillna(imputation[col], inplace=True)            
    return(df)

def create_bins(df_init, nobins = 10, bintype = 'equal-width'):
    df = df_init.copy()
    binning = {}
    for col in df.columns:
        type_col = df[col].dtype
        
        if(col not in ['ID', 'CLASS'] and type_col in ['float64', 'float32', 'int64', 'int32']):
            if(bintype == 'equal-width'):
                df[col], bins = pd.cut(df[col],nobins,retbins=True, labels = False)
            else:
                # duplicate = 'drop' in case there are duplicates in the edges
                df[col], bins = pd.qcut(df[col],nobins,retbins=True, labels = False, duplicates='drop')
            bins[0], bins[-1] = -np.inf, np.inf
            binning[col] = bins
            df[col] = df[col].astype('category')
            df[col] = df[col].cat.set_categories([str(i) for i in df[col].cat.categories], rename = True)
        else:
            df[col] = df[col].astype('category')

    return(df, binning)

def apply_bins(df_init, binning):
    df = df_init.copy()
    for col in df.columns:
        type_col = df[col].dtype
        
        if(col not in ['ID', 'CLASS'] and type_col in ['float64', 'float32', 'int64', 'int32']):
            bins = binning[col]
            df[col] = pd.cut(df[col],bins,labels=False)
            df[col] = df[col].astype('category')
            df[col] = df[col].cat.set_categories([str(i) for i in df[col].cat.categories], rename = True) 
        else:
            df[col] = df[col].astype('category')

    return(df)

def split(df_init, testfraction = 0.5):
    n = df_init.shape[0]
    n_test = int(n * testfraction)
    indexes = [i for i in range(n)]
    np.random.shuffle(indexes)
    test_idx = indexes[:n_test]
    training_idx = indexes[n_test:]
    return(df_init.iloc[training_idx], df_init.iloc[test_idx])

def accuracy(predictions, correctlabels):
    correct = 0
    n = predictions.shape[0]
    for i in range(n):
        d = predictions.iloc[i]
        pred  = d.idxmax()
        if(pred == correctlabels[i]):
            correct += 1
    return(correct/n)

def create_one_hot(df_init):
    df = df_init.copy()
    df2 = df.copy()
    one_hot = {}
    for col in df.columns:
        if((hasattr(df[col], 'object') or hasattr(df[col], 'category')) and col not in ['ID','CLASS']): 
            df[col] = df[col].astype('category')
            n_cat = len(df[col].cat.categories)
            one_hot[col] = df[col].cat.categories
            for i in one_hot[col]:
                name = col+'_'+i   
                new_col = df[col]==i
                new_col = new_col.astype('float')
                df2[name]=new_col 
            df2 = df2.drop(columns = col, axis = 1) 
    return(df2, one_hot)

def apply_one_hot(df_init, one_hot):
    df = df_init.copy()
    df2 = df.copy()
    for col in df.columns:
        if(col in one_hot.keys()):
            for i in one_hot[col]:
                name = col+'-'+i
                new_col = df[col]==i
                new_col = pd.Series(new_col.astype('float'))
                df2[name] = new_col
            df2 = df2.drop(columns = col, axis = 1)
            
    return(df2)

def folds(df, nofolds):
    shuffling = np.random.permutation(df.index)
    intervals = [int(i*df.shape[0]/nofolds) for i in range(nofolds+1)]
    listDf = [df.iloc[intervals[i]:intervals[i+1],:] for i in range(nofolds)]
    return(listDf)

def brier_score(df, correctlabels):
    n = df.shape[0]
    avg_error = 0
    for i in range(n):
        v = np.array(df.iloc[i,:])
        idx = np.where(df.columns==correctlabels[i])[0]
        correct = [1 if i == idx else 0 for i in range(df.shape[1])]
        error = np.sum((v-correct)**2)
        avg_error += error
    avg_error = avg_error / n
    return(avg_error)

def get_true_false_positive(prediction_vector, label_vector, label):
    pred = np.array(prediction_vector)
    v = [i == label for i in label_vector] 
    this_label, not_this_label = pred[v], pred[[not i for i in v]]
    return(this_label, not_this_label)

def get_list(prediction_vector, label_vector, label):
    this_label, not_this_label = get_true_false_positive(prediction_vector, label_vector, label)
    scores_false = [0]
    scores_false += sorted(prediction_vector)
    scores_false += [1]
    dict_scores = {}
    for i in scores_false:
        false_pos_i_r = np.sum(not_this_label>=i)/len(not_this_label)
        true_pos_i_r = np.sum(this_label>=i)/len(this_label)
        dict_scores[i] = [false_pos_i_r, true_pos_i_r]
    list_reversed = [i for i in reversed(list(dict_scores.values()))]
    return(list_reversed)

def get_area(list_values):
    n = len(list_values)
    area = 0
    for i in range(n-1):
        left, right = i, i+1
        tpr_left, tpr_right = list_values[left][1], list_values[right][1]
        fpr_left, fpr_right = list_values[left][0], list_values[right][0]
        if(fpr_right==fpr_left):
            next
        height = (tpr_left+tpr_right)/2
        width = fpr_right-fpr_left
        area += height*width
    return(area)

def get_frequencies(correctlabels):
    x = pd.Series(correctlabels)
    frequencies = x.value_counts()/len(correctlabels)
    return(frequencies)
    
def auc(df, correctlabels):
    frequencies = get_frequencies(correctlabels)
    area = 0
    for col in df.columns:
        prediction_vector = df[col]
        l = get_list(prediction_vector, correctlabels, col)
        area_col = get_area(l)
        area += frequencies[col]*area_col
    return(area)

## 1. Define the class kNN

In [36]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# imputation, normalization, one_hot, labels, training_labels, training_data
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.normalization should be a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot should be a one-hot mapping (see Assignment 1; can be excluded if this function was not completed)
# self.training_labels should be a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels should be the categories of the previous series
# self.training_data should be the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
# normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns 
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# k: an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are estimated by the relative class frequencies in the set of class labels from the k nearest 
#              (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation, normalization and (possibly) one-hot
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies


In [6]:
class kNN:
    def __init__(self):
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        
    def fit(self, df_init, normalizationtype = 'minmax'):
        df = df_init.copy()
        df, self.imputation = create_imputation(df)
        df, self.normalization = create_normalization(df, normalizationtype)
        df, self.one_hot = create_one_hot(df)
        classColumn = df.loc[:,'CLASS'].astype('category')
        self.training_labels = classColumn
        self.labels = classColumn.cat.categories
        df = df.drop(labels = ['CLASS', 'ID'], axis = 1)
        self.training_data = df
        
    def getKNearestNeighbors(self, v1, k):
        df = self.training_data
        mat = df.values
        nrow = df.shape[0]
        matVect = np.tile(v1, nrow).reshape(nrow,len(v1))
        dist = np.sum((mat-matVect)**2, axis = 1)
        kNearestIndexes = dist.argsort()[:k]
        return(kNearestIndexes)
    
    def getProbClass(self, kNearestIndexes):
        k = kNearestIndexes.shape[0]
        labelsData, labels = self.training_labels, self.labels
        labelsProb = np.zeros(len(labels))
        for i in range(k):
            labelVect = np.array((labels == labelsData[kNearestIndexes[i]]), dtype='int64')
            labelsProb += labelVect
        labelsProb = labelsProb / k
        return(labelsProb)
    
    def predict(self, df_init, k = 5):
        df = df_init.copy()
        nrow = df.shape[0]
        probs = pd.DataFrame(columns = self.labels)
        df = df.drop(['CLASS', 'ID'], axis = 1)
        df = apply_imputation(df, self.imputation)
        df = apply_normalization(df, self.normalization)
        df = apply_one_hot(df, self.one_hot)
        matrix = np.array([np.zeros(len(self.labels)) for i in range (nrow)])
        for i in range(nrow):
            v1 = df.iloc[i,:].values
            kNearestIndexes = self.getKNearestNeighbors(v1 = v1, k = k)
            labelsProb = self.getProbClass(kNearestIndexes = kNearestIndexes)
            matrix[i] = labelsProb
        probs = pd.DataFrame(matrix, columns=self.labels)
        return(probs)

In [8]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

results


Training time: 0.03 s.
Testing time (k=1): 0.07 s.
Testing time (k=3): 0.08 s.
Testing time (k=5): 0.08 s.
Testing time (k=7): 0.11 s.
Testing time (k=9): 0.11 s.


Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.7582
3,0.663551,0.488058,0.813829
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [37]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

results


Training time: 0.02 s.
Testing time (k=1): 0.15 s.
Testing time (k=3): 0.14 s.
Testing time (k=5): 0.14 s.
Testing time (k=7): 0.14 s.
Testing time (k=9): 0.14 s.


Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.81035
3,0.663551,0.488058,0.815859
5,0.579439,0.471028,0.833843
7,0.598131,0.471867,0.833481
9,0.616822,0.482981,0.827727


In [9]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


Accuracy on training set (k=1): 1.00
AUC on training set (k=1): 1.00
Brier score on training set (k=1): 0.00


In [14]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


Accuracy on training set (k=1): 1.00
AUC on training set (k=1): 1.00
Brier score on training set (k=1): 0.00


### Comment on assumptions, things that do not work properly, etc.


##### ACCURACY:
Accuracy decreases from 74% for k=1 to 57% for k=5 and then increases until 61% for k=9. 
Explanation: k=1 best value here meaning that looking to the closest nearest neighbor is the most interesting method here. This also means we do not have so many outliers in our data. 
Decrease and increase: Difficult to interpret but globally as we are in high dimension, up to a certain k value, it is meaningless to calculate accuracy as points will be around the boundaries of the hypercube. So looking for k nearest neighbors is like looking for all nearest neighbors --> curse of dimensionality. 

##### Brier Score: 
Very interesting here as we use a stochastic approach. But we find very close values of the Brier Score for different k values. Unstability of kNN in high dimension: it is easier to separate linearly data in high dimension but the algo is here not stable. Data points might be too separated in the hyperspace so findind nearest neighbors becomes more and more difficult. 

##### AUC: 
 AUC value does not evolve so much neither. Expected taking into account our arguments on curse of dimensionality. But our classifier is still doing a good with a high AUC of aroung 80%.

## 2. Define the class NaiveBayes

In [38]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.class_priors should be a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
# to the relative frequencies of the labels
# self.feature_class_value_counts should be a mapping from the feature (column name) to the number of
# training instances with a specific combination of (non-missing, categorical) value for the feature and class label
# self.feature_class_counts should me a mapping from the feature (column name) to the number of
# training instances with a specific class label and some (non-missing, categorical) value for the feature
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
# predictions with estimated class probabilities for each row in df, where the class probabilities
# are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply discretization
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors


In [12]:
class NaiveBayes:
    
    def __init__(self):
        self.binning = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
        self.labels = None
        self.tot = None
        
    def fit(self, df_init, nobins = 10, bintype = 'equal-width'):
        df = df_init.copy()
        df, self.binning = create_bins(df, nobins, bintype)
        classColumn = df.loc[:, 'CLASS'].astype('category')
        self.class_priors = dict(classColumn.value_counts(normalize = True))
        self.labels = classColumn.cat.categories
        self.tot = df.shape[0]
        dictCount, dictValueCount = {}, {}
        for col in df.columns:
            if(col not in ['CLASS', 'ID']):
                df2 = df.dropna(axis = 0, how='any', subset = ['CLASS', col])
                dfg = df2.groupby(['CLASS', col]).size()
                dictCount[col] = dict(df2.loc[:,'CLASS'].value_counts())
                dictValueCount[col] = dict(dfg)
        self.feature_class_counts = dictCount
        self.feature_class_value_counts = dictValueCount
        
    def predict(self, df_init):
        df = df_init.copy()
        df = apply_bins(df, self.binning)

        df = df.drop(labels = ['CLASS', 'ID'], axis = 1)
        labels = self.labels
        nrow, ncol, nlabel = df.shape[0], df.shape[1], len(labels)
        matrix = np.zeros([nlabel, nrow, ncol])
        
        # We will create a matrix where a coefficient is the relative frequency 
        # given for a specific (classLabel, Row, and feature)
        for col_num in range(ncol):
            col = df.columns[col_num]
            
            for label_num in range(nlabel):
                label = labels[label_num]

                for row_num in range(nrow):
                    value = df.iloc[row_num, col_num]
                    if((label, value) in self.feature_class_value_counts[col].keys()):
                        features_value_count = self.feature_class_value_counts[col][(label, value)]
                        feature_count = self.feature_class_counts[col][label]
                        relat_frequency = features_value_count / feature_count
                    else:
                        relat_frequency = 0
                    
                    matrix[label_num, row_num, col_num] = relat_frequency
        
        # We multiply the relative frequencies of the feature between eachother for a specific (rowNumber, classLabel)
        mat_non_normalized = matrix.prod(axis = 2)
        
        # We create a matrix of the class prior value to then multiply one-by-one the terms of the two matrices
        classesVect = np.array([self.class_priors[labels[i]] for i in range(nlabel)])
        matClasses = np.tile(classesVect, nrow).reshape([nrow, nlabel]).T
        mat_non_normalized = (mat_non_normalized * matClasses)
        
        #We create the normalization matrix
        normalization = np.sum(mat_non_normalized, axis = 0)
        # But when this sum is 0 we can't let the value to 0 because we will divide by this value
        # So we put the values to 1 which means that dividing by this value doesn't change anything
        # And we keep a matrix with the information of which sum were equal to 0 to later replace the probabilities
        # with the class priors
        normalizing_mat = np.tile(normalization, nlabel).reshape([nlabel, nrow])
        normalizing_mat_zero = normalizing_mat==0
        normalizing_mat += normalizing_mat_zero.astype('float')
        
        #Then we normalize the final matrix
        normalized_mat = mat_non_normalized / normalizing_mat
        # And we replace the probabilities where the sum were 0 by the class priors
        # We are doing that by adding the values (because if the sum was 0 then the value was 0 also so adding = replace)
        normalizing_adding_priors = normalizing_mat_zero.astype('float')*matClasses
        normalized_mat += normalizing_adding_priors

        result = pd.DataFrame(normalized_mat.T, columns = labels)

        return(result)

In [13]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

results


Training time (3, 'equal-width'): 0.15 s.
Testing time (3, 'equal-width'): 0.19 s.
Training time (3, 'equal-size'): 0.12 s.
Testing time (3, 'equal-size'): 0.42 s.
Training time (5, 'equal-width'): 0.11 s.
Testing time (5, 'equal-width'): 0.25 s.
Training time (5, 'equal-size'): 0.27 s.
Testing time (5, 'equal-size'): 0.46 s.
Training time (10, 'equal-width'): 0.18 s.
Testing time (10, 'equal-width'): 0.19 s.
Training time (10, 'equal-size'): 0.11 s.
Testing time (10, 'equal-size'): 0.18 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.723041
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.769727
5,equal-size,0.598131,0.581556,0.794519
10,equal-width,0.654206,0.527569,0.810974
10,equal-size,0.588785,0.741668,0.729883


In [39]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

results


Training time (3, 'equal-width'): 0.09 s.
Testing time (3, 'equal-width'): 0.07 s.
Training time (3, 'equal-size'): 0.06 s.
Testing time (3, 'equal-size'): 0.08 s.
Training time (5, 'equal-width'): 0.07 s.
Testing time (5, 'equal-width'): 0.08 s.
Training time (5, 'equal-size'): 0.08 s.
Testing time (5, 'equal-size'): 0.08 s.
Training time (10, 'equal-width'): 0.15 s.
Testing time (10, 'equal-width'): 0.21 s.
Training time (10, 'equal-size'): 0.14 s.
Testing time (10, 'equal-size'): 0.12 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.724335
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,0.551101,0.771688
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.654206,0.527569,0.812887
10,equal-size,0.588785,0.741668,0.751165


In [14]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


In [40]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


### Comment on assumptions, things that do not work properly, etc.

##### Assumptions: 

Conditional independence of observations's features given a class c. In other words, Naive Bayes approximated the likelihoods by considering each feature dimension to be conditionnally independent. Hypothesis that could obviously be denied and lead to poorer results. 

Globally, our algo do a fair job with aroung 60% accuracy and a fairly good AUC. 
The Naives Bayes approach tackle the dimensionality problem by assuming independence of individual feature dimensions


##### Equal width binning: 
The higher the nb of bins the better is the accuracy, the BS and the AUC. For accuracy, for example, it goes from 61% for 3 bins to 65% for k=10. This is logical as we better preprocess our points and assign them to more precise values. However, up to a certain point we can deal with very high dimension data. If Bayes classifiers are not unstable for such high dimension datasets, we can nevertheless deal with issues like storage and time computing of our algorithm (too difficult to store covariance matrix and algo too slow). We indeed note that with high number of bins we have longer training and testing times. 

##### Equal size binning:
Not so interesting here. No really notable evolutions of the performance measures. We assign equal nb of elements to each bins which is a bit counter intuitive here as we might have elements well separated that would end up in same bin.