# Assignment 2 Group no. 10
### Project members: 
Veronika Cucorova <cucorova@kth.se> 

Tim Roelofs <tjtro@kth.se> 

Léo Vuylsteker <leov@kth.se> 


### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.


### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above,
and thereby 


## Load NumPy, pandas and time

In [None]:
import numpy as np
import pandas as pd
import time

## Reused functions from Assignment 1

In [2]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

def create_normalization(df, normalizationtype="minmax"):
    #   we need to do the modifications on a deep copy of the dataframe 
    deep_copy_df = df.copy(deep=True)
    if normalizationtype == 'minmax':
        #   define the min and max function on a dataframe
        function1 = lambda dataframe: dataframe.min()
        function2 = lambda dataframe: dataframe.max()
    elif normalizationtype == 'zscore':
        #   define the mean and std function on a dataframe
        function1 = lambda dataframe: dataframe.mean()
        function2 = lambda dataframe: dataframe.std()
    else:
        #   if the keyword is not recognize, we make sure that the programme raises an error to indicate
        #   where the problem is
        raise ValueError('The normalizationtype variable has not been recognized. Please use either minmax '
                         'or zscore.')
    #   we remove the CLASS and ID columns, which have special meanings, from the working dataframe
    wrk_df = deep_copy_df[deep_copy_df.columns.difference(['CLASS', 'ID'])]
    #   we select the columns with contain numeric types
    wrk_df = wrk_df.select_dtypes(np.number)
    dict = {}
    for key in wrk_df.columns:
        #   we add the column name to the dictionary keys along with the corresponding normalization
        dict[key] = (normalizationtype, function1(wrk_df[key]), function2(wrk_df[key]))
        if normalizationtype == 'minmax':
            #   we apply the corresponding min-max normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-function1(wrk_df[key]))/(function2(wrk_df[key] - function1(wrk_df[key])))
            )
        else:
            #   we apply the corresponding z-normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-function1(wrk_df[key]))/function2(wrk_df[key])
            )
    return deep_copy_df, dict


def apply_normalization(df, normalization):
    #   we need to do the modifications on a deep copy of the dataframe 
    deep_copy_df = df.copy(deep=True)
    for key in normalization.keys():
        normalizationtype = normalization[key][0]
        if normalizationtype == 'minmax':
            min = normalization[key][1]
            max = normalization[key][2]
            #   we apply the corresponding min-max normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-min)/(max-min)
            )
        elif normalizationtype == 'zscore':
            #   we apply the corresponding z-normalization to the column
            mean = normalization[key][1]
            std = normalization[key][2]
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-mean)/std
            )
        else:
            #   if the keyword is not recognize, we make sure that the programme raises an error to indicate
            #   where the problem is
            raise ValueError('The normalizationtype variable has not been recognized. Please use either'
                             ' minmax or zscore.')
    return deep_copy_df


def create_imputation(df):
    wrk_df = df.copy()
    dict = {}
    columns = wrk_df.columns.difference(['CLASS', 'ID'])
    for key in columns:
        #treat column as numeric
        if (wrk_df[key].dtypes == np.float64 or wrk_df[key].dtypes == np.int64):
            #first thing that came to my mind on checking if all values are null
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = 0
            else:
                repl_val = wrk_df[key].mean()          
        #treat value as categorical
        elif (wrk_df[key].dtypes == np.bool or wrk_df[key].dtypes.name == 'category'):
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = wrk_df[key].categories[0]
            else: 
                repl_val = wrk_df[key].mode()[0]
        #treat value as a string
        elif (wrk_df[key].dtypes == np.object):
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = ""
            else: 
                repl_val = wrk_df[key].mode()[0] #mode can return multiple things
        else: 
            raise ValueError('Unknown column type.')

        dict[key] = (repl_val)
        wrk_df[key].fillna(repl_val, inplace = True)
    return wrk_df, dict


def apply_imputation(df, repl_vals):
    wrk_df = df.copy()
    for key in repl_vals:
        wrk_df[key].fillna(repl_vals[key], inplace = True)
    return wrk_df


def create_bins(df, nobins=10, bintype="equal-width"):
    newdf = df.copy()
    binning = {}
    binfunctions = {'equal-width':pd.cut, 'equal-size':pd.qcut}
    for key in newdf.columns:
        if newdf[key].dtype == np.number and key not in ['CLASS', 'ID']:
            res, bins = binfunctions[bintype](newdf[key], nobins, labels=False, retbins=True, duplicates="drop")
            newdf[key] = res
            #hint 6 implementation. This part assumes that there are more than 2 bins
            bins[0] = -np.inf
            bins[-1] = np.inf 
            
            binning[key] = bins
            
    #Hint 4 implementation
    for key in newdf.columns:
        newdf[key] = newdf[key].astype('category')
    
    return newdf, binning


def apply_bins(df, binning):
    newdf = df.copy()
    for key in newdf.columns:
        if key in binning:
            newdf[key] = pd.cut(df[key],binning[key], labels=False)
    
    #Hint 3 implementation
    for key in newdf.columns:
        newdf[key] = newdf[key].astype('category')
    
    return newdf


def split(df, testfraction=0.5):
    #   generate a permutation of the row numbers
    permutation = np.random.permutation(df.index)
    #   compute the index separating the testing rows from the training rows in permutation
    sep_index = int(permutation.size*testfraction)
    #   return two slices of the initial data set: 
    #   the first one contains 1-testfraction fraction of the dataframe and correspond to the training set
    #   and the other one contains testfraction fraction of the dataframe and correspond to the testing set.
    return df.iloc[permutation[sep_index:], :], df.iloc[permutation[:sep_index], :]


def accuracy(preds, labels):
    max_prob = []
    max_prob = preds.idxmax(axis=1)
    #perform pairwise comparison of predicted with actual labels
    compared = np.equal(max_prob, labels)
    #return the number of true ones
    return(len(compared[compared == True])/len(compared))


def create_one_hot(df):
    newdf = pd.DataFrame()
    one_hot = {}
    for key in df.columns:
        if df[key].dtype in [np.object] and key not in ['CLASS', 'ID']:
            one_hot[key] = df[key].unique()
            newtab = pd.get_dummies(df[key], prefix=key)
            for i in newtab.columns: #converting to floats
                newtab[i] = newtab[i].astype(float)
            newdf = pd.concat([newdf, newtab], axis=1)
        else:
            newdf = pd.concat([newdf, df[key]], axis=1)
    
    return(newdf, one_hot)


def apply_one_hot(df, one_hot):
    newdf = pd.DataFrame()
    for key in df.columns:
        if key in one_hot:
            cols = pd.CategoricalDtype(categories=one_hot[key])
            newcol = df[key].astype(cols)
            newtab = pd.get_dummies(newcol, prefix=key)
            for i in newtab.columns: #converting to floats
                newtab[i] = newtab[i].astype(float)
                
            newdf = pd.concat([newdf, newtab], axis=1)
        else:
            newdf = pd.concat([newdf, df[key]], axis=1)
    return(newdf)


def folds(df, nofolds=10):
    #   generate a permutation of the row numbers
    permutation = np.random.permutation(df.index)
    #   split the permutation into equal-sized subsets of row numbers
    folds_indexes = np.array_split(permutation, nofolds)
    #   return slices of the data set according to the previous subsets of row numbers
    return [df.iloc[indexes, :] for indexes in folds_indexes]


def brier_score(preds, labels):
    #create boolean mask for each label and assign it as an array to the dict
    data = {l:np.array(labels) == l for l in list(preds.columns)}
    labels_df = pd.DataFrame(data)    
    wrk_df = (labels_df - preds)**2
    sum_ = wrk_df.sum().sum()
    n = wrk_df.shape[0]
    return(sum_/n)


def auc(df, correctlabels):
    correctlabels = list(correctlabels)
    finauc = 0
    freq = {}
    aucdict = {}
    for c in set(correctlabels):
        freq[c] = correctlabels.count(c)
        y_true = np.array([1 if x is c else 0 for x in correctlabels])
        scores = df[c]
        scoredct = {}
        poscount = scores[[bool(x) for x in y_true]].value_counts()
        negcount = scores[[not bool(x) for x in y_true]].value_counts()
        for score in sorted(set(scores), reverse=True):
            posscore = poscount[score] if score in poscount else 0
            negscore = negcount[score] if score in negcount else 0
            scoredct[score] = (posscore, negscore)

        tot_tp = sum([scoredct[key][0] for key in scoredct])
        tot_fp = sum([scoredct[key][1] for key in scoredct])
        auc = 0
        cov_tp = 0
        for i in scoredct:
            tp_i = scoredct[i][0]
            fp_i = scoredct[i][1]
            if fp_i == 0:
                cov_tp += tp_i
            elif tp_i == 0:
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp)
            else:
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp) + (tp_i/tot_tp)*(fp_i/tot_fp)/2
                cov_tp += tp_i
        
        aucdict[c] = auc       
    for key in aucdict:
        finauc += freq[key]/len(correctlabels)*aucdict[key]
    return finauc



## 1. Define the class kNN

In [3]:
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# imputation, normalization, one_hot, labels, training_labels, training_data
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.imputation should be an imputation mapping (see Assignment 1) from df
# self.normalization should be a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
# self.one_hot should be a one-hot mapping (see Assignment 1; can be excluded if this function was not completed)
# self.training_labels should be a pandas series corresponding to the "CLASS" column, set to be of type "category" 
# self.labels should be the categories of the previous series
# self.training_data should be the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
# normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns 
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self: the object itself
# df: a dataframe
# k: an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
#              predictions with estimated class probabilities for each row in df, where the class probabilities
#              are estimated by the relative class frequencies in the set of class labels from the k nearest 
#              (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply imputation, normalization and (possibly) one-hot
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies


class kNN:
    
    def ___init___(self):
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
    
    def fit(self, df):
        impdf, imputation = create_imputation(df)
        normdf, normalization = create_normalization(impdf)
        onehotdf, one_hot = create_one_hot(normdf)
        trainingdf = onehotdf.drop(columns=['CLASS', 'ID'])
        traininglabels = df["CLASS"].astype('category')
        dflabels = traininglabels.unique()
        
        self.imputation = imputation
        self.normalization = normalization
        self.one_hot = one_hot
        self.training_labels = traininglabels
        self.labels = dflabels
        self.training_data = trainingdf
    
    def predict(self, df, k):
        wrk_df = df.copy()
        wrk_df.drop(columns = ['CLASS', 'ID'], inplace = True)
        wrk_df = apply_imputation(wrk_df, self.imputation)
        wrk_df = apply_normalization(wrk_df, self.normalization)
        wrk_df = apply_one_hot(wrk_df, self.one_hot)
        
        
        
        
        # input: ndarray x_test containing floa values
        # input: integer k corresponding to the number of neighbours
        # output: dataframe row with classes as columns and probs as data
    def get_nearest_neighbor_predictions(self, x_test,k):
        distances = pd.DataFrame(columns =  ['dist'])
        distances['dist'] = [np.linalng.norm(x_test, y) for y in self.training_data.rows]
        distances.sort(by = "dist")
        nearest = distances[:k]
        nearest['class'] = [self.training_labels[i] for i in list[nearest.index]]
        nearest.drop(columns = ["dist"], inplace = True)
        result = pd.DataFrame(columns = [self.labels])
        for c in self.labels:
            b_map = nearest[nearest['class'] == c]
            result[c] = len(b_map[b_map == True])/k
        return result

                             
        


In [4]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

results



Training time: 0.34 s.
Testing time (k=1): 0.02 s.


AttributeError: 'NoneType' object has no attribute 'idxmax'

In [5]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.2f}".format(brier_score(predictions,train_labels)))


AttributeError: 'NoneType' object has no attribute 'idxmax'

### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [6]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self: the object itself
#
# Output from __init__:
# nothing
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# binning, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self: the object itself
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# nothing
#
# The result of applying this function should be:
#
# self.binning should be a discretization mapping (see Assignment 1) from df
# self.class_priors should be a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
# to the relative frequencies of the labels
# self.feature_class_value_counts should be a mapping from a feature (column name) to another mapping, which
# given a feature value and class label provides the number of training instances with this specific combination
# self.feature_class_counts should me a mapping from the feature (column name) and class label to the number of
# training instances with this specific class label and any (non-missing) value for the feature
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Hint 1: feature_class_value_counts can be a dictionary, which given a feature f returns a mapping obtained 
#         by pandas groupby and size (see lecture slides), which given a feature value v and class label c 
#         returns the number of instances, e.g., using get((c,v),0)
#
# Input to predict:
# self: the object itself
# df: a dataframe
# 
# Output from predict:
# predictions: a dataframe with class labels as column names and the rows corresponding to
# predictions with estimated class probabilities for each row in df, where the class probabilities
# are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply discretization
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors


class NaiveBayes:
    
    def ___init___(self):
        self.binning = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None
    
    def fit(self, df, nobins=10, bintype="equal-width"):
        bindf, binning = create_bins(df,nobins,bintype)
        classfreqs = df["CLASS"].value_counts(normalize=True)
        
        #feature_class_value_counts
        freqdict = {}
        for key in bindf.columns:
            if key not in ['CLASS', 'ID']:
                keydf = bindf.groupby([key, 'CLASS']).size()
                freqdict[key] = keydf
        #to get a count, for example:
        #IN: freqdict['RI'].get((1.51131, 7),0)
        #OUT: 1
        
        classfreq = bindf.groupby(['CLASS']).count()
        
        self.binning = binning
        self.class_priors = classfreqs.to_dict()
        self.feature_class_value_counts = freqdict
        self.feature_class_counts = classfreq

    def predict(self, df):
        #   TODO
        wrk_df = df.copy()
        wrk_df.drop(columns = ['CLASS', 'ID'], inplace = True)
        (Npts, Nfeatures) = wrk_df.shape
        classlist = list(self.class_priors.keys())
        Nclasses = len(classlist)
        
        wrk_df = apply_bins(wrk_df, self.binning)
        
        relative_frequencies = np.array([[[0 if self.feature_class_value_counts[column].get((wrk_df[column][i], k)) is None else self.feature_class_value_counts[column].get((wrk_df[column][i], k)) / 
                                              self.feature_class_counts[column][k] for k in classlist]
                                            for column in wrk_df.columns.values.tolist()] for i in range(Npts)])
        
        pb_i = np.array([[self.class_priors[classlist[c]] * np.prod(relative_frequencies[i, :, c]) for c in range(Nclasses)] for i in range(Npts)])
        pbsum = np.sum(pb_i, axis=1)
        probabilities = pd.DataFrame(list([self.class_priors.values() if pbsum[pts] == 0 else list(pb_i[pts]/pbsum[pts]) for pts in range(Npts)]), columns=classlist)
        
        return probabilities


In [7]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

results


Training time (3, 'equal-width'): 0.04 s.
Testing time (3, 'equal-width'): 2.03 s.
Training time (3, 'equal-size'): 0.03 s.
Testing time (3, 'equal-size'): 2.20 s.
Training time (5, 'equal-width'): 0.03 s.
Testing time (5, 'equal-width'): 2.02 s.
Training time (5, 'equal-size'): 0.03 s.
Testing time (5, 'equal-size'): 1.89 s.
Training time (10, 'equal-width'): 0.03 s.
Testing time (10, 'equal-width'): 1.80 s.
Training time (10, 'equal-size'): 0.03 s.
Testing time (10, 'equal-size'): 1.75 s.


Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,1.686746,0.715449
3,equal-size,0.607477,0.554782,0.780163
5,equal-width,0.64486,6.970497,0.725099
5,equal-size,0.598131,0.581556,0.796675
10,equal-width,0.598131,22.45812,0.695965
10,equal-size,0.579439,3.996173,0.735491


In [8]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.2f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.2f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.2f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.85
AUC on training set: 0.97
Brier score on training set: 0.23


### Comment on assumptions, things that do not work properly, etc.