# Assignment 1 Group no. 10
### Project members: 
Veronika Cucorova <cucorova@kth.se>
Tim Roelofs <tjtro@kth.se>
Léo Vuylsteker <leov@kth.se>

### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy and pandas may be used in the solution for this assignment.

### Instructions
All assignments starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of assignments starting with 
number 2 below, then the assignment will receive 2 points (in total).

It is highly recommended that you do not develop the code directly within the notebook
but that you copy the comments and test cases to your regular development environment
and only when everything works as expected, that you paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).


## Load NumPy and pandas

In [3]:
import numpy as np
import pandas as pd

## 1a. Create and apply normalization

In [4]:
#   Leo

# Insert the functions create_normalization and apply_normalization below (after the comments)
#
# Input to create_normalization:
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype: "minmax" (default) or "zscore"
#
# Output from create_normalization:
# df: a new dataframe, where each numeric value in a column has been replaced by a normalized value
# normalization: a mapping (dictionary) from each column name to a triple, consisting of
#                ("minmax",min_value,max_value) or ("zscore",mean,std)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID"),
#         the other columns should remain unchanged
# Hint 3: Take a close look at the lecture slides on data preparation


def create_normalization(df, normalizationtype="minmax"):
    #   we need to do the modifications on a deep copy of the dataframe 
    deep_copy_df = df.copy(deep=True)
    if normalizationtype == 'minmax':
        #   define the min and max function on a dataframe
        function1 = lambda dataframe: dataframe.min()
        function2 = lambda dataframe: dataframe.max()
    elif normalizationtype == 'zscore':
        #   define the mean and std function on a dataframe
        function1 = lambda dataframe: dataframe.mean()
        function2 = lambda dataframe: dataframe.std()
    else:
        #   if the keyword is not recognize, we make sure that the programme raises an error to indicate
        #   where the problem is
        raise ValueError('The normalizationtype variable has not been recognized. Please use either minmax '
                         'or zscore.')
    #   we remove the CLASS and ID columns, which have special meanings, from the working dataframe
    wrk_df = deep_copy_df[deep_copy_df.columns.difference(['CLASS', 'ID'])]
    #   we select the columns with contain numeric types
    wrk_df = wrk_df.select_dtypes(np.number)
    dict = {}
    for key in wrk_df.columns:
        #   we add the column name to the dictionary keys along with the corresponding normalization
        dict[key] = (normalizationtype, function1(wrk_df[key]), function2(wrk_df[key]))
        if normalizationtype == 'minmax':
            #   we apply the corresponding min-max normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-function1(wrk_df[key]))/(function2(wrk_df[key] - function1(wrk_df[key])))
            )
        else:
            #   we apply the corresponding z-normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-function1(wrk_df[key]))/function2(wrk_df[key])
            )
    return deep_copy_df, dict


# Input to apply_normalization:
# df: a dataframe
# normalization: a mapping (dictionary) from column names to triples (see above)
#
# Output from apply_normalization:
# df: a new dataframe, where each numerical value has been normalized according to the mapping
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: For minmax-normalization, you may consider to limit the output range to [0,1]


def apply_normalization(df, normalization):
    #   we need to do the modifications on a deep copy of the dataframe 
    deep_copy_df = df.copy(deep=True)
    for key in normalization.keys():
        normalizationtype = normalization[key][0]
        if normalizationtype == 'minmax':
            min = normalization[key][1]
            max = normalization[key][2]
            #   we apply the corresponding min-max normalization to the column
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-min)/(max-min)
            )
        elif normalizationtype == 'zscore':
            #   we apply the corresponding z-normalization to the column
            mean = normalization[key][1]
            std = normalization[key][2]
            deep_copy_df[key] = deep_copy_df[key].map(
                lambda x: (x-mean)/std
            )
        else:
            #   if the keyword is not recognize, we make sure that the programme raises an error to indicate
            #   where the problem is
            raise ValueError('The normalizationtype variable has not been recognized. Please use either'
                             ' minmax or zscore.')
    return deep_copy_df


In [5]:
# Test your code (leave this part unchanged)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

glass_train_norm, normalization = create_normalization(glass_train_df, normalizationtype="minmax")
print("normalization:\n")
for f in normalization:
    print("{}:{}".format(f,normalization[f]))

glass_test_norm = apply_normalization(glass_test_df,normalization)
print("\nglass_test_norm:\n")
print(glass_test_norm)


FileNotFoundError: [Errno 2] File b'glass_train.txt' does not exist: b'glass_train.txt'

### Comment on assumptions, things that do not work properly, etc.


## 1b. Create and apply imputation

In [None]:
#   Veronika

# Insert the functions create_imputation and apply_imputation below (after the comments)
#
# Input to create_imputation:
# df: a dataframe (where the column names "CLASS" and "ID" have special meaning)
#
# Output from create_imputation:
# df: a new dataframe, where each missing numeric value in a column has been replaced by the mean of that column 
#     and each missing categoric value in a column has been replaced by the mode of that column
# imputation: a mapping (dictionary) from column name to value that has replaced missing values
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Handle columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID") in one way
#         and columns of type "object" and "category" in other ways
# Hint 3: Consider using the pandas functions mean() and mode() respectively, as well as fillna
# Hint 4: In the rare case of all values in a column being missing, replace numeric values with 0,
#         object values with "" and category values with the first category (cat.categories[0])  


def create_imputation(df):
    wrk_df = df.copy()
    dict = {}
    columns = wrk_df.columns.difference(['CLASS', 'ID '])
    for key in columns:
        #treat column as numeric
        if (wrk_df[key].dtypes == np.float64 or wrk_df[key].dtypes == np.int64):
            #first thing that came to my mind on checking if all values are null
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = 0
            else:
                repl_val = wrk_df[key].mean()          
        #treat value as categorical
        elif (wrk_df[key].dtypes == np.bool or wrk_df[key].dtypes.name == 'category'):
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = wrk_df[key].categories[0]
            else: 
                repl_val = wrk_df[key].mode()[0]
        #treat value as a string
        elif (wrk_df[key].dtypes == np.object):
            if (wrk_df[key].isnull().sum() == len(wrk_df[key])):
                repl_val = ""
            else: 
                repl_val = wrk_df[key].mode()[0] #mode can return multiple things
        else: 
            raise ValueError('Unknown column type.')

        dict[key] = (repl_val)
        wrk_df[key].fillna(repl_val, inplace = True)
    return wrk_df, dict
    

#
# Input to apply_imputation:
# df: a dataframe
# imputation: a mapping (dictionary) from column name to value that should replace missing values
#
# Output from apply_imputation:
# df: a new dataframe, where each missing value has been replaced according to the mapping
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider using fillna

def apply_imputation(df, repl_vals):
    wrk_df = df.copy()
    for key in repl_vals:
        wrk_df[key].fillna(repl_vals[key])
    return wrk_df





In [None]:
# Test your code (leave this part unchanged)

anneal_train_df = pd.read_csv("anneal_train.txt")
anneal_test_df = pd.read_csv("anneal_test.txt")

anneal_train_imp, imputation = create_imputation(anneal_train_df)
anneal_test_imp = apply_imputation(anneal_test_df,imputation)

print("Imputation:\n")
for f in imputation:
    print("{}:{}".format(f,imputation[f]))

print("\nNo. of replaced missing values in training data:\n{}".format(anneal_train_imp.count()-anneal_train_df.count()))
print("\nNo. of replaced missing values in test data:\n{}".format(anneal_test_imp.count()-anneal_test_df.count()))



### Comment on assumptions, things that do not work properly, etc.

## 1c. Create and apply discretization

In [None]:
# Insert the functions create_bins and apply_bins below
#
# Input to create_bins:
# df: a dataframe
# nobins: no. of bins (default = 10)
# bintype: either "equal-width" (default) or "equal-size" 
#
# Output from create_bins:
# df: a new dataframe, where each numeric feature value has been replaced by a categoric (corresponding to some bin)
# binning: a mapping (dictionary) from column name to bins (threshold values for the bin)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Discretize columns of type "float" or "int" only (and which are not labeled "CLASS" or "ID")
# Hint 3: Consider using pd.cut and pd.qcut respectively, with labels=False, retbins=True and duplicates="drop"
#         (the last option will avoid errors when not enough bins can be created)
# Hint 4: Set all columns in the new dataframe to be of type "category"
# Hint 5: Set the categories of the discretized features to be [0,...,nobins-1]
# Hint 6: Change the first and the last element of each binning to -np.inf and np.inf respectively 

def create_bins(df, bintype="equal-width", nobins=10):
    newdf = df.copy()
    binning = {}
    binfunctions = {'equal-width':pd.cut, 'equal-size':pd.qcut}
    for key in newdf.columns:
        if newdf[key].dtype == np.number and key not in ['CLASS', 'ID']:
            res, bins = binfunctions[bintype](newdf[key], nobins, labels=False, retbins=True, duplicates="drop")
            newdf[key] = res
            #hint 6 implementation. This part assumes that there are more than 2 bins
            bins[0] = -np.inf
            bins[-1] = np.inf 
            
            binning[key] = bins
            
    #Hint 4 implementation
    for key in newdf.columns:
        newdf[key] = newdf[key].astype('category')
    
    return newdf, binning

# Input to apply_bins:
# df: a dataframe
# binning: a mapping (dictionary) from column name to bins (threshold values for the bin)
#
# Output from apply_bins:
# df: a new dataframe, where each numeric feature value has been replaced by a categoric (corresponding to some bin)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider using pd.cut 
# Hint 3: Set all columns in the new dataframe to be of type "category"
# Hint 4: Set the categories of the discretized features to be [0,...,nobins-1]
#

def apply_bins(df, binning):
    newdf = df.copy()
    for key in newdf.columns:
        if key in binning:
            newdf[key] = pd.cut(df[key],binning[key], labels=False)
    
    #Hint 3 implementation
    for key in newdf.columns:
        newdf[key] = newdf[key].astype('category')
    
    return newdf

In [None]:
# Test your code  (leave this part unchanged)

glass_train_df = pd.read_csv("glass_train.txt")

glass_test_df = pd.read_csv("glass_test.txt")

glass_train_disc, binning = create_bins(glass_train_df,nobins=10,bintype="equal-size")
print("binning:\n")
for f in binning:
    print("{}:{}".format(f,binning[f]))

glass_test_disc = apply_bins(glass_test_df,binning)
print("\nglass_test_disc:\n")
glass_test_disc


### Comment on assumptions, things that do not work properly, etc.

In [None]:
## 1d. Divide a dataset into a training and a test set

In [None]:
#   Leo

# Insert the function split below
#
# Input to split:
# df: a dataframe
# testfraction: a float in the range (0,1) (default = 0.5)
#
# Output from split:
# trainingdf: a dataframe consisting of a random sample of (1-testfraction) of the rows in df
# testdf: a dataframe consisting of the rows in df that are not included in trainingdf
#
# Hint: You may use np.random.permutation(df.index) to get a permuted list of indexes where a 
#       prefix corresponds to the test instances, and the suffix to the training instances 


def split(df, testfraction=0.5):
    #   generate a permutation of the row numbers
    permutation = np.random.permutation(df.index)
    #   compute the index separating the testing rows from the training rows in permutation
    sep_index = int(permutation.size*testfraction)
    #   return two slices of the initial data set: 
    #   the first one contains 1-testfraction fraction of the dataframe and correspond to the training set
    #   and the other one contains testfraction fraction of the dataframe and correspond to the testing set.
    return df.iloc[permutation[sep_index:], :], df.iloc[permutation[:sep_index], :]


In [None]:
# Test your code  (leave this part unchanged)

glass_df = pd.read_csv("glass.txt")

glass_train, glass_test = split(glass_df,testfraction=0.25)

print("Training IDs:\n{}".format(glass_train["ID"].values))

print("\nTest IDs:\n{}".format(glass_test["ID"].values))

print("\nOverlap: {}".format(set(glass_train["ID"]).intersection(set(glass_test["ID"]))))


### Comment on assumptions, things that do not work properly, etc.

## 1e. Calculate accuracy of a set of predictions

In [None]:
#   Veronika

# Insert the function accuracy below
#
# Input to accuracy:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from accuracy:
# accuracy: the fraction of cases for which the predicted class label coincides with the correct label
#
# Hint: In case the label receiving the highest probability is not unique, you may
#       resolve that by picking the first (as ordered by the column names) or 
#       by randomly selecting one of the labels with highest probaility.



In [None]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})
predictions


In [None]:
correctlabels = ["B","A","B","B","C"]

accuracy(predictions,correctlabels) # Note that depending on how ties are resolved the accuracy may be 0.6 or 0.8

### Comment on assumptions, things that do not work properly, etc.

## 2a. Create and apply one-hot encoding

In [None]:
#Tim
#
# Insert the functions create_one_hot and apply_one_hot below
#
# Input to create_one_hot:
# df: a dataframe
#
# Output from create_one_hot:
# df: a new dataframe, where each categoric feature has been replaced by a set of binary features 
#    (as many new features as there are possible values)
# one_hot: a mapping (dictionary) from column name to a set of categories (possible values for the feature)
#
# Hint 1: First copy the input dataframe and modify the copy (the input dataframe should be kept unchanged)
# Hint 2: Consider columns of type "object" or "category" only (and which are not labeled "CLASS" or "ID")
# Hint 3: Consider creating new column names by merging the original column name and the categorical value
# Hint 4: Set all new columns to be of type "float"
# Hint 5: Do not forget to remove the original categoric feature

def create_one_hot(df):
    newdf = pd.DataFrame()
    one_hot = {}
    for key in df.columns:
        if df[key].dtype in [np.object, 'category'] and key not in ['CLASS', 'ID']:
            one_hot[key] = df[key].unique()
            newtab = pd.get_dummies(df[key], prefix=key)
            for i in newtab.columns: #converting to floats
                newtab[i] = newtab[i].astype(float)
            newdf = pd.concat([newdf, newtab], axis=1)
        else:
            newdf = pd.concat([newdf, df[key]], axis=1)
    
    return(newdf, one_hot)

# Input to apply_one_hot:
# df: a dataframe
# one_hot: a mapping (dictionary) from column name to categories
#
# Output from apply_one_hot:
# df: a new dataframe, where each categoric feature has been replaced by a set of binary features
#
# Hint: See the above Hints

def apply_one_hot(df, one_hot):
    newdf = pd.DataFrame()
    for key in df.columns:
        if key in one_hot:
            cols = pd.CategoricalDtype(categories=one_hot[key])
            newcol = df[key].astype(cols)
            newtab = pd.get_dummies(newcol, prefix=key)
            for i in newtab.columns: #converting to floats
                newtab[i] = newtab[i].astype(float)
                
            newdf = pd.concat([newdf, newtab], axis=1)
        else:
            newdf = pd.concat([newdf, df[key]], axis=1)
    return(newdf)

In [None]:
# Test your code  (leave this part unchanged)

tictactoe = pd.read_csv("tic-tac-toe.txt")

train_df, test_df = split(tictactoe) # Using your above function

new_train, one_hot = create_one_hot(train_df)

new_test = apply_one_hot(test_df,one_hot)
new_test

### Comment on assumptions, things that do not work properly, etc.

In [None]:
## 2b. Divide a dataset into a number of folds

In [None]:
#   Leo

# Insert the function folds below
#
# Input to folds:
# df: a dataframe
# nofolds: an integer greater than 1 (default = 10)
#
# Output from folds:
# folds: a list (of length = nofolds) dataframes consisting of random non-overlapping, 
#        approximately equal-sized subsets of the rows in df
#
# Hint: You may use np.random.permutation(df.index) to get a permuted list of indexes from which a 
#       prefix corresponds to the test instances, and the suffix to the training instances 


def folds(df, nofolds=10):
    #   generate a permutation of the row numbers
    permutation = np.random.permutation(df.index)
    #  
    folds_indexes = np.array_split(permutation, nofolds)
    #   return two slices of the initial data set: 
    #   the first one contains 1-testfraction fraction of the dataframe and correspond to the training set
    #   and the other one contains testfraction fraction of the dataframe and correspond to the testing set.
    return [df.iloc[permutation[indexes], :] for indexes in folds_indexes]


In [None]:
# Test your code  (leave this part unchanged)

glass_df = pd.read_csv("glass.txt")

glass_folds = folds(glass_df,nofolds=5)

fold_sizes = [len(f) for f in glass_folds]

print("Fold sizes:{}\nTotal no. instances: {}".format(fold_sizes,sum(fold_sizes)))

### Comment on assumptions, things that do not work properly, etc.

## 2c. Calculate Brier score of a set of predictions

In [None]:
#   Veronika

# Insert the function brier_score below
#
# Input to brier_score:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from brier_score:
# brier_score: the average square error of the predicted probabilties 
#
# Hint: Compare each predicted vector to a vector for each correct label, which is all zeros except 
#       for at the index of the correct class. The index can be found using np.where(df.columns==l)[0] 
#       where l is the correct label.



In [None]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})

correctlabels = ["B","A","B","B","C"]

brier_score(predictions,correctlabels)

### Comment on assumptions, things that do not work properly, etc.

## 2d. Calculate AUC of a set of predictions

In [6]:
#   Tim

# Insert the function auc below
#
# Input to auc:
# df: a dataframe with class labels as column names and each row corresponding to
#     a prediction with estimated probabilities for each class
# correctlabels: an array (or list) of the correct class label for each prediction
#                (the number of correct labels must equal the number of rows in df)
#
# Output from auc:
# auc: the weighted area under ROC curve
#
# Hint 1: Calculate the binary AUC first for each class label c, i.e., treating the
#         predicted probability of this class for each instance as a score; the positive
#         instances are the ones belonging to class c and the negative instances the rest
# Hint 2: When calculating the binary AUC, first find the scores of the positive instances and then
#         the scores of the negative instances
# Hint 3: You may use a dictionary with a mapping from each score to an array of two numbers; 
#         the number of positive instances with this score and the number of negative instances with this score
# Hint 4: Created a (reversely) sorted (on the scores) list of pairs from the dictionary and
#         iterate over this to additively calculate the AUC
# Hint 5: For each pair in the above list, there are three cases to consider; the no. of true positives
#         (tp_i) is zero, the number of false positives (fp_i) (negatives) is zero, and both are non-zero
# Hint 6: Calculate the weighted AUC by summing the individual AUCs weighted by the relative
#         frequency of each class (as estimated from the correct labels)

def auc(df, correctlabels):
    finauc = 0
    freq = {}
    aucdict = {}
    for c in set(correctlabels):
        freq[c] = correctlabels.count(c)
        y_true = np.array([1 if x is c else 0 for x in correctlabels])
        scores = df[c]
        scoredct = {}
        poscount = scores[[bool(x) for x in y_true]].value_counts()
        negcount = scores[[not bool(x) for x in y_true]].value_counts()
        for score in sorted(set(scores), reverse=True):
            posscore = poscount[score] if score in poscount else 0
            negscore = negcount[score] if score in negcount else 0
            scoredct[score] = (posscore, negscore)

        tot_tp = sum([scoredct[key][0] for key in scoredct])
        tot_fp = sum([scoredct[key][1] for key in scoredct])
        auc = 0
        cov_tp = 0
        for i in scoredct:
            tp_i = scoredct[i][0]
            fp_i = scoredct[i][1]
            if fp_i == 0:
                cov_tp += tp_i
            elif tp_i == 0:
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp)
            else:
                auc += (cov_tp/tot_tp)*(fp_i/tot_fp) + (tp_i/tot_tp)*(fp_i/tot_fp)/2
                cov_tp += tp_i
        
        aucdict[c] = auc       
    for key in aucdict:
        finauc += freq[key]/len(correctlabels)*aucdict[key]
    return finauc

In [7]:
# Test your code  (leave this part unchanged)

predictions = pd.DataFrame({"A":[0.9,0.9,0.6,0.55],"B":[0.1,0.1,0.4,0.45]})

correctlabels = ["A","B","B","A"]

auc(predictions,correctlabels)



0.375

In [8]:
predictions = pd.DataFrame({"A":[0.5,0.5,0.5,0.25,0.25],"B":[0.5,0.25,0.25,0.5,0.25],"C":[0.0,0.25,0.25,0.25,0.5]})

correctlabels = ["B","A","B","B","C"]

auc(predictions,correctlabels)

0.8499999999999999

### Comment on assumptions, things that do not work properly, etc.