# ID2214/FID3214 Assignment 2 Group no. 15
### Project members: 
Egill Friðriksson, egillf@kth.se
Gard Aasness, gardaa@kth.se
Iosif Koen, iosif@kth.se


### Declaration:
By submitting this solution, it is hereby declared that all individuals listed above have contributed to the solution, either with code that appear in the final solution below, or with code that has been evaluated and compared to the final solution, but for some reason has been excluded. It is also declared that all project members fully understand all parts of the final solution and can explain it upon request.

It is furthermore declared that the code below is a contribution by the project members only, and specifically that no part of the solution has been copied from any other source (except for lecture slides at the course ID2214/FID3214) and no part of the solution has been provided by someone not listed as project member above.

It is furthermore declared that it has been understood that no other library/package than the Python 3 standard library, NumPy, pandas and time may be used in the solution for this assignment.

### Instructions
All parts of the assignment starting with number 1 below are mandatory. Satisfactory solutions
will give 1 point (in total). If they in addition are good (all parts work more or less 
as they should), completed on time (submitted before the deadline in Canvas) and according
to the instructions, together with satisfactory solutions of all parts of the assignment starting 
with number 2 below, then the assignment will receive 2 points (in total).

Note that you do not have to develop the code directly within the notebook
but may instead copy the comments and test cases to a more convenient development environment
and when everything works as expected, you may paste your functions into this
notebook, do a final testing (all cells should succeed) and submit the whole notebook 
(a single file) in Canvas (do not forget to fill in your group number and names above).

## Load NumPy, pandas and time

In [122]:
import numpy as np
import pandas as pd
import time

In [123]:
from platform import python_version

print(f"Python version: {python_version()}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

Python version: 3.10.11
NumPy version: 1.23.4
Pandas version: 2.0.3


## Reused functions from Assignment 1

In [124]:
# Copy and paste functions from Assignment 1 here that you need for this assignment

#####################################################################################################
########################################### COLUMN FILTER ###########################################
#####################################################################################################

def create_column_filter(df):

    # Step 1. Copy the input dataframe.
    copy_df = df.copy()

    # Step 2. Initialise an empty list so to store the names of the columns witch will be kept.
    column_filter = []

    # Step 3. Iterate through all the columns in the dataframe.
    for column in copy_df.columns:

        # Step 4. Check if the column name is CLASS or ID so do not drop them.
        if column == "CLASS" or column == "ID":
            column_filter.append(column)
        else:

            # Step 5. Drop missing values and get unique values.
            unique_values = copy_df[column].dropna().unique()
            
            # Step 6. Check if the number of unique (non-missing) values is more than one, keep the column
            if (len(unique_values) > 1):
                column_filter.append(column)
            else:
                # Step 7. Drop the column from the dataframe copy.
                copy_df.drop(columns=[column], inplace=True)
        
        
    # Return the modified dataframe and the list of remaining columns
    return copy_df, column_filter


def apply_column_filter(df, column_filter):

    # Step 1. Copy the initial dataframe.
    copy_df = df.copy()

    # Step 2. Iterate through all the columns of the copied dataframe.
    for column in copy_df.columns:

        # Step 3. Check the column name if it is in the list of the filtered columns.  
        if column not in column_filter:
            # Step 4. If the column name is not in the filtered columns then it will be dropped.
            copy_df.drop(columns=[column], inplace=True)

    # Step 5. Return the modified dataframe with only the columns witch are in the filtered columns list
    return copy_df

#####################################################################################################
############################################### BINS ################################################
#####################################################################################################

def create_bins(df: pd.DataFrame, nobins: int, bintype: str) -> tuple[pd.DataFrame, dict]:

    # Step 1: Copy the input dataframe to avoid changes to the initial dataframe
    copy_df = df.copy()
    # Step 2: Initialize a dictionary to stor the binning information
    binning = {}

    # Step 3: Iterate the through all column of the dataframe
    for column in copy_df.columns:
        # Check if the the column name is "CLASS" or "ID" and if the type of the column is numeric ("float64", "int64"])
        if column not in ["CLASS", "ID"] and copy_df[column].dtype in ["float64", "int64"]:
            # Apply equal-width binning
            if bintype == "equal-width":
                copy_df[column], bins =  pd.cut(copy_df[column], nobins, labels=False, retbins=True)
            # Apply the equal-size binning
            elif bintype == "equal-size":
                copy_df[column], bins = pd.qcut(copy_df[column], nobins, retbins=True, labels=False, duplicates="drop")

            # Step 4: Adjust the firs and the last bin to the -infinity and +infinity respectively
            bins[0], bins[-1] = -np.inf, np.inf
            # Step 5: Stor the binning information to the dictionary
            binning[column] = bins
            # Step 6: Convert the columns into the categorical data type
            copy_df[column] = pd.Categorical(copy_df[column], categories=range(nobins))

    return copy_df, binning

def apply_bins(df: pd.DataFrame, binning: dict) -> pd.DataFrame:

    # Step 1: Copy the input dataframe to avoid changes to the initial dataframe
    copy_df = df.copy()

    # Step 2: Iterate through binning dictionary
    for column, bins in binning.items():
        # Apply binning to the columns that a in the binning dictionary
        if column in copy_df.columns:
            # Discretize the columns based on the bins
            copy_df[column] = pd.cut(copy_df[column], bins, labels=False)
            # Convert the column into the categorical data type 
            copy_df[column] = pd.Categorical(copy_df[column], categories=range(len(bins) - 1))

    return copy_df
def create_imputation(df):

    # Step 1: Copy the input dataframe
    copy_df = df.copy()

    # Strep 2: Initialize a dictionary to store imputation values
    imputation = {}

    # Step 3: Iterate trough all the columns in the dataframe
    for column in copy_df.columns:

        # Step 4: Skip the "CLASS" and "ID" columns
        if column not in ["CLASS", "ID"]:

            # Step 5: Handle numeric columns (float and int types)
            if np.issubdtype(copy_df[column].dtype, np.number): # = if copy_df[column].dtype in ["float64", "int64"]
                # Calculate the mean value of the column
                mean_value = copy_df[column].mean()
                # Check if all the values of the column are missing and replace mean with 0
                if pd.isnull(mean_value):
                    mean_value = 0
                # Replace the missing values with the mean value
                copy_df[column].fillna(mean_value, inplace=True)
                # Add the mean value to the imputation dictionary
                imputation[column] = mean_value

            # Step 6: Handle the object and the category columns
            elif copy_df[column].dtype in ["object", "category"]:
                # Calculate the mode of the column (Find the most frequent value appear in the column)
                mode_value = copy_df[column].mode().iloc[0] if not copy_df[column].mode().empty else ""
                # If all values are missing and column is category then replace the mode value with the first category.
                # If no categories it will raise an error.
                if copy_df[column].dtype.name == "category" and mode_value == "":               
                    mode_value = copy_df[column].cat.categories[0]
                # Replace missing values with the mode value
                copy_df[column].fillna(mode_value, inplace=True)
                # Add the mode value to the imputation dictionary
                imputation[column] = mode_value

    return copy_df, imputation

def apply_imputation(df, imputation):
    # Step 1: Copy the input dataframe
    copy_df = df.copy()
    # Step 2: Iterate through the imputation dictionary and take the items of it
    for column_name, value in imputation.items():
        # Fill the empty values of the copied dataframe according the dictionary map.
        copy_df[column_name].fillna(value, inplace=True)

    return copy_df

def create_normalization(df, normalizationtype = "minmax"):

    # Step 1. Copy the input dataframe.
    normalized_df = df.copy()

    # Step 2. Initialize a dictionary so to store the normalization parameters.
    normalization_map = {}

    # Step 3. Iterate through all the columns.
    for column in normalized_df.columns:
        # Step 4. Check is the type of the column is numeric (float64 or int64) and is not labeled as CLASS or ID.
        if normalized_df[column].dtype in ["float64", "int64"] and column not in ["CLASS", "ID"]:
            # Step 5. Check if the normalization type is min-max.
            if normalizationtype == "minmax":
                # Step 6. Apply min-max normalization.
                minimum_value = normalized_df[column].min()
                maximum_value = normalized_df[column].max()
                normalized_df[column] = (normalized_df[column] - minimum_value) / maximum_value - minimum_value
                # Step 7. Store the normalization parameters.
                normalization_map[column] = (normalizationtype, minimum_value, maximum_value)
            # Step 8. Check if the normalization type is z-normalization / zscore.
            elif normalizationtype == "zscore":
                # Step 9. Apply the z-normalization.
                mean_value = normalized_df[column].mean()
                std_value = normalized_df[column].std()
                normalized_df[column] = (normalized_df[column] - mean_value) / std_value
                #Step 10. Store the normalization parameters.
                normalization_map[column] = (normalizationtype, mean_value, std_value)
            # Step 10. Handle unexpected normalization type
            else:
                raise TypeError("Unsupported normalization type!")
            
    return normalized_df, normalization_map

def apply_normalization(df, normalization):
    
    # Step 1. Copy the input dataframe.
    copy_df = df.copy()

    # Step 2. Iterate through the normalization dictionary.
    for column, values in normalization.items():

        # Step 3. Check if the column exists in the dataframe.
        if column in copy_df.columns:

            # Step 4. Unpack the normalization parameters.
            normalization_type, first_variable, second_variable = values

        # Step 5. Check if the normalization type is min-max
        if normalization_type == "minmax":

            # Step 6. Apply the min-max normalization
            copy_df[column] = (copy_df[column] - first_variable) / (second_variable - first_variable)

            # Step 7. Ensure that the values are in the range of [0, 1]
            # Find documentation here: https://numpy.org/doc/stable/reference/generated/numpy.clip.html
            copy_df[column] = np.clip(copy_df[column], 0, 1)
        # Step 8. Check if the normalization type is z-normalization / zscore.
        elif normalization_type == "zscore":
            # Step 9. Apply the z-normalization
            copy_df[column] = (copy_df[column] - first_variable) / second_variable
        # Step 10. Handle unexpected normalization type
        else:
            raise TypeError("Unsupported normalization type!")

    return copy_df

def create_one_hot(df: pd.DataFrame) -> tuple[pd.DataFrame, dict]:

    # Step 1: Copy the input dataframe to avoid changes to the initial dataframe
    copy_df = df.copy()
    # Step 2: Initialize a dictionary to stor one-hot information
    one_hot = {}

    # Step 3: Iterate through the columns of the dataframe and apply the one-hot encoding to categorical columns
    for column in copy_df.columns:
        if column not in ["CLASS", "ID"] and copy_df[column].dtype in ["category", "object"]:

            # Get unique categories for the columns
            unique_column_category = df[column].unique()
            # Store the unique categories into dictionary
            one_hot[column] = unique_column_category
            # Generate one-hot encoding
            for category in sorted(unique_column_category):
                # Create a new column for each category
                new_column_category_name = f"{column}_{category}"
                copy_df[new_column_category_name] = (df[column] == category).astype(float)

            # Remove the original column
            copy_df.drop(column, axis=1, inplace=True)

    return copy_df, one_hot

def apply_one_hot(df: pd.DataFrame, one_hot: dict) -> pd.DataFrame:

    # Step 1: Copy the input dataframe to avoid changes to the initial dataframe
    copy_df = df.copy()

    # Step 2: Iterate through the one-hot dictionary so to apply the one-hot encoding
    for column, categories in one_hot.items():
        if column in df.columns:
            for category in categories:
                new_column_category_name = f"{column}_{category}"
                # Create a new column an we add 0 as the default value
                copy_df[new_column_category_name] = 0
                # We add 1 if the category matches 
                copy_df.loc[df[column] == category, new_column_category_name] = 1
            # Remove the original value
            copy_df.drop(column, axis=1, inplace=True)

    return copy_df


def brier_score(df, correctlabels):
    total_brier_score = 0.0

    for i in range(len(df)):
        pred = df.iloc[i].values  
        corr_label = correctlabels[i] 
        
        # Find the index of the correct label in the columns of the dataframe
        corr_label_i = np.where(df.columns == corr_label)[0][0]
        corr_vector = np.zeros(len(pred))
        corr_vector[corr_label_i] = 1
        squared_error = np.sum((pred - corr_vector) ** 2)
        total_brier_score += squared_error
    
    avg_brier_score = total_brier_score / len(df)
    
    return avg_brier_score

def accuracy(df: pd.DataFrame, correctlabels: list):

    if len(df) != len(correctlabels):
        raise ValueError("The number of rows in the DataFrame must equal the number of correct labels")
    
    # Step 2: Find the label with the highest probability in each row
    predictionary_labels = df.idxmax(axis=1)
    #print(correctlabels)
    #print(predictionary_labels)
    # Step 3: Compare with the correct labels
    matches = predictionary_labels == correctlabels

    # Step 4: Calculate accuracy 
    accuracy = matches.sum() / len(correctlabels)

    return accuracy


def auc(df, correctlabels):
    # Helper function to calculate binary AUC for a single class
    def binary_auc(class_name):
        scores = df[class_name].values
        true_positives = [int(label == class_name) for label in correctlabels]
        false_positives = [int(label != class_name) for label in correctlabels]

        # Create triples of (score, tp, fp)
        triples = [(score, tp, fp) for score, tp, fp in zip(scores, true_positives, false_positives)]

        # Sort the triples based on scores in reverse order
        triples.sort(key=lambda x: x[0], reverse=True)

        # Calculate AUC using the lecture's algorithm
        auc = 0
        cov_tp = 0
        tot_tp = sum(true_positives)
        tot_fp = sum(false_positives)

        for _, tp, fp in triples:
            if fp == 0:
                cov_tp += tp
            elif tp == 0:
                auc += (cov_tp / tot_tp) * (fp / tot_fp)
            else:
                auc += (cov_tp / tot_tp) * (fp / tot_fp) + (tp / tot_tp) * (fp / tot_fp) / 2
                cov_tp += tp

        return auc

    # Calculate binary AUC for each class and store the results
    auc_results = {class_name: binary_auc(class_name) for class_name in df.columns}

    # Calculate weighted AUC
    class_counts = pd.Series(correctlabels).value_counts(normalize=True)
    
    weighted_auc = sum(auc_results[class_name] * class_counts.get(class_name, 0) for class_name in df.columns)

    return weighted_auc


## 1. Define the class kNN

In [125]:
from scipy.spatial import distance
# Define the class kNN with three functions __init__, fit and predict (after the comments):
#
class kNN:
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, imputation, normalization, one_hot, labels, training_labels, training_data, training_time
#
    def __init__(self) -> None:
        self.column_filter = None
        self.imputation = None
        self.normalization = None
        self.one_hot = None
        self.labels = None
        self.training_labels = None
        self.training_data = None
        self.training_time = None

# Input to fit:
# self              - the object itself
# df                - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# normalizationtype - "minmax" (default) or "zscore"
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
    def fit(self, df: pd.DataFrame, normalizationtype: str = "minmax"):
      
      # self.column_filter   - a column filter (see Assignment 1) from df
      _,  self.column_filter = create_column_filter(df=df)
      df_filtered = apply_column_filter(df=df, column_filter=self.column_filter)
      
      # self.imputation      - an imputation mapping (see Assignment 1) from df        
      _, self.imputation = create_imputation(df=df_filtered)
      df_imputed = apply_imputation(df=df_filtered, imputation=self.imputation)
      
      # self.normalization   - a normalization mapping (see Assignment 1), using normalizationtype from the imputed df
      _, self.normalization = create_normalization(df=df_imputed, normalizationtype=normalizationtype)
      df_normalized = apply_normalization(df=df_imputed, normalization=self.normalization)
      
      if df_normalized.select_dtypes(include=['object', 'category']).shape[1] > 0:
          # self.one_hot         - a one-hot mapping (see Assignment 1)
          _, self.one_hot = create_one_hot(df=df_normalized)
          df_final = apply_one_hot(df=df_normalized, one_hot=self.one_hot)
      else:
          df_final = df_normalized
      
      # self.training_labels - a pandas series corresponding to the "CLASS" column, set to be of type "category" 
      self.training_labels = df_final['CLASS'].astype('category')
      
      # self.labels          - a list of the categories (class labels) of the previous series
      self.labels = self.training_labels.cat.categories
      
      # self.training_data   - the values (an ndarray) of the transformed dataframe, i.e., after employing imputation, 
      # normalization, and possibly one-hot encoding, and also after removing the "CLASS" and "ID" columns
      self.training_data = df_final.drop(columns=['CLASS', 'ID']).values


# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# k    - an integer >= 1 (default = 5)
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the relative class frequencies in the set of class labels from the k nearest 
#               (with respect to Euclidean distance) neighbors in training_data
#
# Hint 1: Drop any "CLASS" and "ID" columns first and then apply column filtering, imputation, normalization and one-hot
#
# Hint 2: Get the numerical values (as an ndarray) from the resulting dataframe and iterate over the rows 
#         calling some sub-function, e.g., get_nearest_neighbor_predictions(x_test,k), which for a test row
#         (numerical input feature values) finds the k nearest neighbors and calculate the class probabilities.
#
# Hint 3: This sub-function may first find the distances to all training instances, e.g., pairs consisting of
#         training instance index and distance, and then sort them according to distance, and then (using the indexes
#         of the k closest instances) find the corresponding labels and calculate the relative class frequencies
    def predict(self, df: pd.DataFrame, k: int = 5) -> pd.DataFrame:

   # Preprocess the test data
        # Hint 1: Apply the preprocessing steps (column filtering, imputation, normalization, one-hot encoding)
        df_processed = self._preprocess_data(df=df)

        # Convert processed DataFrame to ndarray for distance computation
        test_data = df_processed.values

        # Initialize an empty DataFrame for predictions
        predictions = []

        # Iterate over each test instance and get predictions
        # Hint 2 & 3: Use a sub-function for finding k nearest neighbors and calculating class probabilities
        for test_instance in test_data:
            class_probabilities = self._get_nearest_neighbor_predictions(test_instance=test_instance, k=k)
            predictions.append(class_probabilities)


        predictions_df = pd.concat(predictions, axis=1).T
        predictions_df.columns = self.labels
        predictions_df.index = df.index

        return predictions_df

    def _preprocess_data(self, df: pd.DataFrame) -> pd.DataFrame:
        """Preprocesses the data by applying column filter, imputation, normalization, and one-hot encoding"""
        df_filtered = apply_column_filter(df, self.column_filter)
        df_imputed = apply_imputation(df_filtered, self.imputation)
        df_normalized = apply_normalization(df_imputed, self.normalization)
        if self.one_hot:
            df_final = apply_one_hot(df_normalized, self.one_hot)
        else:
            df_final = df_normalized

        # Drop 'CLASS' and 'ID' columns if present
        return df_final.drop(columns=['CLASS', 'ID'], errors='ignore')

    def _get_nearest_neighbor_predictions(self, test_instance: np.ndarray, k: int) -> pd.Series:
        """Finds the k nearest neighbors and calculates class probabilities"""
        # Calculate all distances between test_instance and training_data
        distances = distance.cdist([test_instance], self.training_data, 'euclidean').flatten()

        # Find indices of k smallest distances
        nearest_indices = np.argsort(distances)[:k]

        # Get labels of nearest neighbors
        nearest_labels = self.training_labels.iloc[nearest_indices]

        # Calculate and return class probabilities
        class_probabilities = nearest_labels.value_counts(normalize=True).reindex(self.labels, fill_value=0)
        return class_probabilities




In [126]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

knn_model = kNN()

t0 = time.perf_counter()
knn_model.fit(glass_train_df)
print("Training time: {0:.2f} s.".format(time.perf_counter()-t0))

test_labels = glass_test_df["CLASS"]

k_values = [1,3,5,7,9]
results = np.empty((len(k_values),3))

for i in range(len(k_values)):
    t0 = time.perf_counter()
    predictions = knn_model.predict(glass_test_df,k=k_values[i])
    print("Testing time (k={0}): {1:.2f} s.".format(k_values[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=k_values,columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time: 0.04 s.


Testing time (k=1): 0.30 s.
Testing time (k=3): 0.18 s.
Testing time (k=5): 0.18 s.
Testing time (k=7): 0.20 s.
Testing time (k=9): 0.22 s.



'results'

Unnamed: 0,Accuracy,Brier score,AUC
1,0.747664,0.504673,0.821943
3,0.663551,0.488058,0.824895
5,0.579439,0.471028,0.837814
7,0.598131,0.471867,0.837775
9,0.616822,0.482981,0.832269


In [127]:
train_labels = glass_train_df["CLASS"]
predictions = knn_model.predict(glass_train_df,k=1)
print("Accuracy on training set (k=1): {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set (k=1): {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set (k=1): {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set (k=1): 1.0000
AUC on training set (k=1): 1.0000
Brier score on training set (k=1): 0.0000


### Comment on assumptions, things that do not work properly, etc.


## 2. Define the class NaiveBayes

In [128]:
# Define the class NaiveBayes with three functions __init__, fit and predict (after the comments):
#
# Input to __init__: 
# self - the object itself
#
# Output from __init__:
# <nothing>
# 
# This function does not return anything but just initializes the following attributes of the object (self) to None:
# column_filter, binning, labels, class_priors, feature_class_value_counts, feature_class_counts
#
# Input to fit:
# self    - the object itself
# df      - a dataframe (where the column names "CLASS" and "ID" have special meaning)
# nobins  - no. of bins (default = 10)
# bintype - either "equal-width" (default) or "equal-size" 
#
# Output from fit:
# <nothing>
#
# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical) 
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical) 
#                                   value for the feature
#
# Note that the function does not return anything but just assigns values to the attributes of the object.
#
# Input to predict:
# self - the object itself
# df   - a dataframe
# 
# Output from predict:
# predictions - a dataframe with class labels as column names and the rows corresponding to
#               predictions with estimated class probabilities for each row in df, where the class probabilities
#               are estimated by the naive approximation of Bayes rule (see lecture slides)
#
# Hint 1: First apply the column filter and discretization
#
# Hint 2: Iterating over either columns or rows, and for each possible class label, calculate the relative
#         frequency of the observed feature value given the class (using feature_class_value_counts and 
#         feature_class_counts) 
#
# Hint 3: Calculate the non-normalized estimated class probabilities by multiplying the class priors to the
#         product of the relative frequencies
#
# Hint 4: Normalize the probabilities by dividing by the sum of the non-normalized probabilities; in case
#         this sum is zero, then set the probabilities to the class priors
#
# Hint 5: To clarify the assignment text a little: self.feature_class_value_counts should be a mapping from 
#         a column name (a specific feature) to another mapping, which given a class label and a value for 
#         the feature, returns the number of training instances which have included this combination, 
#         i.e., the number of training instances with both the specific class label and this value on the feature.
#
# Hint 6: As an additional hint, you may take a look at the slides from the NumPy and pandas lecture, to see how you 
#         may use "groupby" in combination with "size" to get the counts for combinations of values from two columns.

class NaiveBayes:
    
    def __init__(self):
        self.column_filter = None
        self.binning = None
        self.labels = None
        self.class_priors = None
        self.feature_class_value_counts = None
        self.feature_class_counts = None


# The result of applying this function should be:
#
# self.column_filter              - a column filter (see Assignment 1) from df
# self.binning                    - a discretization mapping (see Assignment 1) from df
# self.class_priors               - a mapping (dictionary) from the labels (categories) of the "CLASS" column of df,
#                                   to the relative frequencies of the labels
# self.labels                     - a list of the categories (class labels) of the "CLASS" column of df
# self.feature_class_value_counts - a mapping from the feature (column name) to the number of
#                                   training instances with a specific combination of (non-missing, categorical) 
#                                   value for the feature and class label
# self.feature_class_counts       - a mapping from the feature (column name) to the number of
#                                   training instances with a specific class label and some (non-missing, categorical) 
#                                   value for the feature

    def fit(self, df, nobins=10, bintype="equal-width"):
        # Make a copy of the input dataframe
        df_copy = df.copy()

        # Apply column filter and store the result in self.column_filter
        df_copy, self.column_filter = create_column_filter(df_copy)

        # Apply discretization and store the result in self.binning
        df_copy, self.binning = create_bins(df_copy, nobins, bintype)

        # Calculate class priors as relative frequencies
        self.class_priors = dict(df_copy['CLASS'].value_counts(normalize=True))

        # Extract unique class labels and store them in self.labels
        self.labels = df_copy['CLASS'].astype('category').cat.categories.tolist()

        # Init dictionaries to store counts for feature-class combinations
        dictionary_count = {}
        dictionary_values_count = {}

        # Hint 5 & 6: 
        # Iterate over columns (excluding CLASS and ID)
        for col in df_copy.columns:
            if col not in ['CLASS', 'ID']:
                # Count occurrences for each combination of class label and feature value
                dictionary_values_count[col] = df_copy.groupby(['CLASS', col]).size().to_dict()


                # Drop rows with missing values for the current feature and 'CLASS', then count occurrences for each class label
                df_copy_tmp = df_copy.dropna(axis=0, subset=['CLASS', col])
                dictionary_count[col] = df_copy_tmp.loc[:, 'CLASS'].value_counts().to_dict()

        # Store the calculated counts in self.feature_class_value_counts and self.feature_class_counts
        self.feature_class_value_counts = dictionary_values_count
        self.feature_class_counts = dictionary_count


    def predict(self, df):
        # Make a copy of the input dataframe
        df_copy = df.copy()

        # Apply column filter to the dataframe
        df_copy = apply_column_filter(df_copy, self.column_filter)

        # Apply discretization to the dataframe
        df_copy = apply_bins(df_copy, self.binning)

        # Drop CLASS and ID colums
        df_copy = df_copy.drop(columns=['CLASS', 'ID'], axis=1)

        # get the dimensions of the dataframe
        num_rows = df_copy.shape[0]
        num_col = df_copy.shape[1]
        num_labels = len(self.labels)

        # Initialize a 3 dimensional  matrix to store class probabilities
        matrix = np.zeros([num_labels, num_rows, num_col])

        # Hint 2: 
        # iterate over columns
        for col in range(num_col):
            curr_col = df_copy.columns[col]

            # Iterate over class labels
            for label in range(num_labels):
                curr_label = self.labels[label]

                # Iterate over rows
                for row in range(num_rows):
                    curr_value = df_copy.iloc[row, col]

                    #  Calculate the relative frequency of the observed feature value given the class
                    if (curr_label, curr_value) in self.feature_class_value_counts[curr_col].keys():
                        feature_value_count = self.feature_class_value_counts[curr_col][(curr_label, curr_value)]
                        feature_count = self.feature_class_counts[curr_col][curr_label]

                        rel_frequency = feature_value_count / feature_count
                    else:
                        rel_frequency = 0

                    matrix[label, row, col] = rel_frequency

        # Hint 3: 
        # calculate non-normalized probabilities
        non_normalized_matrix = matrix.prod(axis=2)


        # Create a class vector from class priors and tile it to match the dimensions
        class_vector = np.array([self.class_priors[self.labels[i]] for i in range(num_labels)])
        class_matrix = np.tile(class_vector, num_rows).reshape([num_rows, num_labels]).T

        # Multiply non-normalized probabilities by class priors
        non_normalized_matrix = non_normalized_matrix * class_matrix

        # Calculate normalization values
        normalization = np.sum(non_normalized_matrix, axis=0)
        normalizing_matrix = np.tile(normalization, num_labels).reshape([num_labels, num_rows])

        # Avoid division by zero by adding a small value to the normalization matrix where it is zero
        normalizing_matrix_0 = normalizing_matrix == 0
        normalizing_matrix += normalizing_matrix_0.astype('float')
        #print(normalizing_matrix)

        # Hint 4: 
        # Normalize the probabilities
        result_matrix = non_normalized_matrix / normalizing_matrix

        # Add class priors to the result where normalization matrix was zero
        class_priors = normalizing_matrix_0.astype('float') * class_matrix
        result_matrix += class_priors

        # Create a DataFrame from the result matrix and return it
        result_df = pd.DataFrame(result_matrix.T, columns=self.labels)
        
        return result_df



In [129]:
# Test your code (leave this part unchanged, except for if auc is undefined)

glass_train_df = pd.read_csv("glass_train.csv")
nb_model = NaiveBayes()
nb_model.fit(glass_train_df, 3, "equal-width")

glass_train_df = pd.read_csv("glass_train.csv")

glass_test_df = pd.read_csv("glass_test.csv")

nb_model = NaiveBayes()

test_labels = glass_test_df["CLASS"]

nobins_values = [3,5,10]
bintype_values = ["equal-width","equal-size"]
parameters = [(nobins,bintype) for nobins in nobins_values for bintype in bintype_values]

results = np.empty((len(parameters),3))

for i in range(len(parameters)):
    t0 = time.perf_counter()
    nb_model.fit(glass_train_df,nobins=parameters[i][0],bintype=parameters[i][1])
    print("Training time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    t0 = time.perf_counter()
    predictions = nb_model.predict(glass_test_df)
    print("Testing time {0}: {1:.2f} s.".format(parameters[i],time.perf_counter()-t0))
    results[i] = [accuracy(predictions,test_labels),brier_score(predictions,test_labels),
                  auc(predictions,test_labels)] # Assuming that you have defined auc - remove otherwise

results = pd.DataFrame(results,index=pd.MultiIndex.from_product([nobins_values,bintype_values]),
                       columns=["Accuracy","Brier score","AUC"])

print()
display("results",results)

Training time (3, 'equal-width'): 0.05 s.
Testing time (3, 'equal-width'): 0.22 s.
Training time (3, 'equal-size'): 0.04 s.
Testing time (3, 'equal-size'): 0.16 s.
Training time (5, 'equal-width'): 0.04 s.
Testing time (5, 'equal-width'): 0.18 s.
Training time (5, 'equal-size'): 0.06 s.
Testing time (5, 'equal-size'): 0.19 s.
Training time (10, 'equal-width'): 0.04 s.
Testing time (10, 'equal-width'): 0.21 s.
Training time (10, 'equal-size'): 0.04 s.
Testing time (10, 'equal-size'): 0.17 s.



'results'

Unnamed: 0,Unnamed: 1,Accuracy,Brier score,AUC
3,equal-width,0.616822,0.622116,0.729629
3,equal-size,0.607477,0.554782,0.789825
5,equal-width,0.64486,0.551101,0.76876
5,equal-size,0.598131,0.581556,0.799143
10,equal-width,0.654206,0.527569,0.812162
10,equal-size,0.588785,0.741668,0.754406


In [130]:
train_labels = glass_train_df["CLASS"]
nb_model.fit(glass_train_df)
predictions = nb_model.predict(glass_train_df)
print("Accuracy on training set: {0:.4f}".format(accuracy(predictions,train_labels)))
print("AUC on training set: {0:.4f}".format(auc(predictions,train_labels)))
print("Brier score on training set: {0:.4f}".format(brier_score(predictions,train_labels)))

Accuracy on training set: 0.8505
AUC on training set: 0.9687
Brier score on training set: 0.2263


### Comment on assumptions, things that do not work properly, etc.