# Assignment 2: Classification and Evaluation (20 marks)

Student Name: Wang Risheng

Student ID: 1053051

## General info

<b>Due date</b>: Monday, 1 September 2023 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -10% per day up to 5 days (both weekdays and weekends count)
<ul>
    <li>one day late, -2.0;</li>
    <li>two days late, -4.0;</li>
    <li>three days late, -6.0;</li>
    <li>four days late, -8.0;</li>
    <li>five days late, -10.0;</li>
</ul>

<b>Marks</b>:  This assignment will be marked out of 20, and make up 20% of your overall mark for this subject.

<b>Materials</b>: See [Using Jupyter Notebook and Python page] on Canvas (under Modules> Coding Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages `numpy`, `pandas`, `matplotlib` and `sklearn`. You can use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 


<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on Canvas>Assignments>Assignmnet2; we recommend you check it regularly.

<b>Academic misconduct</b>: This assignment is an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the <a href="https://canvas.lms.unimelb.edu.au/courses/151131/modules#module_825112">CIS Academic Honesty training</a> for more information. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place. Content produced by an AI (including, but not limited to ChatGPT) is not your own work, and submitting such content will be treated as a case of academic misconduct, in line with the <a href="https://academicintegrity.unimelb.edu.au/plagiarism-and-collusion/artificial-intelligence-tools-and-technologies"> University's policy</a>.

**IMPORTANT**

Please carefully read and fill out the <b>Authorship Declaration</b> form at the bottom of the page. Failure to fill out this form results in the following deductions: 
<UL TYPE=”square”>
<LI>Missing Authorship Declaration at the bottom of the page, -2.0
<LI>Incomplete or unsigned Authorship Declaration at the bottom of the page, -1.0
</UL>


## Overview:
For this assignment, you will apply a number of classifiers to various datasets, and
explore various evaluation paradigms and analyze the impact of multiple parameters on the performance of the classifiers. You will then answer a number of conceptual
questions about the Naive Bayes classifier, K-nearest neighbors, and a number of baselines based on your observations. 
## Data Sets:
In this assignment, you will work with two datasets. These datasets are adapted from a UCI archive public dataset:

 - **Adult**: You predict whether an adult person earns less than 50K or 50K or more US dollar per year, based on various personal attributes like age or education level. More information can be found<a href="https://archive.ics.uci.edu/dataset/2/adult"> here </a>. 
 - **Student**: You predict a student’s final grade {A+, A, B, C, D, F} based on a number of personal and performance related attributes, such as school, parent’s education level, number of absences, etc. More information can be found<a href="https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success"> here </a>. 
 
More information about these datasets can be found in `readme.txt` file.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import copy
import math
from collections import defaultdict, Counter

In [2]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB,CategoricalNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score

In [3]:
import warnings

# ignore future warnings 
warnings.simplefilter(action='ignore', category=FutureWarning)

## Question 1. Reading and Pre-processing [1.5 marks] 

**A)** First, you will read in the data using the `fileName` parameter into a pandas DataFrame. You will also need to input the list of numerical feature names `num_feat` to the function to make your pre-processing easier.

**B)** Second, you replace missing values denoted by `?` using the following two strategies: 

   * <b>Continuous features</b>: For each feature find the <b>average feature value</b> in the dataset 
   * <b>Categorical features</b>: For each feature find the <b>most frequent value</b> in the dataset  


**C)** Third, you will use one-hot encoding to convert all nominal (and ordinal) attributes to numeric. You can achieve this by either using `get_dummies()` from the pandas library or `OneHotEncoder()` from the scikit-learn library. The resulting dataset includes all originally numeric features as well as the one-hot encoded features that are now numeric, call this data `num_dataset`.

**D)** Fourth, you will use **equal-width** binning ( 4 bins ) to convert numerical features into categorical. You can achieve this by using `cut()` from pandas library. The resulting dataset includes all originally categorical features as well as the discretized features that are now categorical, call this data `cat_dataset`.


In [4]:
# This function should read a csv file and return two pandas dataframes

def preprocess(fileName,num_feat):
    ## read the csv file
    data = pd.read_csv(fileName)
    categorical_headers = []

    ## replace missing values with the most frequent for categorical feature
    for feature in data.columns[1:-1]:
        if feature in num_feat:
            data[feature] = pd.to_numeric(data[feature], errors='coerce')
            mean = data[feature].mean()
            data[feature].fillna(mean, inplace=True)
    
    ## replace missing values with the average for numerical features
        else:
            categorical_headers.append(feature)
            mode_values = data[feature].mode()
            mode_value = mode_values[0] # We usually pick the first value in case of there are multiple values with same frequency
            data.loc[data[feature] == "?", feature] = mode_value                
        
    ## convert categorical features to numeric using one-hot encoding
    
    # https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
    # I mainly use the methods like fit_transform and get_features_names_out introduced in this website
    dataset = data.copy()
    encoder = OrdinalEncoder()
    label_data = (dataset.iloc[:, -1]).values
    label_data = label_data.reshape(-1,1)
    encoded_label = pd.DataFrame(encoder.fit_transform(label_data))
    encoded = pd.get_dummies(dataset[categorical_headers], drop_first=True)
    num_dataset = pd.concat([dataset[num_feat], encoded, encoded_label], axis = 1)
    
    ## convert numerical features to categorical using equal-width binning
    
    cat_dataset = data.copy() # Since we need to reuse the original data
    num_bins = 4
    for feature in num_feat:
        min_value = data[feature].min()
        max_value = data[feature].max()
        edges = np.linspace(min_value, max_value, num_bins + 1)
        cat_dataset[feature] = pd.cut(data[feature], edges, labels=False, include_lowest=True)
    cat_dataset.drop(columns = ['ID'])
    return cat_dataset, num_dataset



In [5]:
## list of numeric features for adult dataset
adult_num = ['Age','fnlwgt','Education-num','Capital-gain','Capital-loss','Hours-per-week']
## generate the categorical and numerical adult datasets
adult_cat_dataset,adult_num_dataset = preprocess("datasets/adult.csv",adult_num)
#print(adult_num_dataset.shape)
#print(adult_cat_dataset.shape)
adult_cat_dataset
## generate the categorical and numerical student datasets
student_cat_dataset,student_num_dataset = preprocess("datasets/student.csv",[])
#print(student_num_dataset.shape)
#print(student_cat_dataset.shape)
#student_num_dataset.columns.tolist()

#### Question 2 . Baseline methods and Discussion [4.5 marks]
**A)** For 10 rounds, use `train_test_split` to divide the processed `cat_dataset` into 80% train, 20% test . Set the `random_state` equal to the loop counter. For example in the loop
``` python 
for i in range(10):
```
make `random_state` equal to `i`. 
Use the splitted datasets to train and test the following models: **[1 mark]**

- Zero-R
- One-R
- Weighted Random 

Report the average accuracy over the 10 runs. 


In [6]:
## You can define your helper functions for One-R or other baselines in this block
## for One-R at training time, you can break the ties randomly
## for One-R at prediction time, if the test contains an unseen feature value, return the majority class
def one_R_model(X_train, X_test, y_train, y_test):
    feature_selected = None
    optimal_accuracy = 0
    final_dict = dict()

    for attribute in X_train.columns[1:-1]:
        predicts = []
        attribute_counter = Counter(X_train[attribute])
        class_counter = Counter()
        class_dict = dict()
        most_frequency = 0
        total = 0
        for value in attribute_counter:
            class_labels = y_train[X_train[attribute] == value]
            class_counter = Counter(class_labels)
            total += class_counter.total()
            most_common_one, frequency = class_counter.most_common(1)[0]
            most_frequency += frequency
            class_dict[value] = most_common_one
            accuracy = most_frequency/total
            
        if accuracy > optimal_accuracy:
            optimal_accuracy = accuracy
            feature_selected = attribute
            final_dict = class_dict
        
        success = 0
        total = 0
        counter_final = Counter(final_dict)
        
        for i in range(len(X_test[feature_selected])):
            value = X_test[feature_selected].iloc[i]
            
            predict_value = None
            if value not in final_dict:
                most_one, freq = counter_final.most_common(1)[0]
                predict_value = most_one
            else:
                predict_value = final_dict[value]
            predicts.append(predict_value)

            actual_label = X_test.iloc[i,-1]
            if str(predict_value) in str(actual_label):
                success += 1
            total += 1
        score = success / total
    

    return score, predicts


In [7]:
def baselines(cat_dataset):

    ZeroR_Acc_1 = []
    OneR_Acc_1 = []
    WRand_Acc_1 = []

    ## your code here
    # First, we split them into 2 parts which are test part and train part respectively
    runs = 10
    test_portion = 0.2
    X = cat_dataset
    y = cat_dataset.iloc[:,-1]
    predict_list = []
    for i in range(runs):
        counter = i
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = counter, test_size = test_portion
        , shuffle = True)
        
        # Zero-R model
        dummy_clf = DummyClassifier(strategy = "most_frequent")
        dummy_clf.fit(X_train, y_train)
        dummy_predict = dummy_clf.predict(X_test)
        dummy_score = accuracy_score(y_test, dummy_predict)
        ZeroR_Acc_1.append(dummy_score)
        

        # One-R model
        score, predicts = one_R_model(X_train, X_test, y_train, y_test)
        predict_list.append(predicts)
        OneR_Acc_1.append(score)

        # Weighted Random model      
        WR_clf = DummyClassifier(strategy = "stratified")
        WR_clf.fit(X_train, y_train)
        WR_predictions = WR_clf.predict(X_test)
        WR_acc = accuracy_score(y_test, WR_predictions)
        WRand_Acc_1.append(WR_acc)
        

    
    print("Accuracy of ZeroR:", np.mean(ZeroR_Acc_1).round(2))
    print("Accuracy of One-R:", np.mean(OneR_Acc_1).round(2))
    print("Accuracy of Weighted Random:", np.mean(WRand_Acc_1).round(2))
    
##Adult Dataset and Student Dataset results: 
print("Adult Dataset Baseline results:")
baselines(adult_cat_dataset)

print("Student Dataset Baseline results:")
baselines(student_cat_dataset)



Adult Dataset Baseline results:
Accuracy of ZeroR: 0.76
Accuracy of One-R: 0.77
Accuracy of Weighted Random: 0.64
Student Dataset Baseline results:
Accuracy of ZeroR: 0.3
Accuracy of One-R: 0.31
Accuracy of Weighted Random: 0.23


**B)** After comparing the performance of the different models on the classification task, please comment on any differences or lack of differences you observe between the baseline models and the datasets. **[1.5 marks]**</br>
*NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*


In [8]:
X = student_cat_dataset
y = student_cat_dataset.iloc[:,-1]

x = adult_num_dataset
Y = adult_num_dataset.iloc[:, -1]
random_state_num = 6
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = random_state_num, test_size = 0.2 )
new_x_train, new_x_test, new_y_train, new_y_test = train_test_split(x, Y, random_state = random_state_num, test_size = 0.2)
print("Zero-R model--adult data")
dummy_clf_adult = DummyClassifier(strategy = "most_frequent")
dummy_clf_adult.fit(new_x_train, new_y_train)
dummy_predict_adult= dummy_clf_adult.predict(new_x_test)
print(classification_report(new_y_test,dummy_predict_adult, zero_division = 0))
print("Zero-R model--student data")
dummy_clf = DummyClassifier(strategy = "most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_predict = dummy_clf.predict(X_test)
dummy_score = accuracy_score(y_test, dummy_predict)
print(classification_report(y_test,dummy_predict, zero_division = 0))
print()
print("One-R model--adult data")
score_adult, predicts_adult = one_R_model(new_x_train, new_x_test, new_y_train, new_y_test)
print(classification_report(new_y_test, predicts_adult, zero_division = 0))
print("One-R model--student data")
score, predicts = one_R_model(X_train, X_test, y_train, y_test)
print(classification_report(y_test, predicts, zero_division = 0))
print()

print("Weighted Random model--adult data")
WR_clf_adult = DummyClassifier(strategy = "stratified")
WR_clf_adult.fit(new_x_train, new_y_train)
WR_predictions_adult = WR_clf_adult.predict(new_x_test)
print(classification_report(new_y_test,WR_predictions_adult, zero_division = 0))
print("Weighted Random model--student data")
WR_clf = DummyClassifier(strategy = "stratified")
WR_clf.fit(X_train, y_train)
WR_predictions = WR_clf.predict(X_test)
WR_acc = accuracy_score(y_test, WR_predictions)
print(classification_report(y_test,WR_predictions, zero_division = 0))
print()

Zero-R model--adult data
              precision    recall  f1-score   support

         0.0       0.77      1.00      0.87       386
         1.0       0.00      0.00      0.00       114

    accuracy                           0.77       500
   macro avg       0.39      0.50      0.44       500
weighted avg       0.60      0.77      0.67       500

Zero-R model--student data
              precision    recall  f1-score   support

           A       0.00      0.00      0.00        15
          A+       0.00      0.00      0.00         5
           B       0.00      0.00      0.00        29
           C       0.00      0.00      0.00        26
           D       0.29      1.00      0.45        38
           F       0.00      0.00      0.00        17

    accuracy                           0.29       130
   macro avg       0.05      0.17      0.08       130
weighted avg       0.09      0.29      0.13       130


One-R model--adult data
              precision    recall  f1-score   support

*Answer Here*  


Essentially, I use the categorical dataset generated from student csv file since there are multiple labels and is representative. 

Based on the definition of recall and precision. precision is computed by the portion of true positive in the sum of positive results while recall is defined as the portion of true positive in the sum of true positive and false negative, and they are in reverse relationship. 

In zero-R model, the precision value of each label is much smaller than recall value of each label. It means that the zero-R model cannot predict the labels correctly in most cases. For class D, the recall value is 1, which indicates that all the labels are identified as D in majority of instances but not others. Therefore, the f1 value, defined as 2 * precision * recall / (precision + recall), is highly affected and is 0 in most classes.

In one-R model, the performance of the model is still not well. Even though there is one more class has the non-zero precision value but 2 thirds of classes have 0 precision values. This means that predictions are correct in limit number of classes. Recall values reflect that the model mainly uses the instances, which are labelled as D, to do the prediction. Thereby, the values of f1 score are 0 in most classes.

In weighted random model, the range of precision values is roughly big, meaning that the correctness of labelling prediction varies across the classes. For instance, the model prefer to label the instance as C to label it as A+. Due to the relationship between precision and recall, recall values hava a large range. As a result, the values of f1 score also vary across the classes. the performance of this model is still not good.

As discussed above, these 3 models do not have strong performance because they mainly use small amounts of classes to predict classes in majority of instances
and use only limited aspects to predict the classes.


**C)** Update your code for One-R so that you can inspect the feature that is most often selected in the 10 rounds of training and testing for each dataset. Write the classification rule using the best feature and its values for each dataset. **[1 mark]**</br>  

In [9]:
OneR_Acc_1 = []
runs = 10
for i in range(runs):
        counter = i
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = counter, test_size = 0.2
        , shuffle = True)
        
        ## You can define your helper functions for One-R or other baselines in this block
        ## for One-R at training time, you can break the ties randomly
        ## for One-R at prediction time, if the test contains an unseen feature value, return the majority class
        feature_selected = None
        optimal_accuracy = 0
        final_dict = dict()
        overall = Counter()
        
        for attribute in X_train.columns[1:-1]:
                predicts = []
                attribute_counter = Counter(X_train[attribute])
                class_counter = Counter()
                class_dict = dict()
                most_frequency = 0
                total = 0
                for value in attribute_counter:
                        class_labels = y_train[X_train[attribute] == value]
                        class_counter = Counter(class_labels)
                        total += class_counter.total()
                        most_common_one, frequency = class_counter.most_common(1)[0]
                        most_frequency += frequency
                        class_dict[value] = most_common_one
                        accuracy = most_frequency/total
                
                if accuracy > optimal_accuracy:
                        optimal_accuracy = accuracy
                        feature_selected = attribute
                        final_dict = class_dict
                
                success = 0
                total = 0
                counter_final = Counter(final_dict)
                
                for i in range(len(X_test[feature_selected])):
                        value = X_test[feature_selected].iloc[i]
                
                        predict_value = None
                        if value not in final_dict:
                                most_one, freq = counter_final.most_common(1)[0]
                                predict_value = most_one
                        
                        else:
                                predict_value = final_dict[value]
                        predicts.append(predict_value)

                        actual_label = X_test.iloc[i][-1]
                        
                        if str(predict_value) in str(actual_label):
                                success += 1
                        total += 1
                score = success / total
                OneR_Acc_1.append(score)
        print(feature_selected)
print("Accuracy of One-R:", np.mean(OneR_Acc_1).round(2))

Medu
Fedu
Fedu
Medu
Medu
Medu
Fedu
Fedu
Fedu
Fedu
Accuracy of One-R: 0.3


*Answer Here*  
The feature selected is basically decided across the all possible values in each feature. In each feature, I will sum up the number of class labels which are labelled in most insatnces for each value in this feature and then divide it by the total number of instances to get the ratio. The feature with greatese ratio will be selected to do further process. I use the class labelled in most instances in a value in the feature selected to predict the test data set, for example, if majority of instances, which are labelled A, and the almost values in feature selected are 1, then when I come to the test data set, all the instances with value 1 in feature selected will be labelled A.  

**D)** For weighted random baseline applied to Adult dataset, what would the error rate converge to (Write a formula based on the prior probability of the dominant class, named `prior`, and the fraction of test samples belonging to the dominant class, `fraction`)? **[1 mark]**

*Answer Here*

Essentially, Weighted random models randomly assigns the label probablities using train data set and then use the probability to do prediction. Error rate is computed as wrong predictions divided by total number of predictions.


=> prior probability of dominant class in test dataset is N * prior
   actual fraction of test samples belonging to the dominant class is N * fraction, where N is the total number of perdicitions

=>error rate is the absolute value of difference between N * prior and N * fraction, and then divided by N ,also written as abs(N * prior - N * fraction)

## Question 3. Naive Bayes models [5 marks]

**A)** Divide the `num_dataset` and `cat_dataset` into 80% train and 20% test splits for 10 rounds, set the `random_state` equal to the loop counter. Then, train and test the following models:

- Gaussian Naive Bayes
- Bernoulli Naive Bayes
- Categorical Naive Bayes 

You must use the input data that you believe is best suited for each model. Finally, report the average accuracy of the NB models over the 10 runs. **[1 mark]**

**Note: You may need to change your input format to be able to use sklearn's CategoricalNB.** 

In [10]:
def NB_models(num_dataset,cat_dataset):

    GNB_Acc_1 = []
    BNB_Acc_1 = []
    CNB_Acc_1 = []

    ## your code here
    test_portion = 0.2
    runs = 10
    
    for i in range(runs):
        counter = i
        X_train, X_test, y_train, y_test = train_test_split(num_dataset.iloc[:,:-1], num_dataset.iloc[:, -1], train_size = 1 - test_portion, test_size = test_portion, random_state = counter)
        
        ## Gaussian Naive Bayes
        gaussian_clf = GaussianNB()
        gaussian_clf.fit(X_train, y_train)
        gaussian_predictions = gaussian_clf.predict(X_test)
        gaussian_acc = accuracy_score(y_test, gaussian_predictions)
        GNB_Acc_1.append(gaussian_acc)

        ## Bernoulli Naive Bayes
        bernoulli_clf = BernoulliNB()
        bernoulli_clf.fit(X_train, y_train)
        bernoulli_predictions = bernoulli_clf.predict(X_test)
        bernoulli_acc = accuracy_score(y_test, bernoulli_predictions)
        BNB_Acc_1.append(bernoulli_acc)

        ## Categorical Naive Bayes
        encoder = OrdinalEncoder()
        labels = cat_dataset.iloc[:,-1]
        labels_2d = labels.values.reshape(-1, 1)
        labels_encode = encoder.fit_transform(labels_2d)
        
        new_dataset = cat_dataset
        header = []
        for feature in new_dataset.columns[:-1]:
            header.append(feature)
        encoded = pd.get_dummies(new_dataset[header], drop_first=True)
        
        train, test, label_train, label_test= train_test_split(encoded, labels_encode, test_size = test_portion, random_state = counter)
        # https://github.com/scikit-learn/scikit-learn/issues/16028
        # In this link, I foound out the way to fix the error by typing different parameters in CNB
        categorical_clf = CategoricalNB(min_categories=7)
        new_train_label = label_train.ravel()
        new_test_label = label_test.ravel()
        categorical_clf.fit(train, new_train_label)
        categorical_predictions = categorical_clf.predict(test)
        categorical_score = accuracy_score(categorical_predictions,new_test_label)
        CNB_Acc_1.append(categorical_score)

    
    print("Accuracy of GNB:", np.mean(GNB_Acc_1).round(2))
    print("Accuracy of BNB:", np.mean(BNB_Acc_1).round(2))            
    print("Accuracy of CNB:", np.mean(CNB_Acc_1).round(2))
    
    

##Adult Dataset and Student Dataset results: 
print("Adult Dataset NB results:")
NB_models(adult_num_dataset,adult_cat_dataset)


print("Student Dataset NB results:")
NB_models(student_num_dataset,student_cat_dataset)





Adult Dataset NB results:
Accuracy of GNB: 0.8
Accuracy of BNB: 0.79
Accuracy of CNB: 0.76
Student Dataset NB results:
Accuracy of GNB: 0.17
Accuracy of BNB: 0.31
Accuracy of CNB: 0.3


**B)** How does the performance of the Naive Bayes classifiers compare against your baseline models for each dataset? **[1 mark]** Please comment on any differences you observe between the baseline models and the NB models in the context of the two datasets.</br> *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*

In [11]:
test_portion = 0.2
num_dataset = student_num_dataset
cat_dataset = student_cat_dataset
X_train, X_test, y_train, y_test = train_test_split(num_dataset.iloc[:,:-1], num_dataset.iloc[:, -1], train_size = 1 - test_portion, test_size = test_portion, random_state = 6)
new_x_train, new_x_test, new_y_train, new_y_test = train_test_split(adult_num_dataset.iloc[:,:-1], adult_num_dataset.iloc[:,-1], train_size = 1 - test_portion, test_size = test_portion, random_state = 6)       

## Gaussian Naive Bayes
gaussian_clf_adult = GaussianNB()
gaussian_clf_adult.fit(new_x_train, new_y_train)
gaussian_predictions_adult = gaussian_clf_adult.predict(new_x_test)
print("----------------------adult_GNB-------------------------")
print(classification_report(new_y_test, gaussian_predictions_adult, zero_division = 0))
gaussian_clf_student = GaussianNB()
gaussian_clf_student.fit(X_train, y_train)
gaussian_predictions_student = gaussian_clf_student.predict(X_test)
print("----------------------student_GNB-------------------------")
print(classification_report(y_test, gaussian_predictions_student, zero_division = 0))

## Bernoulli Naive Bayes
bernoulli_clf_adult = BernoulliNB()
bernoulli_clf_adult.fit(new_x_train, new_y_train)
bernoulli_predictions_adult = bernoulli_clf_adult.predict(new_x_test)
print("----------------------adult_BNB-------------------------")
print(classification_report(new_y_test, bernoulli_predictions_adult, zero_division = 0))
bernoulli_clf_student = BernoulliNB()
bernoulli_clf_student.fit(X_train, y_train)
bernoulli_predictions_student = bernoulli_clf_student.predict(X_test)
print("----------------------student_BNB-------------------------")
print(classification_report(y_test, bernoulli_predictions_student, zero_division = 0))

## Categorical Naive Bayes
encoder = OrdinalEncoder()
labels = cat_dataset.iloc[:,-1]
labels_2d = labels.values.reshape(-1, 1)
labels_encode = encoder.fit_transform(labels_2d) 
new_dataset = cat_dataset    
header = []
for feature in new_dataset.columns[:-1]:
    header.append(feature)
encoded = pd.get_dummies(new_dataset[header], drop_first=True)

encoder_2 = OrdinalEncoder()
labels_2 = adult_cat_dataset.iloc[:,-1]
labels_2d_2 = labels_2.values.reshape(-1, 1)
labels_encode_2 = encoder_2.fit_transform(labels_2d_2) 
new_dataset_2 = adult_cat_dataset    
header_2 = []
for feature in new_dataset_2.columns[:-1]:
    header_2.append(feature)
encoded_2 = pd.get_dummies(new_dataset_2[header_2], drop_first=True)
        
train, test, label_train, label_test= train_test_split(encoded, labels_encode, test_size = test_portion, random_state = 6)
new_train, new_test, new_label_train, new_label_test = train_test_split(encoded_2, labels_encode_2,test_size = test_portion, random_state = 6)
categorical_clf_2 = CategoricalNB(min_categories=7)
new_train_label_2 = new_label_train.ravel()
new_test_label_2 = new_label_test.ravel()
categorical_clf_2.fit(new_train, new_train_label_2)
categorical_predictions_2 = categorical_clf_2.predict(new_test)
print("----------------------adult_CNB-------------------------")
print(classification_report(new_label_test, categorical_predictions_2, zero_division = 0))
categorical_clf = CategoricalNB(min_categories=7)
new_train_label = label_train.ravel()
new_test_label = label_test.ravel()
categorical_clf.fit(train, new_train_label)
categorical_predictions = categorical_clf.predict(test)
print("----------------------student_CNB-------------------------")
print(classification_report(label_test, categorical_predictions, zero_division = 0))

----------------------adult_GNB-------------------------
              precision    recall  f1-score   support

         0.0       0.82      0.94      0.87       386
         1.0       0.58      0.29      0.39       114

    accuracy                           0.79       500
   macro avg       0.70      0.61      0.63       500
weighted avg       0.76      0.79      0.76       500

----------------------student_GNB-------------------------
              precision    recall  f1-score   support

         0.0       0.14      0.40      0.21        15
         1.0       0.02      0.20      0.04         5
         2.0       0.17      0.10      0.13        29
         3.0       0.00      0.00      0.00        26
         4.0       0.67      0.05      0.10        38
         5.0       0.43      0.35      0.39        17

    accuracy                           0.14       130
   macro avg       0.24      0.18      0.14       130
weighted avg       0.30      0.14      0.13       130

--------------

*Answer Here*  

Accuracy:
When we use adult data set, we can observe that Naive Bayes models can generate higher accuracy compared to the baseline models, meaning that NB classifiers can find the more complicated relationship between class and features in the file and it is able to find the dependencies between features.

Now turns to student data set. NB models have worse performance this time and they give much lower accuracy which is close to that given by baseline methods. Both types of models do not work well. Complicated structure of dataset might be the reason that they estimate the low accuracy.

Precision, recall and F1 score:
In term of adult dataset, NB models generally outperform the baseline models by compairing  precision, recall and f1-score. Higher precision values and higher recall values indicate that NB can be better at distinguish the class. Higher f1 score means that NB can achieve better balance between precision and recall based on the formula of the f1 score.

Now the student dataset. Baseline models outperform the NB models this time.
Baseline models can identify the class for each instance in a better performance compared to NB models. The complexity of the relationships between features might be the reason why NB models have poor performance. But overall, both types of models do not perform well.

In general, NB can have better performance than baseline methods based on either accuracy or (precision, recall, f1 score) evaluation when the dataset is not complicated. Both types of model may have poor performance if they use a complicated dataset.



**C)** The three Naive Bayes (NB) classifiers lead to different performances. Which of these NB classifiers performs best for each dataset, and why do you think it is the case? **[1 mark]** *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.*




*Answer Here*

Adult dataset:

Bernoulli performs best. It can give high precision, high recall, high f1 scores and high accuracy.

Student dataset:

Bernoulli Naive Bayes is still the best one among them since it generates high precision, high recall, high f1 scores and high accuracy

Naive Bayes models rely on how assumptions are deeply aligned with dataset. Bernoulli classifiers can tackle the imbalance in dataset by adjusting class priors. But it can only handle the dataset which is fully numeric.




**D)** The Gaussian Naive Bayes classifier makes two fundamental assumptions: (1) about the distribution of $P(x_j|c_i)$ and (2) about the (conditional) dependency structure between features.
Explain both assumptions, and discuss whether these assumptions are always true for the
numeric attributes in the Adult dataset. If applicable, identify some cases where the assumptions are violated. **[2 marks]**


*Answer Here*  

In adult dataset, attributes like hours work per week and age could be distributed as Gaussian distribution. But attributes like capital loss may not be Gaussian distributed since they may be sparse and many of values are 0. Gaussian distribution indicates that values are symmetrically distribuited around mean value. Thus the values in capital loss may break the assumption.
In adult dataset, education background may be correlated to the occupations, which breaks the independency between features, in other words, the independencies between variables. In this case, the second assumption may be violated.

## Question 4. K-Nearest Neighbor [3 marks] 
**A)** Divide the `num_dataset` into 80% train and 20% test splits for 10 rounds, set the `random_state` equal to the loop counter. Then, train and test the following models:

- 6 K-Nearest Neighbor models with Euclidean distance using the following parameters:

    - with K values of 1,5, and 10
    
    - using inverse distance weighting and majority voting 

Finally, report the average accuracy of the KNN models over the 10 rounds. **[1 mark]**

In [12]:
def KNNs(num_dataset):
    KNN1_Acc_1_weighted = []
    KNN5_Acc_1_weighted = []
    KNN10_Acc_1_weighted = []
    KNN1_Acc_1_majority = []
    KNN5_Acc_1_majority = []
    KNN10_Acc_1_majority = []


    ## your code here
    # https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
    # I mainly use the materials shown in this website
    runs = 10
    test_portion = 0.2
    majority_predictions_list = []

    for i in range(runs):
        random_state = i
        X_train, X_test, y_train, y_test = train_test_split(num_dataset, num_dataset.iloc[:,-1]
        , test_size = test_portion, random_state = random_state)
        
        KNN_inverse_1 = KNeighborsClassifier(n_neighbors = 1, weights = 'distance')
        KNN_inverse_5 = KNeighborsClassifier(n_neighbors = 5, weights = 'distance')
        KNN_inverse_10 = KNeighborsClassifier(n_neighbors = 10, weights = 'distance')

        KNN_majority_1 = KNeighborsClassifier(n_neighbors = 1, weights = 'uniform')
        KNN_majority_5 = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform')
        KNN_majority_10 = KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')

        # Fit the model and calculate corresponding accuracy based on predicitions
        #print(X_train.shape)
        #print(y_train.shape)
        #print(X_test.shape)
        #print(y_test.shape)
        KNN_inverse_1.fit(X_train.values, y_train)
        KNN_inverse_5.fit(X_train.values, y_train)
        KNN_inverse_10.fit(X_train.values, y_train)
        KNN_majority_1.fit(X_train.values, y_train)
        KNN_majority_5.fit(X_train.values, y_train)
        KNN_majority_10.fit(X_train.values, y_train)
        
        inverse_predictions_1 = KNN_inverse_1.predict(X_test.values)
        inverse_predictions_5 = KNN_inverse_5.predict(X_test.values)
        inverse_predictions_10 = KNN_inverse_10.predict(X_test.values)
        majority_predictions_1 = KNN_majority_1.predict(X_test.values)
        majority_predictions_5 = KNN_majority_5.predict(X_test.values)
        majority_predictions_10 = KNN_majority_10.predict(X_test.values)
        
        accuracy_inverse_1 = accuracy_score(y_test, inverse_predictions_1)
        accuracy_inverse_5 = accuracy_score(y_test, inverse_predictions_5)
        accuracy_inverse_10 = accuracy_score(y_test, inverse_predictions_10)
        accuracy_majority_1 = accuracy_score(y_test, majority_predictions_1)
        accuracy_majority_5 = accuracy_score(y_test, majority_predictions_5)
        accuracy_majority_10 = accuracy_score(y_test, majority_predictions_10)

        # Add the accuracies
            
        KNN1_Acc_1_weighted.append(accuracy_inverse_1)
        KNN1_Acc_1_majority.append(accuracy_majority_1)
            
        KNN5_Acc_1_weighted.append(accuracy_inverse_5)
        KNN5_Acc_1_majority.append(accuracy_majority_5)
            
        KNN10_Acc_1_weighted.append(accuracy_inverse_10)
        KNN10_Acc_1_majority.append(accuracy_majority_10)
        
        k_NUM = 3
        KNN_3_major = KNeighborsClassifier(n_neighbors = k_NUM)
        KNN_3_major.fit(X_train.values, y_train)
        KNN_3_predictions = KNN_3_major.predict(X_test.values)
        majority_predictions_list.append(KNN_3_predictions)

            
    print("Accuracy of weighted KNN(1):", np.mean(KNN1_Acc_1_weighted).round(2))
    print("Accuracy of weighted KNN(5):", np.mean(KNN5_Acc_1_weighted).round(2))
    print("Accuracy of weighted KNN(10):", np.mean(KNN10_Acc_1_weighted).round(2))
    print("Accuracy of KNN(1):", np.mean(KNN1_Acc_1_majority).round(2))
    print("Accuracy of KNN(5):", np.mean(KNN5_Acc_1_majority).round(2))
    print("Accuracy of KNN(10):", np.mean(KNN10_Acc_1_majority).round(2))
    
    #return majority_predictions_list
    
##Adult Dataset and Student Dataset results: 

print("Adult Dataset KNN results:")
KNNs(adult_num_dataset)

print("Student Dataset KNN results:")
KNNs(student_num_dataset)

  
    

Adult Dataset KNN results:
Accuracy of weighted KNN(1): 0.7
Accuracy of weighted KNN(5): 0.74
Accuracy of weighted KNN(10): 0.76
Accuracy of KNN(1): 0.7
Accuracy of KNN(5): 0.78
Accuracy of KNN(10): 0.79
Student Dataset KNN results:
Accuracy of weighted KNN(1): 0.62
Accuracy of weighted KNN(5): 0.72
Accuracy of weighted KNN(10): 0.75
Accuracy of KNN(1): 0.62
Accuracy of KNN(5): 0.72
Accuracy of KNN(10): 0.72


**B)** Compare the results of the weighted and majority KNN models (for each value of K) and explain any differences you observe for each dataset in terms of the voting strategy and the number of nearest neighbors. **[1 marks]**</br> *NOTE: You may need to compare other performance metrics of these models, such as precision and recall of each class label, to gain a better understanding of their performance. You can use the `classification_report` from `sklearn.metrics` for this matter and check the performance of the classifiers for one round.* 

In [13]:
num_dataset = adult_num_dataset
test_portion = 0.2
X_train, X_test, y_train, y_test = train_test_split(num_dataset, num_dataset.iloc[:,-1]
        , test_size = test_portion, random_state = 6)
        
KNN_inverse_1 = KNeighborsClassifier(n_neighbors = 1, weights = 'distance')
KNN_inverse_5 = KNeighborsClassifier(n_neighbors = 5, weights = 'distance')
KNN_inverse_10 = KNeighborsClassifier(n_neighbors = 10, weights = 'distance')

KNN_majority_1 = KNeighborsClassifier(n_neighbors = 1, weights = 'uniform')
KNN_majority_5 = KNeighborsClassifier(n_neighbors = 5, weights = 'uniform')
KNN_majority_10 = KNeighborsClassifier(n_neighbors = 10, weights = 'uniform')

# Fit the model and calculate corresponding precision based on predicitions

KNN_inverse_1.fit(X_train.values, y_train)
KNN_inverse_5.fit(X_train.values, y_train)
KNN_inverse_10.fit(X_train.values, y_train)
KNN_majority_1.fit(X_train.values, y_train)
KNN_majority_5.fit(X_train.values, y_train)
KNN_majority_10.fit(X_train.values, y_train)
        
inverse_predictions_1 = KNN_inverse_1.predict(X_test.values)
inverse_predictions_5 = KNN_inverse_5.predict(X_test.values)
inverse_predictions_10 = KNN_inverse_10.predict(X_test.values)
majority_predictions_1 = KNN_majority_1.predict(X_test.values)
majority_predictions_5 = KNN_majority_5.predict(X_test.values)
majority_predictions_10 = KNN_majority_10.predict(X_test.values)
print("----------------------KNN_1_inverse-------------------------")
print(classification_report(y_test, inverse_predictions_1, zero_division = 0))
print("----------------------KNN_5_inverse-------------------------")
print(classification_report(y_test, inverse_predictions_5, zero_division = 0))
print("---------------------KNN_10_inverse-------------------------")
print(classification_report(y_test, inverse_predictions_10, zero_division = 0))
print("---------------------KNN_1_majority-------------------------")
print(classification_report(y_test, majority_predictions_1, zero_division = 0))
print("---------------------KNN_5_majority-------------------------")
print(classification_report(y_test, majority_predictions_5, zero_division = 0))
print("---------------------KNN_10_majority-------------------------")
print(classification_report(y_test, majority_predictions_10, zero_division = 0))        


----------------------KNN_1_inverse-------------------------
              precision    recall  f1-score   support

         0.0       0.81      0.82      0.82       386
         1.0       0.36      0.34      0.35       114

    accuracy                           0.71       500
   macro avg       0.59      0.58      0.58       500
weighted avg       0.71      0.71      0.71       500

----------------------KNN_5_inverse-------------------------
              precision    recall  f1-score   support

         0.0       0.80      0.89      0.84       386
         1.0       0.41      0.26      0.32       114

    accuracy                           0.75       500
   macro avg       0.61      0.58      0.58       500
weighted avg       0.71      0.75      0.72       500

---------------------KNN_10_inverse-------------------------
              precision    recall  f1-score   support

         0.0       0.80      0.92      0.86       386
         1.0       0.47      0.24      0.32       114


*Answer Here*   


We use KNN to estimate the accuracy of prediction based on inverse distance and majority voting respectively. It can be found out that accuracy increases as the k value increases. In K nearest neighbours, k, which is a crucial parameter while using it, indicates the number of nearest number we need to select. When k is small, the model may accidentally capture the noise and outliers and the model may be overfit in consequence. Thereby, large k value can ensure the better decision-making. In case of class distribution is imbalanced, using KNN model with inverse distance weighting can assign more weights to closer neighbours since it uses the distance from the neighbours.

  
Next, in terms of precision, recall and F1 score. Basiclly, KNN model with inverse distance weighting can generate more precise values of precision, recall and F1 compared to the model with majority.



**C)** How would standardisation impact the performance of your KNN models and Gaussian Naive Bayes model for the Adult dataset? **[1 marks]**

*Answer Here*   


KNN model is sensitive to the scale of features since it mainly uses the distance to do decision-making. In case of scales of features are different and larger scale may result in the incorrect results since different scales enable the large scale determine the distance metric. However, using standardisation can reduce the negative impact since it can make all features are in similar scale.

Gaussian Naive Bayes essentially uses the distribution of features to do computations and give the corresponding results. Therefore, standardisation may not be neccessarily required while using GNB model. There are still some impacts even though it is not neccessarily required. In case of different scales, standardisation has the potential to make the model work better. Also, it could potentially prevent the overfitting.

Overall, standardisation is vital for KNN model but is not vital for Gaussian Naive Bayes.

## Question 5. Evaluation metrics [2 marks]

**A)** Update the code in questions 2, 3, and 4 to compute the following metrics for the models listed below:

- One-R 
- Gaussian NB 
- Categorical NB
- 3-Nearest Neighbor model with Euclidean distance and majority voting 

Report their performance using the following two metrics
- micro-averaged precision
- macro-averaged precision 
 
Conversely, you can also choose to implement the same 10 rounds of train and test split (80% train, 20% test) as described in the questions 2,3, and 4 in the code block below and report the average scores for the micro-precision and macro-precision.

**[0.5 marks]**

In [14]:
def compare_eval(num_dataset, cat_dataset):
    
    OneR_microP_1 = []
    GNB_microP_1 = []
    CNB_microP_1 = []
    KNN3_microP_1_majority = []

    OneR_macroP_1 = []
    GNB_macroP_1 = []
    CNB_macroP_1 = []
    KNN3_macroP_1_majority = []
    ## your code here
    runs = 10
    test_portion = 0.2
    scores = []
    for i in range(runs):
        random_state = i
        X_train, X_test, y_train, y_test = train_test_split(num_dataset.iloc[:,:-1], num_dataset.iloc[:,-1]
        , test_size = test_portion, random_state = random_state)
        
        # One-R
        score, predict = one_R_model(X_train, X_test, y_train, y_test)
        scores.append(score)
        OneR_microP = precision_score(y_test, predict, average = 'micro', zero_division = 1)
        OneR_macroP = precision_score(y_test, predict, average = 'macro', zero_division = 1)
        OneR_microP_1.append(OneR_microP)
        OneR_macroP_1.append(OneR_macroP)

        # Gaussian Naive Bayes
        gaussian_clf = GaussianNB()
        gaussian_clf.fit(X_train, y_train)
        gaussian_predictions = gaussian_clf.predict(X_test)
        GNB_microP = precision_score(y_test, gaussian_predictions, average = 'micro', zero_division = 1)
        GNB_macroP = precision_score(y_test, gaussian_predictions, average = 'macro', zero_division = 1)
        GNB_microP_1.append(GNB_microP)
        GNB_macroP_1.append(GNB_macroP)

        # Categorical Naive Bayes
        encoder = OrdinalEncoder()
        labels = cat_dataset.iloc[:,-1]
        labels_2d = labels.values.reshape(-1, 1)
        labels_encode = encoder.fit_transform(labels_2d)
        
        new_dataset = cat_dataset.drop(columns = ['ID'])
        header = []
        for feature in new_dataset.columns[:-1]:
            header.append(feature)
        encoded = pd.get_dummies(new_dataset[header], drop_first=True)
        train, test, label_train, label_test= train_test_split(encoded, labels_encode, test_size = test_portion, random_state = counter)
        categorical_clf = CategoricalNB(min_categories=7)
        new_train_label = label_train.ravel()
        categorical_clf.fit(train, new_train_label)
        categorical_predictions = categorical_clf.predict(test)
        CNB_microP = precision_score(label_test.ravel(), categorical_predictions, average = 'micro', zero_division = 1)
        CNB_macroP = precision_score(label_test.ravel(), categorical_predictions, average = 'macro', zero_division = 1)
        CNB_microP_1.append(CNB_microP)
        CNB_macroP_1.append(CNB_macroP)

        # K-nearest neighbours with k = 3
        k_NUM = 3
        KNN_3_major = KNeighborsClassifier(n_neighbors = k_NUM)
        KNN_3_major.fit(X_train.values, y_train)
        KNN_3_predictions = KNN_3_major.predict(X_test.values)
        KNN3_microP = precision_score(y_test, KNN_3_predictions, average = 'micro', zero_division = 1)
        KNN3_macroP = precision_score(y_test, KNN_3_predictions, average = 'macro', zero_division = 1)
        KNN3_microP_1_majority.append(KNN3_microP)
        KNN3_macroP_1_majority.append(KNN3_macroP)
    
    print("Micro-p of One-R:", np.mean(OneR_microP_1).round(2))
    print("Micro-p of GNB:", np.mean(GNB_microP_1).round(2))
    print("Micro-p of CNB:", np.mean(CNB_microP_1).round(2)) 
    print("Micro-p of KNN(3):", np.mean(KNN3_microP_1_majority).round(2))

    print("Macro-p of One-R:", np.mean(OneR_macroP_1).round(2))
    print("Macro-p of GNB:", np.mean(GNB_macroP_1).round(2))
    print("Macro-p of CNB:", np.mean(CNB_macroP_1).round(2)) 
    print("Macro-p of KNN(3):", np.mean(KNN3_macroP_1_majority).round(2))
    

##Adult Dataset and Student Dataset results: 

print("Adult Dataset Evaluation results:")
compare_eval(adult_num_dataset,adult_cat_dataset)

print("Student Dataset Evaluation results:")
compare_eval(student_num_dataset,student_cat_dataset)    
    

Adult Dataset Evaluation results:
Micro-p of One-R: 0.05
Micro-p of GNB: 0.8
Micro-p of CNB: 0.77
Micro-p of KNN(3): 0.76
Macro-p of One-R: 0.38
Macro-p of GNB: 0.74
Macro-p of CNB: 0.78
Macro-p of KNN(3): 0.64
Student Dataset Evaluation results:
Micro-p of One-R: 0.27
Micro-p of GNB: 0.17
Micro-p of CNB: 0.37
Micro-p of KNN(3): 0.21
Macro-p of One-R: 0.75
Macro-p of GNB: 0.35
Macro-p of CNB: 0.6
Macro-p of KNN(3): 0.24


**B)** Compare the average accuracy vs. macro-average and micro-average precision for the two datasets. Explain which evaluation measurement would be most appropriate for each dataset **[1.5 mark]**



*Answer Here*   
In adult data set, One-R and KNN(3) can acheive high accuracy and precision. Micro-average precision and macro - average precision may be treated as the most suitable measurements as dataset is relatively balanced and features are relatively dependent.   
In student data set, gaussian naive bayes gives the highest micro-precision value and macro-precision value among these 4 models. In student data set, data is not well balanced. And by the definitions of micro-average and macro-average, former one maily calculates for each instance and latter one calculates for each class. Hence,  micro-average precision may be most appropriate.
 

## Question 6. Ethics and implications in practice [4 marks]

The Categorical Naive Bayes classifier you developed in this assignment for the student dataset could for example be used to classify college applicants into admitted vs not-admitted depending on their predicted grade in the Student dataset.

**A)** Discuss ethical problems which might arise in this application and lead to unfair treatment of the applicants. Ground your discussion in the set of features provided in the student data set.**[1 marks]**



*Answer Here*  
In the student data set. There are some features which are irrelevant to the final label class.  
The first one is the sensitive information problem. The occupations of father and mother , which are sensitive information to student, even the whole family, are included as features. These features may cause the violation to the privacy In Categorical Naive Bayes model, these privacy features may lead to a wrong prediction.  
The second one is the bias and potential discrimination. Machine learning algorithm will not identify if the train data set is biased or not. Thereby, in a case of a data set with bias towards a certain group of people, Categorical Naive Bayes may use the wrong data to unfairly to label other persons in this group.  
The third problem should be addressed is access to education background. For example, final score of a student may not identified as low mark just because he/she graduated from a underprivileged high school or he/she did not participate many extracurricular activities.  
These features, which may cause ethical problems, should be removed from the original data set.

**B)** Remove all ethically problematic features from the data set (use your own judgment), and train your Naive Bayes classifiers on the resulting data set. How does the performance change in comparison to the full classifier ( consider accuracy and micro-average precision)?**[2 marks]**



In [15]:
## your code here for part B
def new_preprocess(fileName,num_feat, removed):
    data = pd.read_csv(fileName)
    categorical_headers = []

    ## remove columns that may cause ethical problem
    for a in removed:
        data = data.drop(columns = [a])
    
    ## replace missing values with the most frequent for categorical feature
    for feature in data.columns[1:-1]:
        if feature in num_feat:
            data[feature] = pd.to_numeric(data[feature], errors='coerce')
            mean = data[feature].mean()
            data[feature].fillna(mean, inplace=True)
    
    ## replace missing values with the average for numerical features
        else:
            categorical_headers.append(feature)
            mode_values = data[feature].mode()
            mode_value = mode_values[0] # We usually pick the first value in case of there are multiple values with same frequency
            data.loc[data[feature] == "?", feature] = mode_value

    ## convert categorical features to numeric using one-hot encoding
    dataset = data.copy()
    encoder = OrdinalEncoder()
    label_data = (dataset.iloc[:, -1]).values
    label_data = label_data.reshape(-1,1)
    encoded_label = pd.DataFrame(encoder.fit_transform(label_data))
    encoded = pd.get_dummies(dataset[categorical_headers], drop_first=True)
    num_dataset = pd.concat([dataset[num_feat], encoded, encoded_label], axis = 1)
    return num_dataset

removed_columns = ['Mjob', 'Fjob', 'sex']
new_student_num_set = new_preprocess("datasets/student.csv",[], removed_columns)

test_portion = 0.2
runs = 10
BNB_Acc_1 = [] 
micro_P = []
macro_P = []  
full_BNB_Acc_1 = [] 
full_micro_P = []
full_macro_P = []  
for i in range(runs):
    counter = i
    X_train, X_test, y_train, y_test = train_test_split(new_student_num_set.iloc[:,:-1], new_student_num_set.iloc[:, -1], train_size = 1 - test_portion, test_size = test_portion, random_state = counter)
    x_train, x_test, Y_train, Y_test = train_test_split(student_num_dataset.iloc[:,:-1], student_num_dataset.iloc[:, -1], train_size = 1 - test_portion, test_size = test_portion, random_state = counter)
    
    ## full Bernoulli Naive Bayes
    bernoulli_clf_full = BernoulliNB()
    bernoulli_clf_full.fit(x_train, Y_train)
    bernoulli_predictions_full = bernoulli_clf_full.predict(x_test)
    micro_full = precision_score(Y_test, bernoulli_predictions_full, average = 'micro', zero_division = 1)
    macro_full = precision_score(Y_test, bernoulli_predictions_full, average = 'macro', zero_division = 1)
    full_micro_P.append(micro_full)
    full_macro_P.append(macro_full)
    bernoulli_acc_full = accuracy_score(Y_test, bernoulli_predictions_full)
    full_BNB_Acc_1.append(bernoulli_acc_full)
    
    ## new Bernoulli Naive Bayes
    bernoulli_clf = BernoulliNB()
    bernoulli_clf.fit(X_train, y_train)
    bernoulli_predictions = bernoulli_clf.predict(X_test)
    micro = precision_score(y_test, bernoulli_predictions, average = 'micro', zero_division = 1)
    macro = precision_score(y_test, bernoulli_predictions, average = 'macro', zero_division = 1)
    micro_P.append(micro)
    macro_P.append(macro)
    bernoulli_acc = accuracy_score(y_test, bernoulli_predictions)
    BNB_Acc_1.append(bernoulli_acc)

print("Accuracy of full BNB:", np.mean(full_BNB_Acc_1).round(2))  
print("full micro precision:", np.mean(full_micro_P).round(2))
print("full macro precision:", np.mean(full_macro_P).round(2))
print()
print("Accuracy of new BNB:", np.mean(BNB_Acc_1).round(2))  
print("updated micro precision:", np.mean(micro_P).round(2))
print("updated macro precision:", np.mean(macro_P).round(2))


Accuracy of full BNB: 0.31
full micro precision: 0.31
full macro precision: 0.28

Accuracy of new BNB: 0.33
updated micro precision: 0.33
updated macro precision: 0.31


*Answer Here*

After deleting some columns, the accuracy increases but not substantially, it is slow growth. Value of micro-average precision grows slowly as well.

It means that number of features can affect the performance of BNN but not highly affect the performance.

**C)** The approach to fairness we have adopted is called “fairness through unawareness”, where we simply deleted any questionable features from our data. Is removing all problematic features as done in part (b) guarantee a fair classifier? Explain Why or Why not?**[1 marks]**

*Answer Here*

Even though some features, which may result in ethical problems, are deleted. The overall performance does not improve substantially. It means that this approach cannot guarantee the fairness and classifiers still may be able to produce unfair predictions. The algorithm we choose to do classification may learn the hidden bias and then produce the wrong results. Hidden correlations may not be removed and other indirect bias are still existing in the dataset. Sometimes, if the dataset is complex and then the result could be affected since some significant information, which could lead to fairer result, is removed.

# Authorship Declaration:

   (1) I certify that the program contained in this submission is completely
   my own individual work, except where explicitly noted by comments that
   provide details otherwise.  I understand that work that has been developed
   by another student, or by me in collaboration with other students,
   or by non-students as a result of request, solicitation, or payment,
   may not be submitted for assessment in this subject.  I understand that
   submitting for assessment work developed by or in collaboration with
   other students or non-students constitutes Academic Misconduct, and
   may be penalized by mark deductions, or by other penalties determined
   via the University of Melbourne Academic Honesty Policy, as described
   at https://academicintegrity.unimelb.edu.au.

   (2) I also certify that I have not provided a copy of this work in either
   softcopy or hardcopy or any other form to any other student, and nor will
   I do so until after the marks are released. I understand that providing
   my work to other students, regardless of my intention or any undertakings
   made to me by that other student, is also Academic Misconduct.

   (3) I further understand that providing a copy of the assignment
   specification to any form of code authoring or assignment tutoring
   service, or drawing the attention of others to such services and code
   that may have been made available via such a service, may be regarded
   as Student General Misconduct (interfering with the teaching activities
   of the University and/or inciting others to commit Academic Misconduct).
   I understand that an allegation of Student General Misconduct may arise
   regardless of whether or not I personally make use of such solutions
   or sought benefit from such actions.

   <b>Signed by</b>: [Wang Risheng]
   
   <b>Dated</b>: [30/08/2023]