### CS 6840 Intro Machine Learning - Lab Assignment 2

# <center>Building and Analyzing Classification Models</center>

### 1. Overview
The learning objective of this lab assignment is for students to understand different classification models, including how to train logistic regression, k-nearest neighbors, support vector machine, and decision tree with the impacts of key parameters, how to evaluate their classification performances, and how to compare these results among different classification models.

#### Lecture notes. 
Detailed coverage of these topics can be found in the following:
<li>Logistic Regression</li>
<li>Evaluation Metrics for Classification</li>
<li>Cross Validation</li>
<li>k-Nearest Neighbors</li>
<li>Support Vector Machine</li>
<li>Decision Tree</li>

#### Code demonstrations.
<li>Code 2023-09-20-W-Logistic Regression.ipynb</li>
<li>Code 2023-09-25-M-Evaluation Metrics for Classification.ipynb</li>
<li>Code 2023-09-27-W-Cross Validation.ipynb</li>
<li>Code 2023-10-04-W-k-Nearest Neighbors.ipynb</li>
<li>Code 2023-10-11-W-Soft Margin Classification SVM Model.ipynb</li>
<li>Code 2023-10-16-M-Multi-class Classification and Kernel Trick of SVM.ipynb</li>
<li>Code 2023-10-23-M-Decision Tree.ipynb</li>

### 2. Submission
You need to submit a detailed lab report with code, running results, and answers to the questions. If you submit <font color='red'>a jupyter notebook (“Firstname-Lastname-6840-Lab2.ipynd”)</font>, please fill in this file directly and place the code, running results, and answers in order for each question. If you submit <font color='red'>a PDF report (“Firstname-Lastname-6840-Lab2.pdf”) with code file (“Firstname-Lastname-6840-Lab2.py”)</font>, please include the screenshots (code and running results) with answers for each question in the report.  

### 3. Questions (50 points)

For this lab assignment, you will be using the `housing dataset` to complete the following tasks and answer the questions. The housing dataset is the California Housing Prices dataset based on data from the 1990 California census. You will use these features to build classification models to predict the `ocean proximity` of a house. First, please place `housing.csv` and your notebook/python file in the same directory, and load and preprocess the data.   

#### Load and preprocess the data

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Please place housing.csv and your notebook/python file in the same directory; otherwise, change DATA_PATH 
DATA_PATH = ""

def load_housing_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#Add three useful features
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

#Divide the data frame into features and labels
housing_labels = housing["ocean_proximity"].copy() # use ocean_proximity as classification label
housing_features = housing.drop("ocean_proximity", axis=1) # use colums other than ocean_proximity as features

#Preprocessing the missing feature values
median = housing_features["total_bedrooms"].median()
housing_features["total_bedrooms"].fillna(median, inplace=True) 
median = housing_features["bedrooms_per_room"].median()
housing_features["bedrooms_per_room"].fillna(median, inplace=True)

#Scale the features
std_scaler  = StandardScaler()
housing_features_scaled = std_scaler.fit_transform(housing_features)

#Final housing features X
X = housing_features_scaled

#Binary labels - 0: INLAND; 1: CLOSE TO OCEAN
y_binary = (housing_labels != 1).astype(np.float64)
#Multi-class labels - 0: <1H OCEAN; 1: INLAND; 2: NEAR OCEAN; 3: NEAR BAY
y_multi = housing_labels.astype(np.float64)

#Data splits for binary classification
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y_binary, test_size=0.20, random_state=42)

#Data splits for multi-class classification
X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.20, random_state=42)

<font color='red'><b>About the data used in this assignment: </b></font><br>
**All the binary classification models are trained on `X_train_bi`, `y_train_bi`, and evaluated on `X_test_bi`, `y_test_bi`.**<br>
**All the multi-class classification models are trained on `X_train_mu`, `y_train_mu`, and evaluated on `X_test_mu`, `y_test_mu`.**<br>
**k-fold cross validation is performed directly on `X` and `y_multi`.**


#### Question 1 (4 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a logistic regression binary classification model in function `answer_one( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `solver="newton-cg"` and `random_state=42` in `LogisticRegression` to guarantee the convergence of train loss minimization** 

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

def answer_one():
    #Train a binary_reg
    binary_reg = LogisticRegression(solver="newton-cg", random_state=42)
    
    #Use binary_reg to make prediction y_pred_bi
    binary_reg.fit(X_train_bi, y_train_bi)
    y_pred_bi = binary_reg.predict(X_test_bi)
    
    #Accuracy
    binary_reg_accuracy = accuracy_score(y_test_bi, y_pred_bi)
    
    #F1 score
    binary_reg_f1 = f1_score(y_test_bi, y_pred_bi)
    
    return binary_reg_accuracy, binary_reg_f1

#Run your function in the cell to return the results
accuracy_1, f1_1 = answer_one()
print("Accuracy Score:", accuracy_1)
print("F1 Score:", f1_1)

Accuracy Score: 0.9600290697674418
F1 Score: 0.9710068529256721


#### Answer 1:  
Accuracy is: 96.00290697674418% <br>
F1 score is: 97.10068529256721%

#### Question 2 (4 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a softmax regression multi-class classification model in function `answer_two( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `multi_class="multinomial"`, `solver="newton-cg"` and `random_state=42` in `LogisticRegression` to guarantee the convergence of multi-class training**  

In [5]:
def answer_two():
    #Train a multi_reg
    multi_reg = LogisticRegression(multi_class="multinomial", solver="newton-cg", random_state=42)

    #Use multi_reg to make prediction y_pred_mu
    multi_reg.fit(X_train_mu, y_train_mu)
    y_pred_mu = multi_reg.predict(X_test_mu)
    
    #Accuracy
    multi_reg_accuracy = accuracy_score(y_test_mu, y_pred_mu)
    
    #Micro F1 score
    multi_reg_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
    
    #Macro F1 score
    multi_reg_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
    
    return multi_reg_accuracy, multi_reg_microf1, multi_reg_macrof1

#Run your function in the cell to return the results
accuracy_2, microf1_2, macrof1_2 = answer_two()
print("Accuracy Score: ", accuracy_2)
print("Micro F1 Score: ", microf1_2)
print("Macro F1 Score: ", macrof1_2)

Accuracy Score:  0.7974806201550387
Micro F1 Score:  0.7974806201550388
Macro F1 Score:  0.6847642281014538


#### Answer 2:  
Accuracy is: 79.74806201550387% <br>
Micro f1 score is: 79.74806201550388% <br>
Macro f1 score is: 68.47642281014538%

#### Question 3 (6 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a k-nearest neighbors binary classification model in function `answer_three( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the option `n_neighbors=` in `KNeighborsClassifier` using `1`, `3`, `5`, `7`, and `9` respectively to find an optimal value `k`**   

In [6]:
from sklearn.neighbors import KNeighborsClassifier

def answer_three():
    # Going to add a list of values to loop through to determine the optimal k value
    potential_k_values = [1, 3, 5, 7, 9]
    optimal_k = -1
    optimal_k_accuracy = 0
    optimal_k_f1 = 0
    
    # Loop through the potential k values training the model each time
    for k in potential_k_values:
        #Train a binary_knn
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train_bi, y_train_bi)
    
        #Use binary_knn to make prediction y_pred_bi
        y_pred_bi = knn.predict(X_test_bi)
        
        # Calculate potential k's accuracy and F1 score
        potential_k_accuracy = accuracy_score(y_test_bi, y_pred_bi)
        potential_k_f1 = f1_score(y_test_bi, y_pred_bi)
        print("For k = ", k, " accuracy is: ", potential_k_accuracy, " F1 Score is: ", potential_k_f1)
        
        # Update the final accuracy and F1 if the scores are higher
        if potential_k_accuracy > optimal_k_accuracy or (potential_k_accuracy == optimal_k_accuracy and potential_k_f1 > optimal_k_f1):
            optimal_k = k
            optimal_k_accuracy = potential_k_accuracy
            optimal_k_f1 = potential_k_f1
    
    # Best k Score:
    binary_k = optimal_k
    
    #Accuracy
    binary_knn_accuracy = optimal_k_accuracy 
    
    #F1 score
    binary_knn_f1 = optimal_k_f1
    
    return binary_k, binary_knn_accuracy, binary_knn_f1

#Run your function in the cell to return the results
k_3, accuracy_3, f1_3 = answer_three()
print("Optimal k value: ", k_3)
print("Optimal k accuracy: ", accuracy_3)
print("Optimal k F1 Score: ", f1_3)

For k =  1  accuracy is:  0.9256298449612403  F1 Score is:  0.945615589016829
For k =  3  accuracy is:  0.9367732558139535  F1 Score is:  0.9543307086614174
For k =  5  accuracy is:  0.935077519379845  F1 Score is:  0.9533101045296168
For k =  7  accuracy is:  0.9367732558139535  F1 Score is:  0.9546165884194053
For k =  9  accuracy is:  0.9358042635658915  F1 Score is:  0.953953084274544
Optimal k value:  7
Optimal k accuracy:  0.9367732558139535
Optimal k F1 Score:  0.9546165884194053


#### Answer 3:  
When k = 1, accuracy is: 92.56298449612403%<br>
When k = 3, accuracy is: 93.67732558139535%<br>
When k = 5, accuracy is: 93.5077519379845%<br>
When k = 7, accuracy is: 93.67732558139535%<br>
When k = 9, accuracy is: 93.58042635658915%<br>
Optimal k (`n_neighbors`) is: (7), accuracy is: 93.67732558139535%, F1 score is: 95.46165884194053%<br>

#### Question 4 (7 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a k-nearest neighbors multi-class classification model in function `answer_four( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, macro F1 score, loading time, and prediction time.

**Set `n_neighbors=5` in `KNeighborsClassifier` and set the option `algorithm=` using `'brute'`, `'kd_tree'`, and `ball_tree` respectively to compare the different time used**  

In [12]:
import time

def answer_four():
    algorithms = ['brute', 'kd_tree', 'ball_tree']
    results_a4 = []
    fastest_load_time = {}
    fastest_prediction_time = {}
    highest_accuracy = {}
    highest_micro_f1 = {}
    highest_macro_f1 = {}
    
    for algorithm in algorithms:
        #Add a time checkpoint here
        load_time_start = time.time()
        
        #Train a multi_knn
        multi_knn = KNeighborsClassifier(n_neighbors=5, algorithm=algorithm)
        multi_knn.fit(X_train_mu, y_train_mu)
        
        #Add a time checkpoint here
        load_time_end = time.time()
        
        #Use multi_knn to make prediction y_pred_mu
        y_pred_mu = multi_knn.predict(X_test_mu)
        
        #Add a time checkpoint here
        prediction_time_end = time.time()
        
        #Accuracy:
        multi_knn_accuracy = accuracy_score(y_test_mu, y_pred_mu)
        
        #Micro F1 Score:
        multi_knn_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
        
        #Macro F1 Score:
        multi_knn_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
        
        #time used for data loading:
        multi_knn_loadtime = load_time_end - load_time_start
        
        #time used for prediction:
        multi_knn_predictiontime = prediction_time_end - load_time_end
        
        # Add results to list of results
        results_a4.append((algorithm, multi_knn_accuracy, multi_knn_microf1, multi_knn_macrof1, multi_knn_loadtime, multi_knn_predictiontime))

        # Update the fastest load time
        if not fastest_load_time or multi_knn_loadtime < next(iter(fastest_load_time.values())):
            fastest_load_time = {algorithm: multi_knn_loadtime}
        elif multi_knn_loadtime == next(iter(fastest_load_time.values())):
            fastest_load_time[algorithm] = multi_knn_loadtime

        # Update the fastest prediction time
        if not fastest_prediction_time or multi_knn_predictiontime < next(iter(fastest_prediction_time.values())):
            fastest_prediction_time = {algorithm: multi_knn_predictiontime}
        elif multi_knn_predictiontime == next(iter(fastest_prediction_time.values())):
            fastest_prediction_time[algorithm] = multi_knn_predictiontime

        # Update the highest accuracy
        if not highest_accuracy or multi_knn_accuracy > next(iter(highest_accuracy.values())):
            highest_accuracy = {algorithm: multi_knn_accuracy}
        elif multi_knn_accuracy == next(iter(highest_accuracy.values())):
            highest_accuracy[algorithm] = multi_knn_accuracy

        # Update the highest micro F1
        if not highest_micro_f1 or multi_knn_microf1 > next(iter(highest_micro_f1.values())):
            highest_micro_f1 = {algorithm: multi_knn_microf1}
        elif multi_knn_microf1 == next(iter(highest_micro_f1.values())):
            highest_micro_f1[algorithm] = multi_knn_microf1

        # Update the highest macro F1
        if not highest_macro_f1 or multi_knn_macrof1 > next(iter(highest_macro_f1.values())):
            highest_macro_f1 = {algorithm: multi_knn_macrof1}
        elif multi_knn_macrof1 == next(iter(highest_macro_f1.values())):
            highest_macro_f1[algorithm] = multi_knn_macrof1
    
    return results_a4, fastest_load_time, fastest_prediction_time, highest_accuracy, highest_micro_f1, highest_macro_f1


#Run your function in the cell to return the results
results_4, load_time_4, prediction_time_4, accuracy_4, microf1_4, macrof1_4 = answer_four()
for result in results_4:
    print(f"Algorithm: {result[0]}, Load Time: {result[4]}s, Prediction Time: {result[5]}s, Accuracy: {result[1]}, Micro F1: {result[2]}, Macro F1: {result[3]}")

print("Best Load Time:")
for key in load_time_4:
    print(key, " - ", load_time_4.get(key), "seconds")

print("Best Prediction Time:")
for key in prediction_time_4:
    print(key, " - ", prediction_time_4.get(key), "seconds")
    
print("Best Accuracy:")
for key in accuracy_4:
    print(key, " - ", accuracy_4.get(key))

print("Best Micro F1:")
for key in microf1_4:
    print(key, " - ", microf1_4.get(key))

print("Best Macro F1:")
for key in macrof1_4:
    print(key, " - ", macrof1_4.get(key))

Algorithm: brute, Load Time: 0.002195119857788086s, Prediction Time: 0.04861092567443848s, Accuracy: 0.811046511627907, Micro F1: 0.811046511627907, Macro F1: 0.7542631097247448
Algorithm: kd_tree, Load Time: 0.004649162292480469s, Prediction Time: 0.19477486610412598s, Accuracy: 0.811046511627907, Micro F1: 0.811046511627907, Macro F1: 0.7542631097247448
Algorithm: ball_tree, Load Time: 0.004762887954711914s, Prediction Time: 0.3692941665649414s, Accuracy: 0.811046511627907, Micro F1: 0.811046511627907, Macro F1: 0.7542631097247448
Best Load Time:
brute  -  0.002195119857788086 seconds
Best Prediction Time:
brute  -  0.04861092567443848 seconds
Best Accuracy:
brute  -  0.811046511627907
kd_tree  -  0.811046511627907
ball_tree  -  0.811046511627907
Best Micro F1:
brute  -  0.811046511627907
kd_tree  -  0.811046511627907
ball_tree  -  0.811046511627907
Best Macro F1:
brute  -  0.7542631097247448
kd_tree  -  0.7542631097247448
ball_tree  -  0.7542631097247448


#### Answer 4:  
<b>Brute force: </b> data loading time is: 0.002195119857788086s, prediction time is: 0.04861092567443848s, accuracy is: 81.1046511627907%, micro f1 score is: 81.1046511627907%, macro f1 score is: 75.42631097247448% <br>
<b>K-d tree: </b> data loading time is: 0.004649162292480469s, prediction time is: 0.19477486610412598s, accuracy is: 81.1046511627907%, micro f1 score is: 81.1046511627907%, macro f1 score is: 75.42631097247448% <br>
<b>Ball tree: </b> data loading time is: 0.004762887954711914s, prediction time is: 0.3692941665649414s, accuracy is: 81.1046511627907%, micro f1 score is: 81.1046511627907%, macro f1 score is: 75.42631097247448% <br>
Summarize your observations about the time used by these searching algorithms: The Brute force algorithm had the fastest loading time and the fastest prediction time and observations about the classification performance: Overall, all three algorithms have the same scores for accuracy, micro F1 and macro F1

#### Question 5 (7 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a support vector machine binary classification model in function `answer_five( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `random_state=42` in `SVC`, and set the kernel function `kernel=` using `'linear'`, `'rbf'`, and `'poly'` respectively to compare different performance** 

In [14]:
from sklearn.svm import SVC

def answer_five():
    # Create a list of models and a list to store the results
    kernels = ['linear', 'rbf', 'poly']
    results_a5 = []
    
    # Create a dictionary to store the best performing model
    best_perfomance = {'Kernel': '', 'Accuracy': 0, 'F1': 0}
    
    # Traverse the list of models
    for kernel in kernels:
        
        #Train a binary_svm
        binary_svm = SVC(kernel=kernel, random_state=42)
        binary_svm.fit(X_train_bi, y_train_bi)

        #Use binary_svm to make prediction y_pred_bi
        y_pred_bi = binary_svm.predict(X_test_bi)
    
        #Accuracy
        binary_svm_accuracy = accuracy_score(y_test_bi, y_pred_bi)
        
        #F1 score
        binary_svm_f1 = f1_score(y_test_bi, y_pred_bi)
        
        # Add result to list of results
        results_a5.append((kernel, binary_svm_accuracy, binary_svm_f1))
        
        # Update the best performing model
        if binary_svm_accuracy > best_perfomance['Accuracy'] or (binary_svm_accuracy == best_perfomance['Accuracy'] and binary_svm_f1 > best_perfomance['F1']):
            best_perfomance['Kernel'] = kernel
            best_perfomance['Accuracy'] = binary_svm_accuracy
            best_perfomance['F1'] = binary_svm_f1
    
    return results_a5, best_perfomance

#Run your function in the cell to return the results
results_5, best_performance_5 = answer_five()
for result in results_5:
    print(f"Kernel: {result[0]}, Accuracy: {result[1]}, F1 Score: {result[2]}\n")

print("Best Performing Model:\n")
print(f"Kernel: {best_performance_5['Kernel']}, Accuracy: {best_performance_5['Accuracy']}, F1 Score: {best_performance_5['F1']}")

Kernel: linear, Accuracy: 0.9617248062015504, F1 Score: 0.9721340388007055

Kernel: rbf, Accuracy: 0.9656007751937985, F1 Score: 0.9751574527641709

Kernel: poly, Accuracy: 0.9341085271317829, F1 Score: 0.9532967032967032

Best Performing Model:

Kernel: rbf, Accuracy: 0.9656007751937985, F1 Score: 0.9751574527641709


#### Answer 5:  
<b>Linear kernel: </b> accuracy is: 96.17248062015504%, and f1 score is: 97.21340388007055% <br> 
<b>RBF kernel: </b> accuracy is: 96.56007751937985%, and f1 score is: 97.51574527641709% <br> 
<b>Polynomial kernel: </b> accuracy is: 93.41085271317829%, and f1 score is: 95.32967032967032% <br>
Summarize your observations about the performance derived by these different kernels: The best performing kernel was the RBF kernel which had both the highest accuracy and F1 score.

#### Question 6 (6 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a support vector machine multi-class classification model in function `answer_six( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `kernel='rbf'`, `random_state=42` in `SVC`, and set `decision_function_shape=` using `'ovr'` and `'ovo'` respectively to compare different performance and time cost**  

In [20]:
def answer_six():
    
    # Create a list of decision functions and an empty list to store results
    decision_functions = ['ovr', 'ovo']
    results_a6 = []
    
    
    # Traverse the decision_function list and train the SVM
    for decision_function in decision_functions:
        
        #Add a time checkpoint here
        time1 = time.time()
        
        #Train a multi_svm
        multi_svm = SVC(kernel='rbf', decision_function_shape=decision_function, random_state=42)
        multi_svm.fit(X_train_mu, y_train_mu)
    
        #Use multi_svm to make prediction y_pred_mu
        y_pred_mu = multi_svm.predict(X_test_mu)
        
        #Add a time checkpoint here
        time2 = time.time()
        
        #Accuracy
        multi_svm_accuracy = accuracy_score(y_test_mu, y_pred_mu)
        
        #Micro F1 score
        multi_svm_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
        
        #Macro F1 score
        multi_svm_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
        
        #time used
        multi_svm_time = time2 - time1
        
        results_a6.append((decision_function, multi_svm_time, multi_svm_accuracy, multi_svm_microf1, multi_svm_macrof1))
    
    return results_a6

#Run your function in the cell to return the results
results_6 = answer_six()
for result in results_6:
    print(f"Decision Function: {result[0]}, Training Time: {result[1]}s, Accuracy: {result[2]}, Micro F1: {result[3]}, Macro F1: {result[4]}")

Decision Function: ovr, Training Time: 2.510913133621216s, Accuracy: 0.8439922480620154, Micro F1: 0.8439922480620154, Macro F1: 0.7813377183100643
Decision Function: ovo, Training Time: 2.421372890472412s, Accuracy: 0.8439922480620154, Micro F1: 0.8439922480620154, Macro F1: 0.7813377183100643


#### Answer 6:  
<b>One-vs-one (ovo): </b> time used is: 2.421372890472412s, accuracy is: 84.39922480620154%, micro f1 score is: 84.39922480620154%, macro f1 score is: 78.13377183100643% <br>
<b>One-vs-rest (ovr): </b> time used is: 2.510913133621216s, accuracy is: 84.39922480620154%, micro f1 score is: 84.39922480620154%, macro f1 score is: 78.13377183100643% <br>
Summarize your observations about the time used by these multi-class methods: The one-vs-one decision function was faster than the one-vs-rest and observations about the classification performance: other than the training time, both models had the same scores for accuracy and F1's meaning that ovo was the better option since it achieved the same result in less time.

#### Question 7 (3 points):
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Based on the results from Question 1 to Question 6: <br>
The model with best binary classification performance is: SVM w/ RBF <br>
The model with worst binary classification performance is: kNN <br>
The model with best multi-class classification performance is: SVM w/ OVO <br>
The model with worst multi-class classification performance is: Softmax Regression <br>
Summarize your personal thoughts on the model choices: The binary models all outperform the multi-class models, but they will restrict the breadth of predictions that can be made because they are simply binary models. That said, while the multiclass SVM w/ OVO had the best performance for any of the multiclass models, it also had the worst performance in terms of time. When working with small datasets like this it isn't an issue, but would likely be a major factor when scaled up to much larger datasets. Overall, the tradeoff appears to be that more complex predictions tend to be less accurate than the simple binary predictions, as well as that increasing the accuracy of these complex predictions tends to negatively impact model performance in terms of time.

#### Question 8 (6 points):
Please use `X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.40)` to perform data splits for 5 different times, and each time, use `X_train_mu` and `y_train_mu` to train a decision tree in function `answer_eight( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=4`, `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`**

In [24]:
from sklearn.tree import DecisionTreeClassifier

def answer_eight():
    
    # Create lists to store accuracy, micro f1, and macro f1 for each split
    accuracies = []
    micro_f1s = []
    macro_f1s = []
    
    # Loop for 5 runs
    for run in range(5):
        X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.40)
        
        multi_dt = DecisionTreeClassifier(max_depth=4, criterion='gini', random_state=42)
        multi_dt.fit(X_train_mu, y_train_mu)
      
        y_pred_mu = multi_dt.predict(X_test_mu)
        
        #Accuracy
        multi_dt_accuracy = accuracy_score(y_test_mu, y_pred_mu)
        accuracies.append(multi_dt_accuracy)
        
        #Micro F1 score
        multi_dt_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
        micro_f1s.append(multi_dt_microf1)
        
        #Macro F1 score
        multi_dt_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
        macro_f1s.append(multi_dt_macrof1)
        
        # Print info for current split
        print("Run #: ", run + 1)
        print("Accuracy: ", multi_dt_accuracy)
        print("Micro F1: ", multi_dt_microf1)
        print("Macro F1: ", multi_dt_macrof1)
        
    avg_accuracy = sum(accuracies) / len(accuracies)
    avg_microf1 = sum(micro_f1s) / len(micro_f1s)
    avg_macrof1 = sum(macro_f1s) / len(macro_f1s)
    
    return avg_accuracy,avg_microf1, avg_macrof1 

#Run your function in the cell to return the results
accuracy_8, microf1_8, macrof1_8 = answer_eight()
print("Average for All:")
print("Average Accuracy: ", accuracy_8)
print("Average Micro F1 Score: ", microf1_8)
print("Average Macro F1 Score: ", macrof1_8)

Run #:  1
Accuracy:  0.8313953488372093
Micro F1:  0.8313953488372093
Macro F1:  0.766801220060083
Run #:  2
Accuracy:  0.843265503875969
Micro F1:  0.843265503875969
Macro F1:  0.7833267119512235
Run #:  3
Accuracy:  0.8270348837209303
Micro F1:  0.8270348837209303
Macro F1:  0.7781206898058822
Run #:  4
Accuracy:  0.8349079457364341
Micro F1:  0.8349079457364341
Macro F1:  0.7704587777706714
Run #:  5
Accuracy:  0.8399951550387597
Micro F1:  0.8399951550387597
Macro F1:  0.7702341098959629
Average for All:
Average Accuracy:  0.8353197674418604
Average Micro F1 Score:  0.8353197674418604
Average Macro F1 Score:  0.7737883018967646


#### Answer 8:  
First run: <br>
Accuracy is: 83.13953488372093%, Micro f1 score is: 83.13953488372093%, Macro f1 score is: 76.6801220060083% <br><br>
Second run: <br>
Accuracy is: 84.3265503875969%, Micro f1 score is: 84.3265503875969%, Macro f1 score is: 78.33267119512235% <br><br>
Third run: <br>
Accuracy is: 82.70348837209303%, Micro f1 score is: 82.70348837209303%, Macro f1 score is: 77.81206898058822% <br><br>
Fourth run: <br>
Accuracy is: 83.49079457364341%, Micro f1 score is: 83.49079457364341%, Macro f1 score is: 77.04587777706714% <br><br>
Fifth run: <br>
Accuracy is: 83.99951550387597%, Micro f1 score is: 83.99951550387597%, Macro f1 score is: 77.02341098959629% <br><br>
Summarize your observations why these results vary and the disadvantages of hold-out evaluation: The accuracy and micro F1 scores are the same for each run since they are mathematically equivalent with multiclass models. This value did vary between runs, but not by a lot (82.7% to 84.3% or a variance of roughly 1.6%). This indicates that the model performance is affected by the data split used for training and testing. The macro F1 score shows similar variability. These variations point out a key disadvantage of hold-out evaluation in that it is dependent on the data split. These different splits will create different distributions of classes in the training and test datasets, thus affecting the model performance. This could potentially be much worse if the datasets are much smaller or if they are even more imbalanced between the training and test datasets.

#### Question 9 (7 points):
Please use `X` and `y_multi` to implement k-fold cross validation in function `answer_nine( )` to evaluate decision tree multi-class classification model, including the mean of accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=4`, `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`**

**Set `cv=5` and `scoring=("accuracy", "f1_micro", "f1_macro")` in `cross_validate` to return the cross-validation evaluation results**

In [27]:
from sklearn.model_selection import cross_validate
from statistics import mean

def answer_nine():
    multi_dt = DecisionTreeClassifier(max_depth=4, criterion='gini', random_state=42)
    
    #Cross validation evaluation
    cv_results = cross_validate(multi_dt, X, y_multi, cv=5, scoring=("accuracy", "f1_micro", "f1_macro"))
    
    #Accuracy: use mean()
    multi_dt_accuracy = mean(cv_results['test_accuracy'])
    
    #Micro F1 score: use mean()
    multi_dt_microf1 = mean(cv_results['test_f1_micro'])
    
    #Macro F1 score: use mean()
    multi_dt_macrof1 = mean(cv_results['test_f1_macro'])
    
    return multi_dt_accuracy, multi_dt_microf1, multi_dt_macrof1

#Run your function in the cell to return the results
accuracy_9, microf1_9, macrof1_9 = answer_nine()
print("Mean Accuracy:", accuracy_9)
print("Mean Micro F1 Score:", microf1_9)
print("Mean Macro F1 Score:", macrof1_9)

Mean Accuracy: 0.6973837209302326
Mean Micro F1 Score: 0.6973837209302326
Mean Macro F1 Score: 0.6475799195256458


#### Answer 9:  
Accuracy using 5-fold cross validation is: 69.73837209302326% <br>
Micro f1 score using 5-fold cross validation is: 69.73837209302326% <br>
Macro f1 score using 5-fold cross validation is: 64.75799195256458% <br>
Compared to the classification results in Question 8, what is your observation: the accuracy and F1 scores are much lower than the results from question 8, and why that happens: this is likely because question 8 had a heavily biased estimate of the model's performance. If the hold-out test set was not representative of the overall dataset, it would explain having the much lower cross validation scores. <br>
Summarize the advantages and disadvantages of cross validation: Overall cross validation offers a more reliable/complete assessment of how a model is performing compared to a single hold-out validation. This helps to reduce bias, better utilize the data, and to provide more accurate estimates since the model is evaluated on multiple subsets within the data. Unfortunately this will have a negative impact when using much larger datasets, presenting a scaling problem. The other challenge is that since cross validation involves training multiple models on subsets of the data, there is not a unified model that is validated (unless you retrain on the entire dataset, which would not be validated as a whole). 