### CS 6840 Intro Machine Learning - Lab Assignment 2

# <center>Building and Analyzing Classification Models</center>

### 1. Overview
The learning objective of this lab assignment is for students to understand different classification models, including how to train logistic regression, k-nearest neighbors, support vector machine, and decision tree with the impacts of key parameters, how to evaluate their classification performances, and how to compare these results among different classification models.

#### Lecture notes. 
Detailed coverage of these topics can be found in the following:
<li>Logistic Regression</li>
<li>Evaluation Metrics for Classification</li>
<li>Cross Validation</li>
<li>k-Nearest Neighbors</li>
<li>Support Vector Machine</li>
<li>Decision Tree</li>

#### Code demonstrations.
<li>Code 2023-09-20-W-Logistic Regression.ipynb</li>
<li>Code 2023-09-25-M-Evaluation Metrics for Classification.ipynb</li>
<li>Code 2023-09-27-W-Cross Validation.ipynb</li>
<li>Code 2023-10-04-W-k-Nearest Neighbors.ipynb</li>
<li>Code 2023-10-11-W-Soft Margin Classification SVM Model.ipynb</li>
<li>Code 2023-10-16-M-Multi-class Classification and Kernel Trick of SVM.ipynb</li>
<li>Code 2023-10-23-M-Decision Tree.ipynb</li>

### 2. Submission
You need to submit a detailed lab report with code, running results, and answers to the questions. If you submit <font color='red'>a jupyter notebook (“Firstname-Lastname-6840-Lab2.ipynd”)</font>, please fill in this file directly and place the code, running results, and answers in order for each question. If you submit <font color='red'>a PDF report (“Firstname-Lastname-6840-Lab2.pdf”) with code file (“Firstname-Lastname-6840-Lab2.py”)</font>, please include the screenshots (code and running results) with answers for each question in the report.  

### 3. Questions (50 points)

For this lab assignment, you will be using the `housing dataset` to complete the following tasks and answer the questions. The housing dataset is the California Housing Prices dataset based on data from the 1990 California census. You will use these features to build classification models to predict the `ocean proximity` of a house. First, please place `housing.csv` and your notebook/python file in the same directory, and load and preprocess the data.   

#### Load and preprocess the data

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Please place housing.csv and your notebook/python file in the same directory; otherwise, change DATA_PATH 
DATA_PATH = ""

def load_housing_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#Add three useful features
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

#Divide the data frame into features and labels
housing_labels = housing["ocean_proximity"].copy() # use ocean_proximity as classification label
housing_features = housing.drop("ocean_proximity", axis=1) # use colums other than ocean_proximity as features

#Preprocessing the missing feature values
median = housing_features["total_bedrooms"].median()
housing_features["total_bedrooms"].fillna(median, inplace=True) 
median = housing_features["bedrooms_per_room"].median()
housing_features["bedrooms_per_room"].fillna(median, inplace=True)

#Scale the features
std_scaler  = StandardScaler()
housing_features_scaled = std_scaler.fit_transform(housing_features)

#Final housing features X
X = housing_features_scaled

#Binary labels - 0: INLAND; 1: CLOSE TO OCEAN
y_binary = (housing_labels != 1).astype(np.float64)
#Multi-class labels - 0: <1H OCEAN; 1: INLAND; 2: NEAR OCEAN; 3: NEAR BAY
y_multi = housing_labels.astype(np.float64)

#Data splits for binary classification
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y_binary, test_size=0.20, random_state=42)

#Data splits for multi-class classification
X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.20, random_state=42)

<font color='red'><b>About the data used in this assignment: </b></font><br>
**All the binary classification models are trained on `X_train_bi`, `y_train_bi`, and evaluated on `X_test_bi`, `y_test_bi`.**<br>
**All the multi-class classification models are trained on `X_train_mu`, `y_train_mu`, and evaluated on `X_test_mu`, `y_test_mu`.**<br>
**k-fold cross validation is performed directly on `X` and `y_multi`.**


#### Question 1 (4 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a logistic regression binary classification model in function `answer_one( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `solver="newton-cg"` and `random_state=42` in `LogisticRegression` to guarantee the convergence of train loss minimization** 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

def answer_one():
    #Train a binary_reg

    #Use binary_reg to make prediction y_pred_bi
    
    #Accuracy
    binary_reg_accuracy = 
    
    #F1 score
    binary_reg_f1 = 
    
    return binary_reg_accuracy, binary_reg_f1

#Run your function in the cell to return the results
accuracy_1, f1_1 = answer_one()

#### Answer 1:  
Accuracy is: ( ) <br>
F1 score is: ( )

#### Question 2 (4 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a softmax regression multi-class classification model in function `answer_two( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `multi_class="multinomial"`, `solver="newton-cg"` and `random_state=42` in `LogisticRegression` to guarantee the convergence of multi-class training**  

In [None]:
def answer_two():
    #Train a multi_reg

    #Use multi_reg to make prediction y_pred_mu
    
    #Accuracy
    multi_reg_accuracy = 
    
    #Micro F1 score
    multi_reg_microf1 = 
    
    #Macro F1 score
    multi_reg_macrof1 = 
    
    return multi_reg_accuracy, multi_reg_microf1, multi_reg_macrof1

#Run your function in the cell to return the results
accuracy_2, microf1_2, macrof1_2 = answer_two()

#### Answer 2:  
Accuracy is: ( ) <br>
Micro f1 score is: ( ) <br>
Macro f1 score is: ( )

#### Question 3 (6 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a k-nearest neighbors binary classification model in function `answer_three( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the option `n_neighbors=` in `KNeighborsClassifier` using `1`, `3`, `5`, `7`, and `9` respectively to find an optimal value `k`**   

In [None]:
from sklearn.neighbors import KNeighborsClassifier

def answer_three():
    #Train a binary_knn

    #Use binary_knn to make prediction y_pred_bi
    
    #Accuracy
    binary_knn_accuracy = 
    
    #F1 score
    binary_knn_f1 = 
    
    return binary_knn_accuracy, binary_knn_f1

#Run your function in the cell to return the results
accuracy_3, f1_3 = answer_three()

#### Answer 3:  
When k = 1, accuracy is: ( )<br>
When k = 3, accuracy is: ( )<br>
When k = 5, accuracy is: ( )<br>
When k = 7, accuracy is: ( )<br>
When k = 9, accuracy is: ( )<br>
Optimal k (`n_neighbors`) is: ( ), accuracy is: ( ), F1 score is: ( )<br>

#### Question 4 (7 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a k-nearest neighbors multi-class classification model in function `answer_four( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, macro F1 score, loading time, and prediction time.

**Set `n_neighbors=5` in `KNeighborsClassifier` and set the option `algorithm=` using `'brute'`, `'kd_tree'`, and `ball_tree` respectively to compare the different time used**  

In [None]:
import time

def answer_four():
    #Add a time checkpoint here
    time1 = time.time()
    
    #Train a multi_knn

    
    #Add a time checkpoint here
    time2 = time.time()
    
    #Use multi_knn to make prediction y_pred_mu
    
    
    #Add a time checkpoint here
    time3 = time.time()
    
    #Accuracy
    multi_knn_accuracy = 
    
    #Micro F1 score
    multi_knn_microf1 = 
    
    #Macro F1 score
    multi_knn_macrof1 = 
    
    #time used for data loading
    multi_knn_loadtime = time2 - time1
    
    #time used for prediction
    multi_knn_predictiontime = time3 - time2
    
    return multi_knn_accuracy, multi_knn_microf1, multi_knn_macrof1, multi_knn_loadtime, multi_knn_predictiontime

#Run your function in the cell to return the results
accuracy_4, microf1_4, macrof1_4, loadtime, predictiontime = answer_four()

#### Answer 4:  
<b>Brute force: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
<b>K-d tree: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
<b>Ball tree: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
Summarize your observations about the time used by these searching algorithms: ( ) and observations about the classification performance: ( )

#### Question 5 (7 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a support vector machine binary classification model in function `answer_five( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `random_state=42` in `SVC`, and set the kernel function `kernel=` using `'linear'`, `'rbf'`, and `'poly'` respectively to compare different performance** 

In [None]:
from sklearn.svm import SVC

def answer_five():
    #Train a binary_svm
    

    #Use binary_svm to make prediction y_pred_bi

    
    #Accuracy
    binary_svm_accuracy = 
    
    #F1 score
    binary_svm_f1 = 
    
    return binary_svm_accuracy, binary_svm_f1

#Run your function in the cell to return the results
accuracy_5, f1_5 = answer_five()

#### Answer 5:  
<b>Linear kernel: </b> accuracy is: ( ), and f1 score is: ( ) <br> 
<b>RBF kernel: </b> accuracy is: ( ), and f1 score is: ( ) <br> 
<b>Polynomial kernel: </b> accuracy is: ( ), and f1 score is: ( ) <br>
Summarize your observations about the performance derived by these different kernels: ( )  

#### Question 6 (6 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a support vector machine multi-class classification model in function `answer_six( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `kernel='rbf'`, `random_state=42` in `SVC`, and set `decision_function_shape=` using `'ovr'` and `'ovo'` respectively to compare different performance and time cost**  

In [None]:
def answer_six():
    #Add a time checkpoint here
    time1 = time.time()
    
    #Train a multi_svm
    

    #Use multi_svm to make prediction y_pred_mu

    
    #Add a time checkpoint here
    time2 = time.time()
    
    #Accuracy
    multi_svm_accuracy = 
    
    #Micro F1 score
    multi_svm_microf1 = 
    
    #Macro F1 score
    multi_svm_macrof1 =
    
    #time used
    multi_svm_time = time2 - time1
    
    return multi_svm_accuracy, multi_svm_microf1, multi_svm_macrof1, multi_svm_time

#Run your function in the cell to return the results
accuracy_6, microf1_6, macrof1_6, used_time = answer_six()

#### Answer 6:  
<b>One-vs-one (ovo): </b> time used is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
<b>One-vs-rest (ovr): </b> time used is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
Summarize your observations about the time used by these multi-class methods: ( ) and observations about the classification performance: ( )

#### Question 7 (3 points):
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Based on the results from Question 1 to Question 6: <br>
The model with best binary classification performance is: ( ) <br>
The model with worst binary classification performance is: ( ) <br>
The model with best multi-class classification performance is: ( ) <br>
The model with worst multi-class classification performance is: ( ) <br>
Summarize your personal thoughts on the model choices: ( ) 

#### Question 8 (6 points):
Please use `X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.40)` to perform data splits for 5 different times, and each time, use `X_train_mu` and `y_train_mu` to train a decision tree in function `answer_eight( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=4`, `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`**

In [None]:
from sklearn.tree import DecisionTreeClassifier

X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.40)

def answer_eight():
    multi_dt = 
  
    y_pred_mu = 
    
    #Accuracy
    multi_dt_accuracy = 
    
    #Micro F1 score
    multi_dt_microf1 = 
    
    #Macro F1 score
    multi_dt_macrof1 = 
    
    return multi_dt_accuracy, multi_dt_microf1, multi_dt_macrof1

#Run your function in the cell to return the results
accuracy_8, microf1_8, macrof1_8 = answer_eight()

#### Answer 8:  
First run: <br>
Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ) <br><br>
Second run: <br>
Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ) <br><br>
Third run: <br>
Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ) <br><br>
Fourth run: <br>
Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ) <br><br>
Fifth run: <br>
Accuracy is: ( ), Micro f1 score is: ( ), Macro f1 score is: ( ) <br><br>
Summarize your observations why these results vary and the disadvantages of hold-out evaluation: ( )

#### Question 9 (7 points):
Please use `X` and `y_multi` to implement k-fold cross validation in function `answer_nine( )` to evaluate decision tree multi-class classification model, including the mean of accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=4`, `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`**

**Set `cv=5` and `scoring=("accuracy", "f1_micro", "f1_macro")` in `cross_validate` to return the cross-validation evaluation results**

In [None]:
from sklearn.model_selection import cross_validate
from statistics import mean

def answer_nine():
    multi_dt = 
    
    #Cross validation evaluation
    cv_results = 
    
    #Accuracy: use mean()
    multi_dt_accuracy = 
    
    #Micro F1 score: use mean()
    multi_dt_microf1 = 
    
    #Macro F1 score: use mean()
    multi_dt_macrof1 = 
    
    return multi_dt_accuracy, multi_dt_microf1, multi_dt_macrof1

#Run your function in the cell to return the results
accuracy_9, microf1_9, macrof1_9 = answer_nine()

#### Answer 9:  
Accuracy using 5-fold cross validation is: ( ) <br>
Micro f1 score using 5-fold cross validation is: ( ) <br>
Macro f1 score using 5-fold cross validation is: ( ) <br>
Compared to the classificaion results in Question 8, what is your observation: ( ), and why that happens: ( ) <br>
Summarize the advantages and disadvantages of cross validation: ( )