### CS 4840 Intro Machine Learning - Lab Assignment 2

# <center>Building and Analyzing Classification Models</center>

## <center><font color='red'>This is only for undergraduate students in CS 4840</font></center>

### 1. Overview
The learning objective of this lab assignment is for students to understand different classification models, including how to apply logistic regression, k-nearest neighbors, decision tree, and ensemble learning and random forest with the impacts of key parameters, how to evaluate their classification performances, and how to compare these results across different classification models.

#### Lecture notes. 
Detailed coverage of these topics can be found in the following:
<li>Logistic Regression</li>
<li>Evaluation Metrics for Classification</li>
<li>k-Nearest Neighbors</li>
<li>Decision Tree</li>
<li>Ensemble Learning and Random Forest</li>

#### Code demonstrations.
<li>Code 2024-09-23-M-Training Logistic Regression using Scikit-Learn.ipynb</li>
<li>Code 2024-09-25-W-Evaluation Metrics for Classification-Scikit-Learn.ipynb</li>
<li>Code 2024-10-02-W-k-Nearest Neighbors.ipynb</li>
<li>Code 2024-10-09-W-Decision Tree.ipynb</li>
<li>Code 2024-10-21-M-Ensemble Learning and Random Forest.ipynb</li>

### 2. Submission
You need to submit a detailed lab report with code, running results, and answers to the questions. If you submit <font color='red'>a jupyter notebook (“Firstname-Lastname-4840-Lab2.ipynd”)</font>, please fill in this file directly and place the code, running results, and answers in order for each question. If you submit <font color='red'>a PDF report (“Firstname-Lastname-4840-Lab2.pdf”) with code file (“Firstname-Lastname-4840-Lab2.py”)</font>, please include the screenshots (code and running results) with answers for each question in the report.  

### 3. Questions (50 points)

For this lab assignment, you will be using the `housing` dataset to complete the following tasks and answer the questions. The housing dataset is the California Housing Prices dataset based on data from the 1990 California census. You will use these features to build classification models to predict the `ocean proximity` of a house. First, please place `housing.csv` and your notebook/python file in the same directory, and load and preprocess the data.   

#### Load and preprocess the data

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Please place housing.csv and your notebook/python file in the same directory; otherwise, change DATA_PATH 
DATA_PATH = ""

def load_housing_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#Add three useful features
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

#Divide the data frame into features and labels
housing_labels = housing["ocean_proximity"].copy() # use ocean_proximity as classification label
housing_features = housing.drop("ocean_proximity", axis=1) # use colums other than ocean_proximity as features

#Preprocessing the missing feature values
median = housing_features["total_bedrooms"].median()
housing_features["total_bedrooms"].fillna(median, inplace=True) 
median = housing_features["bedrooms_per_room"].median()
housing_features["bedrooms_per_room"].fillna(median, inplace=True)

#Scale the features
std_scaler  = StandardScaler()
housing_features_scaled = std_scaler.fit_transform(housing_features)

#Final housing features X
X = housing_features_scaled

#Binary labels - 0: INLAND; 1: CLOSE TO OCEAN
y_binary = (housing_labels != 1).astype(int)
#Multi-class labels - 0: <1H OCEAN; 1: INLAND; 2: NEAR OCEAN; 3: NEAR BAY
y_multi = housing_labels.astype(int)

#Data splits for binary classification
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y_binary, test_size=0.20, random_state=42)

#Data splits for multi-class classification
X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.20, random_state=42)

print(X_train_bi.shape)
print(X_test_bi.shape)

<font color='red'><b>About the data used in this assignment: </b></font><br>
**All the binary classification models are trained on `X_train_bi`, `y_train_bi`, and evaluated on `X_test_bi`, `y_test_bi`.**<br>
**All the multi-class classification models are trained on `X_train_mu`, `y_train_mu`, and evaluated on `X_test_mu`, `y_test_mu`.**<br>


#### Question 1 (4 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a logistic regression binary classification model in function `answer_one( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Choose your own `solver` and set `random_state=42` in `LogisticRegression`** 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

def answer_one():
    binary_reg =

    
    y_pred_bi = 
    
    #Accuracy 
    binary_reg_accuracy = 
    
    #F1 score
    binary_reg_f1 = 
    
    return binary_reg_accuracy, binary_reg_f1

#Run your function in the cell to return the results
accuracy_1, f1_1 = answer_one()

#### Answer 1:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: ( ) <br>
F1 score is: ( )

#### Question 2 (4 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a softmax regression multi-class classification model in function `answer_two( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `solver="newton-cg"` to guarantee the convergence of train loss minimization and set `random_state=42` in `LogisticRegression`** 

In [None]:
def answer_two():
    multi_reg = 

    
    y_pred_mu = 
    
    #Accuracy
    multi_reg_accuracy = 
    
    #Micro F1 score
    multi_reg_microf1 = 
    
    #Macro F1 score
    multi_reg_macrof1 = 
    
    return multi_reg_accuracy, multi_reg_microf1, multi_reg_macrof1

#Run your function in the cell to return the results
accuracy_2, microf1_2, macrof1_2 = answer_two()

#### Answer 2:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: ( ) <br>
Micro f1 score is: ( ) <br>
Macro f1 score is: ( )

#### Question 3 (5 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a k-nearest neighbors binary classification model in function `answer_three( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the option `n_neighbors=` in `KNeighborsClassifier` using `1`, `3`, `5`, `7`, and `9` respectively to find an optimal value `k`**   

In [None]:
from sklearn.neighbors import KNeighborsClassifier

#Please use 1, 3, 5, 7, 9
k = 

def answer_three(k):
    binary_knn = 

    
    y_pred_bi = 
    
    #Accuracy
    binary_knn_accuracy = 
    
    #F1 score
    binary_knn_f1 = 
    
    return binary_knn_accuracy, binary_knn_f1

#Run your function in the cell to return the results
accuracy_3, f1_3 = answer_three(k)

#### Answer 3: 
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When k = 1</b>, accuracy is: ( )<br>
<b>When k = 3</b>, accuracy is: ( )<br>
<b>When k = 5</b>, accuracy is: ( )<br>
<b>When k = 7</b>, accuracy is: ( )<br>
<b>When k = 9</b>, accuracy is: ( )<br>
<b>Optimal k (`n_neighbors`) is</b>: ( ), accuracy is: ( ), F1 score is: ( )<br>

#### Question 4 (7 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a k-nearest neighbors multi-class classification model in function `answer_four( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, macro F1 score, loading time, and prediction time.

**Set `n_neighbors=` in `KNeighborsClassifier` using the optimal `k` in Question 3 and set the option `algorithm=` using `'brute'`, `'kd_tree'`, and `'ball_tree'` respectively to compare the different time used**  

In [None]:
import time

#Please use the optimal k in Question 3
k = 

#Please use 'brute', 'kd_tree', and 'ball_tree', respectively  
alg = 

def answer_four(k, alg):
    #Add a time checkpoint here
    time1 = time.time()
    
    multi_knn = 

    
    #Add a time checkpoint here
    time2 = time.time()
    
    y_pred_mu = 
    
    #Add a time checkpoint here
    time3 = time.time()
    
    #Accuracy
    multi_knn_accuracy = 
    
    #Micro F1 score
    multi_knn_microf1 = 
    
    #Macro F1 score
    multi_knn_macrof1 = 
    
    #time used for data loading
    multi_knn_loadtime = time2 - time1
    
    #time used for prediction
    multi_knn_predictiontime = time3 - time2
    
    return multi_knn_accuracy, multi_knn_microf1, multi_knn_macrof1, multi_knn_loadtime, multi_knn_predictiontime

#Run your function in the cell to return the results
accuracy_4, microf1_4, macrof1_4, loadtime, predictiontime = answer_four(k, alg)

#### Answer 4:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>Brute force: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
<b>K-d tree: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
<b>Ball tree: </b> data loading time is: ( ), prediction time is: ( ), accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br>
Summarize your observations about the time used by these searching algorithms: ( ) and observations about the classification performance: ( )  

#### Question 5 (7 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a decision tree binary classification model in function `answer_five( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`, and set `max_depth=` using `2`, `5`, and `10` respectively to compare different performance** 

In [None]:
from sklearn.tree import DecisionTreeClassifier

#Please use 2, 5, 10
d = 

def answer_five(d):
    binary_dt = 


    y_pred_bi = 
    
    #Accuracy
    binary_dt_accuracy = 
    
    #F1 score
    binary_dt_f1 = 
    
    return binary_dt_accuracy, binary_dt_f1

#Run your function in the cell to return the results
accuracy_5, f1_5 = answer_five(d)

#### Answer 5:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When d = 2: </b> accuracy is: ( ), and f1 score is: ( ) <br> 
<b>When d = 5: </b> accuracy is: ( ), and f1 score is: ( ) <br> 
<b>When d = 10: </b> accuracy is: ( ), and f1 score is: ( ) <br>
Summarize your observations about the performance derived by these different `max_depth`: ( )  

#### Question 6 (7 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a decision tree multi-class classification model in function `answer_six( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=5` and `random_state=42` in `DecisionTreeClassifier` and set `criterion=` using `'gini'` and `'entropy'` respectively to compare different performance**  

In [None]:
#Please use 'gini' and 'entropy' respectively
c = 'entropy'

def answer_six(c):
    multi_dt = 


    y_pred_mu = 
    
    #Accuracy
    multi_dt_accuracy = 
    
    #Micro F1 score
    multi_dt_microf1 = 
    
    #Macro F1 score
    multi_dt_macrof1 = 
    
    return multi_dt_accuracy, multi_dt_microf1, multi_dt_macrof1

#Run your function in the cell to return the results
accuracy_6, microf1_6, macrof1_6 = answer_six(c)

#### Answer 6:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When c = 'gini': </b> accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br> 
<b>When c = 'entropy': </b> accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br> 
Summarize your observations about the performance derived by these different `criterion`: ( )  

#### Question 7 (7 points):
Please use features `X_train_bi` and binary labels `y_train_bi` to train a binary classification model using AdaBoost ensemble learning in function `answer_seven( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the base model as `DecisionTreeClassifier(max_depth=1)`, `n_estimators=100` and `random_state=42` in `AdaBoostClassifier`** 

In [None]:
from sklearn.ensemble import AdaBoostClassifier

def answer_seven():
    binary_ada = 

    
    y_pred_bi = 
    
    #Accuracy
    binary_ada_accuracy = 
    
    #F1 score
    binary_ada_f1 = 
    
    return binary_ada_accuracy, binary_ada_f1

#Run your function in the cell to return the results
accuracy_7, f1_7 = answer_seven()

#### Answer 7:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: ( ) <br>
F1 score is: ( ) <br>
Compared to the classificaion results in Question 5 that builds decision trees with `max_depth=2, 5, 10`, summarize your observations about the performance derived by AdaBoost with `DecisionTreeClassifier(max_depth=1)`: ( )

#### Question 8 (7 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a random forest multi-class classification model in function `answer_eight( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `n_estimators=10` and `random_state=42` in `RandomForestClassifier`**  

In [None]:
from sklearn.ensemble import RandomForestClassifier

def answer_eight():
    multi_rf = 


    y_pred_mu = 
    
    #Accuracy
    multi_rf_accuracy = 
    
    #Micro F1 score
    multi_rf_microf1 = 
    
    #Macro F1 score
    multi_rf_macrof1 = 
    
    return multi_rf_accuracy, multi_rf_microf1, multi_rf_macrof1

#Run your function in the cell to return the results
accuracy_8, microf1_8, macrof1_8 = answer_eight()

#### Answer 8:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When n_estimators=10: </b> accuracy is: ( ), micro f1 score is: ( ), macro f1 score is: ( ) <br> 
Compared to the classificaion results in Question 6 that builds a single decision tree, summarize your observations about the performance derived by random forest: ( )  

#### Question 9 (2 points):
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Based on the results from Question 1 to Question 8 (considering different model parameters): <br>
The model with best binary classification performance is: ( ) <br>
The model with worst binary classification performance is: ( ) <br>
The model with best multi-class classification performance is: ( ) <br>
The model with worst multi-class classification performance is: ( )