### CS 4840 Intro Machine Learning - Lab Assignment 2

# <center>Building and Analyzing Classification Models</center>

## <center><font color='red'>This is only for undergraduate students in CS 4840</font></center>

### 1. Overview
The learning objective of this lab assignment is for students to understand different classification models, including how to apply logistic regression, k-nearest neighbors, decision tree, and ensemble learning and random forest with the impacts of key parameters, how to evaluate their classification performances, and how to compare these results across different classification models.

#### Lecture notes. 
Detailed coverage of these topics can be found in the following:
<li>Logistic Regression</li>
<li>Evaluation Metrics for Classification</li>
<li>k-Nearest Neighbors</li>
<li>Decision Tree</li>
<li>Ensemble Learning and Random Forest</li>

#### Code demonstrations.
<li>Code 2024-09-23-M-Training Logistic Regression using Scikit-Learn.ipynb</li>
<li>Code 2024-09-25-W-Evaluation Metrics for Classification-Scikit-Learn.ipynb</li>
<li>Code 2024-10-02-W-k-Nearest Neighbors.ipynb</li>
<li>Code 2024-10-09-W-Decision Tree.ipynb</li>
<li>Code 2024-10-21-M-Ensemble Learning and Random Forest.ipynb</li>

### 2. Submission
You need to submit a detailed lab report with code, running results, and answers to the questions. If you submit <font color='red'>a jupyter notebook (“Firstname-Lastname-4840-Lab2.ipynd”)</font>, please fill in this file directly and place the code, running results, and answers in order for each question. If you submit <font color='red'>a PDF report (“Firstname-Lastname-4840-Lab2.pdf”) with code file (“Firstname-Lastname-4840-Lab2.py”)</font>, please include the screenshots (code and running results) with answers for each question in the report.  

### 3. Questions (50 points)

For this lab assignment, you will be using the `housing` dataset to complete the following tasks and answer the questions. The housing dataset is the California Housing Prices dataset based on data from the 1990 California census. You will use these features to build classification models to predict the `ocean proximity` of a house. First, please place `housing.csv` and your notebook/python file in the same directory, and load and preprocess the data.   

#### Load and preprocess the data

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#Please place housing.csv and your notebook/python file in the same directory; otherwise, change DATA_PATH 
DATA_PATH = ""

def load_housing_data(housing_path=DATA_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

#Add three useful features
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

#Divide the data frame into features and labels
housing_labels = housing["ocean_proximity"].copy() # use ocean_proximity as classification label
housing_features = housing.drop("ocean_proximity", axis=1) # use colums other than ocean_proximity as features

#Preprocessing the missing feature values
median = housing_features["total_bedrooms"].median()
housing_features["total_bedrooms"].fillna(median, inplace=True) 
median = housing_features["bedrooms_per_room"].median()
housing_features["bedrooms_per_room"].fillna(median, inplace=True)

#Scale the features
std_scaler  = StandardScaler()
housing_features_scaled = std_scaler.fit_transform(housing_features)

#Final housing features X
X = housing_features_scaled

#Binary labels - 0: INLAND; 1: CLOSE TO OCEAN
y_binary = (housing_labels != 1).astype(int)
#Multi-class labels - 0: <1H OCEAN; 1: INLAND; 2: NEAR OCEAN; 3: NEAR BAY
y_multi = housing_labels.astype(int)

#Data splits for binary classification
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X, y_binary, test_size=0.20, random_state=42)

#Data splits for multi-class classification
X_train_mu, X_test_mu, y_train_mu, y_test_mu = train_test_split(X, y_multi, test_size=0.20, random_state=42)

print(X_train_bi.shape)
print(X_test_bi.shape)

(16512, 12)
(4128, 12)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  housing_features["total_bedrooms"].fillna(median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  housing_features["bedrooms_per_room"].fillna(median, inplace=True)


<font color='red'><b>About the data used in this assignment: </b></font><br>
**All the binary classification models are trained on `X_train_bi`, `y_train_bi`, and evaluated on `X_test_bi`, `y_test_bi`.**<br>
**All the multi-class classification models are trained on `X_train_mu`, `y_train_mu`, and evaluated on `X_test_mu`, `y_test_mu`.**<br>


#### Question 1 (4 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a logistic regression binary classification model in function `answer_one( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Choose your own `solver` and set `random_state=42` in `LogisticRegression`** 

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

def answer_one():
    # Initialize the logistic regression model with a solver and random state
    binary_reg = LogisticRegression(solver='liblinear', random_state=42)
    
    # Train the model using the training data
    binary_reg.fit(X_train_bi, y_train_bi)
    
    # Predict the labels for the test data
    y_pred_bi = binary_reg.predict(X_test_bi)
    
    # Calculate accuracy
    binary_reg_accuracy = accuracy_score(y_test_bi, y_pred_bi)
    
    # Calculate F1 score
    binary_reg_f1 = f1_score(y_test_bi, y_pred_bi)
    
    return binary_reg_accuracy, binary_reg_f1

# Run your function in the cell to return the results
accuracy_1, f1_1 = answer_one()

print(accuracy_1)
print(f1_1)

0.9600290697674418
0.9710068529256721


#### Answer 1:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: (0.96) <br>
F1 score is: (0.97)

#### Question 2 (4 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a softmax regression multi-class classification model in function `answer_two( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `solver="newton-cg"` to guarantee the convergence of train loss minimization and set `random_state=42` in `LogisticRegression`** 

In [4]:
def answer_two():
    # Initialize the logistic regression model with solver 'newton-cg' and random state
    multi_reg = LogisticRegression(solver='newton-cg', random_state=42)
    
    # Train the model using the training data
    multi_reg.fit(X_train_mu, y_train_mu)
    
    # Predict the labels for the test data
    y_pred_mu = multi_reg.predict(X_test_mu)
    
    # Calculate accuracy
    multi_reg_accuracy = accuracy_score(y_test_mu, y_pred_mu)
    
    # Calculate micro F1 score
    multi_reg_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
    
    # Calculate macro F1 score
    multi_reg_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
    
    return multi_reg_accuracy, multi_reg_microf1, multi_reg_macrof1

#Run your function in the cell to return the results
accuracy_2, microf1_2, macrof1_2 = answer_two()
print(accuracy_2)
print(microf1_2)
print(macrof1_2)

0.7974806201550387
0.7974806201550387
0.6847642281014538


#### Answer 2:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: (0.7975) <br>
Micro f1 score is: (0.7975) <br>
Macro f1 score is: (0.6848)

#### Question 3 (5 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a k-nearest neighbors binary classification model in function `answer_three( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the option `n_neighbors=` in `KNeighborsClassifier` using `1`, `3`, `5`, `7`, and `9` respectively to find an optimal value `k`**   

In [6]:
from sklearn.neighbors import KNeighborsClassifier

#Please use 1, 3, 5, 7, 9
k_values = [1, 3, 5, 7, 9]

def answer_three(k):
    # Initialize the k-nearest neighbors model with k neighbors
    binary_knn = KNeighborsClassifier(n_neighbors=k)
    
    # Train the model using the training data
    binary_knn.fit(X_train_bi, y_train_bi)
    
    # Predict the labels for the test data
    y_pred_bi = binary_knn.predict(X_test_bi)
    
    # Calculate accuracy
    binary_knn_accuracy = accuracy_score(y_test_bi, y_pred_bi)
    
    # Calculate F1 score
    binary_knn_f1 = f1_score(y_test_bi, y_pred_bi)
    
    return binary_knn_accuracy, binary_knn_f1

# Run your function in the cell to return the results for each k
results = {}
for k in k_values:
    accuracy, f1 = answer_three(k)
    results[k] = (accuracy, f1)

results

{1: (0.9256298449612403, np.float64(0.9456155890168291)),
 3: (0.9367732558139535, np.float64(0.9543307086614173)),
 5: (0.935077519379845, np.float64(0.9533101045296167)),
 7: (0.9367732558139535, np.float64(0.9546165884194053)),
 9: (0.9358042635658915, np.float64(0.9539530842745438))}

#### Answer 3: 
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When k = 1</b>, accuracy is: (0.9256)<br>
<b>When k = 3</b>, accuracy is: (0.9368)<br>
<b>When k = 5</b>, accuracy is: (0.9351)<br>
<b>When k = 7</b>, accuracy is: (0.9368)<br>
<b>When k = 9</b>, accuracy is: (0.9358)<br>
<b>Optimal k (`n_neighbors`) is</b>: (7), accuracy is: (0.9368), F1 score is: (0.9546)<br>

#### Question 4 (7 points):  
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a k-nearest neighbors multi-class classification model in function `answer_four( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, macro F1 score, loading time, and prediction time.

**Set `n_neighbors=` in `KNeighborsClassifier` using the optimal `k` in Question 3 and set the option `algorithm=` using `'brute'`, `'kd_tree'`, and `'ball_tree'` respectively to compare the different time used**  

In [7]:
import time

#Please use the optimal k in Question 3
k = 7

#Please use 'brute', 'kd_tree', and 'ball_tree', respectively  
alg = ['brute', 'kd_tree', 'ball_tree']

def answer_four(k, alg):
    # Add a time checkpoint here
    time1 = time.time()
    
    # Initialize the k-nearest neighbors model with k neighbors and specified algorithm
    multi_knn = KNeighborsClassifier(n_neighbors=k, algorithm=alg)
    
    # Train the model using the training data
    multi_knn.fit(X_train_mu, y_train_mu)
    
    # Add a time checkpoint here
    time2 = time.time()
    
    # Predict the labels for the test data
    y_pred_mu = multi_knn.predict(X_test_mu)
    
    # Add a time checkpoint here
    time3 = time.time()
    
    # Calculate accuracy
    multi_knn_accuracy = accuracy_score(y_test_mu, y_pred_mu)
    
    # Calculate micro F1 score
    multi_knn_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
    
    # Calculate macro F1 score
    multi_knn_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
    
    # Time used for data loading
    multi_knn_loadtime = time2 - time1
    
    # Time used for prediction
    multi_knn_predictiontime = time3 - time2
    
    return multi_knn_accuracy, multi_knn_microf1, multi_knn_macrof1, multi_knn_loadtime, multi_knn_predictiontime

# Run your function in the cell to return the results for each algorithm
results = {}
for a in alg:
    accuracy, microf1, macrof1, loadtime, predictiontime = answer_four(k, a)
    results[a] = (accuracy, microf1, macrof1, loadtime, predictiontime)

results

{'brute': (0.8078972868217055,
  np.float64(0.8078972868217055),
  np.float64(0.7445180350580481),
  0.001486063003540039,
  0.1413097381591797),
 'kd_tree': (0.8078972868217055,
  np.float64(0.8078972868217055),
  np.float64(0.7445180350580481),
  0.015799999237060547,
  0.2689788341522217),
 'ball_tree': (0.8078972868217055,
  np.float64(0.8078972868217055),
  np.float64(0.7445180350580481),
  0.008485555648803711,
  0.4559965133666992)}

#### Answer 4:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>Brute force: </b> data loading time is: (0.0010), prediction time is: (0.1635), accuracy is: (0.8079), micro f1 score is: (0.8079), macro f1 score is: (0.7445) <br>
<b>K-d tree: </b> data loading time is: (0.0187), prediction time is: (0.3382), accuracy is: (0.8079), micro f1 score is: (0.8079), macro f1 score is: (0.7445) <br>
<b>Ball tree: </b> data loading time is: (0.0098), prediction time is: (0.4452), accuracy is: (0.8079), micro f1 score is: (0.8079), macro f1 score is: (0.7445) <br>
Summarize your observations about the time used by these searching algorithms: (in this instance, brute force loaded the fastest, and predicted the fastest, I suspect with a larger data pool that might change) and observations about the classification performance: (They all had the exact same performance)  

#### Question 5 (7 points):  
Please use features `X_train_bi` and binary labels `y_train_bi` to train a decision tree binary classification model in function `answer_five( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set `random_state=42` and `criterion='gini'` in `DecisionTreeClassifier`, and set `max_depth=` using `2`, `5`, and `10` respectively to compare different performance** 

In [8]:
from sklearn.tree import DecisionTreeClassifier

#Please use 2, 5, 10
depth = [2, 5, 10]

def answer_five(d):
    # Initialize the decision tree model with specified max depth, criterion 'gini', and random state
    binary_dt = DecisionTreeClassifier(max_depth=d, criterion='gini', random_state=42)
    
    # Train the model using the training data
    binary_dt.fit(X_train_bi, y_train_bi)
    
    # Predict the labels for the test data
    y_pred_bi = binary_dt.predict(X_test_bi)
    
    # Calculate accuracy
    binary_dt_accuracy = accuracy_score(y_test_bi, y_pred_bi)
    
    # Calculate F1 score
    binary_dt_f1 = f1_score(y_test_bi, y_pred_bi)
    
    return binary_dt_accuracy, binary_dt_f1

# Run your function in the cell to return the results for each max depth
results = {}
for d in depth:
    accuracy, f1 = answer_five(d)
    results[d] = (accuracy, f1)

results

{2: (0.8604651162790697, np.float64(0.9007238883143743)),
 5: (0.9367732558139535, np.float64(0.9545850008700192)),
 10: (0.9762596899224806, np.float64(0.9826179496275275))}

#### Answer 5:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When d = 2: </b> accuracy is: (0.8605), and f1 score is: (0.9001) <br> 
<b>When d = 5: </b> accuracy is: (0.9368), and f1 score is: (0.9546) <br> 
<b>When d = 10: </b> accuracy is: (0.9763), and f1 score is: (0.9826) <br>
Summarize your observations about the performance derived by these different `max_depth`: (More depth seems to correlate to better accuracy and f1 score, for this specific data, though with added computation cost)  

#### Question 6 (7 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a decision tree multi-class classification model in function `answer_six( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `max_depth=5` and `random_state=42` in `DecisionTreeClassifier` and set `criterion=` using `'gini'` and `'entropy'` respectively to compare different performance**  

In [9]:
# Please use 'gini' and 'entropy' respectively
criteria = ['gini', 'entropy']

def answer_six(c):
    # Initialize the decision tree model with specified criterion, max depth 5, and random state
    multi_dt = DecisionTreeClassifier(criterion=c, max_depth=5, random_state=42)
    
    # Train the model using the training data
    multi_dt.fit(X_train_mu, y_train_mu)
    
    # Predict the labels for the test data
    y_pred_mu = multi_dt.predict(X_test_mu)
    
    # Calculate accuracy
    multi_dt_accuracy = accuracy_score(y_test_mu, y_pred_mu)
    
    # Calculate micro F1 score
    multi_dt_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
    
    # Calculate macro F1 score
    multi_dt_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
    
    return multi_dt_accuracy, multi_dt_microf1, multi_dt_macrof1

# Run your function in the cell to return the results for each criterion
results = {}
for c in criteria:
    accuracy, microf1, macrof1 = answer_six(c)
    results[c] = (accuracy, microf1, macrof1)

results

{'gini': (0.8783914728682171,
  np.float64(0.8783914728682171),
  np.float64(0.8446278025347744)),
 'entropy': (0.8800872093023255,
  np.float64(0.8800872093023255),
  np.float64(0.864962365347806))}

#### Answer 6:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When c = 'gini': </b> accuracy is: (0.8784), micro f1 score is: (0.8784), macro f1 score is: (0.8446) <br> 
<b>When c = 'entropy': </b> accuracy is: (0.8800), micro f1 score is: (0.8800), macro f1 score is: (0.8650) <br> 
Summarize your observations about the performance derived by these different `criterion`: (The entropy criterion has a higher accuracy and macro f1, likely because it accounts for bias)  

#### Question 7 (7 points):
Please use features `X_train_bi` and binary labels `y_train_bi` to train a binary classification model using AdaBoost ensemble learning in function `answer_seven( )`. After the model is trained, use `X_test_bi` and `y_test_bi` to evaluate the performance, including accuracy and F1 score.

**Set the base model as `DecisionTreeClassifier(max_depth=1)`, `n_estimators=100` and `random_state=42` in `AdaBoostClassifier`** 

In [None]:
from sklearn.ensemble import AdaBoostClassifier

def answer_seven():
    # Initialize the AdaBoost model with DecisionTreeClassifier as the base estimator
    binary_ada = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=100,
        algorithm='SAMME',
        random_state=42
    )
    
    # Train the model using the training data
    binary_ada.fit(X_train_bi, y_train_bi)
    
    # Predict the labels for the test data
    y_pred_bi = binary_ada.predict(X_test_bi)
    
    # Calculate accuracy
    binary_ada_accuracy = accuracy_score(y_test_bi, y_pred_bi)
    
    # Calculate F1 score
    binary_ada_f1 = f1_score(y_test_bi, y_pred_bi)
    
    return binary_ada_accuracy, binary_ada_f1

# Run your function in the cell to return the results
accuracy_7, f1_7 = answer_seven()
print(accuracy_7)
print(f1_7)

0.970687984496124
0.9786256845080374


#### Answer 7:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Accuracy is: (0.9707) <br>
F1 score is: (0.9786) <br>
Compared to the classificaion results in Question 5 that builds decision trees with `max_depth=2, 5, 10`, summarize your observations about the performance derived by AdaBoost with `DecisionTreeClassifier(max_depth=1)`: (The results from this model performed almost as well as the classification from question 5 on depth of 10, and performed better than the lower depth versions, but also executed much slower so seemingly heavier computation costs)

#### Question 8 (7 points):
Please use features `X_train_mu` and multi-class labels `y_train_mu` to train a random forest multi-class classification model in function `answer_eight( )`. After the model is trained, use `X_test_mu` and `y_test_mu` to evaluate the performance, including accuracy, micro F1 score, and macro F1 score.

**Set `n_estimators=10` and `random_state=42` in `RandomForestClassifier`**  

In [16]:
from sklearn.ensemble import RandomForestClassifier

def answer_eight():
    # Initialize the random forest model with 10 estimators and random state
    multi_rf = RandomForestClassifier(n_estimators=10, random_state=42)
    
    # Train the model using the training data
    multi_rf.fit(X_train_mu, y_train_mu)
    
    # Predict the labels for the test data
    y_pred_mu = multi_rf.predict(X_test_mu)
    
    # Calculate accuracy
    multi_rf_accuracy = accuracy_score(y_test_mu, y_pred_mu)
    
    # Calculate micro F1 score
    multi_rf_microf1 = f1_score(y_test_mu, y_pred_mu, average='micro')
    
    # Calculate macro F1 score
    multi_rf_macrof1 = f1_score(y_test_mu, y_pred_mu, average='macro')
    
    return multi_rf_accuracy, multi_rf_microf1, multi_rf_macrof1

# Run your function in the cell to return the results
accuracy_8, microf1_8, macrof1_8 = answer_eight()
print(accuracy_8)
print(microf1_8)
print(macrof1_8)

0.9457364341085271
0.9457364341085271
0.9356159265365518


#### Answer 8:  
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
<b>When n_estimators=10: </b> accuracy is: (0.9457), micro f1 score is: (0.9457), macro f1 score is: (0.9356) <br> 
Compared to the classificaion results in Question 6 that builds a single decision tree, summarize your observations about the performance derived by random forest: (Its accuracy and f1 is worse, which is surprising, but I think maybe it needs a larger dataset or smaller number of estimators as a random forest is training several models on the same amount of data.)  

#### Question 9 (2 points):
<font color='red'><b>Double click here to answer the questions in this cell: </b></font><br>
Based on the results from Question 1 to Question 8 (considering different model parameters): <br>
The model with best binary classification performance is: (Decision Tree with max depth of 10) <br>
The model with worst binary classification performance is: (Decision Tree with max depth of 2) <br>
The model with best multi-class classification performance is: (Random Forest) <br>
The model with worst multi-class classification performance is: (Softmax Regression)