# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: SOUMINI MOHANDAS

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, accuracy_score, f1_score

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [36]:
# Import the occupancy dataset from yellowbrick library
from yellowbrick.datasets import load_occupancy

In [37]:
# Load the occupancy dataset into a feature matrix X and a target vector y
X, y = load_occupancy()

In [38]:
# Print the size and type of the feature matrix X 
# Shape indicates (n_samples, n_features)
print("The shape of feature matrix X: ", X.shape) 
print("The type of feature matrix X: ", type(X))

The shape of feature matrix X:  (20560, 5)
The type of feature matrix X:  <class 'pandas.core.frame.DataFrame'>


In [39]:
# Print the size and type of the target vector y 
# Shape indicates length (n_samples)
print("The shape of target vector y: ", y.shape) 
print("The type of taget vector y: ", type(y))

The shape of target vector y:  (20560,)
The type of taget vector y:  <class 'pandas.core.series.Series'>


In [40]:
# Shows the various columns present in the feature matrix X
# (i.e., 5 features = 5 columns)
X.head()

Unnamed: 0,temperature,relative humidity,light,CO2,humidity
0,23.18,27.272,426.0,721.25,0.004793
1,23.15,27.2675,429.5,714.0,0.004783
2,23.15,27.245,426.0,713.5,0.004779
3,23.15,27.2,426.0,708.25,0.004772
4,23.1,27.2,426.0,704.5,0.004757


In [41]:
# Shows the 1D target array having only 1 column (i.e., occupancy)
y.head(10)

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: occupancy, dtype: int64

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

Answer 01: The dataset used for this assignment is downloaded from the yellowbrick library. 
https://www.scikit-yb.org/en/latest/api/datasets/occupancy.html
The yellowbrick function load_occupancy() was used to to load the occupancy dataset into the feature matrix X and target vector y. And the size and type of X and y was viewed.

Answer 02: 
I chose this dataset for the following reasons:
a) The dataset is designed for binary classification, which makes it well-suited for tasks where you want to predict one of two discrete outcomes, such as "occupied" or "not occupied."
b) With 20,560 instances, the dataset provides a reasonably large sample size.
c) While the dataset provides real-valued attributes, there's still room for feature engineering if needed. One can create additional features or apply transformations to improve model performance.
d) And I believe the fact that the dataset contains not very many features (just 5) to choose from, makes it quite challenging during feature selection. I wish to see how the results turn out to be in such a scenario. 

Answer 03: 
Discovering a suitable dataset felt like a challenging dilemma. I would come across intriguing ones, only to realize midway through the modeling process that they yielded unsatisfactory results. Conversely, when I encountered datasets that performed well, they lacked the same level of interest. Eventually, I settled on the current dataset because I found it to be both complex and demanding from various angles. 
Some of the factors that I found challenging with this dataset are: 
a) One common challenge in binary classification datasets is class imbalance. If one class significantly outnumbers the other (e.g., many more instances of "not occupied" than "occupied"), it can affect model performance. And this dataset has a class imbalance. 
b) While the dataset has 5 features, not all of them may be equally informative. Feature selection or feature importance analysis may be necessary to identify the most relevant attributes for prediction.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [42]:
# Clean data (if needed)
# Check if the X dataframe contains any missing or NaN values
# sum() indicates the total count of NaN values present in each column 
X.isnull().sum()

temperature          0
relative humidity    0
light                0
CO2                  0
humidity             0
dtype: int64

In [43]:
# Check if the target vector y has any missing values 
# present in its 1 column (i.e., occupancy)
y.isnull().sum()

0

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [45]:
# Since the dataset contains imbalanced classes, techniques like oversampling/undersampling, 
# can be used to handle it. And this is done by resampling the dataset and training the model on the 
# resampled training data. There are two ways to resample the training data i.e., either oversample the 
# minority class or undersample the majority class. 
# In this case, I have opted to oversample the minority class 

from sklearn.utils import resample

# Combine the feature and target data for the training set
Xy_train = pd.concat([X_train, y_train], axis=1)

# Separate the majority and minority classes
majority_class = Xy_train[Xy_train['occupancy'] == 0]
minority_class = Xy_train[Xy_train['occupancy'] == 1]

# Oversample the minority class (e.g., duplicate minority samples)
minority_class_oversampled = resample(minority_class,
                                      replace=True,  # Sample with replacement
                                      n_samples=len(majority_class),  # Match the majority class size
                                      random_state=42)  # Set a random seed for reproducibility

# Combine the oversampled minority class with the majority class
Xy_train_resampled = pd.concat([majority_class, minority_class_oversampled])

# Separate the features and target from the resampled data
X_train_resampled = Xy_train_resampled.drop('occupancy', axis=1)
y_train_resampled = Xy_train_resampled['occupancy']

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

Answer 01: 
As seen in the beginning of Step.2, there are no missing or NaN values in either the feature matrix X or the target vector y. Hence it is not necessary to use a method to fill in missing values. If missing values had existed, assuming that we are filling it with Zeros instead of just dropping the row or column containing missing or NaN values, we would have used the below mentioned commands:
a) X.fillna(0) # For feature matrix X
b) y.fillna(0) # For target vector y 

Answer 02: 
a) The data type for the occupancy dataset features is described as "real" and "positive." This typically implies that the features are numerical and non-negative.
b) Preprocessing methods commonly applied to such data include feature scaling (I have opted to use StandardScaler in the pipeline), handling missing values (there are none in this dataset), and handling imbalanced classes (using techniques like oversampling/undersampling, as needed). This is done by resampling the dataset and training the model on the resampled training data. There are two ways to resample the training data i.e., either oversample the 
minority class or undersample the majority class. In this case, I have opted to oversample the minority class as seen above. 

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [46]:
# Random Forest Pipeline
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])

# Logistic Regression Pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# SVM Pipeline
svm_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC())
])

# Define parameter grids
param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, 15]
}

param_grid_lr = {
    'classifier__C': [0.1, 1, 10],
    'classifier__penalty': ['l1'],  # Use 'l1' penalty
    'classifier__solver': ['liblinear', 'saga']  # Choose an appropriate solver
}

param_grid_svm = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

# Define the scoring metrics you want to use
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score)
}

# Create GridSearchCV instances for each algorithm with multiple scoring metrics
grid_search_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring=scoring, refit='f1_score')
grid_search_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=5, scoring=scoring, refit='f1_score')
grid_search_svm = GridSearchCV(svm_pipeline, param_grid_svm, cv=5, scoring=scoring, refit='f1_score')

# Fit the models
grid_search_rf.fit(X_train_resampled, y_train_resampled)
grid_search_lr.fit(X_train_resampled, y_train_resampled)
grid_search_svm.fit(X_train_resampled, y_train_resampled)

# Get the best parameters based on F1
best_params_rf = grid_search_rf.best_params_
best_params_lr = grid_search_lr.best_params_
best_params_svm = grid_search_svm.best_params_

# Access the results for both scoring metrics
results_rf = grid_search_rf.cv_results_
results_lr = grid_search_lr.cv_results_
results_svm = grid_search_svm.cv_results_

# Print the results for accuracy and F1
print("Random Forest Results:")
print("Accuracy scores:", results_rf['mean_test_accuracy'])
print("F1 scores:", results_rf['mean_test_f1_score'])
print("\nBest Parameters for Random Forest based on F1:", best_params_rf)

print("\nLogistic Regression Results:")
print("Accuracy scores:", results_lr['mean_test_accuracy'])
print("F1 scores:", results_lr['mean_test_f1_score'])
print("\nBest Parameters for Logistic Regression based on F1:", best_params_lr)

print("\nSVM Results:")
print("Accuracy scores:", results_svm['mean_test_accuracy'])
print("F1 scores:", results_svm['mean_test_f1_score'])
print("\nBest Parameters for SVM based on F1:", best_params_svm)



Random Forest Results:
Accuracy scores: [0.99247696 0.99263535 0.99267495 0.99469428 0.99477346 0.99469427
 0.99572373 0.99580292 0.99576332]
F1 scores: [0.9925118  0.99267036 0.99271064 0.99472133 0.99479896 0.99472025
 0.99573899 0.9958179  0.99577823]

Best Parameters for Random Forest based on F1: {'classifier__max_depth': 15, 'classifier__n_estimators': 200}

Logistic Regression Results:
Accuracy scores: [0.99120997 0.99124957 0.99128917 0.99128917 0.99136834 0.99136834]
F1 scores: [0.99126703 0.99130514 0.99134486 0.99134486 0.99142264 0.99142271]

Best Parameters for Logistic Regression based on F1: {'classifier__C': 10, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'}

SVM Results:
Accuracy scores: [0.99140796 0.99132874 0.99124955 0.99223943 0.99113077 0.99279375]
F1 scores: [0.99146175 0.99138566 0.99130203 0.99228818 0.99118619 0.99283522]

Best Parameters for SVM based on F1: {'classifier__C': 10, 'classifier__kernel': 'rbf'}


After the grid search, the best model parameters for each model are: 
a)Best Parameters for Random Forest based on F1: classifier__max_depth = 15, classifier__n_estimators = 300
b)Best Parameters for Logistic Regression based on F1: classifier__C = 10, classifier__penalty = l1, classifier__solver = saga
c) Best Parameters for SVM based on F1: classifier__C = 10, classifier__kernel = rbf

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [47]:
# Assuming you already have 'results_rf', 'results_lr', and 'results_svm' from your code
# Extract the mean test accuracy and test accuracy scores
mean_test_accuracy_rf = results_rf['mean_test_accuracy']
mean_test_accuracy_lr = results_lr['mean_test_accuracy']
mean_test_accuracy_svm = results_svm['mean_test_accuracy']

test_accuracy_rf = grid_search_rf.score(X_test, y_test)
test_accuracy_lr = grid_search_lr.score(X_test, y_test)
test_accuracy_svm = grid_search_svm.score(X_test, y_test)

# Create a dataframe for the heatmap
import pandas as pd

data = {
    'Algorithm': ['Random Forest', 'Logistic Regression', 'SVM'],
    'Mean Test Accuracy': [mean_test_accuracy_rf, mean_test_accuracy_lr, mean_test_accuracy_svm],
    'Test Accuracy': [test_accuracy_rf, test_accuracy_lr, test_accuracy_svm]
}

pd.DataFrame(data)

Unnamed: 0,Algorithm,Mean Test Accuracy,Test Accuracy
0,Random Forest,"[0.9924769645733864, 0.9926353490517075, 0.992...",0.983467
1,Logistic Regression,"[0.9912099749616461, 0.9912495710812264, 0.991...",0.974251
2,SVM,"[0.9914079555595476, 0.9913287398072519, 0.991...",0.97738


In [50]:
# Assuming you already have 'results_rf', 'results_lr', and 'results_svm' from your code
# Extract the mean F1 test accuracy and F1 test accuracy scores
mean_test_f1_score_rf = results_rf['mean_test_f1_score']
mean_test_f1_score_lr = results_lr['mean_test_f1_score']
mean_test_f1_score_svm = results_svm['mean_test_f1_score']

test_f1_score_rf = grid_search_rf.score(X_test, y_test)
test_f1_score_lr = grid_search_lr.score(X_test, y_test)
test_f1_score_svm = grid_search_svm.score(X_test, y_test)

# Create a dataframe for the heatmap
import pandas as pd

data = {
    'Algorithm': ['Random Forest', 'Logistic Regression', 'SVM'],
    'Mean F1 Test Accuracy': [mean_test_f1_score_rf, mean_test_f1_score_lr, mean_test_f1_score_svm],
    'Test F1': [test_f1_score_rf, test_f1_score_lr, test_f1_score_svm]
}

df_ = pd.DataFrame(data)
df_

Unnamed: 0,Algorithm,Mean F1 Test Accuracy,Test F1
0,Random Forest,"[0.9925118047660547, 0.9926703592215838, 0.992...",0.983467
1,Logistic Regression,"[0.9912670272790471, 0.9913051372994174, 0.991...",0.974251
2,SVM,"[0.9914617508956717, 0.991385659071498, 0.9913...",0.97738


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*

Answer 01: 
The problem we are addressing with this dataset is binary classification (occupancy), so classification models are appropriate for this dataset. 

Answer 02: 
These three models well-established, versatile algorithms suitable for binary classification tasks.
Random Forest was chosen for its ability to capture complex relationships. 
Logistic Regression was chosen for its interpretability and efficiency. 
SVM was chosen for its effectiveness in high-dimensional spaces and flexibility in modeling decision boundaries.

Answer 03: 
It appears that all three models are performing exceptionally well with high accuracy scores. 
The occupancy dataset is a binary classification problem where the goal is to predict whether a room is occupied or not occupied based on environmental factors such as temperature, humidity, light, and CO2 levels. The dataset likely contains distinct patterns and relationships between these features and room occupancy, which the models are capturing effectively. 
The preprocessing steps (StandardScaler, handling imbalanced classes) can help improve model performance and contribute to the high test accuracies.
Fine-tuning hyperparameters can significantly improve model performance, leading to high test accuracies. 
If the dataset is well-preprocessed and free of anomalies, it can contribute to better model performance.
Balancing the classes can lead to better model generalization.
This suggests that the models have learned the underlying patterns in the data effectively and are capable of accurately predicting room occupancy based on environmental factors. 


### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

Answer 01:
Since this dataset is imbalanced, metrics like F1 score, precision, recall, and the confusion matrix can be used to get a more comprehensive view of model performance. And I have used F1 score in this case. 

Answer 02:
The test accuracy and F1 score for the models are almost identical. Since the model performs well on both training and testing sets, it is more likely to generalize well. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

This assignment was open-ended, allowing us to select both the dataset and various linear and non-linear models. I found this task to be simultaneously challenging, intriguing, and occasionally stressful. I discovered that meticulous planning couldn't always prevent encountering unsatisfactory results during the machine learning modeling process, leading to multiple dataset changes in search of one that was both interesting and demanding. Some guidance in narrowing down our choices for dataset selection, as well as linear and non-linear model selection, would have been beneficial. Overall, I believe the experience was valuable, despite the somewhat arduous path to achieving success. Despite its challenges, it piqued my interest.