# Assignment 4: Pipelines and Hyperparameter Tuning (52 total marks)
### Due: March 19 at 11:59pm

### Name: 

The purpose of this assignment is to practice following the grid-search workflow: 
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

In [57]:
import numpy as np
import pandas as pd

## Part 1: Classification (21 marks)

### 1.1: Load data (2 marks)
For this task, we will be using the yellowbrick mushroom dataset. This dataset uses physical characteristics of mushrooms to predict whether or not the mushroom is poisonous.

More information on the dataset can be found here:
https://www.scikit-yb.org/en/latest/api/datasets/mushroom.html

#### Prepare the feature matrix and target vector

Using the yellowbrick `load_mushroom()` function, load the mushroom data set into feature matrix `X` and target vector `y`

Print the shape of `X` and `y`

In [58]:
# TODO: Load the dataset
from yellowbrick.datasets.loaders import load_mushroom

# TODO: Print the shape of X and y
X,y = load_mushroom()
print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

X is of type <class 'pandas.core.frame.DataFrame'> and size (8123, 3)
y is of type <class 'pandas.core.series.Series'> and size (8123,)


In [59]:
X.isnull().sum()

shape      0
surface    0
color      0
dtype: int64

In [60]:
X.head()

Unnamed: 0,shape,surface,color
0,convex,smooth,yellow
1,bell,smooth,white
2,convex,scaly,white
3,convex,smooth,gray
4,convex,scaly,yellow


In [61]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   shape    8123 non-null   object
 1   surface  8123 non-null   object
 2   color    8123 non-null   object
dtypes: object(3)
memory usage: 190.5+ KB


### 1.2: Pre-processing (3 marks)
In this dataset, all the features are categorical, so they need to be encoded. We will use `OneHotEncoder(sparse_output=False)` for this case

In [62]:
# TODO: Create OneHotEncoder object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
ohe

The next step is to build a pipeline to combine the encoding with the selected machine learning method. To initialize the pipeline, we will use `LogisticRegression(max_iter=1000)` as a placeholder

In [63]:
# TODO: Build the pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([('preprocessing',ohe), ('classifier', LogisticRegression(max_iter=1000))])
pipe

The next step is to split the data into training and testing sets. Use `test_size=0.1, stratify=y, random_state=42`

In [64]:
# TODO: Split data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=42)

### 1.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LogisticRegression(max_iter=1000)`, `KNeighborsClassifier()` and `SVC()`. Build your parameter grid based on what you think are reasonable values to test

In [69]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# TODO: Build a parameter grid
param_grid = [
                {
                'classifier': [LogisticRegression(max_iter=1000)], 
                'preprocessing': [ohe],
                'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
                },
                {
                'classifier': [KNeighborsClassifier()],
                'preprocessing': [ohe],
                'classifier__n_neighbors': [5, 10, 25, 50, 100, 500],                
                },
                {
                'classifier': [SVC()],
                'preprocessing': [ohe],
                'classifier__C': [0.01, 0.01, 0.1, 1, 10, 100],
                'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],  
                'classifier__kernel': ['poly', 'rbf']
                }
]

In [70]:
from sklearn.model_selection import GridSearchCV

In [71]:
grid = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True)
grid.fit(X_train, y_train)

### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [72]:
# TODO: Print the results from the grid search
print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid.cv_results_['mean_train_score'][grid.best_index_]))
print("Best cross-validation validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

Best params:
{'classifier': SVC(), 'classifier__C': 0.01, 'classifier__gamma': 10, 'classifier__kernel': 'poly', 'preprocessing': OneHotEncoder(sparse_output=False)}

Best cross-validation train score: 0.72
Best cross-validation validation score: 0.71
Test-set score: 0.69


### Questions (6 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas.

*ANSWER HERE*
1. Support vector classifier performed best with the following parameters: C = 0.01, gamma = 10, and polynomial kernel. 
2. The mode was not a good fit. The training, validation, and test scores of 0.72, 0.71, and 0.69 are all low, indicating underfitting. Since, the scores are close together, the learning size is not limiting. Which means the model is not sophisticated enough. 
3. One way is to use a different model for example gradient boosting or neural networks. Only the polynomial and radial bases were considered, however, others such as sigmoid are available. Using one of these could help improve model performance. 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

The code I wrote was based on examples provided in *Introduction to Machine Learning With Python* textbook and the class example notebooks. I also referred to sklearn documentation website for information on pipelines, gridcv, and the classifiers. I completed the questions in order and did not have much challenge. However, the difficulty of this assignment was noticeably higher than the previous two. I did not use any generative AI. I went through the relevant sections in the textbooks before attempting the assignment. This approach helped me in completing this assignment. 

# Part 2: Regression (26 marks)

For this task, we will be using the auto-mpg dataset. The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

### 2.1: Load data (3 marks)

#### Prepare the feature matrix and target vector

Using the code below, load the dataset and separate it into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the shape of `X` and `y`

**Note that you will need to download the file from D2L or from the UCI website and store it in the same folder as the code for this to work**

In [73]:
# Code to read in the dataset - DO NOT CHANGE
data = pd.read_csv('auto-mpg.data', 
               header=None, 
              names=["mpg",
                    "cylinders",
                    "displacement",
                    "horsepower",
                    "weight",
                    "acceleration",
                    "model_year",
                    "origin",
                    "car_name"],
               na_values='?',
               sep=r'\s+')

In [74]:
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [75]:
# TODO: Separate dataset into feature matrix and target vector
X, y = data[["cylinders",
              "displacement",
              "horsepower",
              "weight",
              "acceleration",
              "model_year",
              "origin",
              "car_name"]], data["mpg"]

# TODO: Print shape of X and y
print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

X is of type <class 'pandas.core.frame.DataFrame'> and size (398, 8)
y is of type <class 'pandas.core.series.Series'> and size (398,)


Do we have any missing values in this case?

In [76]:
# TODO: Check if there are any missing value
X.isnull().sum()

cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

In [77]:
# Drop null
data_ = data[data['horsepower'].notna()]
X, y = data_[["cylinders",
              "displacement",
              "horsepower",
              "weight",
              "acceleration",
              "model_year",
              "origin",
              "car_name"]], data_["mpg"]
# TODO: Print shape of X and y
print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

X is of type <class 'pandas.core.frame.DataFrame'> and size (392, 8)
y is of type <class 'pandas.core.series.Series'> and size (392,)


In [78]:
X.isnull().sum()

cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

### 2.2: Pre-processing (5 marks)
In this dataset, we have a mixture of categorical and numerical data. This means that we will need to use a `ColumnTransformer()`

If you try to use a ColumnTransformer on the data with all the existing features, you will get an error. This is because there are too many unique feature values in the `car_name` column to capture all possible values in the training set. For this assignment, we will remove the `car_name` column to avoid this problem

In [79]:
# TODO: Remove car_name column
X.drop(columns=['car_name'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X.drop(columns=['car_name'],inplace=True)


For this case, we will use:
- `OneHotEncoder(sparse_output=False)` for any categorical columns
- `StandardScaler()` for any numerical columns
- Minimal information imputation for any missing values

In [80]:
# TODO: Create ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
ct = ColumnTransformer(
    [("onehot", OneHotEncoder(sparse_output=False), ['cylinders', 'model_year', 'origin']),
     ('scaling', StandardScaler(), ['displacement', 'horsepower', 'weight', 'acceleration'])
     ]
)

The next step is to build a pipeline to combine the ColumnTransformer with the selected machine learning method. To initialize the pipeline, we will use `LinearRegression()` as a placeholder

In [81]:
# TODO: Build the pipeline
from sklearn.linear_model import LinearRegression
pipe = Pipeline(steps=[('preprocessor', ct),
                       ('regressor', LinearRegression())])

The next step is to split the data into training and testing sets. Use `test_size=0.1, random_state=0`

In [82]:
# TODO: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

### 2.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LinearRegression()`, `KNeighborsRegressor()` and `RandomForestRegressor(random_state=0)`. Build your parameter grid based on what you think are reasonable values to test

In [83]:
# TODO: Build a parameter grid
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    { 'regressor': [LinearRegression()],
      'preprocessor': [ct],},
    { 'regressor': [KNeighborsRegressor()],
     'preprocessor': [ct],
     'regressor__n_neighbors': [2, 5, 10, 25, 50,100]},
     { 'regressor': [RandomForestRegressor(random_state=0)],
      'preprocessor': [ct],
      'regressor__n_estimators': [5, 10, 50, 100],
      'regressor__max_depth': [1,3,5,7],
      'regressor__max_features': [1,2,4,7]
     }
    ]

In [84]:
# TODO: Implement Grid Search
grid = GridSearchCV(pipe, param_grid, cv=5, return_train_score=True)
grid.fit(X_train, y_train)

### 2.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [85]:
# TODO: Print the results from the grid search
print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid.cv_results_['mean_train_score'][grid.best_index_]))
print("Best cross-validation validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

Best params:
{'preprocessor': ColumnTransformer(transformers=[('onehot', OneHotEncoder(sparse_output=False),
                                 ['cylinders', 'model_year', 'origin']),
                                ('scaling', StandardScaler(),
                                 ['displacement', 'horsepower', 'weight',
                                  'acceleration'])]), 'regressor': LinearRegression()}

Best cross-validation train score: 0.88
Best cross-validation validation score: 0.84
Test-set score: 0.88


### Questions (8 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different than the two ideas given for the previous part).
1. Comparing the two parts, which one took longer to run the grid search? Why do you think it took longer?

*ANSWER HERE*
1. Linear regression produced the best results. This model has no parameters. 
2. The model was an adequate fit. The training (0.88), validation (0.84), and test scores (0.88) are comparable, which indicates the model is not overfitting. The scores are relatively high (closer to 1), indciating that the model is  slightly underfitting, which some room for improvement. 
3. A more complex linear model can be used, for example linear regression with ridge or lasso regularization. These have tunable parameters, which would allow us to increase the complexity of the model to get even better model performance. Polynomial regression can also be considered, either higher order interactions between features can be engineered, or a polynomial basis support vector regressor can be employed. Either of these would allow us to increase model complexity, which is necessary to get further improvements in performance. 
4. The first grid search took much longer (~10 minutes) compared to the second (< 1 second). The support vector classifiers in the first section probably made the grid search execution longer. Support vector methods are computationally expensive and do not scale well with data size. The first dataset is also much larger (~8000, 3) compared to the second(398,8). The categorical features also span a larger set of values for the first dataset, leading to an even signficance difference in size with the use of one hot encoding.  

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*<br>
The code I wrote was based on examples provided in *Introduction to Machine Learning With Python* textbook and the class example notebooks. I also referred to sklearn documentation website for information on pipelines, gridcv, and the regressors. I completed the questions in order and did not have much challenge. However, the difficulty of this assignment was noticeably higher than the previous two. I did not use any generative AI. I went through the relevant sections in the textbooks before attempting the assignment. This approach helped me in completing this assignment. 

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

We discussed that SVM methods are computationally more demanding than some of the other methods we have covered. This was seen during this lab as the runtime for the first section was signficantly greater (~10 minutes) compared to the second method (~0 seconds). In the second section it was interesting to see that the simplest model, linear regression, had the best performance (0.88 test accuracy). It even outperformed the non-linear random forest ensemble method. 

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE* <br>
Overall, the assignemt was good. I found this assignment more challenging, a bit more obscure than the previous assignments. I think this is mostly because the steps for setting up gridCV with multiple models is more involved than our previous work. I apprecited the oppurtunity to choose the search space for gridsearch, as it allowed for independet problem solving. 