# Assignment 4: Pipelines and Hyperparameter Tuning (52 total marks)
### Due: March 19 at 11:59pm

### Name: 

The purpose of this assignment is to practice following the grid-search workflow: 
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

In [1]:
import numpy as np
import pandas as pd

## Part 1: Classification (21 marks)

### 1.1: Load data (2 marks)
For this task, we will be using the yellowbrick mushroom dataset. This dataset uses physical characteristics of mushrooms to predict whether or not the mushroom is poisonous.

More information on the dataset can be found here:
https://www.scikit-yb.org/en/latest/api/datasets/mushroom.html

#### Prepare the feature matrix and target vector

Using the yellowbrick `load_mushroom()` function, load the mushroom data set into feature matrix `X` and target vector `y`

Print the shape of `X` and `y`

In [2]:
# TODO: Load the dataset
from yellowbrick.datasets.loaders import load_mushroom

# TODO: Print the shape of X and y
X,y = load_mushroom()
print(f"X is of type {type(X)} and size {X.shape}")
print(f"y is of type {type(y)} and size {y.shape}")

X is of type <class 'pandas.core.frame.DataFrame'> and size (8123, 3)
y is of type <class 'pandas.core.series.Series'> and size (8123,)


In [3]:
X.isnull().sum()

shape      0
surface    0
color      0
dtype: int64

In [4]:
X.head()

Unnamed: 0,shape,surface,color
0,convex,smooth,yellow
1,bell,smooth,white
2,convex,scaly,white
3,convex,smooth,gray
4,convex,scaly,yellow


In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   shape    8123 non-null   object
 1   surface  8123 non-null   object
 2   color    8123 non-null   object
dtypes: object(3)
memory usage: 190.5+ KB


### 1.2: Pre-processing (3 marks)
In this dataset, all the features are categorical, so they need to be encoded. We will use `OneHotEncoder(sparse_output=False)` for this case

In [6]:
# TODO: Create OneHotEncoder object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
ohe

The next step is to build a pipeline to combine the encoding with the selected machine learning method. To initialize the pipeline, we will use `LogisticRegression(max_iter=1000)` as a placeholder

In [7]:
# TODO: Build the pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([('preprocessing',ohe), ('classifier', LogisticRegression(max_iter=1000))])
pipe

The next step is to split the data into training and testing sets. Use `test_size=0.1, stratify=y, random_state=42`

In [8]:
# TODO: Split data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=42)

### 1.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LogisticRegression(max_iter=1000)`, `KNeighborsClassifier()` and `SVC()`. Build your parameter grid based on what you think are reasonable values to test

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# TODO: Build a parameter grid
param_grid = [
                {
                'classifier': [LogisticRegression(max_iter=1000)], 
                'preprocessing': [ohe, None],
                'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
                },
                {
                'classifier': [KNeighborsClassifier()],
                'preprocessing': [ohe, None],
                'classifier__n_neighbors': [5, 10, 50, 100, 500, 1000],                
                },
                {
                'classifier': [SVC()],
                'preprocessing': [ohe, None],
                'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
                'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],  
                }
]

In [13]:
from sklearn.model_selection import GridSearchCV

In [14]:
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7b59ae8f0790>
Traceback (most recent call last):
  File "/home/saurav/anaconda3/envs/enel682-lab/lib/python3.10/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/saurav/anaconda3/envs/enel682-lab/lib/python3.10/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/saurav/anaconda3/envs/enel682-lab/lib/python3.10/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/saurav/anaconda3/envs/enel682-lab/lib/python3.10/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
Exception ignored on calling ctypes callback function: <

### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [15]:
# TODO: Print the results from the grid search
print("Best params:\n{}\n".format(grid.best_params_))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))

Best params:
{'classifier': SVC(gamma=1), 'classifier__gamma': 1, 'preprocessing': OneHotEncoder(sparse_output=False)}

Best cross-validation score: 0.71
Test-set score: 0.69


### Questions (6 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas.

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

# Part 2: Regression (26 marks)

For this task, we will be using the auto-mpg dataset. The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

### 2.1: Load data (3 marks)

#### Prepare the feature matrix and target vector

Using the code below, load the dataset and separate it into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the shape of `X` and `y`

**Note that you will need to download the file from D2L or from the UCI website and store it in the same folder as the code for this to work**

In [None]:
# Code to read in the dataset - DO NOT CHANGE
data = pd.read_csv('auto-mpg.data', 
               header=None, 
              names=["mpg",
                    "cylinders",
                    "displacement",
                    "horsepower",
                    "weight",
                    "acceleration",
                    "model_year",
                    "origin",
                    "car_name"],
               na_values='?',
               sep=r'\s+')

In [None]:
# TODO: Separate dataset into feature matrix and target vector

# TODO: Print shape of X and y


Do we have any missing values in this case?

In [None]:
# TODO: Check if there are any missing values


### 2.2: Pre-processing (5 marks)
In this dataset, we have a mixture of categorical and numerical data. This means that we will need to use a `ColumnTransformer()`

If you try to use a ColumnTransformer on the data with all the existing features, you will get an error. This is because there are too many unique feature values in the `car_name` column to capture all possible values in the training set. For this assignment, we will remove the `car_name` column to avoid this problem

In [None]:
# TODO: Remove car_name column


For this case, we will use:
- `OneHotEncoder(sparse_output=False)` for any categorical columns
- `StandardScaler()` for any numerical columns
- Minimal information imputation for any missing values

In [None]:
# TODO: Create ColumnTransformer


The next step is to build a pipeline to combine the ColumnTransformer with the selected machine learning method. To initialize the pipeline, we will use `LinearRegression()` as a placeholder

In [None]:
# TODO: Build the pipeline


The next step is to split the data into training and testing sets. Use `test_size=0.1, random_state=0`

In [None]:
# TODO: Split data into training and testing sets


### 2.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LinearRegression()`, `KNeighborsRegressor()` and `RandomForestRegressor(random_state=0)`. Build your parameter grid based on what you think are reasonable values to test

In [None]:
# TODO: Build a parameter grid


In [None]:
# TODO: Implement Grid Search


### 2.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [None]:
# TODO: Print the results from the grid search


### Questions (8 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different than the two ideas given for the previous part).
1. Comparing the two parts, which one took longer to run the grid search? Why do you think it took longer?

*ANSWER HERE*

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*