# Assignment 4: Pipelines and Hyperparameter Tuning (52 total marks)
### Due: March 19 at 11:59pm

### Name: 

The purpose of this assignment is to practice following the grid-search workflow: 
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

In [3]:
import numpy as np
import pandas as pd

## Part 1: Classification (21 marks)

### 1.1: Load data (2 marks)
For this task, we will be using the yellowbrick mushroom dataset. This dataset uses physical characteristics of mushrooms to predict whether or not the mushroom is poisonous.

More information on the dataset can be found here:
https://www.scikit-yb.org/en/latest/api/datasets/mushroom.html

#### Prepare the feature matrix and target vector

Using the yellowbrick `load_mushroom()` function, load the mushroom data set into feature matrix `X` and target vector `y`

Print the shape of `X` and `y`

In [4]:
# TODO: Load the dataset
from yellowbrick import datasets

X, y = datasets.load_mushroom()
# TODO: Print the shape of X and y
print(X.shape)
print(y.shape)
print(X.head())

print(X.isna().sum())


(8123, 3)
(8123,)
    shape surface   color
0  convex  smooth  yellow
1    bell  smooth   white
2  convex   scaly   white
3  convex  smooth    gray
4  convex   scaly  yellow
shape      0
surface    0
color      0
dtype: int64


### 1.2: Pre-processing (3 marks)
In this dataset, all the features are categorical, so they need to be encoded. We will use `OneHotEncoder(sparse_output=False)` for this case

In [5]:
# TODO: Create OneHotEncoder object
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)

enc.fit(X)
print(X.head())

enc.get_feature_names_out()

X = pd.DataFrame(enc.transform(X), columns=enc.get_feature_names_out())

print(X.head())

    shape surface   color
0  convex  smooth  yellow
1    bell  smooth   white
2  convex   scaly   white
3  convex  smooth    gray
4  convex   scaly  yellow
   shape_bell  shape_conical  shape_convex  shape_flat  shape_knobbed  \
0         0.0            0.0           1.0         0.0            0.0   
1         1.0            0.0           0.0         0.0            0.0   
2         0.0            0.0           1.0         0.0            0.0   
3         0.0            0.0           1.0         0.0            0.0   
4         0.0            0.0           1.0         0.0            0.0   

   shape_sunken  surface_fibrous  surface_grooves  surface_scaly  \
0           0.0              0.0              0.0            0.0   
1           0.0              0.0              0.0            0.0   
2           0.0              0.0              0.0            1.0   
3           0.0              0.0              0.0            0.0   
4           0.0              0.0              0.0            1.0 

The next step is to build a pipeline to combine the encoding with the selected machine learning method. To initialize the pipeline, we will use `LogisticRegression(max_iter=1000)` as a placeholder

In [6]:
# TODO: Build the pipeline
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
pipe = Pipeline([("lr", LogisticRegression(max_iter=1000))])

The next step is to split the data into training and testing sets. Use `test_size=0.1, stratify=y, random_state=42`

In [7]:
# TODO: Split data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42)

print(X_train.head())
print(X_train.isna().sum())

      shape_bell  shape_conical  shape_convex  shape_flat  shape_knobbed  \
6853         0.0            0.0           0.0         1.0            0.0   
3025         0.0            0.0           1.0         0.0            0.0   
6707         0.0            0.0           0.0         1.0            0.0   
4267         0.0            0.0           1.0         0.0            0.0   
4141         0.0            0.0           0.0         1.0            0.0   

      shape_sunken  surface_fibrous  surface_grooves  surface_scaly  \
6853           0.0              0.0              0.0            1.0   
3025           0.0              1.0              0.0            0.0   
6707           0.0              0.0              0.0            0.0   
4267           0.0              0.0              0.0            1.0   
4141           0.0              0.0              0.0            1.0   

      surface_smooth  color_brown  color_buff  color_cinnamon  color_gray  \
6853             0.0          1.0      

### 1.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LogisticRegression(max_iter=1000)`, `KNeighborsClassifier()` and `SVC()`. Build your parameter grid based on what you think are reasonable values to test

In [8]:
# TODO: Build a parameter grid
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [9]:
# TODO: Implement grid search
param_grid_svc = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
}

# Parameter grid for KNeighborsClassifier
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
}

# Parameter grid for LogisticRegression
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100]
}

grid_search_svc = GridSearchCV(SVC(), param_grid=param_grid_svc, cv=5, return_train_score=True)
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid_knn, cv=5, return_train_score=True)
grid_search_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid=param_grid_lr, cv=5, return_train_score=True)

grid_search_svc.fit(X_train, y_train)
grid_search_knn.fit(X_train, y_train)
grid_search_lr.fit(X_train, y_train)

Traceback (most recent call last):
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 813, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\base.py", line 705, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
                             ^^^^^^^^^^^^^^^
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\neighbors\_classification.py", line 246, in predict
    if self._fit_method == "brute" and ArgKminClassMode.is_usable_for(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\metrics\_pairwise_distances_reduction\_dispatcher

### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [10]:
# TODO: Print the results from the grid search

# SVC 
print("--------------------SVC------------------------")
print("Best parameters:", grid_search_svc.best_params_)

print("Best cross-validation score:", grid_search_svc.best_score_)

best_estimator = grid_search_svc.best_estimator_
cv_results = grid_search_svc.cv_results_
best_index = grid_search_svc.best_index_
mean_train_score = cv_results['mean_train_score'][best_index] if 'mean_train_score' in cv_results else "Not available"
print("Mean cross-validation train score for the best estimator:", mean_train_score)

from sklearn.metrics import accuracy_score
y_pred = best_estimator.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy:", test_accuracy)

print("-----------------------------------------------\n")


# LR 
print("--------------------LogisticRegression---------")
print("Best parameters:", grid_search_lr.best_params_)

print("Best cross-validation score:", grid_search_lr.best_score_)

best_estimator = grid_search_lr.best_estimator_
cv_results = grid_search_lr.cv_results_
best_index = grid_search_lr.best_index_
mean_train_score = cv_results['mean_train_score'][best_index] if 'mean_train_score' in cv_results else "Not available"
print("Mean cross-validation train score for the best estimator:", mean_train_score)
y_pred = best_estimator.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy:", test_accuracy)

print("-----------------------------------------------\n")



# KNN 
print("--------------------KNN------------------------")
print("Best parameters:", grid_search_knn.best_params_)

print("Best cross-validation score:", grid_search_knn.best_score_)

best_estimator = grid_search_knn.best_estimator_
cv_results = grid_search_lr.cv_results_
best_index = grid_search_lr.best_index_
mean_train_score = cv_results['mean_train_score'][best_index] if 'mean_train_score' in cv_results else "Not available"
print("Mean cross-validation train score for the best estimator:", mean_train_score)

y_pred = best_estimator.predict(X_test.to_numpy())
test_accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy:", test_accuracy)

--------------------SVC------------------------
Best parameters: {'C': 10, 'gamma': 1}
Best cross-validation score: 0.7114911080711355
Mean cross-validation train score for the best estimator: 0.7160396716826265
Test set accuracy: 0.7121771217712177
-----------------------------------------------

--------------------LogisticRegression---------
Best parameters: {'C': 100}
Best cross-validation score: 0.6649794801641586
Mean cross-validation train score for the best estimator: 0.6665868673050616
Test set accuracy: 0.6765067650676507
-----------------------------------------------

--------------------KNN------------------------
Best parameters: {'n_neighbors': 3}
Best cross-validation score: nan
Mean cross-validation train score for the best estimator: 0.6665868673050616
Test set accuracy: 0.6654366543665436




### Questions (6 marks)

1. Which model and what parameters produced the best results?
2. Was this model a good fit? Why or why not?
3. Is there anything else we could do to try to improve model performance? Provide two ideas.

1. Which model and what parameters produced the best results?

The Support Vector Classifier (SVC) with C=10 and gamma=1 achieved the highest performance across the three evaluated models. It yielded a remarkable cross-validation score of 0.7115 and an impressive test set accuracy of 0.7122, showcasing its superior predictive power.

2. Was this model a good fit? Why or why not?

The closely matched best cross-validation score and test set accuracy (0.7115 and 0.7122) suggest that the model generalizes effectively to unseen data. Additionally, the mean cross-validation train score (0.7160) being very close to the cross-validation score indicates a lack of significant overfitting or underfitting issues

3. Is there anything else we could do to try to improve model performance? Provide two ideas.

One approach could be to broaden the range of hyperparameters. For instance, we could extend the hyperparameters to C = [0.01, 0.1, 1, 10, 100, 1000] and gamma = [0.0001, 0.001, 0.01, 0.1, 1, 10].

Alternatively we could take another approach that involves narrowing down the search based on the initial findings. For example, if we found that C=10 and gamma=1 were among the best parameters, we can refine our search around these values. We could use a grid such as {'C': [5, 10, 15], 'gamma': [0.5, 1, 1.5]} to fine-tune the model around the previously identified optimal values.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1- I used the lecture materials and labs and assignment 3.

2- I hollistically tried to solve most problems on my own, if that did not work I referred to the slides and lab documents for stub code to aid my process. Using that and some gen AI to give me syntax, I managed to solve the problems. 

3- My use of gen AI was mostly limited to syntax problems (ie asking GPT what is the code to import certain libarries or how to print a certain statement)

4- Initially, I struggled with constructing a pipeline and generating ideas to enhance the performance of the SVC models. However, after reviewing the lecture materials, I gained clarity on both aspects and was able to overcome these challenges.

# Part 2: Regression (26 marks)

For this task, we will be using the auto-mpg dataset. The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Auto%2BMPG

### 2.1: Load data (3 marks)

#### Prepare the feature matrix and target vector

Using the code below, load the dataset and separate it into feature matrix `X` and target vector `y`. Which column represents the target vector?

Print the shape of `X` and `y`

**Note that you will need to download the file from D2L or from the UCI website and store it in the same folder as the code for this to work**

In [11]:
# Code to read in the dataset - DO NOT CHANGE
data = pd.read_csv('auto-mpg.data', 
               header=None, 
              names=["mpg",
                    "cylinders",
                    "displacement",
                    "horsepower",
                    "weight",
                    "acceleration",
                    "model_year",
                    "origin",
                    "car_name"],
               na_values='?',
               sep=r'\s+')

In [12]:
# TODO: Separate dataset into feature matrix and target vector

# TODO: Print shape of X and y
X = data.drop(columns=['mpg'])
y = data['mpg']

print("Feature matrix X shape:", X.shape)
print("Feature matrix X shape:", X.head(10))

print("Target vector y shape:", y.shape)

Feature matrix X shape: (398, 8)
Feature matrix X shape:    cylinders  displacement  horsepower  weight  acceleration  model_year  \
0          8         307.0       130.0  3504.0          12.0          70   
1          8         350.0       165.0  3693.0          11.5          70   
2          8         318.0       150.0  3436.0          11.0          70   
3          8         304.0       150.0  3433.0          12.0          70   
4          8         302.0       140.0  3449.0          10.5          70   
5          8         429.0       198.0  4341.0          10.0          70   
6          8         454.0       220.0  4354.0           9.0          70   
7          8         440.0       215.0  4312.0           8.5          70   
8          8         455.0       225.0  4425.0          10.0          70   
9          8         390.0       190.0  3850.0           8.5          70   

   origin                   car_name  
0       1  chevrolet chevelle malibu  
1       1          buick sky

Do we have any missing values in this case?

In [13]:
# TODO: Check if there are any missing values
print(X_train.isna().sum())

shape_bell         0
shape_conical      0
shape_convex       0
shape_flat         0
shape_knobbed      0
shape_sunken       0
surface_fibrous    0
surface_grooves    0
surface_scaly      0
surface_smooth     0
color_brown        0
color_buff         0
color_cinnamon     0
color_gray         0
color_green        0
color_pink         0
color_purple       0
color_red          0
color_white        0
color_yellow       0
dtype: int64


### 2.2: Pre-processing (5 marks)
In this dataset, we have a mixture of categorical and numerical data. This means that we will need to use a `ColumnTransformer()`

If you try to use a ColumnTransformer on the data with all the existing features, you will get an error. This is because there are too many unique feature values in the `car_name` column to capture all possible values in the training set. For this assignment, we will remove the `car_name` column to avoid this problem

In [14]:
# TODO: Remove car_name column
X = data.drop(columns=['car_name'])

For this case, we will use:
- `OneHotEncoder(sparse_output=False)` for any categorical columns
- `StandardScaler()` for any numerical columns
- Minimal information imputation for any missing values

In [15]:
# TODO: Create ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

# the preprocessing pipelines for numerical and categorical columns
numerical_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scaling', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(sparse_output=False))
])

# combine pipelines with ColumnTransformer
preprocessor = ColumnTransformer(
    [("num", numerical_pipeline, ['displacement', 'horsepower', 'weight', 'acceleration']),
     ("cat", categorical_pipeline, ['cylinders', 'model_year', 'origin'])]
)

The next step is to build a pipeline to combine the ColumnTransformer with the selected machine learning method. To initialize the pipeline, we will use `LinearRegression()` as a placeholder

In [16]:
# TODO: Build the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

The next step is to split the data into training and testing sets. Use `test_size=0.1, random_state=0`

In [17]:
# TODO: Split data into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=0)

### 2.3: Grid Search (4 marks)

For the grid search, we would like to test three different models: `LinearRegression()`, `KNeighborsRegressor()` and `RandomForestRegressor(random_state=0)`. Build your parameter grid based on what you think are reasonable values to test

In [19]:
# TODO: Build a parameter grid
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

model_params = [
    {
        'preprocessor': [preprocessor], 
        'model': [LogisticRegression(max_iter=1000)],
        'model__C': [0.01, 0.1, 1, 10, 100]
    },
    {
        'preprocessor': [preprocessor],
        'model': [KNeighborsRegressor()],
        'model__n_neighbors': [3, 5, 7, 10],
        'model__weights': ['uniform', 'distance']
    },
    {
        'preprocessor': [preprocessor],
        'model': [RandomForestRegressor(random_state=0)],
        'model__n_estimators': [100, 200, 300],
        'model__max_depth': [None, 10, 20, 30],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4]
    }]

In [20]:
# TODO: Implement Grid Search
# Initialize GridSearchCV

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=1000)) 
])

grid_search = GridSearchCV(pipeline, param_grid=model_params, cv=5, scoring='neg_mean_squared_error', verbose=1)


grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 121 candidates, totalling 605 fits


25 fits failed out of a total of 605.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\taimo\anaconda3\Lib\site-packages\sklearn\base.py", line

### 2.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [21]:
# TODO: Print the results from the grid search
from sklearn.metrics import mean_squared_error, r2_score

# Best parameters
print("Best parameters:", grid_search.best_params_)

# Best cross-validation score
print("Best cross-validation score:", -grid_search.best_score_)

y_pred = grid_search.predict(X_test)

test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)

print("Test set MSE:", test_mse)
print("Test set R^2:", test_r2)

Best parameters: {'model': KNeighborsRegressor(weights='distance'), 'model__n_neighbors': 5, 'model__weights': 'distance', 'preprocessor': ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaling',
                                                  StandardScaler())]),
                                 ['displacement', 'horsepower', 'weight',
                                  'acceleration']),
                                ('cat',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('onehot',
                                                  OneHotEncoder(sparse_output=False))]),
                                 ['cylinders', 'model_year', 'origin'])])}
Be

### Questions (8 marks)

1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different than the two ideas given for the previous part).
1. Comparing the two parts, which one took longer to run the grid search? Why do you think it took longer?

1. Which model and what parameters produced the best results?

The KNeighborsRegressor model yielded the optimal results, utilizing the parameters n_neighbors=5 and weights='distance'. A ColumnTransformer was employed as the preprocessor, applying median imputation and scaling to numerical features (displacement, horsepower, weight, acceleration), and employing most frequent imputation followed by one-hot encoding for categorical features (cylinders, model_year, origin).
2. Was this model a good fit? Why or why not?

The model demonstrates a good fit, with a low Mean Squared Error (MSE) on the test set at 14.9903. Furthermore, the R^2 score stands at 0.7831, signifying that approximately 78.31% of the variance in the target variable is accounted for by the model. This relatively high R^2 score indicates a strong fit of the model to the data.

3. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different than the two ideas given for the previous part).
1- Create More Features and Capture Interactions: Introducing new features that represent interactions between existing ones can unveil valuable insights that might be overlooked by straightforward or distance-based models like KNN.

2- Reduce Dimensionality: Dimensionality reduction can enhance performance by eliminating noise and addressing the 'curse of dimensionality,' which can have a significant impact on models like KNN. This process helps streamline the dataset, making it more manageable and potentially improving model accuracy.


4. Comparing the two parts, which one took longer to run the grid search? Why do you think it took longer?

Part 1 took longer, since a pipeline was not employed.

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1- I used the lecture materials and labs and assignment 3.

2- I hollistically tried to solve most problems on my own, if that did not work I referred to the slides and lab documents for stub code to aid my process. Using that and some gen AI to give me syntax, I managed to solve the problems. 

3- My use of gen AI was mostly limited to syntax problems (ie asking GPT what is the code to import certain libarries or how to print a certain statement)

4- Initially, I struggled with constructing a pipeline and generating ideas to help improve the performance of the KNN models. However, after reviewing the lecture materials, I gained clarity on both aspects and was able to overcome these challenges.

## Part 3: Observations/Interpretation (3 marks)

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.

SVC
The optimal performance of the SVC with C=10 and gamma=1 indicates that a non-linear decision boundary is needed for this specific problem. These hyperparameters provide the flexibility required to capture the complexities of the data.

R^2 score
The R^2 score of 0.7831 for KNN suggests that it can capture a substantial amount of the data's variance. However, this score also hints at potential limitations stemming from dimensionality or noise in the dataset.

## Part 4: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.



While working on this assignment, I gained valuable insights into using pipelines and parameter grid search. Exploring ways to enhance the performance of KNN and SVC models was a significant learning experience. This assignment not only deepened my understanding of these concepts but also provided a great opportunity for self-learning and exploration.