## Grid Search Hyperparameter optimization

This case study is all about using grid searches to identify the optimal parameters for a machine learning algorithm. To complere this case study, you'll use the Pima Indian diabetes dataset from Kaggle and KNN. Follow along with the preprocessing steps of this case study.

Load the necessary packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# set random seed to try make this exercise and solutions reproducible (NB: this is just for teaching purpose and not something you would do in real life)
random_seed_number = 42
np.random.seed(random_seed_number)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#### Load the diabetes data

In [None]:
diabetes_data = pd.read_csv('data/diabetes.csv')
diabetes_data.head()

**<font color='teal'> Start by reviewing the data info.</font>**

In [None]:
#Display info about df
diabetes_data.info()

**<font color='teal'> Apply the describe function to the data.</font>**

In [None]:
# Display descriptive stats about df
diabetes_data.describe()

**<font color='teal'> Currently, the missing values in the dataset are represented as zeros. Replace the zero values in the following columns ['Glucose','BloodPressure','SkinThickness','Insulin','BMI'] with nan .</font>**

In [None]:
# List of columns to replace zero values with NaN
columns_to_replace = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Replace zero values with NaN in the specified columns
diabetes_data[columns_to_replace] = diabetes_data[columns_to_replace].replace(0, np.nan)

**<font color='teal'> Plot histograms of each column. </font>**

In [None]:
# Plotting histograms for each column in the df
diabetes_data.hist(figsize=(8, 6), bins=20)
plt.tight_layout()
plt.show()

#### Replace the NaNs with mean and median values.

In [None]:
# Replacing NaNs with mean and median values as specified
diabetes_data.fillna({
    'Glucose': diabetes_data['Glucose'].mean(),
    'BloodPressure': diabetes_data['BloodPressure'].mean(),
    'SkinThickness': diabetes_data['SkinThickness'].median(),
    'Insulin': diabetes_data['Insulin'].median(),
    'BMI': diabetes_data['BMI'].median()
}, inplace=True)

**<font color='teal'> Plot histograms of each column after replacing nan. </font>**

In [None]:
# Plotting histograms for each column in the df
diabetes_data.hist(figsize=(8, 6), bins=20)
plt.tight_layout()
plt.show()

#### Plot the correlation matrix heatmap

In [None]:
# Plot correlation matrix heatmap for df
plt.figure(figsize=(8,6))
print('Correlation between various features')
p=sns.heatmap(diabetes_data.corr(), annot=True, cmap ='Blues')

**<font color='teal'> Define the `y` variable as the `Outcome` column.</font>**

In [None]:
# Define x and y variables for model
X = diabetes_data.drop(columns=['Outcome'])
y = diabetes_data['Outcome']

**<font color='teal'> Create a 70/30 train and test split. </font>**

In [None]:
# Create a 70/30 train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**<font color='teal'> Using Sklearn, standarize the magnitude of the features by scaling the values. </font>**

Note: Don't forget to fit() your scaler on X_train and then use that fitted scaler to transform() X_test. This is to avoid data leakage while you standardize your data.

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on X_train and transform X_train
X_train_scaled = scaler.fit_transform(X_train)

# Use the fitted scaler to transform X_test
X_test_scaled = scaler.transform(X_test)

#### Using a range of neighbor values of 1-10, apply the KNearestNeighbor classifier to classify the the data.

In [None]:
test_scores = []
train_scores = []

for i in range(1,10):

    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train_scaled, y_train)
    
    train_scores.append(knn.score(X_train_scaled,y_train))
    test_scores.append(knn.score(X_test_scaled, y_test))

**<font color='teal'> Print the train and test scores for each iteration.</font>**

In [None]:
# Print the train and test scores 
for i in range(1,10):
    print(f"Neighbors: {i}, Train Score: {train_scores[i-1]}, Test Score: {test_scores[i-1]}")

**<font color='teal'> Identify the number of neighbors that resulted in the max score in the training dataset. </font>**

In [None]:
# Identify the number of neighbors that resulted in the max score in the training dataset
max_train_score = max(train_scores)
optimal_train_neighbors = train_scores.index(max_train_score) + 1
print(f"Max Train Score: {max_train_score} with {optimal_train_neighbors} neighbors")

**<font color='teal'> Identify the number of neighbors that resulted in the max score in the testing dataset. </font>**

In [None]:
# Identify the number of neighbors that resulted in the max score in the testing dataset
max_test_score = max(test_scores)
optimal_test_neighbors = test_scores.index(max_test_score) + 1
print(f"Max Test Score: {max_test_score} with {optimal_test_neighbors} neighbors")

Plot the train and test model performance by number of neighbors.

In [None]:
plt.figure(figsize=(8,5))
p = sns.lineplot(x=range(1,10),y=train_scores,marker='*',label='Train Score')
p = sns.lineplot(x=range(1,10),y=test_scores,marker='o',label='Test Score')

**<font color='teal'> Fit and score the best number of neighbors based on the plot. </font>**

In [None]:
# Fit the model with the optimal number of neighbors
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train_scaled, y_train)

# Score the model
train_score = knn.score(X_train_scaled, y_train)
test_score = knn.score(X_test_scaled, y_test)

print(f"Optimal number of neighbors: {optimal_test_neighbors}")
print(f"Train Score: {train_score}")
print(f"Test Score: {test_score}")

In [None]:
y_pred = knn.predict(X_test_scaled)
pl = confusion_matrix(y_test,y_pred)

**<font color='teal'> Plot the confusion matrix for the model fit above. </font>**

In [None]:
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(pl, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted Negative', 'Predicted Positive'], yticklabels=['Actual Negative', 'Actual Positive'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

**<font color='teal'> Print the classification report </font>**

In [None]:
# Print the classification report
print("Classification Report:")
report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose()
display(report_df)

#### In the case of the K nearest neighbors algorithm, the K parameter is one of the most important parameters affecting the model performance.  The model performance isn't horrible, but what if we didn't consider a wide enough range of values in our neighbors for the KNN? An alternative to fitting a loop of models is to use a grid search to identify the proper number. It is common practice to use a grid search method for all adjustable parameters in any type of machine learning algorithm. First, you define the grid — aka the range of values — to test in the parameter being optimized, and then compare the model outcome performance based on the different values in the grid.

#### Run the code in the next cell to see how to implement the grid search method for identifying the best parameter value for the n_neighbors parameter. Notice the param_grid is the range value to test and we apply cross validation with five folds to score each possible value of n_neighbors.

In [None]:
param_grid = {'n_neighbors':np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(X,y)

#### Print the best score and best parameter for n_neighbors.

In [None]:
print("Best Score:" + str(knn_cv.best_score_))
print("Best Parameters: " + str(knn_cv.best_params_))

Here you can see that the ideal number of n_neighbors for this model is 14 based on the grid search performed. - Actually looks like 31?

In [None]:
# GridSearch using StandardScaler and knn

# Define the pipeline with scaling and KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

In [None]:
# Define the parameter grid with the correct key
param_grid = {'knn__n_neighbors': np.arange(1, 50)}

# Initialize GridSearchCV
knn_cv_standard = GridSearchCV(pipeline, param_grid, cv=5)

# Fit the model
knn_cv_standard.fit(X, y)

In [None]:
# Print the best score and best parameters
print("Best Score: " + str(knn_cv_standard.best_score_))
print("Best Parameters: " + str(knn_cv_standard.best_params_))

**<font color='teal'> Now, following the KNN example, apply this grid search method to find the optimal number of estimators in a Randon Forest model.
</font>**

In [None]:
# Define the pipeline with scaling and Random Forest
pipeline = Pipeline([
    ('scaler', StandardScaler()), 
    ('rf', RandomForestClassifier())
])


In [None]:
# Define the parameter grid for the number of estimators
param_grid = {'rf__n_estimators': np.arange(10, 201, 10)}

# Initialize GridSearchCV
rf_cv = GridSearchCV(pipeline, param_grid, cv=5)

# Fit the model
rf_cv.fit(X, y)

In [None]:
# Print the best score and best parameters
print("Best Score: " + str(rf_cv.best_score_))
print("Best Parameters: " + str(rf_cv.best_params_))