# Exercise - Hyperparameter Tuning with Grid Search

In this exercise you will train a base model and then try to find better combinations of hyperparameter values using the grid search technique.

In [20]:
# DO NOT MODIFY - imports
import pandas as pd
import numpy as np

## 1. Setup

Execute the cells below to read prepared data on the [Invesco QQQ Trust, Series 1 (NASDAQ: QQQ)](https://finance.yahoo.com/quote/QQQ/) ETF from 1999 to 2017. We have already engineered some technical indicators as features and cleaned the data. The DataFrame also includes the raw level of the VIX (Volatility Index).

In [None]:
# DO NOT MODIFY - load data and display basic statistics
df = pd.read_csv("Data.csv")
df.describe()

We'd like to try and predict the direction of 5-day future returns. Run the cell below to split the data and prepare for model training.

In [22]:
# DO NOT MODIFY - Define features and target and split data
from sklearn.model_selection import train_test_split

X = df.drop(columns=["fut_ret_5d_is_pos", "Date"])
y = df["fut_ret_5d_is_pos"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

## 2. Training a base model

Train a `RandomForestClassifier` and train it using its default hyperparameter values. As this is a tree-based model, you do not need to scale the features.

In [None]:
# DO NOT MODIFY - imports
from sklearn.ensemble import RandomForestClassifier

# FILL IN - Instantiate a RandomForestClassifier and fit it to the training data
# Use random_state=52 for reproducibility
# Set n_jobs=-1 to enable parallel processing using all available CPU cores
clf = ...

We will focus on precision as our performance metric, as we would like to avoid False Positives as much as possible.  
Below, we have provided a function that plots 5-fold cross-validated precision scores. Study it and invoke it to plot the learning curves using the training set. You should be able to observe that the model is overfitting to the training set.

In [None]:
# DO NOT MODIFY - imports
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

# DO NOT MODIFY - plotter function
def plot_learning_curve(model, X, y, cv=5, n_jobs=-1):
    train_sizes, train_scores, test_scores = learning_curve(
        model,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        scoring="precision",
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    plt.plot(train_sizes, train_scores_mean, label="CV training score")
    plt.plot(train_sizes, test_scores_mean, label="CV test score")
    plt.title("Learning curve for Random Forest Classifier")
    plt.xlabel("Training examples")
    plt.ylabel("Precision")
    plt.legend()
    plt.show()


# FILL IN - Plot the learning curve for the RandomForestClassifier using the training data


What was the average cross-validated precision score on the training set?

In [None]:
# DO NOT MODIFY - imports
from sklearn.model_selection import cross_val_score

# FILL IN - Get the 5-fold cross-validated precision of the classifer on the training data
# Use n_jobs=-1 for parallel processing


And how does this score compare to the precision on the test set? - **HINT:** Use the fitted classifier's `predict()` method to get an array of predictions on the test set.

In [None]:
# DO NOT MODIFY - imports
from sklearn.metrics import precision_score

# FILL IN - Get the precision of the classifier on the test data


## 3. Grid search

Recall that you can use the `get_params()` method of the classifierto see a list of its hyperparameters and other settings.

In [None]:
clf.get_params()

Below, we have picked 3 different values for each of the 4 major hyperparameters of `RandomForestClassifier`. Using Scikit-Learn's `GridSearchCV` class, perform a 5-fold cross-validated grid search using the provided search grid.

In [None]:
# DO NOT MODIFY - imports
from sklearn.model_selection import GridSearchCV

# DO NOT MODIFY - the `hyperspace` of hyperparameters to search
search_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 7, 15],
    'min_samples_leaf': [1, 2, 4]
}

# FILL IN - Instantiate a GridSearchCV object with the fitted RandomForestClassifier, the search grid, 5-fold cross-validation, and precision scoring.
# Fit it to the training data
# Don't forget to set n_jobs=-1 for parallel processing. This may take a minute or two even with parallel processing.
search = ...


Store the best parameters, best score, and best estimator (model). (These are attributes of `search`.) Feel free to print out the best CV precision score. How does it compare to the base model? Which combination of values yielded this result?

In [29]:
# FILL IN - Get the best parameters, best score, and best estimator from the GridSearchCV object
best_params = ...
best_score = ...
best_estimator = ...

In [None]:
best_score

In [None]:
best_params

Run the cell below to see the top 5 results in detail.

In [None]:
search_results = pd.DataFrame(search.cv_results_)
search_results.sort_values('rank_test_score').head()

Re-use the same `plot_learning_curve()` function we provided earlier to plot the learning curve for the best estimator found using training data. How does it compare to the learning curve of the original classifier?

In [None]:
# FILL IN - Plot the learning curve for the best estimator using the training data


And finally, use this estimator to evaluate the test set and get the new test performance score. How does it compare?

In [None]:
# FILL IN - Get the precision of the best model on test data
