<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<h2>Hyperparameter Tuning</h2>
<h4>Roan G. W. Salgueiro</h4>

<br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
In this project I will tuning the hyperparameters of two of the following model types:

* Regression Tree
* Random Forest
* Gradient Boosted Machine (GBM)

Hyperparameters are parameters that are set before training the model and affect the behavior of the model during training. Tuning hyperparameters involves selecting the best combination of hyperparameters that lead to the highest performance of the model on the given dataset.

To tune the hyperparameters, I'll need to experiment with different values for each hyperparameter and evaluate the performance of the model for each combination of hyperparameters. I'll use techniques such as cross-validation to evaluate the performance of the model on different subsets of the data.

Once I have selected the best hyperparameters for each model, I'll need to evaluate the performance of the models on a holdout dataset. This dataset is separate from the dataset used for tuning the hyperparameters and is used to estimate the generalization performance of the models.

Overall, the goal of this project is to demonstrate my ability to apply machine learning techniques to real-world problems by tuning the hyperparameters of these models to achieve the best possible performance.

<h3>Step 1: Imports</h3>
Import the following packages and make sure to start replacing my comments with your own. Also, please tell me about which models you are tuning in the second cell.

In [1]:
# importing related libraries 
import numpy             as np
import pandas            as pd  
import matplotlib.pyplot as plt
import seaborn           as sns


# importing model types
import sklearn.linear_model                            # importing linear models 
from sklearn.tree     import DecisionTreeRegressor     # impoting regression trees models 
from sklearn.ensemble import RandomForestRegressor     # importing random forest models 
from sklearn.ensemble import GradientBoostingRegressor # importing GBM model 


# importing ML tools
from sklearn.model_selection import train_test_split   # importing train test split package
from sklearn.model_selection import RandomizedSearchCV # hyperparameter tuning

# loading data to housing variable 
housing = pd.read_excel('./__datasets/housing_feature_rich.xlsx')

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


########################################
# x-variable sets
########################################

x_variables = ['Garage_Cars', 'Overall_Qual', 'Total_Bsmt_SF',
               'NridgHt', 'Kitchen_AbvGr', 'has_Second_Flr',
               'Mas_Vnr_Area', 'has_Garage', 'Porch_Area',
               'NWAmes', 'OldTown', 'Overall_Cond',
               'Edwards', 'Somerst', 'Fireplaces',
               'Second_Flr_SF', 'First_Flr_SF', 'has_Mas_Vnr',
               'CulDSac', 'Total_Bath', 'Crawfor', 'Garage_Area',
               'has_Porch']


full_x = ['Overall_Qual', 'Overall_Cond', 'Mas_Vnr_Area', 'Total_Bsmt_SF',
          'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area', 'Full_Bath',
          'Half_Bath', 'Kitchen_AbvGr', 'TotRms_AbvGr', 'Fireplaces',
          'Garage_Cars', 'Garage_Area', 'Porch_Area', 'log_Lot_Area',
          'has_Second_Flr', 'has_Garage', 'has_Mas_Vnr', 'has_Porch',
          'Total_Bath', 'CulDSac', 'BrkSide', 'CollgCr', 'Crawfor',
          'Edwards', 'Gilbert', 'Mitchel', 'NWAmes', 'NridgHt', 'OldTown',
          'Sawyer', 'SawyerW', 'Somerst', 'Other_NH']


reduced_x = ['Overall_Qual', 'Gr_Liv_Area', 'Full_Bath',
             'Kitchen_AbvGr', 'TotRms_AbvGr', 'Fireplaces',
             'Garage_Cars', 'Garage_Area', 'Porch_Area', 
             'log_Lot_Area', 'has_Second_Flr', 'has_Garage',
             'has_Mas_Vnr', 'has_Porch', 'Total_Bath', 'CulDSac']

# checking results
housing.head(n=5)

Unnamed: 0,Order,Lot_Area,Overall_Qual,Overall_Cond,Mas_Vnr_Area,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Gr_Liv_Area,Full_Bath,Half_Bath,Kitchen_AbvGr,TotRms_AbvGr,Fireplaces,Garage_Cars,Garage_Area,Porch_Area,Pool_Area,Sale_Price,log_Sale_Price,log_Lot_Area,log_Mas_Vnr_Area,m_Mas_Vnr_Area,m_Total_Bsmt_SF,m_Garage_Cars,m_Garage_Area,m_log_Mas_Vnr_Area,has_Second_Flr,has_Garage,has_Mas_Vnr,has_Porch,log_Overall_Qual,Total_Bath,Grvl,Pave,Corner,CulDSac,FR2,FR3,Inside,Blmngtn,Blueste,BrDale,BrkSide,ClearCr,CollgCr,Crawfor,Edwards,Gilbert,Greens,GrnHill,IDOTRR,Landmrk,MeadowV,Mitchel,NAmes,NPkVill,NWAmes,NoRidge,NridgHt,OldTown,SWISU,Sawyer,SawyerW,Somerst,StoneBr,Timber,Veenker
0,1,31770,6,5,112,1080,1656,0,1656,1,0,1,7,2,2,528,272,0,215000,12.278393,10.366278,4.718508,0,0,0,0,0,0,1,1,1,1.791759,1.0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
1,2,11622,5,6,0,882,896,0,896,1,0,1,5,0,1,730,260,0,105000,11.561716,9.360655,-6.907755,0,0,0,0,0,0,1,0,1,1.609438,1.0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
2,3,14267,6,6,108,1329,1329,0,1329,1,1,1,6,0,1,312,429,0,172000,12.05525,9.565704,4.68214,0,0,0,0,0,0,1,1,1,1.791759,1.5,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
3,4,11160,7,5,0,2110,2110,0,2110,2,1,1,8,2,2,522,0,0,244000,12.404924,9.320091,-6.907755,0,0,0,0,0,0,1,0,0,1.94591,2.5,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,5,13830,5,5,0,928,928,701,1629,2,1,1,6,1,2,482,246,0,189900,12.154253,9.534595,-6.907755,0,0,0,0,0,1,1,0,1,1.609438,2.5,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


<br>

Which models are you tuning?

Our team is tuning "Random Forest" and "Gradient Boosted Machine" models.

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 2: Train-Test Split</h3>
Set up train-test split in the cell below. Set your <em>test_size</em> to 0.25 and your <em>random_state</em> to 219.

In [2]:
# preparing x-varaibles (dropping the y variables from the dataset)
x_data = housing.drop(['Sale_Price','log_Sale_Price'],axis = 1)

# preparing y-variable 
y_data = housing.loc[ : , 'Sale_Price']

# train-test split with stratification
x_train, x_test, y_train, y_test = train_test_split(x_data, 
                                                    y_data,
                                                    test_size    = 0.25,
                                                    random_state = 219)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 3: Check Available Hyperparameters</h3>
Call help(&nbsp;) on your selected model types to see which hyperparameters are available for tuning. Do NOT tune random_state, n_jobs or anything related to the intercept of a model. Use as many cells as needed.

In [3]:
# insert additional cells to call help() on each of your models
# help(DecisionTreeRegressor)

In [4]:
# help(GradientBoostingRegressor)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 4: Model Development - Default Hyperparameters</h3>
Run each of your selected models using three (3) respective default hyperparameters that you feel would be good candidates for optimization, printing their R-Square values for the training and testing sets. Use as many cells as needed and make sure to label your default hyperparameters.

In [5]:
# Model 1 - DecisionTree with default hyperparameters 

# Instantiating a DecisionTree model with default values 
dt_1 = DecisionTreeRegressor(criterion    = 'squared_error', # Setting the criterion as default value 'squared_error'
                             splitter     = 'best',          # Setting the splitter as default value 'best'
                             max_depth    = None,            # Setting the max_depth as default value None
                             random_state = 219)             # Setting the random_state 

# Fitting the training data into the model 
dt_1.fit(x_train, y_train)
# Predicting based on the testing set 
dt_1.predict(x_test)
# Scoring the result for training data and testing data 
dt_1_train_score = dt_1.score(x_train, y_train)
dt_1_test_score  = dt_1.score(x_test, y_test)

# Printing the R-squared values for the training and testing score 
print(f"""
{'*' * 80}
DecisionTree model with default hyperparameters \n
Train R-Square value: {dt_1_train_score}
Test R-Square value: {dt_1_test_score}
{'*' * 80}
""")



********************************************************************************
DecisionTree model with default hyperparameters 

Train R-Square value: 1.0
Test R-Square value: 0.7473218270728141
********************************************************************************



In [6]:
# Model 2 - Gradient Boosted Machine with default hyperparameters 

# Instantiating a DecisionTree model with default values 
GBM_1 = GradientBoostingRegressor(loss          = 'squared_error',# Setting the loss as default value 'squared_error
                                  learning_rate = 0.1,            # Setting the learning_rate as default value 0.1
                                  n_estimators  = 100,            # Setting the n_estimators as default value 100
                                  random_state  = 219)            # Setting the random_state

# Fitting the training data into the model 
GBM_1.fit(x_train, y_train)
# Predicting based on the testing set 
GBM_1.predict(x_test)
# Scoring the result for training data and testing data 
GBM_1_training_score = GBM_1.score(x_train, y_train)
GBM_1_testing_score  = GBM_1.score(x_test, y_test)

# Printing the R-squared values for the training and testing score 
print(f"""
{'*' * 80}
Gradient Boosting Machine model with default hyperparameters \n
Train R-Square value: {GBM_1_training_score}
Test R-Square value:  {GBM_1_testing_score}
{'*' * 80}
""")


********************************************************************************
Gradient Boosting Machine model with default hyperparameters 

Train R-Square value: 0.9481756836545663
Test R-Square value:  0.8912790360095459
********************************************************************************



<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 5: Hyperparameter Tuning Part I</h3><br>
Develop ranges for each of the hyperparameters you would like tune. Write a code to calculate how many models you are building (based on the combinations of hyperparameters you are tuning).
<br><br>
<strong>Requirement:</strong> You must tune at least three hyperparameters per model.<br>
<strong>Recommendation:</strong> Keep the number of models (i.e., iterations) you are building to:

* Less than 2,000 for Random Forest and GBM  (not including cross-validation)
* Less than 10,000 for all other model types (not including cross-validation)

Insert as many code cells as you need for the number of model types you are tuning.

In [7]:
# Hyperparameter Tuning for Model 1 - Decision Tree

# Declaring a hyperparameter space
splitter_range  = ['best','random']         # setting two types of splitter      
depth_range     = range(1,60,2)             # setting max depth range 
leaf_range      = range(2,50,2)             # setting min samples split range

In [8]:
# The number of model combinations for Model 1 - Decision Tree
print(f"""
The number of model combinations for Decision Tree: 
{len(depth_range)*len(splitter_range)*len(leaf_range)}
""")


The number of model combinations for Decision Tree: 
1440



In [9]:
# Hyperparameter Tuning for Model 2 - Gradient Boosted Machine 

# Declaring a hyperparameter space 
learn_rate = np.linspace(0.1,1,10)        # setting learning rate range from 0.1 to 1 step 10 

n_estimators = range(1,50,5)              # setting n_estimator range

min_sample_split = range(2,20,2)          # setting min sample split 


In [10]:
# The number of model combinations for Model 2 - GBM
print(f"""
The number of model combinations for GBM : 
{len(learn_rate)*len(n_estimators)*len(min_sample_split)}
""")


The number of model combinations for GBM : 
900



<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 6: Hyperparameter Tuning Part II</h3><br>
Create a hyperparameter grid for each of the models you are tuning. Note that this is the dictionary step we conducted in class. Use as many cells as needed for this task.

In [11]:
# Hyperparameter grid for Model 1 - Decision Tree
param_grid_1 = {'splitter'         : splitter_range,
                'max_depth'        : depth_range,
                'min_samples_split': leaf_range}

In [12]:
# Hyperparameter grid for Model 2 - GBM
param_grid_2 = {'learning_rate'    : learn_rate,
                'n_estimators'     : n_estimators,
                'min_samples_split': min_sample_split}

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<h3>Step 7: Hyperparameter Tuning Part III</h3><br>
a) Complete the remaining steps for hyperparameter tuning. While your code runs, observe the processing time it takes to tune. Use as many cells as needed for this task.

In [13]:
# Instantiating the model object with DecisionTreeRegressor
tuned_tree = DecisionTreeRegressor(random_state = 219)

# RandomizedSearchCV object
tuned_tree_cv = RandomizedSearchCV(estimator             = tuned_tree,
                                   param_distributions   = param_grid_1,
                                   cv                    = 5,
                                   n_iter                = 1000,
                                   random_state          = 219)

# Fitting to the full dataset(due to cross-validation)
tuned_tree_cv.fit(x_data, y_data)


# printing the optimal parameters and best score
print("Tuned Parameters:", tuned_tree_cv.best_params_)
print("Tuned Training R-square:", tuned_tree_cv.best_score_.round(4))

Tuned Parameters: {'splitter': 'random', 'min_samples_split': 20, 'max_depth': 11}
Tuned Training R-square: 0.815


In [14]:
# Instantiating the model object with GradientBoostingRegressor 
tuned_GBM = GradientBoostingRegressor(random_state = 219)

# RandomizedSearchCV object
tuned_GBM_cv = RandomizedSearchCV(estimator              = tuned_GBM,
                                   param_distributions   = param_grid_2,
                                   cv                    = 5,
                                   n_iter                = 50,
                                   random_state          = 219)

# Fitting to the full dataset(due to cross-validation)
tuned_GBM_cv.fit(x_data, y_data)


# printing the optimal parameters and best score
print("Tuned Parameters:", tuned_GBM_cv.best_params_)
print("Tuned Training R-square:", tuned_GBM_cv.best_score_.round(4))

Tuned Parameters: {'n_estimators': 36, 'min_samples_split': 6, 'learning_rate': 0.2}
Tuned Training R-square: 0.8836


<br>
b) Explore the hyperparameter tuning results for one of your tuned models. Look for hyperparameter combinations that tied for first place. Test each of these models using train-test split. If no models tied for first place, test your top three models.

In [15]:
# Explore the hyperparameter tuning results of Gradient Boosting Machine 

# cross validation results of Gradient Boosting Machine
tuned_GBM_cv.cv_results_

{'mean_fit_time': array([0.15069723, 0.05830998, 0.10091496, 0.2546123 , 0.46881542,
        0.52552605, 0.10773902, 0.15252156, 0.48887677, 0.14369102,
        0.1881968 , 0.13732963, 0.01167507, 0.26431465, 0.1366313 ,
        0.25632477, 0.2199214 , 0.17286224, 0.20985751, 0.0117135 ,
        0.13029952, 0.32587047, 0.28848257, 0.21019077, 0.13145065,
        0.08816695, 0.13939633, 0.33560443, 0.25250216, 0.33161664,
        0.29489231, 0.18280497, 0.26371622, 0.14305611, 0.28741961,
        0.28917518, 0.0548152 , 0.06031928, 0.41755013, 0.10124521,
        0.31708341, 0.01182418, 0.19343662, 0.31915331, 0.22773199,
        0.05414805, 0.32730522, 0.18503714, 0.0121419 , 0.26741681]),
 'std_fit_time': array([0.11560738, 0.001254  , 0.00603545, 0.01363939, 0.09392682,
        0.17707103, 0.00375738, 0.00452022, 0.16747994, 0.00512376,
        0.00157944, 0.00431057, 0.00057448, 0.00539291, 0.00558715,
        0.00420009, 0.01146357, 0.00747444, 0.00251041, 0.00049074,
        0.003

In [16]:
# Defining a function to analyze tuning result 
def tuning_results(cv_results, n=1):
    """
This function will display the top "n" models from hyperparameter tuning,
based on "rank_test_score".

PARAMETERS
----------
cv_results = results dictionary from the attribute ".cv_results_"
n          = number of models to display
    """
    param_lst = []

    for result in cv_results["params"]:
        result = str(result).replace(":", "=")
        param_lst.append(result[1:-1])


    results_df = pd.DataFrame(data = {
        "Model_Rank" : cv_results["rank_test_score"],
        "Mean_Test_Score" : cv_results["mean_test_score"],
        "SD_Test_Score" : cv_results["std_test_score"],
        "Parameters" : param_lst
    })


    results_df = results_df.sort_values(by = "Model_Rank", axis = 0)
    return results_df.head(n = n)

In [17]:
# run tuning_results() on the hyperparameter tuning results 
# returning top 3 models 
tuning_results( cv_results = tuned_GBM_cv.cv_results_, n = 3 )

Unnamed: 0,Model_Rank,Mean_Test_Score,SD_Test_Score,Parameters
30,1,0.883638,0.024119,"'n_estimators'= 36, 'min_samples_split'= 6, 'l..."
4,2,0.882705,0.020625,"'n_estimators'= 46, 'min_samples_split'= 14, '..."
38,3,0.880688,0.018583,"'n_estimators'= 46, 'min_samples_split'= 4, 'l..."
