# Parameter Optimization

Author: Antonio Miranda

In [None]:
# Load all required packages
import pandas
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import tree
from skopt import BayesSearchCV
from datetime import datetime

First, load the dataset and split the predictors and the response variable in train and test datasets

In [2]:
# Load dataset
data = pandas.read_csv("kaggleCompetition.csv")
data = data.values

# Dataset decomposition
X = data[0:1460,:-1]
y = data[0:1460,-1]
X_comp = data[1460:,:-1]
y_comp = data[1460:,-1]

##### 3.1. KNN. Grid-search with 3-fold crossvalidation for K hyper-parameter tuning. Model evaluation with 4-fold crossvalidation.

For reproducibility purposes we set a seed.

As a design choice we choose to do the Grid Search between 1 and 50 neighbors, in increments of 2 (to have a smaller search space and to select an odd number of neighbors, to avoid ties). According to some documentattion, the rule of thumb is to have a number of neigbors near the square root of the sample size (<40). So we explore the neighbor-space around that number, despite according to our personal experience, much smaller values are usually selected.

Since the sample size is not too large and we only have to optimize one parameter, we use use one core. In Windows systems to use more cores we would need to encapsulate the code inside "if __name__ == '__main__':"

In [3]:
# -------------------------------- Grid Search ----------------------------------

# Hyper-parameter tuning --> Gridsearch
np.random.seed(1)
cv_grid = KFold(n_splits=3, shuffle=True, random_state=0)
param_grid_knn = {'n_neighbors': list(range(1,50,2))}
method_knn = KNeighborsRegressor()
method_tune_knn = GridSearchCV(
        method_knn, 
        param_grid_knn, 
        scoring='neg_mean_squared_error',
        cv = cv_grid, 
        n_jobs =1, 
        verbose = 0
)


# Model evaluation
t1 = datetime.now()
cv_evaluation = KFold(n_splits=4, shuffle=True, random_state=0)
scores_knn = -cross_val_score(
        method_tune_knn, 
        X, 
        y, 
        scoring='neg_mean_squared_error', 
        cv = cv_evaluation,
        verbose=0
)
tf = datetime.now() - t1
print(tf)
print(scores_knn)
print(scores_knn.mean(), "+-", scores_knn.std())

0:00:24.963467
[0.04392385 0.05844956 0.06067776 0.04528427]
0.05208385962541403 +- 0.007536535431943325


We obtain a mean squared error of 0.0521, despite in one fold in the cross validation it was 0.0439. We will try to decrease it in the next steps.

It took more than 25s in our computer to perform the Grid Search with cross-validation for hyper-parameter tuning and model evaluation. We could decrease this time if we increase the number of cores, since Grid Search is easy to parallelize.

In [4]:
knn_model = method_tune_knn.fit(X, y)
print('\nThe best parameter for this model is: ')
print(knn_model.best_params_)


The best parameter for this model is: 
{'n_neighbors': 5}


We have fitted the model and the optimum number of neighbors is 5.

##### 3.2. Decision tree. Random-search with 3-fold crossvalidation for hyper-parameter tuning to tune parameters max_depth, min_samples_split, and criterion. Model evaluation also with 4-fold crossvalidation. 

To use the same folds we use the the KFold objects set up in the previous chunk.

In addition, we also set the seed for reproducibility purporses. 

The parameter grid is selected between 2 and 100 for both parameters, as a design choice. The reason is to speed up the hyper-parameter tuning.

The budget is set to 40. Again, it is a nice balance, high enough to find good parameters with a more or less exhaustive Random Search, but not to high so the process takes forever. 

In [5]:
# -------------------------------- Random Search ----------------------------------

# Hyper-parameter tuning --> Random Search
param_grid_tree = {'max_depth': list(range(2,100,1)), 
                   'min_samples_split': list(range(2,100,1)),
                   'criterion': ('mse', 'friedman_mse')
}
budget = 40

np.random.seed(1)
#cv_grid = KFold(n_splits=3, shuffle=True, random_state=0)
method_tree = tree.DecisionTreeRegressor()
#method_tree = tree.DecisionTreeRegressor(criterion="entropy")
method_tune_tree = RandomizedSearchCV(
        method_tree, 
        param_grid_tree, 
        scoring='neg_mean_squared_error', 
        cv=cv_grid, 
        n_jobs=1, 
        verbose=0, 
        n_iter=budget
)

# Model evaluation
#cv_evaluation = KFold(n_splits=4, shuffle=True, random_state=0)
t1 = datetime.now()
scores_tree = -cross_val_score(
        method_tune_tree, 
        X, 
        y, 
        scoring='neg_mean_squared_error', 
        cv = cv_evaluation
)
tf = datetime.now() - t1
print(tf)
print(scores_tree)
print(scores_tree.mean(), "+-", scores_tree.std())

0:00:07.267564
[0.03325202 0.03926548 0.03841255 0.03095004]
0.03547002037942877 +- 0.003478997154999783


The mean squared error has decreased to 0.0355 (smaller than with KNN). It seems like a Decision Tree is more suitable for this problem than KNN.

In addition, the time it took to perform the hyper-parameter optimization was smaller, despite we cannot compare properly with step 3.1 because it is a different algorithm and different hyper-parameters. 

In [6]:
np.random.seed(1)
tree_model = method_tune_tree.fit(X, y)
print('\nThe best parameter configuration for this model is: ')
print(tree_model.best_params_)


The best parameter configuration for this model is: 
{'min_samples_split': 22, 'max_depth': 28, 'criterion': 'friedman_mse'}


After fitting a final model, we show the selected parameters: __friedman mse__ as criterion, __22__ of minimum number of samples required to split an internal node and __28__ as maximum depth of the tree.

##### Bayes Search. Decision Tree

We will use the package scikit-optimize. The conditions are the same as in 3.2 (same budget, same parameter grid and same splits for the inner and outer loop).

Also, we use a defined seed for reproducibility.

In [7]:
# -------------------------------- Bayes Search ----------------------------------

# Hyper-parameter tuning --> Bayes Search
np.random.seed(1)
param_grid_bayes = {'max_depth': (2,100), 
                   'min_samples_split': (2,100),
                   'criterion': ('mse', 'friedman_mse')                
}
method_tune_bayes = BayesSearchCV(
        method_tree, 
        param_grid_bayes, 
        scoring='neg_mean_squared_error', 
        cv=cv_grid, 
        n_jobs=1, 
        verbose=0, 
        n_iter=budget
)

# Model evaluation
t1 = datetime.now()
scores_bayes = -cross_val_score(
        method_tune_bayes, 
        X, 
        y, 
        scoring='neg_mean_squared_error', 
        cv = cv_evaluation
)
tf = datetime.now() - t1
print(tf)
print(scores_bayes)
print(scores_bayes.mean(), "+-", scores_bayes.std())



0:02:39.778933
[0.03220015 0.03857478 0.03826416 0.03083077]
0.034967465168435 +- 0.0034875241068768707


In this case, we obtained a RMSE of 0.0350, slightly smaller than in the Random Search scenario. However, the performance difference is not significative, since the standard deviation is bigger than the difference. 

The time with Bayesian Optimization was much higher than with Random Search (2:30 minutes vs 7s in our computer). Moreover, if we increase the number of cores Bayesian Optimization will not run much faster, because it is a sequential algorithm. The only hope is that the way the scikit Decision Tree is coded allows parallelization every time a tree is computed. On the other hand, algorithms like Grid Search are easy to parallelize because they are not sequential.

The warnings that we see means that the objective function has already been evaluated for the hyperparameter combination being tested at a given iteration. 

If we take a look at the verbose output, we observe how the iterations with smallest RMSE were in the middle. After 30 iterations the algorithm was not able to continue improving the performance. This means we could have selected a smaller budget and save time and power.

##### 3.4. Determine what is your best method 3.1, 3.2, and 3.3. Use your best method to compute predictions for the competition.

The cross-validation scores have already been printed, but let's show them again.

In [9]:
# -------------------------------- Final model ----------------------------------
np.random.seed(0)
print('\nDecision Tree with Bayes Search performance: ')
print(scores_bayes.mean(), "+-", scores_bayes.std())

print('\nDecision Tree with Random Search performance: ')
print(scores_tree.mean(), "+-", scores_tree.std())

print('\nKNN with Grid Search performance: ')
print(scores_knn.mean(), "+-", scores_knn.std())

# Final model fit: Decision Tree with Bayes Search
final_model = method_tune_bayes.fit(X, y)
print('\nThe best parameters for this model according to the Bayes Search are: ')
print(final_model.best_params_)

# Kaggle predictions (need to obtain actual sale price computing the exponential 
# of the predicted variable)
y_result = final_model.predict(X_comp)

SalePrice = np.exp(y_result)
output = pandas.DataFrame(SalePrice, columns = ['SalePrice'])
output['Id'] = list(range(1461, 2920))
output.to_csv('submission.csv', sep = ',', index = False, columns = ['Id', 'SalePrice'])


Decision Tree with Bayes Search performance: 
0.034967465168435 +- 0.0034875241068768707

Decision Tree with Random Search performance: 
0.03547002037942877 +- 0.003478997154999783

KNN with Grid Search performance: 
0.05208385962541403 +- 0.007536535431943325





The best parameters for this model according to the Bayes Search are: 
{'criterion': 'friedman_mse', 'max_depth': 68, 'min_samples_split': 63}


We see how the best model was Decision Tree. The best hyper-parameter configuration was chosen with Bayesian Optimization and it was: __friedman_mse__ as criterion, __68__ as maximum depth and __63__ as the minimum number of samples to split.

We have to transform the predictions (exponentiate them) to submit it to the Kaggle competition because our response variable here was transformed in the pre-processing.