# Fine-Tuning and Hyperparameter Optimization

__________________________
**Tags:** *Machine Learning*, *Hyperparameter Optimization*, *PCA*, *Regression*

**Models**: *Ridge Regression*, *SGD*

**Python:** *Scikit-Learn*, *Pandas*
__________________________

__________________________
**Scenario:** Given 79 features of numerical and categorical type, we would like to predict the *Sale Price* of a real estate object in Ames, Iowa.


**Dataset:** The Housing Dataset is provided by Kaggle; see DanB. Housing Prices Competition for Kaggle Learn Users. https://kaggle.com/competitions/home-data-for-ml-course, 2018. Kaggle.
__________________________

__________________________
## Contents

1. **Data Preprocessing**
<p> </p>
2. **Dimensionality Reduction**
<p> </p>
3. **Model Building and Parameter Fine-Tuning**
<p> </p>
4. **Outputting a Test Set Prediction**
__________________________

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

### Packages used throughout this notebook

In [1]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

### Reproducibility

In [2]:
rng = np.random.RandomState(0)

### Importing Dataset

In [3]:
df_train = pd.read_csv('train.csv') #https://www.kaggle.com/competitions/home-data-for-ml-course/overview
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:,-1]

df_train.shape

(1460, 81)

In [4]:
X_train.dtypes.value_counts()

object     43
int64      34
float64     3
Name: count, dtype: int64

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 1. Data Preprocessing

In this section, we will create two pipelines and subsequently combine them into a single transformer.

### Pipeline for numerical features

Let us create a list of all numerical features in the data frame. 

Note that we will need the column indices since Scikit Pipelines return NumPy arrays without column names.

In [5]:
num_feats = X_train.select_dtypes(include=np.number).columns #all column names with numerical attributes

#returns a list of all column indices with numerical attributes
num_idx = []
for colname in num_feats:
    num_idx.append(X_train.columns.get_loc(colname))

num_idx.remove(0) #removes "Id" column index

Creating the respective pipeline:

In [6]:
num_pipe = make_pipeline(SimpleImputer(strategy="median"),
                         StandardScaler())

### Pipeline for categorical features

Let us create a list of all categorical features in the data frame.

In [7]:
cat_feats = X_train.select_dtypes(include=object).columns #all column names with categorical attributes

#returns a list of all column indices with categorical attributes
cat_idx = []
for colname in cat_feats:
    cat_idx.append(X_train.columns.get_loc(colname))

Creating the pipeline: we will simply impute missing values by the most frequent ones and one-hot encode all categorical features.

In [8]:
cat_pipe = make_pipeline(SimpleImputer(strategy="most_frequent"),
                         OneHotEncoder(sparse_output=False, handle_unknown='ignore'))

### Creating a Transformer

Creating the final transformer, using Scikit Learn's standard transformer *drop* that deletes the concerning features:

In [9]:
clean_pipe = ColumnTransformer([
                  ("num", num_pipe, num_idx),
                  ("cat", cat_pipe, cat_idx),
                  ("drop", "drop", [0]), #"Id" column is redundant
                               ])

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 2. Dimensionality Reduction

We will use a Principal Component Analysis transformer that will keep 95 % of the variance in the features.

In [10]:
dim_pipe = make_pipeline(PCA(n_components=0.95))

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 3. Model Building and Parameter Fine-Tuning

We are building a Ridge Regression Model with Stochastic Gradient Descent parameters to keep the weights small.

In [11]:
sgdreg = SGDRegressor(max_iter=1000, tol=1e-3, penalty="l2", alpha=0.1, eta0=0.005,
                      random_state=rng)

In [12]:
full_pipes = make_pipeline(clean_pipe,
                           dim_pipe,
                           sgdreg)

We would like to find the best hyperparameters for the model. Therefore, we will use *GridSearchCV* and plot the results in a table.

In [13]:
param_grid = {'sgdregressor__alpha': [0.01, 0.03, 0.1, 0.3],
              'sgdregressor__eta0': [0.001, 0.003, 0.01, 0.03, 0.1, 0.3]}

grid_search = GridSearchCV(full_pipes, param_grid, cv=3,
                                 scoring="neg_root_mean_squared_error")

In [14]:
grid_search.fit(X_train, y_train)

cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_sgdregressor__alpha,param_sgdregressor__eta0,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
6,0.306298,0.019672,0.020748,0.000715,0.03,0.001,"{'sgdregressor__alpha': 0.03, 'sgdregressor__e...",-28015.30294,-34776.996662,-38930.69826,-33907.665954,4498.389761,1
12,0.297204,0.011736,0.022638,0.000845,0.1,0.001,"{'sgdregressor__alpha': 0.1, 'sgdregressor__et...",-28048.863895,-34668.984742,-39093.44959,-33937.099409,4538.53577,2
13,0.284992,0.025059,0.024024,0.002513,0.1,0.003,"{'sgdregressor__alpha': 0.1, 'sgdregressor__et...",-28158.117121,-34677.651085,-39079.226812,-33971.665006,4486.384697,3
7,0.292291,0.022276,0.025925,0.001403,0.03,0.003,"{'sgdregressor__alpha': 0.03, 'sgdregressor__e...",-28300.32935,-34785.51082,-38860.158946,-33981.999706,4348.311769,4
0,0.325919,0.027657,0.023545,0.001146,0.01,0.001,"{'sgdregressor__alpha': 0.01, 'sgdregressor__e...",-28157.191585,-34972.648134,-38918.318155,-34016.052625,4444.979752,5


The best results are achieved with *alpha* = 0.03, *eta0* = 0.001. They yield a mean RMSE of 33,907 on the three cross-validation folds performed. 

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 4. Outputting a Test Set Prediction

The following lines create a prediction on the test set:

In [15]:
full_pipes.fit(X_train, y_train)

X_test = pd.read_csv('test.csv')
df_pred = pd.DataFrame(pd.Series((X_test["Id"])), columns=["Id"])
df_pred["SalePrice"] = pd.Series(full_pipes.predict(X_test))

Uncomment the following line to create a CSV file:

In [16]:
#df_pred.to_csv('out.csv', index=False)

On Kaggle, the RMSE on the test set is about 18,775.