# Fine-Tuning a Ridge Regression Model

__________________________
**Tags:** *Machine Learning*, *Hyperparameter Optimization*, *PCA*, *Regression*

**Models**: *Ridge Regression*, *SGD*

**Python:** *Scikit-Learn*, *Pandas*
__________________________

__________________________
**Scenario:** Given 79 features of numerical and categorical type, we would like to predict the *Sale Price* of a real estate object in Ames, Iowa.


**Dataset:** The Housing Dataset is provided by Kaggle; see DanB. Housing Prices Competition for Kaggle Learn Users. https://kaggle.com/competitions/home-data-for-ml-course, 2018. Kaggle.
__________________________

__________________________
## Contents

1. **Data Preprocessing**
<p> </p>
2. **Dimensionality Reduction**
<p> </p>
3. **Model Building and Parameter Fine-Tuning**
<p> </p>
4. **Outputting a Test Set Prediction**
__________________________

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

### Packages used throughout this notebook

In [1]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

### Reproducibility

In [2]:
rng = np.random.RandomState(0)

### Importing Dataset

In [3]:
df_train = pd.read_csv('train.csv') #https://www.kaggle.com/competitions/home-data-for-ml-course/overview
X_train = df_train.iloc[:,:-1]
y_train = df_train.iloc[:,-1]

df_train.shape

(1460, 81)

In [4]:
X_train.dtypes.value_counts()

object     43
int64      34
float64     3
Name: count, dtype: int64

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 1. Data Preprocessing

In this section, we will create two pipelines and subsequently combine them into a single transformer.

### Pipeline for numerical features

Let us create a list of all numerical features in the data frame. 

Note that we will need the column indices since Scikit Pipelines return NumPy arrays without column names.

In [5]:
num_feats = X_train.select_dtypes(include=np.number).columns #all column names with numerical attributes

#returns a list of all column indices with numerical attributes
num_idx = []
for colname in num_feats:
    num_idx.append(X_train.columns.get_loc(colname))

num_idx.remove(0) #removes "Id" column index

Creating the respective pipeline:

In [6]:
num_pipe = make_pipeline(SimpleImputer(strategy="median"),
                         StandardScaler())

### Pipeline for categorical features

Let us create a list of all categorical features in the data frame.

In [7]:
cat_feats = X_train.select_dtypes(include=object).columns #all column names with categorical attributes

#returns a list of all column indices with categorical attributes
cat_idx = []
for colname in cat_feats:
    cat_idx.append(X_train.columns.get_loc(colname))

Creating the pipeline: we will simply impute missing values by the most frequent ones and one-hot encode all categorical features.

In [8]:
cat_pipe = make_pipeline(SimpleImputer(strategy="most_frequent"),
                         OneHotEncoder(sparse_output=False, handle_unknown='ignore'))

### Creating a Transformer

Creating the final transformer, using Scikit Learn's standard transformer *drop* that deletes the concerning features:

In [9]:
clean_pipe = ColumnTransformer([
                  ("num", num_pipe, num_idx),
                  ("cat", cat_pipe, cat_idx),
                  ("drop", "drop", [0]), #"Id" column is redundant
                               ])

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 2. Dimensionality Reduction

We will use a Principal Component Analysis transformer that will keep 95 % of the variance in the features.

In [10]:
dim_pipe = make_pipeline(PCA(n_components=0.95))

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 3. Model Building and Parameter Fine-Tuning

We are building a Ridge Regression Model with Stochastic Gradient Descent parameters to keep the weights small.

In [11]:
sgdreg = SGDRegressor(max_iter=1000, tol=1e-3, penalty="l2", alpha=0.1, eta0=0.005,
                      random_state=rng)

In [12]:
full_pipes = make_pipeline(clean_pipe,
                           dim_pipe,
                           sgdreg)

In [13]:
full_pipes.fit(X_train, y_train)

We would like to find the best hyperparameters for the SGD model. Therefore, we will use *GridSearchCV* and plot the results in a table.

In [19]:
param_grid = {'sgdregressor__alpha': [0.01, 0.03, 0.05, 0.1, 0.3, 0.5],
              'sgdregressor__eta0': [0.001, 0.005, 0.01, 0.05, 0.1]}

grid_search = GridSearchCV(full_pipes, param_grid, cv=3,
                                 scoring="neg_root_mean_squared_error")

grid_search.fit(X_train, y_train)

In [20]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_sgdregressor__alpha,param_sgdregressor__eta0,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
16,0.239153,0.004595,0.018873,0.000223,0.1,0.005,"{'sgdregressor__alpha': 0.1, 'sgdregressor__et...",-27538.818344,-34883.567451,-38984.091144,-33802.15898,4734.67004,1
10,0.293994,0.024306,0.020538,0.002237,0.05,0.001,"{'sgdregressor__alpha': 0.05, 'sgdregressor__e...",-27954.580691,-34684.0587,-38899.826782,-33846.155391,4507.487435,2
11,0.243551,0.001992,0.018951,0.000276,0.05,0.005,"{'sgdregressor__alpha': 0.05, 'sgdregressor__e...",-27669.676261,-34897.379706,-38973.717036,-33846.924334,4674.250516,3
15,0.260402,0.004842,0.018975,0.000566,0.1,0.001,"{'sgdregressor__alpha': 0.1, 'sgdregressor__et...",-27972.269336,-34643.265172,-38989.293399,-33868.275969,4530.94258,4
5,0.271967,0.006184,0.01883,0.000103,0.03,0.001,"{'sgdregressor__alpha': 0.03, 'sgdregressor__e...",-28028.847104,-34751.207929,-38903.091175,-33894.382069,4480.543787,5


The best results are achieved with *alpha* = 0.1, *eta0* = 0.005. They yield a mean RMSE of 33,802 on the three cross-validation folds performed. 

<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>
<p> <br> </p>

# 4. Outputting a Test Set Prediction

The following lines create a prediction on the test set:

In [16]:
full_pipes.fit(X_train, y_train)

X_test = pd.read_csv('test.csv')
df_pred = pd.DataFrame(pd.Series((X_test["Id"])), columns=["Id"])
df_pred["SalePrice"] = pd.Series(full_pipes.predict(X_test))

Uncomment the following line to create a CSV file:

In [17]:
#df_pred.to_csv('out.csv', index=False)

On Kaggle, the RMSE on the test set is about 17,888.