# Linear Regression Model Tuning

In [1]:
import numpy as np
import pandas as pd

## Dataset

For the dataset I've chosen the [UCI Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality), as I think it is appropriate for this case.

In [2]:
from sklearn.model_selection import train_test_split

# Set seed for reproducible runs
np.random.seed(42)

df = pd.read_csv("../data/winequalityN.csv")
df = df.dropna()

train, test = train_test_split(df, test_size=0.2)

In [3]:
train

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
459,white,6.1,0.43,0.35,9.1,0.059,83.0,249.0,0.99710,3.37,0.50,8.500000,5
1460,white,8.5,0.17,0.74,3.6,0.050,29.0,128.0,0.99280,3.28,0.40,12.400000,6
4403,white,5.2,0.22,0.46,6.2,0.066,41.0,187.0,0.99362,3.19,0.42,9.733333,5
3828,white,6.3,0.40,0.24,5.1,0.036,43.0,131.0,0.99186,3.24,0.44,11.300000,6
4317,white,6.7,0.34,0.26,1.9,0.038,58.0,138.0,0.98930,3.00,0.47,12.200000,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3798,white,7.9,0.21,0.39,2.0,0.057,21.0,138.0,0.99176,3.05,0.52,10.900000,5
5219,red,9.3,0.61,0.26,3.4,0.090,25.0,87.0,0.99975,3.24,0.62,9.700000,5
5254,red,11.5,0.41,0.52,3.0,0.080,29.0,55.0,1.00010,3.26,0.88,11.000000,5
5418,red,9.8,0.25,0.49,2.7,0.088,15.0,33.0,0.99820,3.42,0.90,10.000000,6


In [4]:
# x is all numerical features, y is quality

train_x, train_y = train[df.columns.difference(["type", "quality"])], train["quality"]
test_x, test_y = test[df.columns.difference(["type", "quality"])], test["quality"]

## Basic Linear Regression

In [10]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(train_x, train_y)

LinearRegression()

In [12]:
from sklearn.metrics import mean_squared_error, r2_score


def print_metrics(model, test_x, test_y):
    test_pred = model.predict(test_x)

    print(f"MSE {mean_squared_error(test_y, test_pred)}")
    print(f"R2 Score {r2_score(test_y, test_pred)}")

In [14]:
print_metrics(model, test_x, test_y)

MSE 0.5181044887725762
R2 Score 0.34613659277743936


In [50]:
test_pred = model.predict(test_x)

In [56]:
df['quality'].describe()

count    6463.000000
mean        5.818505
std         0.873286
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         9.000000
Name: quality, dtype: float64

In [54]:
pd.DataFrame(test_pred).describe()

Unnamed: 0,0
count,1293.0
mean,5.843063
std,0.482378
min,4.237051
25%,5.490974
50%,5.841138
75%,6.203619
max,7.112223


In [57]:
model.coef_

array([ 2.69339111e-01, -5.99532485e-01, -1.53457236e-01, -4.56550079e+01,
        5.97136749e-02,  6.05018246e-03,  4.01915362e-01,  4.00340414e-02,
        7.53978919e-01, -2.29118707e-03, -1.35851918e+00])

Not great metrics for a straight up Linear Regression model. Let's try improving the performance

In [58]:
df.columns

Index(['type', 'fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality'],
      dtype='object')

In [59]:
df['pH'].describe()

count    6463.000000
mean        3.218332
std         0.160650
min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000
Name: pH, dtype: float64

## Data normalization

In [15]:
df.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0,6463.0
mean,7.217755,0.339589,0.318758,5.443958,0.056056,30.516865,115.694492,0.994698,3.218332,0.53115,10.492825,5.818505
std,1.297913,0.164639,0.145252,4.756852,0.035076,17.758815,56.526736,0.003001,0.16065,0.148913,1.193128,0.873286
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99233,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.997,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


All the input features have extremely varied ranges of what they can be. Can scaling them make the model work better?

In [30]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [31]:
pipe.fit(train_x, train_y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

In [32]:
print_metrics(pipe, test_x, test_y)

MSE 0.5181044887725772
R2 Score 0.346136592777438


Adding a scaler didn't improve the result of the model, so basic Linear Regression isn't sensitive to scaling

## *l1* and *l2* regularization

Let's start by training a simple *l1* (Lasso) and *l2* (Ridge) model on the same data as the linear model above.

For now, let's go with $\alpha=1$ for full-strength regularization

In [39]:
from sklearn.linear_model import Ridge, Lasso

In [38]:
lasso_model = Lasso(1)
lasso_model.fit(train_x, train_y)

print("Lasso metrics:")
print_metrics(lasso_model, test_x, test_y)

Lasso metrics:
MSE 0.7895467709055487
R2 Score 0.0035682898464813873


In [40]:
ridge_model = Ridge(1)
ridge_model.fit(train_x, train_y)

print("Ridge metrics: ")
print_metrics(ridge_model, test_x, test_y)

Ridge metrics: 
MSE 0.5212445630274457
R2 Score 0.34217372487007014


So, Lasso regularized model performs much worse compared to normal linear regression. And Ridge performs about the same. Why is that?

As they are both meant to prevent overfitting, perhaps the  simple linear regression model doesn't overfit? Let's look at how it performs on the training dataset.

In [48]:
print("Linear Regression Test Metrics")
print_metrics(pipe, test_x, test_y)

Linear Regression Test Metrics
MSE 0.5181044887725772
R2 Score 0.346136592777438


In [49]:
print("Linear Regression Train Metrics")
print_metrics(pipe, train_x, train_y)

Linear Regression Train Metrics
MSE 0.5445772958914633
R2 Score 0.27872131256933186


So it seems that the model is underfitting on the training metrics.

This likely means that this dataset doesn't have a linear relationship at all, and Linear Regression isn't an appropriate model to solve it.