# Exploring OLS, Lasso and Random Forest in a regression task

In [39]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso, LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

We are working with UCI Wine quality dataset.

In [2]:
url = 'https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/data/winequality-white.csv'
data = pd.read_csv(url, sep=';')
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with StandardScaler.



In [4]:
y = data['quality']
X = data.drop('quality', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Linear Regression
Train a simple linear regression model (Ordinary Least Squares).

In [5]:
linreg = LinearRegression(n_jobs=-1)
linreg.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

What are mean squared errors of model predictions on train and holdout sets?



In [6]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, linreg.predict(X_train_scaled)))
print("Mean squared error (test): %.3f" % mean_squared_error(y_test, linreg.predict(X_test_scaled)))

Mean squared error (train): 0.558
Mean squared error (test): 0.584


Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use pandas.DataFrame here.



In [16]:
linreg_coef = pd.DataFrame({'coef': linreg.coef_, 'abs coef': abs(linreg.coef_)}, index=X.columns)
linreg_coef

Unnamed: 0,coef,abs coef
fixed acidity,0.097822,0.097822
volatile acidity,-0.19226,0.19226
citric acid,-0.000183,0.000183
residual sugar,0.538164,0.538164
chlorides,0.008127,0.008127
free sulfur dioxide,0.04218,0.04218
total sulfur dioxide,0.014304,0.014304
density,-0.66572,0.66572
pH,0.150036,0.150036
sulphates,0.062053,0.062053


In [17]:
linreg_coef.sort_values(by='abs coef', ascending=False)

Unnamed: 0,coef,abs coef
density,-0.66572,0.66572
residual sugar,0.538164,0.538164
volatile acidity,-0.19226,0.19226
pH,0.150036,0.150036
alcohol,0.129533,0.129533
fixed acidity,0.097822,0.097822
sulphates,0.062053,0.062053
free sulfur dioxide,0.04218,0.04218
total sulfur dioxide,0.014304,0.014304
chlorides,0.008127,0.008127


Density is the most important feature on wine quality.

## Lasso Regression
Train a LASSO model with $\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.



In [19]:
lasso1 = Lasso(alpha=0.01, random_state=17)
lasso1.fit(X_train_scaled, y_train)

Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=17,
      selection='cyclic', tol=0.0001, warm_start=False)

Which feature is the least informative in predicting wine quality, according to this LASSO model?



In [22]:
lasso1_coef = pd.DataFrame({'coef': lasso1.coef_, 'abs coef': abs(lasso1.coef_)}, index=X.columns)
lasso1_coef.sort_values(by='abs coef', ascending=False)

Unnamed: 0,coef,abs coef
alcohol,0.322425,0.322425
residual sugar,0.256363,0.256363
density,-0.235492,0.235492
volatile acidity,-0.188479,0.188479
pH,0.067277,0.067277
free sulfur dioxide,0.043088,0.043088
sulphates,0.029722,0.029722
chlorides,-0.002747,0.002747
fixed acidity,-0.0,0.0
citric acid,-0.0,0.0


Based on this lasso model ,fixed acidith, citric acid, and total sulfur dioxide is the least infromative features.

Train LassoCV with random_state=17 to choose the best value of $\alpha$ in 5-fold cross-validation.



In [25]:
alphas = np.logspace(-6, 2, 200)
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=17, n_jobs=-1)
lasso_cv.fit(X_train_scaled, y_train)

LassoCV(alphas=array([1.00000000e-06, 1.09698580e-06, 1.20337784e-06, 1.32008840e-06,
       1.44811823e-06, 1.58856513e-06, 1.74263339e-06, 1.91164408e-06,
       2.09704640e-06, 2.30043012e-06, 2.52353917e-06, 2.76828663e-06,
       3.03677112e-06, 3.33129479e-06, 3.65438307e-06, 4.00880633e-06,
       4.39760361e-06, 4.82410870e-06, 5.29197874e-06, 5.80522552e-06,
       6.36824994e-06, 6.98587975e-0...
       3.61234270e+01, 3.96268864e+01, 4.34701316e+01, 4.76861170e+01,
       5.23109931e+01, 5.73844165e+01, 6.29498899e+01, 6.90551352e+01,
       7.57525026e+01, 8.30994195e+01, 9.11588830e+01, 1.00000000e+02]),
        copy_X=True, cv=5, eps=0.001, fit_intercept=True, max_iter=1000,
        n_alphas=100, n_jobs=-1, normalize=False, positive=False,
        precompute='auto', random_state=17, selection='cyclic', tol=0.0001,
        verbose=False)

In [26]:
lasso_cv.alpha_

0.0002833096101839324


Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?

In [27]:
lasso_cv_coeff = pd.DataFrame({'coef': lasso_cv.coef_, 'abs coef': abs(lasso_cv.coef_)}, index=X.columns)
lasso_cv_coeff.sort_values(by='abs coef', ascending=False)

Unnamed: 0,coef,abs coef
density,-0.648161,0.648161
residual sugar,0.526883,0.526883
volatile acidity,-0.192049,0.192049
pH,0.146549,0.146549
alcohol,0.137115,0.137115
fixed acidity,0.093295,0.093295
sulphates,0.060939,0.060939
free sulfur dioxide,0.042698,0.042698
total sulfur dioxide,0.012969,0.012969
chlorides,0.006933,0.006933


Based on the tuned lasso model, citric acid is the least infromative features.

What are mean squared errors of tuned LASSO predictions on train and holdout sets?

In [31]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, lasso_cv.predict(X_train_scaled)))
print("Mean squared error (test): %.3f" % mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)))

Mean squared error (train): 0.558
Mean squared error (test): 0.583


## Random Forest

Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.

In [40]:
forest = RandomForestRegressor(random_state=17, n_jobs=-1)
forest.fit(X_train_scaled, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=17, verbose=0, warm_start=False)

What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?



In [41]:
print("Mean squared error (train): %.3f" % mean_squared_error(y_train, forest.predict(X_train_scaled)))
print("Mean squared error (cv) %.3f" % np.mean(np.abs(cross_val_score(forest, X_train_scaled, y_train, scoring='neg_mean_squared_error'))))
print("Mean squared error (holdout) %.3f" % mean_squared_error(y_test, forest.predict(X_test_scaled)))

Mean squared error (train): 0.053
Mean squared error (cv) 0.414
Mean squared error (holdout) 0.372


Tune the max_features and max_depth hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.

In [44]:
forest_params = {'max_depth': range(10, 25),
                 'max_features': range(6, 12)}

tuned_forest = GridSearchCV(RandomForestRegressor(random_state=17, n_jobs=-1), forest_params, scoring='neg_mean_squared_error', n_jobs=-1, cv=5)
tuned_forest.fit(X_train_scaled, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=-1,
                                             oob_score=False, random_state=17,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-

In [45]:
tuned_forest.best_params_, tuned_forest.best_score_

({'max_depth': 21, 'max_features': 6}, -0.39773288191505934)

What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?

In [48]:
print("Mean squared error (cv): %.3f" % np.mean(np.abs(cross_val_score(tuned_forest.best_estimator_, X_train_scaled, y_train, 
                                                                       scoring='neg_mean_squared_error'))))
print("Mean squared error (test): %.3f" % mean_squared_error(y_test, tuned_forest.predict(X_test_scaled)))

Mean squared error (cv): 0.398
Mean squared error (test): 0.366


Output RF's feature importance. Again, it's nice to present it as a DataFrame.
What is the most important feature, according to the Random Forest model?

In [53]:
rf_fi = pd.DataFrame(tuned_forest.best_estimator_.feature_importances_, index=X.columns, columns=['coef'])
rf_fi.sort_values(by='coef', ascending=False)

Unnamed: 0,coef
alcohol,0.206056
volatile acidity,0.117578
free sulfur dioxide,0.111556
density,0.088549
pH,0.073659
total sulfur dioxide,0.07364
chlorides,0.073366
residual sugar,0.072072
citric acid,0.062601
fixed acidity,0.061813


Alcohol is the most important feature based on tuned random forest regressor.