# Linear Regression with Regularisation

Task:

1-Return to Question 15 at the end of Chapter 3 of the textbook ISLR (James et al, 2014). Complete part (b) of this question:
Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 :βj =0?

2-Now repeat this using each of the following regularisation approaches:
• ridge regression (l2)
• lasso (l1)
• elastic net (l1 and l2).

3-Now do all of this on a publically available dataset with one output variable and at least 20 predictors (input variables). Explain your choice of dataset

** Where appropriate, use k-fold cross-validation (splitting into training and validation sets k times) to estimate the model quality.

### Answer Structure

1. Data Exploration

    1.1 Check missing value
    1.2 Correlation analysis
    1.3 Identify outliers
2. Data preparation

    2.1 Remove outliers
    2.2 Scaled the dataset
3. Model fitting

    3.1 OLS
    3.2 Ridge
    3.3 Lasso
    3.4 ElasticNet
    
4. Try the whole process in new dataset (repeat 1-3 with the energy dataset)



### 1. Data Exploration

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.linear_model import ElasticNet
from scipy import stats
from sklearn.model_selection import KFold, GridSearchCV, RepeatedKFold, RandomizedSearchCV, learning_curve, cross_val_score
from sklearn.preprocessing import StandardScaler

In [None]:
# Load dataset
boston = pd.read_csv("../input/boston/Boston.csv")
boston.describe()

#### 1.1 Check missing value

In [None]:
# 1.1 Check missing value
boston.isnull().sum()

#### 1.2 Check linearity (correlation analysis)

In [None]:
# 1.2 Correlation analysis
sns.set(rc={'figure.figsize':(15,12)})
boston_cor_matrix = boston.corr().round(2)
sns.heatmap(data=boston_cor_matrix, annot=True,cmap="vlag")


In [None]:
scatter_matrix = pd.plotting.scatter_matrix(boston, figsize=(25,25))

#### 1.3 Check outliers

Here we will use Z-score to identify outliers. 
While calculating the Z-score we need to rescale and center the data and look for data points which are too faraway from zero. 
A data point will be indentified as an outlier if the absolute value of Z-score is greater than 3.

Reference: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba


In [None]:
z = np.abs(stats.zscore(boston))
x = np.where(z>3)

i=np.where(x[1]==3)
len(i[0])


In [None]:
# boston[boston["chas"]>0].shape
# there are 35 records that with chas=1

However, the Z-score method cannot be applied for Boolean variables, it will treat all the minority records as outliers. For example, chas is a Boolean variable in the Boston dataset. Using the Z-score method, all of the records with 1 value are treated as outliers since only 7% (35/505) of the records with chas equals 1. Therefore, we excluded “chas” from the outlier processing.

### 2. Data Preparation

#### 2.1 Remove outliers

In [None]:
# Let's build a function for it
def outliers_del(dataset, threshold):
    z = np.abs(stats.zscore(dataset))
    dataset_no_ol = dataset[(z<threshold).all(axis=1)]
    return dataset_no_ol

In [None]:
boston_no_ol = outliers_del(boston, 3)
# add the 35 records that has a "chad" value of 1 
boston_no_ol = pd.concat([boston_no_ol, boston[boston["chas"]>0]], ignore_index=True)
boston_no_ol.shape

In [None]:
boston.shape

We remove 56 records/outliers from the boston dataset.

#### 2.2 Scale the dataset
As mentioned in the ISLR book, it is best to apply regularisation after standardizing the predictors. It won't change the scores.

In [None]:
# scaled the whole dataset for better comparison between dataset while using MSE
def get_data(dataset, target):
    dataset = dataset
    X = dataset.drop(columns=[target])
    y = dataset[target]
    sc = StandardScaler()
    dataset_scaled = sc.fit_transform(dataset)
    dataset_scaled = pd.DataFrame(dataset_scaled, columns=dataset.columns)
    X_scaled = dataset_scaled.drop(columns=[target])
    y_scaled = dataset_scaled[target]
    return X_scaled, y_scaled

### 3. Model fitting

#### build the stats result function

the functions can be used for each model and pass parameters (if any).
Reference:  https://www.kaggle.com/suugaku/islr-lab-2-python

In [None]:
def stats_result(model_input, X_input, y_input):

    model = model_input
    X = X_input
    y = y_input
    model.fit(X, y)
    
    # Store the coefficients (regression intercept and coefficients) and predictions
    coefficients = np.append(model.intercept_, model.coef_)
    predictions = model.predict(X)
    
    # Create matrix with shape (num_samples, num_features + 1)
    # Where the first column is all ones and then there is one column for the values of each feature or predictor
    X_mat = np.append(np.ones((X.shape[0], 1)), X, axis = 1)
    
    # Compute residual sum of squares
    RSS = np.sum((y - predictions)**2)
    
    # Compute total sum of squares
    TSS = np.sum((np.mean(y) - y)**2)
    
    # Mean squared error 
    MSE = RSS / X_mat.shape[0]
    
    # Estimate the variance of the y-values
    obs_var = RSS/(X_mat.shape[0] - X_mat.shape[1]) 
    # Variances of the parameter estimates are on the diagonal of the variance-covariance matrix of the parameter estimates
    var_beta = obs_var*(np.linalg.inv(np.matmul(X_mat.T, X_mat)).diagonal())
    # Standard error is square root of variance
    se_beta = np.sqrt(var_beta)
    
    # t-statistic for beta_i is beta_i/se_i where se_i is the standard error for beta_i
    t_stats_beta = coefficients/se_beta
    
    # Compute p-values for each parameter using a t-distribution with (num_samples - 1) degrees of freedom
    beta_p_values = [2 * (1 - stats.t.cdf(np.abs(t_i), X_mat.shape[0] - 1))
                    for t_i in t_stats_beta]
    
    # Construct dataframe for the overall model statistics:
    
    # MSE, R^2
    model_scores = pd.Series({"MSE": MSE, "R-squared": model.score(X, y)})
    
    # Construct dataframe for parameter statistics:
    # coefficients, standard errors, t-statistic, p-values for t-statistics
    xlabels = X.columns.insert(0, "Intercept")
    coef_stats = pd.DataFrame({"Coefficient": coefficients, "Standard Error": se_beta,
                                "t-value": t_stats_beta, "Prob(>|t|)": beta_p_values}, index=xlabels)
    return {"model": model, "coef_stats": coef_stats, "scores": model_scores}

In [None]:
def print_stats_result(stats_result):
    print(stats_result["model"])
    print("{:=^60}".format("Score"))
    print(np.round(stats_result["scores"], 4))
    print("{:=^60}".format("Coefficients Statistics"))
    print(np.round(stats_result["coef_stats"], 4))
    print("{:=^60}".format("Predictors We Can Reject (P < 0.05)"))
    coef_stats = stats_result["coef_stats"]
    print(np.round(coef_stats[coef_stats["Prob(>|t|)"] <= 0.05], 4))

#### 3.1 OLS

In [None]:
# OLS linear regression model
X_boston_org, y_boston_org = get_data(boston, "crim")
lr = LinearRegression()
lr_result = stats_result(lr, X_boston_org, y_boston_org)
print_stats_result(lr_result)

In [None]:
X_boston, y_boston = get_data(boston_no_ol, "crim")
lr_result_no_ol = stats_result(lr, X_boston, y_boston)
print_stats_result(lr_result_no_ol)

From the above 2 results, we can see that after remove the outliers, the result for both R^2 and MSE are improved. 

Therefore, in the following steps, we will use the outlier removed and scaled dataset.

#### Build a para-tuning function for regularisations

In [None]:
def param_tuning(model, X, y, params, n):
    '''use grid search and K-Fold cross validation to find the best parameters for the regularisation models
    where n is the number of folds'''
    cv = RepeatedKFold(n_splits=n, n_repeats=3, random_state=1)
#     cv = KFold(n_splits=n, shuffle=True)
    gs_r2 = GridSearchCV(model,
                      params,
                      scoring="r2",
                      cv=cv,
                      n_jobs=-1,
                      return_train_score=True)
    
    gs_mse = GridSearchCV(model,
                      params,
                      scoring="neg_mean_squared_error",
                      cv=cv,
                      n_jobs=-1,
                      return_train_score=True)
    results_r2 = gs_r2.fit(X, y)
    results_mse = gs_mse.fit(X, y)
    return results_r2, results_mse


In [None]:
def result_plot(alphas, results):
    print("%s: %.3f" % (results.scoring, results.best_score_))
    print("best config: %s" % results.best_params_)
    train_scores_mean = results.cv_results_["mean_train_score"]
    test_scores_mean = results.cv_results_["mean_test_score"]
    plt.figure(figsize=(8, 6))
    plt.title("LR with %s" %(results.estimator.__class__.__name__))
    plt.xlabel('$\\alpha$ (alpha)')
    plt.ylabel(results.scoring)
    # plot train scores
    plt.plot(alphas, train_scores_mean, label='Mean Train score', color="r", linewidth=2.0)
    plt.plot(alphas, test_scores_mean, label='Mean Test score', color="b", linewidth=2.0)
    plt.legend(loc='best')


#### 3.2 Ridge Regression

In [None]:
# tuning ridge model
ridge = Ridge()
alphas = np.arange(0, 20, 0.1)
params = {'alpha': alphas}
result_r2, result_mse = param_tuning(ridge, X_boston, y_boston, params, n=5)
result_plot(alphas, result_r2)
result_plot(alphas, result_mse)

In [None]:
# according to tunning result of MSE score, the best parameter is when alpha = 3.5
ridge_tunned = Ridge(alpha=3.5)
ridge_result = stats_result(ridge_tunned, X_boston, y_boston)
print_stats_result(ridge_result)

#### 3.3 Lasso

In [None]:
# tuning lasso
lasso = Lasso()
alphas = np.arange(0, 2, 0.01)
params = {'alpha': alphas}
result_r2, result_mse = param_tuning(lasso, X_boston, y_boston, params, n=5)
result_plot(alphas, result_r2)
result_plot(alphas, result_mse)

In [None]:
#according to tunning result of MSE score, the best parameter is when alpha = 0.011
lasso = Lasso(alpha=0.01)
lasso_result = stats_result(lasso, X_boston, y_boston)
print_stats_result(lasso_result)

In [None]:
# tuning elastic net 
elasticnet = ElasticNet()
alphas = np.arange(0, 2, 0.1)
l1_ratio = np.arange(0, 1, 0.1)
params = {'alpha': alphas, 'l1_ratio':l1_ratio}
result_r2, result_mse = param_tuning(elasticnet, X_boston, y_boston, params, n=5)


In [None]:
print("%s: %.3f" % (result_r2.scoring, result_r2.best_score_))
print("best config: %s" % result_r2.best_params_)
print("%s: %.3f" % (result_mse.scoring, result_mse.best_score_))
print("best config: %s" % result_mse.best_params_)

In [None]:
# according to tunning result of r^2 score, the best parameter is when both alpha and l1 ratio are 0
elasticnet = ElasticNet(alpha=0, l1_ratio=0)
elasticnet_result = stats_result(elasticnet, X_boston, y_boston)
print_stats_result(elasticnet_result)

## 4. Try the model on new dataset

In [None]:
energy = pd.read_csv("../input/appliances-energy-prediction/KAG_energydata_complete.csv")
energy = energy.drop(columns=["date", "rv1", "rv2"])
energy.describe()

In [None]:
# Check missing value
energy.isnull().sum()

In [None]:
# Correlation analysis
sns.set(rc={'figure.figsize':(15,12)})
energy_cor_matrix = energy.corr().round(2)
sns.heatmap(data=energy_cor_matrix, annot=True,cmap="vlag")

It seems like the correlations between "Appliances" and other indicators are not very significant

In [None]:
# scatter_matrix = pd.plotting.scatter_matrix(energy, figsize=(25,25))

In [None]:
# remove outliers
energy_no_ol = outliers_del(energy, 3)
energy_no_ol.shape

In [None]:
energy.shape

We remove 2391 rows of outliers.

In [None]:
# Scaled the dataset
X_energy, y_energy = get_data(energy_no_ol, "Appliances")
X_energy_org, y_energy_org = get_data(energy, "Appliances")

In [None]:
# OLS with outliers
energy_lr_result = stats_result(lr, X_energy_org, y_energy_org)
print_stats_result(energy_lr_result)

In [None]:
# OLS (without outlier)
energy_lr_result = stats_result(lr, X_energy, y_energy)
print_stats_result(energy_lr_result)

In [None]:
# tuning ridge model
alphas = np.arange(0, 20, 0.1)
params = {'alpha': alphas}
result_r2, result_mse = param_tuning(ridge, X_energy, y_energy, params, n=5)
result_plot(alphas, result_r2)
result_plot(alphas, result_mse)

In [None]:
# according to tunning result of MSE score, the best parameter is when alpha = 8.5
ridge_tunned_energy = Ridge(alpha=8.7)
ridge_result = stats_result(ridge_tunned_energy, X_energy, y_energy)
print_stats_result(ridge_result)

In [None]:
# tuning lasso model
alphas = np.arange(0, 2, 0.1)
params = {'alpha': alphas}
result_r2, result_mse = param_tuning(lasso, X_energy, y_energy, params, n=5)
result_plot(alphas, result_r2)
result_plot(alphas, result_mse)

In [None]:
# according to tunning result of r^2 score, the best parameter is when alpha = 0
lasso_tunned_energy = Lasso(alpha=0)
lasso_result_energy = stats_result(lasso_tunned_energy, X_energy, y_energy)
print_stats_result(lasso_result_energy)

In [None]:
# tuning elastic net 
alphas = np.arange(0, 2, 0.1)
l1_ratio = np.arange(0, 1, 0.1)
params = {'alpha': alphas, 'l1_ratio':l1_ratio}
result_r2, result_mse = param_tuning(elasticnet, X_energy, y_energy, params, n=5)

In [None]:
print("%s: %.3f" % (result_r2.scoring, result_r2.best_score_))
print("best config: %s" % result_r2.best_params_)
print("%s: %.3f" % (result_mse.scoring, result_mse.best_score_))
print("best config: %s" % result_mse.best_params_)

In [None]:
elasticnet_tunned_energy = ElasticNet(alpha=0, l1_ratio=0)
elasticnet_result_energy = stats_result(elasticnet_tunned_energy, X_energy, y_energy)
print_stats_result(elasticnet_result_energy)

Overall, the model with regularisation perform worse than OLS.