## Linear, Ridge and Lasso Regression

Used the Tesla stock price data of last 10 years from 2010 to 2020 to compare the three models and see which regression model gives a better accuracy for the data.

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

## Loading the Dataset

In [2]:
df = pd.read_csv(r"C:\Users\19736\Desktop\Surabhi\Surabhi Files\Stats and ML\Machine Learning\TSLA.csv")

In [3]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.0,25.0,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.42,23.299999,23.83,23.83,17187100
2,2010-07-01,25.0,25.92,20.27,21.959999,21.959999,8218800
3,2010-07-02,23.0,23.1,18.709999,19.200001,19.200001,5139800
4,2010-07-06,20.0,20.0,15.83,16.110001,16.110001,6866900


## Dropping the unnecessary columns

In [4]:
df.drop(['Date'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,19.0,25.0,17.540001,23.889999,23.889999,18766300
1,25.790001,30.42,23.299999,23.83,23.83,17187100
2,25.0,25.92,20.27,21.959999,21.959999,8218800
3,23.0,23.1,18.709999,19.200001,19.200001,5139800
4,20.0,20.0,15.83,16.110001,16.110001,6866900


In [5]:
target_column=['Volume'] 
predictors=list(set(list(df.columns))-set(target_column))

## Splitting the data into train and test sets

In [6]:
X = df[predictors].values
y = df[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)

(1691, 5)
(725, 5)


## Linear Regression

The simplest form of regression is the linear regression, which assumes that the predictors have a linear relationship with the target variable. The linear regression equation can be expressed in the following form:

y = a1x1 + a2x2 + a3x3 + ..... + anxn + b

In order to fit the linear regression model, the first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set.

In [7]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

The first line of code below predicts on the training set. The second and third lines of code prints the evaluation metrics - RMSE and R-squared on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.

In [8]:
pred_train_lr= lr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lr))) #RMSE on training set
print(r2_score(y_train, pred_train_lr)) #R-squared on training set

pred_test_lr= lr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lr))) #RMSE on test set
print(r2_score(y_test, pred_test_lr)) #R-squared on test set

3399901.9578878703
0.5570951909627421
3023094.1421532887
0.5844712752950807


For the above model, the train set RMSE is 339990 and the test set RMSE is 302309. Whereas, the the train set R-squared value is 55% and the test set value is 58%.

## Ridge Regression

Perhaps the most common form of regularization is known as ridge regression or L2 regularization. This proceeds by penalizing the sum of squares (2-norms) of the model coefficients; in this case, the penalty on the model fit would be

P=α∑n=1Nθ2n

where α is a free parameter that controls the strength of the penalty.

In scikit-learn, a ridge regression model is constructed by using the Ridge class. The first line of code below instantiates the Ridge Regression model with an alpha value of 0.05. The second line fits the model to the training data. The third line of code predicts, while the fourth and fifth lines print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.

In [9]:
rr = Ridge(alpha=0.05)
rr.fit(X_train, y_train) 
pred_train_rr= rr.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_rr))) #RMSE on training set
print(r2_score(y_train, pred_train_rr)) #R-squared on training set

pred_test_rr= rr.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_rr))) #RMSE on test set
print(r2_score(y_test, pred_test_rr)) #R-squared on test set

3399901.957902207
0.5570951909590067
3023093.9541161833
0.584471326987031


For the above model, the train set RMSE is 339990 and the test set RMSE is 302309. Whereas, the the train set R-squared value is 55% and the test set value is 58%.

## Lasso Regression

Another very common type of regularization is known as lasso, and involves penalizing the sum of absolute values (1-norms) of regression coefficients:

P=α∑n=1N|θn|

In scikit-learn, a lasso regression model is constructed by using the Lasso class. The first line of code below instantiates the Lasso Regression model with an alpha value of 0.01. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics - RMSE and R-squared - on the training set. The same steps are repeated on the test dataset in the sixth to eight lines of code.

In [10]:
model_lasso = Lasso(alpha=0.05)
model_lasso.fit(X_train, y_train) 
pred_train_lasso= model_lasso.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_lasso))) #RMSE on training set
print(r2_score(y_train, pred_train_lasso)) #R-squared on training set

pred_test_lasso= model_lasso.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_lasso))) #RMSE on test set
print(r2_score(y_test, pred_test_lasso)) #R-squared on test set

3730466.6761576566
0.46678305985431434
3349838.4047389287
0.4897941637500035




For the above model, the train set RMSE is 380865 and the test set RMSE is 345307 thousand. Whereas, the the train set R-squared value is 44% and the test set value is 45%.

## Conclusion

From the above models, we can clearly see that Lasso Regression is performing the worst in terms of R-squared accuracy and Ridge Regression is performing the best. Whereas, in terms of RMSE error, Lasso Regression is performing the best.