## Regularization 

<ul>
<li>A model will have a low accuracy if it is overfitting. </li>
<li>Overfitting occurs when model is trying too hard to capture the noise (samples that don't represent true pattern)
    in your training dataset.</li>
<li>When model is more flexible, it is prone to overfitting</li>
<li>Regularization shrinks the coefficients (parameter or slops) towards zero to discourages a more complex or flexible model, so as to avoid the risk of overfitting</li>
    <li>Ridge and Lasso are two options </li>
    <li>Regularization strength is provided by parameter <b>alpha</b></li>
</ul>    

In [1]:
# import pandas library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
cars = pd.read_csv("final_cars.csv")

In [3]:
cars.columns

Index(['make', 'fuel-type', 'num-of-doors', 'body-style', 'drive-wheels',
       'curb-weight', 'engine-size', 'highway-mpg', 'wheel-base', 'price'],
      dtype='object')

In [4]:
## create X and Y
y = cars['price']
X = cars.drop(columns=['price','make','fuel-type', 'num-of-doors', 'body-style', 'drive-wheels'])

In [29]:
X

Unnamed: 0,curb-weight,engine-size,highway-mpg,wheel-base
0,2548,130,27,88.6
1,2548,130,27,88.6
2,2823,152,26,94.5
3,2337,109,30,99.8
4,2824,136,22,99.4
...,...,...,...,...
196,2952,141,28,109.1
197,3049,141,25,109.1
198,3012,173,23,109.1
199,3217,145,27,109.1


### LinearRegression

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

In [6]:
ss = StandardScaler()

In [7]:
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state = 0)

In [8]:
X_train_scaled = ss.fit_transform(X_train)

In [9]:
X_test_scaled = ss.transform(X_test)

In [10]:
lr_model = LinearRegression()
lr_model.fit(X_train_scaled,y_train)

# for t in zip(X.columns, lr_model.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

In [12]:
lr_model.intercept_

12598.65

In [13]:
y_pred = lr_model.predict(X_test_scaled)

In [14]:
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ", mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  24273971.748518646
RMSE :  4926.862261979591


## Ridge Regression

In [15]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

In [16]:
ridge = Ridge(alpha=3.0)
ridge.fit(X_train_scaled,y_train)

In [17]:
ridge.intercept_

12598.65

In [18]:
# Display coefficient for each column
for t in zip(X.columns, ridge.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

curb-weight                  2474.86
engine-size                  3053.85
highway-mpg                  -518.74
wheel-base                    939.40


In [19]:
y_pred = ridge.predict(X_test_scaled)

In [20]:
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ", mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  24601862.70717874
RMSE :  4960.026482507804


## LassoCV

In [30]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import r2_score

In [31]:
X_scaled = ss.fit_transform(X)

In [38]:
lm = LassoCV(cv=5,alphas=(3,4,5,6,7, 10))
lm.fit(X_scaled,y)

In [39]:
lm.alpha_

10

In [40]:
# Display coefficient for each column
for t in zip(X.columns, lm.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

curb-weight                  2118.81
engine-size                  4606.38
highway-mpg                  -809.12
wheel-base                    -87.07


In [41]:
# Take a part of data for final testing 
y_test  = y[:50]
X_test_scaled = X_scaled[:50]
y_pred = lm.predict(X_test_scaled)

In [42]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ",mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  14079926.792988647
RMSE :  3752.3228529790244


In [43]:
r2score = r2_score(y_test,y_pred)
print(f"R2 Score: {r2score:0.2f}")

R2 Score: 0.84
