## Regularization 

<ul>
<li>A model will have a low accuracy if it is overfitting. </li>
<li>Overfitting occurs when model is trying too hard to capture the noise (samples that don't represent true pattern)
    in your training dataset.</li>
<li>When model is more flexible, it is prone to overfitting</li>
<li>Regularization shrinks the coefficients (parameter or slops) towards zero to discourages a more complex or flexible model, so as to avoid the risk of overfitting</li>
    <li>Ridge and Lasso are two options </li>
    <li>Regularization strength is provided by parameter <b>alpha</b></li>
</ul>    

In [48]:
# import pandas library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [49]:
cars = pd.read_csv("final_cars.csv")

In [50]:
cars.columns

Index(['make', 'fuel-type', 'num-of-doors', 'body-style', 'drive-wheels',
       'length', 'width', 'curb-weight', 'engine-size', 'highway-mpg',
       'price'],
      dtype='object')

In [51]:
## create X and Y
y = cars['price']
X = cars.drop(columns=['price','make','fuel-type', 'num-of-doors', 'body-style', 'drive-wheels'])

### LinearRegression

In [52]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [53]:
X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state = 0)

In [54]:
lr_model = LinearRegression(normalize=True)
lr_model.fit(X_train,y_train)

LinearRegression(normalize=True)

In [55]:
for t in zip(X.columns, lr_model.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

length                         10.48
width                         528.78
curb-weight                     4.88
engine-size                    72.33
highway-mpg                   -46.98


In [62]:
y_pred = lr_model.predict(X_test)

In [63]:
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ", mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  22211616.794805497
RMSE :  4712.920198221639


## Ridge Regression

In [57]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split


In [58]:
ridge = Ridge(normalize=True, alpha=2.0)
ridge.fit(X_train,y_train)

Ridge(alpha=2.0, normalize=True)

In [59]:
# Display coefficient for each column
for t in zip(X.columns, ridge.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

length                         68.77
width                         445.98
curb-weight                     2.06
engine-size                    28.91
highway-mpg                  -109.07


In [60]:
y_pred = ridge.predict(X_test)

In [61]:
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ", mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  38913820.519167095
RMSE :  6238.094301881552


## LassoCV

In [71]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import r2_score

In [75]:
lm = LassoCV(normalize = True, cv=5,alphas=(50,45,40,35,25,10))
lm.fit(X,y)

LassoCV(alphas=(50, 45, 40, 35, 25, 10), cv=5, normalize=True)

In [76]:
lm.alpha_

25

In [77]:
# Display coefficient for each column
for t in zip(X.columns, lm.coef_):
  print(f"{t[0]:25s} {t[1]:10.2f}")

length                          0.00
width                         401.63
curb-weight                     2.35
engine-size                   107.86
highway-mpg                   -96.49


In [78]:
lm.coef_[np.abs(lm.coef_) == 0]

array([0.])

In [80]:
# Take a part of data for final testing 
y_test  = y[:50]
X_test = X[:50]
y_pred = lm.predict(X_test)

In [81]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)
print("MSE  : ",mse)
print("RMSE : ", np.sqrt(mse))

MSE  :  13878867.909706945
RMSE :  3725.4352644633277


In [82]:
r2score = r2_score(y_test,y_pred)
print(f"R2 Score: {r2score:0.2f}")

R2 Score: 0.85
