# <font color=blue>Assignments for "Overfitting and Regularization"</font>

In this assignment, you'll continue working with the house prices data. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Reimplement your model from the previous lesson.
- Try OLS, Lasso, Ridge and ElasticNet regressions using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype
from sklearn import metrics
import math
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

df=pd.read_csv("E:/user/Notebooks/data/house-prices/train.csv", low_memory=False)

In [29]:
df=df.drop(['PoolQC','MiscFeature','Fence','Alley'], axis=1)

In [30]:
def fix_missing(df, col, name):
    if is_numeric_dtype(col):
        df[name] = col.fillna(col.median())    
for n, c in df.items():
        fix_missing(df, c, n)

In [39]:
df['OverallQual*GrLivArea']=df.OverallQual*df.GrLivArea
df['TotalBsmtSF*1stFlrSF']=df.TotalBsmtSF*df['1stFlrSF']
X = df[["YearBuilt","TotalBsmtSF",
                 "1stFlrSF",'OverallQual','GarageCars','OverallQual*GrLivArea','TotalBsmtSF*1stFlrSF']]
Y = df['SalePrice']

X_train, X_test, y_train, y_test =  train_test_split(X, Y, test_size=0.20, random_state=111)

In [40]:
from sklearn.linear_model import LinearRegression
lrm = LinearRegression()
lrm.fit(X_train, y_train)

LinearRegression()

In [41]:
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

In [42]:
from statsmodels.tools.eval_measures import mse, rmse
print("R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.8045016358776023
-----Test set statistics-----
R-squared of the model in test set is: 0.8333583070958053
Mean absolute error of the prediction is: 22138.054813140472
Mean squared error of the prediction is: 1215813607.6534686
Root mean squared error of the prediction is: 34868.51886234155
Mean absolute percentage error of the prediction is: 12.280134542952961


## Ridge

In [43]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
parameters = {"alpha": [10 ** x for x in range (-10, 10, 1)],
              }

ridge = Ridge()
grid_cv = GridSearchCV(estimator=ridge,
                       param_grid = parameters,
                       cv = 10
                      )

grid_cv.fit(X_train, y_train)

print("Best Parameters : ", grid_cv.best_params_)
print("Best Score      : ", grid_cv.best_score_)

Best Parameters :  {'alpha': 100}
Best Score      :  0.7495065348028482


In [45]:
# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridgeregr = Ridge(alpha=10**2) 
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.8042170201917365
-----Test set statistics-----
R-squared of the model in test set is: 0.8335302842523099
Mean absolute error of the prediction is: 22108.925284376674
Mean squared error of the prediction is: 1214558866.6373405
Root mean squared error of the prediction is: 34850.52175559701
Mean absolute percentage error of the prediction is: 12.251481254458119


## Lasso

In [49]:
from sklearn.linear_model import Lasso

lasso = Lasso()
grid_cv = GridSearchCV(estimator=lasso,
                       param_grid = parameters,
                       cv = 10
                      )

grid_cv.fit(X_train, y_train)

print("Best Parameters : ", grid_cv.best_params_)
print("Best Score      : ", grid_cv.best_score_)

Best Parameters :  {'alpha': 10}
Best Score      :  0.7490709290540807


In [50]:
from sklearn.linear_model import Lasso

lassoregr = Lasso(alpha=10**1) 
lassoregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.8045015523273694
-----Test set statistics-----
R-squared of the model in test set is: 0.8333707462421861
Mean absolute error of the prediction is: 22136.450217636193
Mean squared error of the prediction is: 1215722851.9537764
Root mean squared error of the prediction is: 34867.217439218985
Mean absolute percentage error of the prediction is: 12.278884852144555


## ElasticNet

In [54]:
from sklearn.linear_model import ElasticNet

parameters = {"alpha": [10 ** x for x in range (-10, 10, 1)],
              "l1_ratio": np.arange(0.0, 1.0, 0.1)
              }


el = ElasticNet()
grid_cv = GridSearchCV(estimator=el,
                       param_grid = parameters,
                       cv = 10
                      )

grid_cv.fit(X_train, y_train)

print("Best Parameters : ", grid_cv.best_params_)
print("Best Score      : ", grid_cv.best_score_)

Best Parameters :  {'alpha': 0.1, 'l1_ratio': 0.1}
Best Score      :  0.7495060404967996


In [55]:


elasticregr = ElasticNet(alpha=10**-1, l1_ratio=0.1) 
elasticregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

print("R-squared of the model in training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in training set is: 0.8041936195398951
-----Test set statistics-----
R-squared of the model in test set is: 0.8335297800806335
Mean absolute error of the prediction is: 22109.293998673034
Mean squared error of the prediction is: 1214562545.0612333
Root mean squared error of the prediction is: 34850.57452985866
Mean absolute percentage error of the prediction is: 12.251148849200465


R-squared of test with Ridge is highest but difference between train,test r-squared with Lineer regression is lowest. All of performance statics in test is lowest with Ridge model so Ridge model is best.