## Day 28 Lecture 2 Assignment

In this assignment, we will learn about overfitting and regularization. We will use the king county housing dataset loaded below and analyze the regression from this dataset.

In [1]:
%matplotlib inline

import math
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from scipy.stats import bartlett, levene, jarque_bera, normaltest
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [2]:
king_county = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/kc_house_data.csv')

In [3]:
king_county.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Perform the same transformations in the previous assignment to meet model assumptions:
1. Remove all columns except: price, bedrooms, bathrooms, sqft_living, floors, waterfront
1. Remove outliers
1. Split the data into train and test subsets. 20% of the data should be in the test subset

In [4]:
kc = king_county[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'floors', 'waterfront']]
kc["has_waterfront"] = pd.get_dummies(kc.waterfront, drop_first=True)


kc = kc[kc.bedrooms != 33]
kc = kc[kc.price != 7700000.0]
kc.dropna()

kc.drop('waterfront', axis=1, inplace=True)
kc.info()

X = kc.drop('price', axis=1)
Y = kc.price

X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21611 entries, 0 to 21612
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   price           21611 non-null  float64
 1   bedrooms        21611 non-null  int64  
 2   bathrooms       21611 non-null  float64
 3   sqft_living     21611 non-null  int64  
 4   floors          21611 non-null  float64
 5   has_waterfront  21611 non-null  uint8  
dtypes: float64(3), int64(2), uint8(1)
memory usage: 1.0 MB


In [5]:
pd.Series([variance_inflation_factor(X_train.values, i)
        for i in range(X_train.shape[1])],
        index=X_train.columns)

bedrooms          13.523773
bathrooms         23.852623
sqft_living       16.146105
floors             9.545831
has_waterfront     1.027525
dtype: float64

In [6]:
kc_dummy = pd.get_dummies(kc, columns=['floors'], drop_first=True)

In [7]:
X = kc_dummy.drop('price', axis=1)
Y = kc_dummy.price

X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2)

Apply a ridge regression model with lambda=50 to the data and evaluate by looking at r squared for test and train

In [8]:
ridge = Ridge(alpha=50)
ridge.fit(X_train, Y_train)

Ridge(alpha=50)

In [9]:
print(f'\nRidge Train R^2: {ridge.score(X_train, Y_train):.2f}')
print(f'Ridge Test R^2: {ridge.score(x_test, y_test):.2f}')


Ridge Train R^2: 0.56
Ridge Test R^2: 0.53


>*With a 0.53 R^2 value for the test set, the Ridge Regression model is underfitting with no definite amount.*

Perform a grid search for the following values of alpha: 0.001, 0.01, 0.1, 1, 10, 100, 1000 to find the most optimal ridge regression model. Experiment with different scoring metrics in the grid search (R^2 is the default, but you can use root mean squared error or many others). 
https://scikit-learn.org/stable/modules/model_evaluation.html

In [10]:
grid = {'alpha':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}

ridge_cv = GridSearchCV(Ridge(), grid, cv=5)
ridge_cv.fit(X_train, np.log(Y_train))

print(f'selected alpha: {ridge_cv.best_estimator_.alpha}')

selected alpha: 0.1


In [11]:
ridge_cv = GridSearchCV(Ridge(), grid, scoring='neg_mean_absolute_error', cv=5)
ridge_cv.fit(X_train, np.log(Y_train))

print(f'selected mean_abs: {ridge_cv.best_score_}')

selected mean_abs: -0.2975205625567954


In [12]:
ridge_cv = GridSearchCV(Ridge(), grid, scoring='neg_mean_squared_error', cv=5)
ridge_cv.fit(X_train, np.log(Y_train))

print(f'selected MSE: {ridge_cv.best_score_}')

selected MSE: -0.1350560233976334
