# Assignment

In this assignment, we'll continue working with the house prices data. We will complete the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Reimplement your model from the previous checkpoint.
* Try OLS, Lasso, Ridge, and ElasticNet regression using the same model specification. This time, you need to do **k-fold cross-validation** to choose the best hyperparameter values for your models. Which model is the best? Why?

## House price model 

In this assignment, we will revisit the house price model and focus on the problem of __overfitting__ and __underfitting__. Overfitting refers to a model that models the training data too well and is unable to generalize to new data, thus producing poor results. On the other hand, an underfitted model will have poor performance on the training data and new data. A good model should be able to __generalize__ and reduce the __generalization gap__ which is the difference between the erros in the test and training set. As a general rule, if our model is too complex, it will tend to overfit. Inversely, if our model is not complex enough, it will underfit the training set.

We will also try different regression model, which incorportate __regularization__, the process of modifying algorithms in order to lower the generalization gap without sacrificing training performance.

When choosing the best hyperparameter values for our models, we will use __k-fold cross-validation__. We will split the data into 5 random folds to estimate the skill fo the model on unseen data. 

### Iteration 1 OLS

We'll start by reloading our houseprice model and analyzing the performance of Ordinary Least Squares. 

In [1]:
# Libraries 

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sqlalchemy import create_engine
import warnings

# Import data

warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

houseprices_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, 
# as we're only doing a single query
engine.dispose()

In [2]:
# Create dummy variables 
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.mszoning, prefix="mszoning", drop_first=True)], axis=1)
houseprices_df = pd.concat([houseprices_df,pd.get_dummies(houseprices_df.housestyle, prefix="style", drop_first=True)], axis=1)

dummy_column_names = list(pd.get_dummies(houseprices_df.mszoning, prefix="mszoning", drop_first=True).columns)
dummy_column_names = dummy_column_names + list(pd.get_dummies(houseprices_df.housestyle, prefix="style", drop_first=True).columns)

In [3]:
# Create new variables 
houseprices_df['totalsf'] = houseprices_df['totalbsmtsf'] + houseprices_df['firstflrsf'] + houseprices_df['secondflrsf']

houseprices_df['int_over_sf'] = houseprices_df['totalsf'] * houseprices_df['overallqual']

In [4]:
# Y is the target variable
Y = np.log1p(houseprices_df['saleprice'])
# X is the feature set
X = houseprices_df[['overallqual','grlivarea','garagearea','totalbsmtsf','firstflrsf', 'int_over_sf'] + dummy_column_names]

In [5]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465)

print("The number of observations in training set is {}".format(X_train.shape[0]))
print("The number of observations in test set is {}".format(X_test.shape[0]))

alphas = [np.power(10.0,p) for p in np.arange(-10,40,1)]

The number of observations in training set is 1168
The number of observations in test set is 292


In [6]:
# We fit an OLS model using sklearn
lrm = LinearRegression()
lrm.fit(X_train, y_train)


# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print("R-squared of the model in the training set is: {}".format(lrm.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model in the test set is: {}".format(lrm.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model in the training set is: 0.8292006059392802
-----Test set statistics-----
R-squared of the model in the test set is: 0.820784417090392
Mean absolute error of the prediction is: 0.12885874941196004
Mean squared error of the prediction is: 0.02988341865036109
Root mean squared error of the prediction is: 0.1728682117983555
Mean absolute percentage error of the prediction is: 1.0765115436158368


As we see, the R-squared of the model in the training set is 0.83 and is 0.82 in the test set. The difference between these values is very small so our model is a good fit in the training set. 

We also printed out some prediction statistics on the test set to compare with the following models.

### Iteration 2 Lasso 

Least Absolute Shrinkage and Selection Operator regression works to prevent overfitting by trying to penaltize non-zero coefficients and the sum of their absolute values, forcing small parameter estimates to be equal to zero, effectively dropping them from the model. 

In [7]:
from sklearn.linear_model import Lasso

lassoregr = LassoCV(alphas=alphas, cv=5)
lassoregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lassoregr.predict(X_train)
y_preds_test = lassoregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(lassoregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(lassoregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))


R-squared of the model on the training set is: 0.8289842776415911
-----Test set statistics-----
R-squared of the model on the test set is: 0.8184967365411038
Mean absolute error of the prediction is: 0.1292773499289438
Mean squared error of the prediction is: 0.030264879427837916
Root mean squared error of the prediction is: 0.17396804139794733
Mean absolute percentage error of the prediction is: 1.0804753132610663


As we see, the R-squared of the model in the training set is 0.83 and is 0.82 in the test set. The difference between these values is very similar to the OLS model. However, all of the evaluation metrics increased from the previous model indicating slightly more errors in this model. 

### Iteration 3 Ridge

Ridge regression minimizes this cost function by imposing a pentality for large coefficients. As the complexity of a model increases and features correlate with one another more and more, the model is incorporating too much variance in the training set. Removing features from the model can be seen as settings their coefficients to zero. Instead of forcing them to be exactly zero, let's penalize them if they are too far from zero, thus enforcing them to be small in a continuous way. This way, we decrease model complexity while keeping all variables in the model.

In [8]:
from sklearn.linear_model import Ridge

# Fitting a ridge regression model. Alpha is the regularization
# parameter (usually called lambda). As alpha gets larger, parameter
# shrinkage grows more pronounced.
ridgeregr = RidgeCV(alphas=alphas, cv=5) 
ridgeregr.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridgeregr.predict(X_train)
y_preds_test = ridgeregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(ridgeregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(ridgeregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))

R-squared of the model on the training set is: 0.8286748212734817
-----Test set statistics-----
R-squared of the model on the test set is: 0.8167057471604998
Mean absolute error of the prediction is: 0.12963708370240168
Mean squared error of the prediction is: 0.030563519114130914
Root mean squared error of the prediction is: 0.17482425207656663
Mean absolute percentage error of the prediction is: 1.0837971354032898


These results are very similar to the Lasso model. The R-squared changes from 0.83 in the training model to 0.82 in the testing set. The evaluation metrics are still higher than the OLS model. 

### Iteration 4 Elastic Net 

Elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

In [9]:
from sklearn.linear_model import ElasticNet

elasticregr = ElasticNetCV(alphas=alphas, cv=5)

elasticregr.fit(X_train, y_train)
# We are making predictions here
y_preds_train = elasticregr.predict(X_train)
y_preds_test = elasticregr.predict(X_test)

print("R-squared of the model on the training set is: {}".format(elasticregr.score(X_train, y_train)))
print("-----Test set statistics-----")
print("R-squared of the model on the test set is: {}".format(elasticregr.score(X_test, y_test)))
print("Mean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_preds_test)))
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_preds_test)))
print("Mean absolute percentage error of the prediction is: {}".format(np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100))


R-squared of the model on the training set is: 0.8291190450186339
-----Test set statistics-----
R-squared of the model on the test set is: 0.8193730067494465
Mean absolute error of the prediction is: 0.1291282364967868
Mean squared error of the prediction is: 0.03011876518340893
Root mean squared error of the prediction is: 0.17354758766231507
Mean absolute percentage error of the prediction is: 1.0790605176978927


The elastic net model also yield very similar results to the ridge and lasso regression models. The R-squared and evaulation metrics of this model is somewhere between the ridge and lasso regression since it is a combination of both models. 

### Which model is the best?
I would continue with the OLS regression since its R-squared is higher than the other models meaning that the OLS is able to explain more variation in the target variable. The evaluation metrics of the OLS is slightly lower than the others, signifying a better fit line. 

## Concept Summary 

Linear models that contain many features or variables that are correlated to one another, the standard OLS parameters estimates will have high variance, thus making the model unreliable. To counter this, we can use regularization, which will allow us to decrease the variance at at cost of introducing some bias. 

Three popular regularization techniques that aim at decreasing the size of the coefficients include:
 * Ride Regression: penalizes sum of squared coefficents (L2 pentalty)
 * Lasso Regression: penalizes the sum of absolute values of the coefficients (L1 penalty)
 * Elastic Net: combination of Ridge and Lasso