<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Ames Housing Data and Kaggle Challenge

# Part 3: Modeling 

## Table of Contents:
- [Datasets Used](#Datasets-Used)
- [Import Libraries](#Import-Libraries)
- [Functions](#Functions)
- [Data Import](#Data-Import)


## Datasets Used

The following preprocessed datasets in [`datasets`](../datasets/) folder will be used for in this notebook

* [`train_preprocessed.csv`](../datasets/train.csv): (2016 - 2010) Pre-processed Ames Housing dataset
* [`test_preprocessed.csv`](../datasets/test.csv): (2016 - 2010) Pre-processed Ames Housing dataset, excluding target variable

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms
from scipy import stats

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, Ridge, Lasso
from sklearn.dummy import DummyRegressor

from sklearn.metrics import mean_squared_error, r2_score

## Functions

In [188]:
def metrics(model, X_train, y_train, pred, X_test, y_test):
    y_actual = np.exp(y_test)
    y_pred = np.exp(pred)
    print(f'RMSE Score (original units): {mean_squared_error(y_actual, y_pred, squared = False)} \n')
    print(f'Training Score (R^2): {model.score(X_train, y_train)} \n')
    print(f'CrossVal Score: {abs(cross_val_score(model, X_train, np.exp(y_train), cv = 10, scoring = "neg_root_mean_squared_error").mean())} \n')
    print(f'Testing Score (R^2): {model.score(X_test, y_test)} \n')

## Data Import

In [3]:
train = pd.read_csv('../datasets/train_preprocessed.csv')
test = pd.read_csv('../datasets/test_preprocessed.csv')

## Fitting and Modeling
___

### Baseline Model
A baseline model is created to base upon subsequent models. DummyRegressor is used that simply predicts the mean house price of the training data for any house has a prediction error of $78,278.84. 

In [78]:
X = train
y = train['saleprice']
print(X.shape)
print(y.shape)

(2048, 142)
(2048,)


In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [80]:
dr = DummyRegressor(strategy = 'mean')
dr.fit(X_train, y_train)
y_pred = dr.predict(X_test)
metrics(dr, X_train, y_train, y_pred, X_test, y_test)

RMSE Score (original units): 78278.84170187355 

Training Score (R^2): 0.0 

CrossVal Score: 79625.55077030548 

Testing Score (R^2): -0.0018513201208738561 



---

### Modeling: Feature Selection by SelectKBest

In [173]:
X = train[['lot_frontage', 'overall_qual', 'year_remod/add', 'mas_vnr_area',
       'bsmt_qual', 'bsmt_exposure', 'bsmtfin_type_1', 'heating_qc',
       'kitchen_qual', 'totrms_abvgrd', 'fireplace_qu', 'garage_yr_blt',
       'garage_finish', 'garage_area', 'garage_qual', 'paved_drive',
       'wood_deck_sf', 'open_porch_sf', 'house_age', 'house_area',
       'location', 'total_bath', 'ms_subclass_30', 'ms_subclass_60',
       'ms_zoning_RM', 'mas_vnr_type_None', 'foundation_CBlock',
       'foundation_PConc', 'central_air_Y', 'garage_type_Attchd',
       'garage_type_Detchd', 'sale_type_New']]

y = train['saleprice']
print(X.shape)
print(y.shape)

(2048, 32)
(2048,)


In [174]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [175]:
# summarize shape
print('Train:', X_train.shape, y_train.shape)

Train: (1638, 32) (1638,)


#### Linear Regression Model

In [176]:
lr = LinearRegression()

In [177]:
lr.fit(X_train, y_train)

LinearRegression()

In [178]:
print(lr.coef_)
print(lr.intercept_)

[ 0.02373821  0.07148841  0.02876963  0.00915423 -0.00483965  0.01103523
  0.01091978  0.02247645  0.03876273  0.01431619  0.01591823 -0.00431867
  0.00273932  0.01034734  0.02451134  0.04966949  0.01196336  0.00448034
 -0.01104239  0.12248629  0.02972992  0.02810115 -0.09558238  0.01673806
 -0.03188987  0.01785988 -0.00311232  0.01100226  0.10478342 -0.01002807
 -0.00135393  0.04934919]
10.883172160396985


In [179]:
y_pred = lr.predict(X_test)

In [146]:
metrics(lr, X_train, y_train, y_pred, X_test, y_test)

RMSE Score (original units): 22910.904572787018 

Training Score (R^2): 0.8879574382495934 

CrossVal Score: 28916.84565604644 

Testing Score (R^2): 0.8829845589375604 



#### Ridge 

In [184]:
ridgecv = RidgeCV(alphas = np.linspace(0, 100, 50), cv = 10)
ridgecv.fit(X_train, y_train)
ridgecv.alpha_

16.3265306122449

In [185]:
ridge = Ridge(alpha = ridgecv.alpha_)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
metrics(ridge, X_train, y_train, ridge_pred, X_test, y_test)

RMSE Score (original units): 22267.225089476804 

Training Score (R^2): 0.8934371165112741 

CrossVal Score: 28634.8795313333 

Testing Score (R^2): 0.8957363454762204 



#### Lasso Regression
Lasso differs from Ridge regression by summing the absolute value of the predictors (mⱼ) instead of summing the squared values.

In [186]:
lassocv = LassoCV(alphas = np.logspace(-4, -1, 50), cv = 10)
lassocv.fit(X_train, y_train)
lassocv.alpha_

0.00035564803062231287

In [187]:
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
metrics(lasso, X_train, y_train, lasso_pred, X_test, y_test)

RMSE Score (original units): 22279.308150865643 

Training Score (R^2): 0.8935400991538824 

CrossVal Score: 28666.70851187904 

Testing Score (R^2): 0.8950594842420962 

