# Project 2 - Ames Housing Data and Kaggle Challenge
Author: _Ritchie Kwan_

---



## Table of Contents
1. [EDA and Data Cleaning](01-EDA-and-Cleaning.ipynb)
2. [Preprocessing and Feature Engineering](02-Preprocessing-and-Feature-Engineering.ipynb)
3. [Modeling Benchmarks](03-Model-Benchmarks.ipynb)  
4. [Model Tuning](04-Model-Tuning.ipynb)
5. [Production Model and Insights](#Production-Model-and-Insights) 
 

### Import Libraries

In [65]:
import pandas as pd

from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, cross_val_score

from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV


### Load Data

In [55]:
df = pd.read_csv('../datasets/train_processed.csv')
df_train = pd.read_csv('../datasets/train_split_processed.csv')
df_test = pd.read_csv('../datasets/test_split_processed.csv')
df_kaggle = pd.read_csv('../datasets/kaggle_processed.csv')

### Define Predictors and Target

In [56]:
X = df[[col for col in df.columns if col != 'saleprice']]
y = df['saleprice']

X_kaggle = df_kaggle[[col for col in df_test.columns if col != 'saleprice']]


## Production Model and Insights

`Lasso` will be used as the production model. After fitting the entire training set, `Lasso` has the highest CV score out of all of the regression models. 

In [60]:
def compare_r2(X, y, model_type = 'ridge', r_alphas = np.logspace(0, 5, 200)):
    model = None
    
    # case insensitive
    model_type = model_type.lower()
    
    r_alphas = np.logspace(0, 5, 200)

    if model_type == 'ridge' :
        model = RidgeCV(alphas = r_alphas, scoring = 'r2')
    elif model_type == 'lasso' : 
        model = LassoCV()
    
    model = model.fit(X, y)

    # make predictions
    y_pred = model.predict(X)

    # R2 scores
    score = r2_score(y, y_pred)

    # K-Folds Cross Validation
    kf = KFold(n_splits = 10, 
               shuffle = True, 
               random_state = 42)

    # cross-validation scores
    cv_score = cross_val_score(model, X, y, cv = kf).mean()

    # build output
    output = {'Score' : score,
              'CV Score' : cv_score,
              'Model' : model,
              'alpha' : model.alpha_,
              'coef' : model.coef_
             }
    return pd.DataFrame({'Score' : output})

In [61]:
ridge_comp = compare_r2(X, y, 'ridge')

In [62]:
ridge_comp

Unnamed: 0,Score
CV Score,0.91107
Model,"RidgeCV(alphas=array([1.00000e+00, 1.05956e+00..."
Score,0.944811
alpha,182.518
coef,"[7356.801127139642, 11928.149139617235, 3628.0..."


In [63]:
lasso_comp = compare_r2(X, y, 'lasso')

In [64]:
lasso_comp

Unnamed: 0,Score
CV Score,0.91548
Model,"LassoCV(alphas=None, copy_X=True, cv=None, eps..."
Score,0.934227
alpha,689.551
coef,"[7529.538871504198, 15937.031238247524, 2541.2..."


In [48]:
final_model = lasso_comp.loc['Model', 'Score']

### Top Predictive Features

In [66]:
feature_coefs = pd.DataFrame({'feature' : X.columns, 'coef' : final_model.coef_})
feature_coefs = feature_coefs.sort_values(by = 'coef', ascending = False)
feature_coefs.head(10)

Unnamed: 0,feature,coef
3,gr_liv_area,20501.6325
1,overall_qual,15937.031238
0,neighborhood,7529.538872
11,year_remod/add,7370.995147
17,bsmtfin_sf_1,7084.476286
6,total_bsmt_sf,6878.351216
21,lot_area,6330.059709
10,year_built,4749.090813
16,fireplaces,4149.396609
5,kitchen_qual,3613.744841


In [67]:
feature_coefs.tail(10)

Unnamed: 0,feature,coef
284,garage_yr_blt lot_frontage,-257.622209
219,year_built^2,-382.76967
104,gr_liv_area totrms_abvgrd,-444.853821
270,totrms_abvgrd garage_yr_blt,-451.757963
314,lot_frontage^2,-570.528379
225,year_built fireplaces,-863.972753
298,bsmtfin_sf_1 wood_deck_sf,-932.152513
291,fireplaces wood_deck_sf,-951.634775
323,half_bath^2,-1083.436652
299,bsmtfin_sf_1 lot_frontage,-1088.085246


### Predictions

In [51]:
y_kaggle_final_pred = final_model.predict(X_kaggle)

### Write predictions to CSV file

In [52]:
final_predictions = pd.DataFrame([], columns = ['Id', 'SalePrice'])

final_predictions['Id'] = df_kaggle['id']
final_predictions['SalePrice'] = y_kaggle_final_pred


In [53]:
final_predictions.to_csv('../datasets/predictions_final.csv', index = False)