# Project 2 - Ames Housing Data and Kaggle Challenge
Author: _Ritchie Kwan_

---



## Table of Contents
1. [EDA and Data Cleaning](01-EDA-and-Cleaning.ipynb)
2. [Preprocessing and Feature Engineering](02-Preprocessing-and-Feature-Engineering)
3. [Modeling Benchmarks](#Modeling-Benchmarks)
4. [Model Tuning](04-Model-Tuning.ipynb)  
5. [Production Model and Insights](05-Production-Model-and-Insights.ipynb)  
 

### Import Libraries

In [10]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold, cross_val_score

### Load Data

In [11]:
df = pd.read_csv('../datasets/train_processed.csv')
df_train = pd.read_csv('../datasets/train_split_processed.csv')
df_test = pd.read_csv('../datasets/test_split_processed.csv')
df_kaggle = pd.read_csv('../datasets/kaggle_processed.csv')

### Define Predictors and Target

In [12]:
X = df[[col for col in df.columns if col != 'saleprice']]
y = df['saleprice']

X_train = df_train[[col for col in df_train.columns if col != 'saleprice']]
y_train = df_train['saleprice']

X_test = df_test[[col for col in df_test.columns if col != 'saleprice']]
y_test = df_test['saleprice']

X_kaggle = df_kaggle[[col for col in df_test.columns if col != 'saleprice' and col != 'id']]

## Modeling Benchmarks

`LinearRegression` will be used as the benchmark predictive model.

### K-Folds

In [14]:
kf = KFold(n_splits = 10, shuffle = True, random_state = 42)

### Linear Model

In [15]:
# linear_model

def linear_reg(X_train, X_test, y_train, y_test):

    lm = LinearRegression()
    lm = lm.fit(X_train, y_train)
    y_train_pred = lm.predict(X_train)
    y_test_pred = lm.predict(X_test)

    train_score = r2_score(y_train, y_train_pred)
    test_score = r2_score(y_test, y_test_pred)

    cv_train_score = cross_val_score(lm, X_train, y_train, cv = kf).mean()
    cv_test_score = cross_val_score(lm, X_test, y_test, cv = kf).mean()
    
    print('Train:  \t{}\nTest:\t\t{}\nTrainCV:\t{}\nTestCV:\t\t{}'
          .format(train_score, test_score, cv_train_score, cv_test_score))
    
    return lm

#### Linear Model Predictions

In [16]:
lm = linear_reg(X_train, X_test, y_train, y_test)

y_kaggle_lm_pred = lm.predict(X_kaggle)

Train:  	0.9531733852423574
Test:		0.8719576833399215
TrainCV:	0.8660803946391541
TestCV:		0.7001903671607572


### Write predictions to CSV file

In [17]:
predictions = pd.DataFrame([], columns = ['Id', 'SalePrice'])
predictions['Id'] = df_kaggle['id']
predictions['SalePrice'] = y_kaggle_lm_pred

In [18]:
predictions.to_csv('../datasets/predictions_linear.csv', index = False)