# Project Part II: Predicting Housing Prices - Build Your Own Model (50 pts)

 

### Grading Scheme

Your grade for the project will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 20 | 17 | 15 | 10
--- | --- | --- | --- | ---
Training RMSE | Less than 70k | [70k, 120k) | [120k, 200k) | More than 230k

Points | 20 | 17 | 15 | 10
--- | --- | --- | --- | ---
Test RMSE | Less than 75k | [70k, 130k) | [130k, 230k) | More than 230k

The top 20% of the submissions with the least testing errors will receive the additional 10 points


In [1]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()
    
    
### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model performs.

Most of what you have done in project part I should be transferrable here. 

## Step 1. Set up all the helper functions for your `process_data_fm` function.

**Copy-paste all of the helper functions your `process_data_fm` need here in the following cell**. You **do not** have to fill out all of the functions in the cell below -- only fill out those that are actually useful to your feature engineering pipeline.

In [150]:
import re
def add_total_bedrooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    with_rooms['Bedrooms'] = data['Description'].apply(lambda row: int(re.findall(r'(\d+(?:\.\d+)?) of which are bedrooms', row)[0])).fillna(0)
    return with_rooms
def add_total_bathrooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    with_rooms['Bathrooms'] = data['Description'].apply(lambda row: re.findall(r'(\d+(?:\.\d+)?) of which are bathrooms', row)[0]).fillna(0)
    return with_rooms
def add_total_rooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    with_rooms['Totalrooms'] = data['Description'].apply(lambda row: re.findall(r'It has a total of (\d+(?:\.\d+)?) rooms', row)[0]).fillna(0)
    return with_rooms
def process_ohe_vals(data, *columns):
    data_copy = data.copy()
    for column in columns:
        X = data_copy[column].values.reshape(-1, 1)
        enc = OneHotEncoder().fit(X)
        names = enc.get_feature_names()
        names = [ele + column for ele in names]
        re = pd.DataFrame(enc.transform(X).toarray(), columns=names)
        data_copy = pd.merge(data_copy, re, left_index=True, right_index=True)
    return data_copy

def ohe_roof_material(data):
    """
    One-hot-encodes roof material.  New columns are of the form 0x_QUALITY.
    """
    with_ohe = data.copy()
    with_ohe['Roof Material'].replace([1,2,3,4,5,6], ['Single/Asphalt', 'Tar&Gravel', 'Slate', 'Shake', 'Tile', 'Other'], inplace=True)
    X = with_ohe['Roof Material'].values.reshape(-1, 1)
    enc = OneHotEncoder().fit(X)
    re = pd.DataFrame(enc.transform(X).toarray(), columns=['M1','M2','M3','M4','M5','M6'])
    return pd.merge(with_ohe, re, left_index=True, right_index=True)

def add_log_vals(data):
    with_logs = data.copy()
    with_logs['Log Land Square Feet'] = np.log(data['Land Square Feet'])
    with_logs['Log Log Land Square Feet'] = np.log(np.log(data['Land Square Feet']))
    with_logs['Log Building Square Feet'] = np.log(data['Building Square Feet'])
    with_logs['Log Log Building Square Feet'] = np.log(np.log(data['Building Square Feet']))
    return with_logs

def process_data_gm(data, pipeline_functions, prediction_col):
    """Process the data for a guided model."""
    for function, arguments, keyword_arguments in pipeline_functions:
        if keyword_arguments and (not arguments):
            data = data.pipe(function, **keyword_arguments)
        elif (not keyword_arguments) and (arguments):
            data = data.pipe(function, *arguments)
        else:
            data = data.pipe(function)
    X = data.drop(columns=[prediction_col]).to_numpy()
    y = data.loc[:, prediction_col].to_numpy()
    return X, y

def select_columns(data, *columns):
    """Select only columns passed as arguments."""
    return data.loc[:, columns]

## Step 2. Setup your `process_data_fm` function

**Create your implementation of `process_data_fm` from into the following cell.**

Here are a few additional things **you should check and change to make sure your `process_data_fm` function satisfies**:
- Unlike part 1, we will not be expecting your `process_data_fm` function to return both the design matrix `X` and the observed target vector `y`; your function should now **only return X**.


In [160]:
# Please include all of your feature engineering process inside this function.
# Do not modify the parameters of the function below. 
# Note that data will no longer have the column Sale Price in it directly, so plan your feature engineering process around that.
def process_data_fm(data):
    # Replace the following line with your own feature engineering pipeline
    X = data
    X = add_total_bedrooms(X)
    X = add_total_bathrooms(X)
    X = add_log_vals(X)
    X = process_ohe_vals(X,'Wall Material', 'Roof Material', 'Basement',
       'Basement Finish', 'Central Heating', 'Central Air', 'Repair Condition')
    X.drop(['PIN', 'Property Class', 'Neighborhood Code', 'Land Square Feet',
       'Town Code', 'Apartments', 'Wall Material', 'Roof Material', 'Basement',
       'Basement Finish', 'Central Heating', 'Other Heating', 'Central Air', 'Attic Type', 'Attic Finish', 'Design Plan',
       'Cathedral Ceiling', 'Construction Quality', 'Site Desirability',
       'Garage 1 Size', 'Garage 1 Material', 'Garage 1 Attachment',
       'Garage 1 Area', 'Garage 2 Size', 'Garage 2 Material',
       'Garage 2 Attachment', 'Garage 2 Area', 'Porch', 'Other Improvements',
       'Building Square Feet', 'Repair Condition', 'Multi Code',
       'Deed No.', 'Census Tract',
       'Multi Property Indicator', 'Modeling Group', 'Floodplain', 'Road Proximity', 'Sale Year',
       'Sale Quarter', 'Sale Half-Year', 'Sale Quarter of Year',
       'Sale Month of Year', 'Sale Half of Year', 'Most Recent Sale',
       'Age Decade', 'Pure Market Filter',
       'Neigborhood Code (mapping)', 'Town and Neighborhood', 'Description', 'Age', 'Number of Commercial Units', 'Estimate (Land)',
       'Estimate (Building)',
       'Lot Size'], axis=1, inplace=True)
    return X


In [161]:
from sklearn.model_selection import train_test_split


train_data = pd.read_csv('cook_county_contest_train.csv')
y_train = np.log(train_data['Sale Price'])
train_data = train_data.drop(columns=['Sale Price'])
X_train = process_data_fm(train_data)

x_selftrain,x_selftest, y_selftrain, y_selftest = train_test_split(X_train, y_train, random_state=42)

model = lm.LinearRegression(fit_intercept=True)
###You can use other models
model.fit(x_selftrain, y_selftrain)
y_selffitted = model.predict(x_selftrain)
def rmse(predicted, actual):
    return np.sqrt(np.mean((np.exp(actual) - np.exp(predicted))**2))
training_error = rmse(y_selffitted, y_selftrain)
y_selfpredicted = model.predict(x_selftest)

testing_error = rmse(y_selfpredicted, y_selftest)
# training_error, testing_error

(117789.40042244806, 119429.92993688404)

In [162]:
model = lm.LinearRegression(fit_intercept=True)
###You can use other models
model.fit(X_train, y_train)
y_fitted = model.predict(X_train)
def rmse(predicted, actual):
    return np.sqrt(np.mean((np.exp(actual) - np.exp(predicted))**2))
training_error = rmse(y_fitted, y_train)
# training_error

118608.7370377128

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** If your `process_data_fm` satisfies all the specified requirements, the cell should run without any error.

**As usual**, your model will predict the log-transformed sale price, and our grading will transform your predictions back to the normal vlaues.

In [163]:
train_data = pd.read_csv('cook_county_contest_train.csv')
y_train = np.log(train_data['Sale Price'])
train_data = train_data.drop(columns=['Sale Price'])
X_train = process_data_fm(train_data)
model = lm.LinearRegression(fit_intercept=True)
###You can use other models
model.fit(X_train, y_train);

In [164]:
model.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'normalize': 'deprecated',
 'positive': False}

In [175]:
def rmse(predicted, actual):
    return np.sqrt(np.mean((np.exp(actual) - np.exp(predicted))**2))
training_error = rmse(y_fitted, y_train)
training_error

118608.7370377128

## Step 4. Make Predictions on the Test Dataset

Run the following cell to estimate the sale price on the test dataset and export your model's predictions as a csv file called `predictions.csv`.

In [165]:
test_data = pd.read_csv('cook_county_contest_test.csv')
X_test = process_data_fm(test_data)
y_test_predicted = model.predict(X_test)
###If you took log in the prediction, please convert it back to regular scale
###Check y_test_predicts has the same range of your sale price in training
predictions = pd.DataFrame({'Sale Price': np.exp(y_test_predicted)})
predictions.to_csv('predictions.csv')
print('Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. ')

Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. 
