# Project Part II: Predicting Housing Prices - Build Your Own Model

 

### Grading Scheme

Your grade for the project will be based on your training RMSE and test RMSE. The thresholds are as follows:

Points | 9 | 7 | 5 | 3
--- | --- | --- | --- | ---
Training RMSE | Less than 60k | [60k, 120k) | [120k, 200k) | More than 230k

Points | 9 | 7 | 5 | 3
--- | --- | --- | --- | ---
Test RMSE | Less than 65k | [65k, 130k) | [130k, 230k) | More than 230k

The top 20% of the submissions with the least testing errors will receive the additional two points


In [1]:
# Some Imports You Might Need
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import zipfile
import os

from sklearn.preprocessing import OneHotEncoder
from sklearn import linear_model as lm

# Plot settings
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.size'] = 12

# Extract Dataset
with zipfile.ZipFile('cook_county_contest_data.zip') as item:
    item.extractall()
    
    
### Note: we filtered the data in cook_county_contest_data, 
####so please use this dataset instead of the old one.

### Note

This notebook is specifically designed to guide you through the process of exporting your model's predictions on the test dataset for submission so you can see how your model performs.

Most of what you have done in project part I should be transferrable here. 

## Step 1. Set up all the helper functions for your `process_data_fm` function.

**Copy-paste all of the helper functions your `process_data_fm` need here in the following cell**. You **do not** have to fill out all of the functions in the cell below -- only fill out those that are actually useful to your feature engineering pipeline.

In [2]:
def add_total_bedrooms(data):
    """
    Input:
      data (data frame): a data frame containing at least the Description column.
    """
    with_rooms = data.copy()
    with_rooms['Bedrooms'] = with_rooms['Description'].str.extract(r'(\d+) of which').fillna(0).astype(int)
    return with_rooms

def ohe_roof_material(data):
    """
    One-hot-encodes roof material.  New columns are of the form 0x_QUALITY.
    """
    one_hot_encoder = OneHotEncoder()
    data[one_hot_encoder.get_feature_names()] = one_hot_encoder.fit_transform(data[['Roof Material']]).todense()
    return data

def remove_outliers(data, variable, lower=-np.inf, upper=np.inf):
    """
    Input:
      data (data frame): the table to be filtered
      variable (string): the column with numerical outliers
      lower (numeric): observations with values lower than this will be removed
      upper (numeric): observations with values higher than this will be removed
    
    Output:
      a winsorized data frame with outliers removed
      
    Note: This function should not change mutate the contents of data.
    """  
    return data[(data[variable] <= upper) & (data[variable] >= lower)]

def add_bathrooms(data):
    with_rooms = data.copy()
    with_rooms['Bathrooms'] = with_rooms['Description'].str.extract(r'(d+) of which are bathrooms').fillna(0).astype(int)
    return with_rooms

def add_total_rooms(data):
    
    with_rooms = data.copy()
    with_rooms['Rooms'] = with_rooms['Description'].str.extract(r'has a total of (\d+) rooms').fillna(0).astype(int)
    return with_rooms
def substitute_roof_material(data):
    """
    Input:
      data (data frame): a data frame containing a 'Roof Material' column.  Its values
                         should be limited to those found in the codebook
    Output:
      data frame identical to the input except with a refactored 'Roof Material' column
    """
    data = data.replace({'Roof Material': {1: 'Shingle/Asphalt',2: 'Tar&Gravel',3: 'Slate',4: 'Shake',5: 'Tile',6: 'Other' }})
    return data


# This makes the train-test split in this section reproducible across different runs 
# of the notebook. You do not need this line to run train_test_split in general

# DO NOT CHANGE THIS LINE
np.random.seed(1337)
# DO NOT CHANGE THIS LINE

def train_test_split(data):
    data_len = data.shape[0]
    shuffled_indices = np.random.permutation(data_len)
    train_indices = shuffled_indices[:int(0.8*data_len)]
    test_indices = shuffled_indices[int(0.8*data_len):]
    return data.iloc[train_indices], data.iloc[test_indices]








def process_data_gm(data, pipeline_functions, prediction_col):
    """Process the data for a guided model."""
    
    for function, arguments, keyword_arguments in pipeline_functions:
        if keyword_arguments and (not arguments):
            data = data.pipe(function, **keyword_arguments)
        elif (not keyword_arguments) and (arguments):
            data = data.pipe(function, *arguments)
        else:
            data = data.pipe(function)
    X = data.drop(columns=[prediction_col]).to_numpy()
    y = data.loc[:, prediction_col].to_numpy()
    return X, y

def select_columns(data, *columns):
    """Select only columns passed as arguments."""
    return data.loc[:, columns]

## Step 2. Setup your `process_data_fm` function

**Create your implementation of `process_data_fm` from into the following cell.**

Here are a few additional things **you should check and change to make sure your `process_data_fm` function satisfies**:
- Unlike part 1, we will not be expecting your `process_data_fm` function to return both the design matrix `X` and the observed target vector `y`; your function should now **only return X**.


In [3]:
train_data = pd.read_csv('cook_county_contest_data/cook_county_contest_train.csv')
train_data.head()
test_data = pd.read_csv('cook_county_contest_data/cook_county_contest_test.csv')

In [4]:
# Define any additional helper functions you need here

np.random.seed(1338)
# DO NOT CHANGE THIS LINE

# Process the data using the pipeline for the second model
train_m3, test_m3 = train_test_split(train_data)
def log_transformations(data, col):
    data['Log ' + col] = np.log(data[col])
    return data
train_m3 = remove_outliers(train_m3, 'Sale Price', lower = 45000, upper = 300000)
train_m3 = substitute_roof_material(train_m3)
train_m3 = ohe_roof_material(train_m3)
one_hot_roof_cols = train_m3.filter(regex='^0x').columns.tolist()


m3_pipelines = [
    (remove_outliers, ["Sale Price", 499], None),
    (log_transformations, ["Sale Price"], None),
    (log_transformations, ["Building Square Feet"], None),
    (add_total_bedrooms, None, None),
    (add_bathrooms, None, None),
    (add_total_rooms, None, None),
    (select_columns, ['Log Sale Price', 'Bedrooms', 'Log Building Square Feet','Bathrooms','Rooms','Age', 'Lot Size','Road Proximity','Repair Condition'], None)
]


X_train_m3, y_train_m3 = process_data_gm(train_m3, m3_pipelines, 'Log Sale Price')
X_test_m3, y_test_m3 = process_data_gm(test_m3, m3_pipelines, 'Log Sale Price')

# Please include all of your feature engineering process inside this function.
# Do not modify the parameters of this function.
def process_data_fm(data):
    
    # Return predictors and response variables separately
    X = data.drop(['Log Sale Price'], axis = 1)
    y = data.loc[:, 'Log Sale Price']
    
    return X, y

## Step 3. Train your model

Run the following cell to import the new set of training data to fit your model on. **You can use any regression model, the following is just an example** If your `process_data_fm` satisfies all the specified requirements, the cell should run without any error.

**As usual**, your model will predict the log-transformed sale price, and our grading will transform your predictions back to the normal vlaues.

In [5]:
from sklearn import linear_model as lm

linear_model_m3 = lm.LinearRegression(fit_intercept=True)

In [6]:
# Fit the 1st model
# Compute the fitted and predicted values of Sale Price for 1st model

y_fitted_m3 = linear_model_m3.fit(X_train_m3, y_train_m3).predict(X_train_m3)
y_predicted_m3 = linear_model_m3.fit(X_test_m3, y_test_m3).predict(X_test_m3)

## Step 4. Make Predictions on the Test Dataset

Run the following cell to estimate the sale price on the test dataset and export your model's predictions as a csv file called `predictions.csv`.

In [7]:
def rmse(predicted, actual):
  """
  Calculates RMSE from actual and predicted values
  Input:
    predicted (1D array): vector of predicted/fitted values
    actual (1D array): vector of actual values
  Output:
    a float, the root-mean square error
  """
  return np.sqrt(np.mean((actual - predicted)**2))

In [8]:
# Training and test errors for the 1st model
training_error_m3 = rmse(y_fitted_m3, y_train_m3)


# Training and test errors for the 1st model (in its original values before the log transform)
training_error_m3_delog = rmse(np.exp(y_fitted_m3), np.exp(y_train_m3)) 

print("Model\nTraining RMSE: {}\n".format(training_error_m3))
print("Model (no log transform)\nTraining RMSE: {}\n".format(training_error_m3_delog))


Model
Training RMSE: 0.47350419752441103

Model (no log transform)
Training RMSE: 69752.60033437963



In [9]:
#test_data = pd.read_csv('cook_county_contest_test.csv')
#X_test = process_data_fm(test_data)
#y_test_predicted = model.predict(X_test)
###If you took log in the prediction, please convert it back to regular scale
###Check y_test_predicts has the same range of your sale price in training
predictions = pd.DataFrame({'Sale Price': y_test_m3})
predictions = np.exp(predictions)
predictions.to_csv('predictions.csv')
print('Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. ')

Your predictions have been exported as predictions.csv. Please download the file and submit it to Canvas. 
