# Price Prediction
- __The Goal__: We have data scraping from Craigslist every hour gathering data on apartments in New York City. I would like to take that data and build a model that predicts the price of an apartment. This can be used similar to a Zestimate on Zillow for being a reference for if an apartment is a good "deal".


- __What This Notebook Is__: We've already conducted an exploratory analysis of the data. Now we will use what we learned to build a predictive model. The ultimate goal is to get this into a functioning web app, so many of the functions used in this exploratory model building will need to be exported in order to handle incoming data. This notebook will build our model that we can export via pickle, and also define many of these functions for encoding incoming data.

# Import Our Libraries

In [71]:
import pandas as pd
import numpy as np
import warnings

# Scikit libraries
warnings.filterwarnings('ignore')

# Notebook Settings

In [72]:
pd.options.display.max_columns = None

# Read in our data
Read in our data and remove all of the columns that we won't be taking into account with our model (things like ID, address, etc). These features are discussed more in the exploratory analysis.


In [73]:
df = pd.read_csv('../../notebooks/housing_cleaned.csv')

# Preprocessing
Most of our data was already preprocessed in the Data Cleansing notebook, but there are still a few things we need to take care of for our model.

## Fix postal code variable
The data comes in as a float, so we can to fix that. We will also fill missing values with a postal code of 999999. This will later be one hot encoded so the value chosen is irrelevant.

In [74]:
df['postalCodeChopped'] = df['postalCodeChopped'].fillna(99999).astype(int).astype(str)

## Fill in missing values for "Where"
This is a neighborhood field the creator of the ad puts in. If they don't put it in we will put it as "Not Specified", which will be encoded shortly.

In [75]:
df['where'].fillna("Not Specified", inplace=True)

## Remove duplicates
Many ads are reposted if the apartment does not find a buyer after a certain amount of days. We don't want to have these duplicates skew our model, so we will remove duplicates. The best way to do this is to remove ads with the same name as often the user just reposts the ad.

In [76]:
df = df.drop_duplicates(subset=['name'])

## Remove columns not needed for the model
This is discussed more in the Data Exploration notebook, but we will not used all of our columns in the model, so we remove the unneeded columns here.

In [77]:
df = df[['bedrooms', 'bikeScore', 'transitScore', 
         'walkScore', 'distanceToNearestIntersection', 'has_image', 
         'has_map', 'neighborhood', 'advertises_no_fee', 'is_repost',
        'sideOfStreetEncoded', 'price']]

## Encode neighborhood

In [78]:
df['neighborhood'] = df['neighborhood'].str.replace(' ', '_').str.lower()
df = (pd.concat([df,
                       pd.get_dummies(df['neighborhood'], prefix='neighborhood')],
               axis=1))
df.drop('neighborhood', axis=1, inplace=True)

In [79]:
df.head()

Unnamed: 0,bedrooms,bikeScore,transitScore,walkScore,distanceToNearestIntersection,has_image,has_map,advertises_no_fee,is_repost,sideOfStreetEncoded,price,neighborhood_borough_park,neighborhood_bronx_park_and_fordham,neighborhood_bushwick_and_williamsburg,neighborhood_canarsie_and_flatlands,neighborhood_central_bronx,neighborhood_central_brooklyn,neighborhood_central_harlem,neighborhood_central_queens,neighborhood_chelsea_and_clinton,neighborhood_east_harlem,neighborhood_east_new_york_and_new_lots,neighborhood_flatbush,neighborhood_gramercy_park_and_murray_hill,neighborhood_greenpoint,neighborhood_greenwich_village_and_soho,neighborhood_high_bridge_and_morrisania,neighborhood_hunts_point_and_mott_haven,neighborhood_inwood_and_washington_heights,neighborhood_jamaica,neighborhood_kearney,neighborhood_kingsbridge_and_riverdale,neighborhood_lower_east_side,neighborhood_lower_manhattan,neighborhood_mid-island,neighborhood_no_neighhood_found,neighborhood_north_queens,neighborhood_northeast_bronx,neighborhood_northeast_queens,neighborhood_northwest_brooklyn,neighborhood_northwest_queens,neighborhood_port_richmond,neighborhood_queens,neighborhood_rockaways,neighborhood_south_shore,neighborhood_southeast_bronx,neighborhood_southeast_queens,neighborhood_southern_brooklyn,neighborhood_southwest_brooklyn,neighborhood_southwest_queens,neighborhood_stamford,neighborhood_sunset_park,neighborhood_upper_east_side,neighborhood_upper_west_side,neighborhood_west_central_queens,neighborhood_west_queens
0,3.0,64.0,100.0,92.0,0.0,1,1,1,0,1.0,2700,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,1.0,88.0,100.0,98.0,203.483553,1,1,0,1,0.0,2600,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,3.0,79.0,97.0,94.0,0.013114,1,1,0,0,1.0,2875,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,3.0,79.0,97.0,94.0,0.013114,1,1,1,0,1.0,2800,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,1.0,81.0,100.0,93.0,61.301497,1,1,1,1,0.0,3500,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [80]:
df.dtypes

bedrooms                                      float64
bikeScore                                     float64
transitScore                                  float64
walkScore                                     float64
distanceToNearestIntersection                 float64
has_image                                       int64
has_map                                         int64
advertises_no_fee                               int64
is_repost                                       int64
sideOfStreetEncoded                           float64
price                                           int64
neighborhood_borough_park                       uint8
neighborhood_bronx_park_and_fordham             uint8
neighborhood_bushwick_and_williamsburg          uint8
neighborhood_canarsie_and_flatlands             uint8
neighborhood_central_bronx                      uint8
neighborhood_central_brooklyn                   uint8
neighborhood_central_harlem                     uint8
neighborhood_central_queens 

# Model Time
Now that we have a matrix that is beautifully numeric we can build out our model. We will try several different models with a goal to minimize MSE.