# Airbnb Pricing Prediction: Milestone 4 & 5
**James Gearheart**<br>
**Danny Zhuang**<br>
**Bob Saludo**<br>
**Ryan Wallace**<br><br>
**Harvard University**<br>
**Fall 2016**<br>
**TF: Christine Hwang**<br>
**Due Date: ** Saturday, November 28th, 2016

## Summary of Work and Insights


### Milestone 4

For our baseline model, we were to fit a linear regression model using the relevant features to predict price. However, the main hurdle in this task was massaging/manipulating each of the raw features so that we could fit our model while achieving interpretability and computational efficiency. 

To incorporate the days of the week and holidays, we used our work on the “Average Difference from Listing’s Own Mean Price” from Milestone 3 in visualizing how prices changed throughout the year. Ultimately, we found that the real increase in per-night rental costs came on Friday, Saturday, and around the New Years’ holidays. Additionally, we found that non-holiday dates in January and February showed the lowest prices, which we deem as “slump” dates. Thus, we made categorical variables to denote the day of the week (weekend or no weekend), holiday (3 days around New Years’), and slump dates (January and February dates that aren’t around New Years’). 

Because the categorical variables of neighborhood and zipcode have over 200 distinct values each, one-hot encoding would produce to far too many variables for a linear regression model – leading to long computational time and a small chance of over-fitting.  To solve this, we create four categorical variables for each of these features that separate the neighborhoods and zipcodes by price into quartiles. Each quartile is its own categorical variables (e.g. most expensive 25% of neighborhoods, least expensive 25% of zipcodes). Thus, we move away from trying to account for individual neighborhoods such as “Tribeca” separately and instead choose to analyze the most expensive neighborhoods together.  While we lose some degree of granularity, we believe that what is gained in computational efficiency and streamlined interpretability is well worth it. 

### Milestone 5

Although many of the data transformations were created and visualized in the Milestone 3 assignment, one of the key items that could net beneficial results is to create a clustering algorithm for zip codes and neighborhoods.  The current methodology calculates the median price for each of the zip codes and neighborhoods and sorts them by their median price.  This list of zip codes and their median prices are then grouped by quartiles.  Although this method provided a net lift in the predictive model’s accuracy, we believe that an algorithmic approach based on k-means clustering would be a better approach to determine related groups.   

The current baseline model to predict daily price for the AirBnB dataset is a linear regression model which currently has an R^2 of 37%.  We believe that the predictive accuracy could be improved by exploring additional predictive models designed for a continuous dependant variable.  We will build and analyze Ridge Regression and Lasso Regression models to compare to our baseline standard linear regression model.  These models are designed to predict continuous data that has a large number of predictive variables.  We believe that one of these models will provide us with greater accuracy than the current baseline model.

Since our baseline and proposed methods result in a continuous number price prediction (say, $109.01 per night), we ultimately hope to generate an interval around our continuous predictions in order to provide a range of pricing options, as this seems more reasonable and useful for AirBnB hosts. We plan to do this by analyzing the prediction or confidence intervals for the results of our regressions to create ranges around our predictions in which we hope a certain fixed percentage of the true prices fall.

---

In [1]:
# import necessary libraries
import csv
import datetime
import operator
import random
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LinReg
%matplotlib inline

In [2]:
# read the three datasets
listings_df = pd.read_csv('datasets/listings_final.csv', delimiter=',', index_col=0)
reviews_df = pd.read_csv('datasets/reviews_final.csv', delimiter=',', index_col=0)
calendar_df = pd.read_csv('datasets/calendar_final.csv', index_col=0)

# log transform prices in calendar
calendar_df['price_log'] = np.log(calendar_df['price'])

# create calendar with listings data added
calendar_expanded_df = calendar_df.merge(listings_df, on='listing_id', how='left', suffixes=['_calendar', '_listings'])

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# strip errant NaN's
cols = ['bathrooms', 'bedrooms', 'beds', 'accommodates', 'longitude', 'neighbourhood', 
        'zipcode', 'date', 'price_log_calendar']
calendar_expanded_df.replace([np.inf, -np.inf], np.nan)
calendar_expanded_df = calendar_expanded_df.dropna(subset=cols)

In [4]:
# method to convert date to day of week
def get_day(date):
    return datetime.datetime.strptime(date, '%Y-%m-%d').strftime('%A')

In [5]:
# create indicators for time variables
# weekend
dates = np.array(calendar_expanded_df['date'])
days = [get_day(date) for date in dates]
weekend = [1 if day == 'Friday' or day == 'Saturday' else 0 for day in days]
calendar_expanded_df['weekend'] = pd.Series(np.array(weekend), index=calendar_expanded_df.index)

# major holidys (around New Years)
holiday_dates = ['2015-01-01', '2015-01-02', '2015-01-03']
holiday = [1 if date in holiday_dates else 0 for date in dates]
calendar_expanded_df['holiday'] = pd.Series(np.array(holiday), index=calendar_expanded_df.index)

# not January (excluding Holidays) or February
slump_dates = []
for d in range(4, 10):
    slump_dates.append('2015-01-0' + str(d))
for d in range(10, 32):
    slump_dates.append('2015-01-' + str(d))
for d in range(1, 10):
    slump_dates.append('2015-02-0' + str(d))
for d in range(10, 29):
    slump_dates.append('2015-01-' + str(d))
slump = [1 if date in slump_dates else 0 for date in dates]
calendar_expanded_df['slump'] = pd.Series(np.array(slump), index=calendar_expanded_df.index)

In [6]:
# find means by zipcode and group into zipcodes
neighborhoods = calendar_expanded_df['neighbourhood'].unique()
zipcodes = calendar_expanded_df['zipcode'].unique()

neighborhood_prices = []
for neighborhood in neighborhoods:
    neighborhood_prices.append((neighborhood, np.mean(np.array(listings_df[listings_df['neighbourhood'] == neighborhood]['price']))))

zipcode_prices = []
for zipcode in zipcodes:
    zipcode_prices.append((zipcode, np.mean(np.array(listings_df[listings_df['zipcode'] == zipcode]['price']))))
    
# group zipcodes and neighborhoods into quartiles by average
neighborhood_prices.sort(key=operator.itemgetter(1), reverse=True)
zipcode_prices.sort(key=operator.itemgetter(1), reverse=True)

# find size of quartiles
neighborhood_quartile_size = int(np.round(len(neighborhood_prices)*0.25))
zipcode_quartile_size = int(np.round(len(zipcode_prices)*0.25))

# break up neighboorhood and zipcodes by quartile
neighborhood_1 = neighborhood_prices[:neighborhood_quartile_size]
neighborhood_2 = neighborhood_prices[neighborhood_quartile_size:2*neighborhood_quartile_size]
neighborhood_3 = neighborhood_prices[2*neighborhood_quartile_size:3*neighborhood_quartile_size]
neighborhood_4 = neighborhood_prices[3*neighborhood_quartile_size:]

zipcode_1 = zipcode_prices[:zipcode_quartile_size]
zipcode_2 = zipcode_prices[zipcode_quartile_size:2*zipcode_quartile_size]
zipcode_3 = zipcode_prices[2*zipcode_quartile_size:3*zipcode_quartile_size]
zipcode_4 = zipcode_prices[3*zipcode_quartile_size:]

# create new indicators for each quartile
neighborhoods = np.array(calendar_expanded_df['neighbourhood'])
zipcodes = np.array(calendar_expanded_df['zipcode'])

neighborhood_q1 = []
neighborhood_q2 = []
neighborhood_q3 = []
neighborhood_q4 = []

zipcode_q1 = []
zipcode_q2 = []
zipcode_q3 = []
zipcode_q4 = []

for neighborhood in neighborhoods:
    if neighborhood in neighborhood_1:
        neighborhood_q1.append(1)
    else:
        neighborhood_q1.append(0)
    
    if neighborhood in neighborhood_2:
        neighborhood_q2.append(1)
    else:
        neighborhood_q2.append(0)
    
    if neighborhood in neighborhood_3:
        neighborhood_q3.append(1)
    else:
        neighborhood_q3.append(0)
        
    if neighborhood in neighborhood_4:
        neighborhood_q4.append(1)
    else:
        neighborhood_q4.append(0)
        
for zipcode in zipcodes:
    if zipcode in zipcode_1:
        zipcode_q1.append(1)
    else:
        zipcode_q1.append(0)
    
    if zipcode in zipcode_2:
        zipcode_q2.append(1)
    else:
        zipcode_q2.append(0)
    
    if zipcode in zipcode_3:
        zipcode_q3.append(1)
    else:
        zipcode_q3.append(0)
        
    if zipcode in zipcode_4:
        zipcode_q4.append(1)
    else:
        zipcode_q4.append(0)

In [7]:
# extract relevant feature listing
relevant_vars = ['bathrooms', 'bedrooms', 'beds', 'accommodates', 'longitude', 
                 'weekend', 'holiday', 'slump']
X_df = calendar_expanded_df[relevant_vars].copy()
y_df = calendar_expanded_df[['price_log_calendar']].copy()

# numpy for sklearn
X = X_df.as_matrix()
y = y_df.as_matrix()

In [8]:
# convert zipcode, neighborhood lists to np arrays
zipcode_q1 = np.resize(np.array(zipcode_q1), (len(zipcode_q1), 1))
zipcode_q2 = np.resize(np.array(zipcode_q2), (len(zipcode_q2), 1))
zipcode_q3 = np.resize(np.array(zipcode_q3), (len(zipcode_q3), 1))
zipcode_q4 = np.resize(np.array(zipcode_q4), (len(zipcode_q4), 1))

neighborhood_q1 = np.resize(np.array(neighborhood_q1), (len(neighborhood_q1), 1))
neighborhood_q2 = np.resize(np.array(neighborhood_q2), (len(neighborhood_q2), 1))
neighborhood_q3 = np.resize(np.array(neighborhood_q3), (len(neighborhood_q3), 1))
neighborhood_q4 = np.resize(np.array(neighborhood_q4), (len(neighborhood_q4), 1))

# add categorical vars to X
Xy = np.concatenate((X, zipcode_q1, zipcode_q2, zipcode_q3, zipcode_q4, 
                neighborhood_q1, neighborhood_q2, neighborhood_q3, 
                neighborhood_q4, y), axis=1)

# format for sklearn
Xy.astype(np.float32, copy=False)

array([[ 1.        ,  1.        ,  2.        , ...,  0.        ,
         0.        ,  6.39692974],
       [ 1.        ,  1.        ,  2.        , ...,  0.        ,
         0.        ,  6.39692974],
       [ 1.        ,  1.        ,  2.        , ...,  0.        ,
         0.        ,  6.39692974],
       ..., 
       [ 1.        ,  1.        ,  1.        , ...,  0.        ,
         0.        ,  4.60517025],
       [ 1.        ,  1.        ,  1.        , ...,  0.        ,
         0.        ,  4.60517025],
       [ 1.        ,  1.        ,  1.        , ...,  0.        ,
         0.        ,  4.60517025]], dtype=float32)

In [9]:
# Split into training and testing
# use 75% for training, the rest for testing
num_train = int(np.round(Xy.shape[0]*0.75))

# shuffle for random selection
random.shuffle(Xy)

# pull out sets
X_train = Xy[:num_train,:(-1)]
X_test = Xy[num_train:,:(-1)]
y_train = Xy[:num_train, -1]
y_test = Xy[num_train:, -1]

X_train = np.nan_to_num(X_train)
X_test = np.nan_to_num(X_test)
y_train = np.nan_to_num(y_train)
y_test = np.nan_to_num(y_test)

In [10]:
# fit simple linear regression
linear_model = LinReg()
linear_model.fit(X_train, y_train)
print 'R^2 in test: ', linear_model.score(X_test, y_test)

ValueError: array must not contain infs or NaNs