# Data Modeling Seasonality

Now that we've explored what the seasonality dataset looks like, the goal is to be able to make modifications to our average prediction from the initial regression to account for seasonality of pricing. This basically means that we will make +/- modifications to our average prediction based on the the day that we are projecting the data for. We can also do similar things for months of the year as well as holidays. This will hopefully reduce the residuals because the seasonality of the pricing data would cause some correlation among the residuals (based on time of the year) violating a lot of the OLS assumptions. We use averages as a way to explore seasonality. More advanced seasonality measurements could be used if we had more data over several years (where we could build an ARIMA or SARIMA model).

In [2]:
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
from sklearn import linear_model
import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pylab
import scipy.stats as stats
%matplotlib inline
import datetime as dt
from datetime import datetime

### Process Overview:
The general idea behind this analysis is as follows: we aggregate prices by weekday for each listing. Then, we normalize each listing's price by the monday price to find an average multiplier for each listing for each day. Then, for each day we average across all listings to get a final average multiplier for each day. Lastly, we compare these predictions to a subset of the listings.

In [3]:
#Importing Datafile
results_nona = pd.read_csv('../datasets/seasonality_tomodel.csv')
results_multiplier = pd.read_csv('../datasets/seasonality_tomodel.csv')
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
results_multiplier.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
0,1,1.0,1.0,1.0,1.0,1.0,1.0,3604481.0
1,1,1.0,1.0,1.0,1.0,1.0,1.0,2949128.0
2,1,0.991826,0.991826,0.999846,1.138965,1.138965,1.0,4325397.0
3,1,1.0,1.0,1.0,1.0,1.0,1.0,4325398.0
4,1,0.991494,0.994952,1.004395,1.015027,1.011263,0.99895,3426149.0


We see that the dataframe now contains a multiplier for each day of the week for each listing. Now we take an average for each day(averaging across all listings) to see an average multiplier value for each day

In [4]:
multiplier = dict.fromkeys(b)
for index,i in enumerate(multiplier):
    multiplier[i]=results_multiplier.mean()[i]
multiplier

{'Fri': 1.0306309142782333,
 'Mon': 1.0,
 'Sat': 1.0306744793102525,
 'Sun': 1.0009269127770359,
 'Thu': 1.0003671343632679,
 'Tue': 0.99980071473169119,
 'Wed': 0.9998361380714712}

The results are very much in line with what we saw earlier in our seasonality-exploration file. Monday and Tuesday see a slight dip in their prices(99.9%) while Friday and Saturday see a sizable increase in prices (103%). These are thus the numbers we will be using to apply seasonality to the averages from our previous models. Here, we apply it to RidgeCV the best one that we found.

## Predicting Prices Using Our Seasonality Averages

Now, it is important to test the performance of the averages we arrived at. Here we seek to utilize the RidgeCV regression-- one of the best ones from the models we ran-- and apply seasonality training to it.

In [5]:
#We import the data and rerun the RidgeCV Regression
data = pd.read_csv('../datasets/listings_clean.csv')
data.head()
# split into x and y (note that we do not include id and host_id as predictors)
x = data.iloc[:, 2:-2]
y = data.iloc[:, -2]
y_log = data.iloc[:, -1]

In [6]:
data.head(5)

Unnamed: 0,id,host_id,accommodates,bathrooms,bedrooms,beds,minimum_nights,availability_30,number_of_reviews,host_listing_count,...,50-59,60-69,70-79,80-84,85-89,90-94,95-100,No Reviews,price,price_log
0,1069266,5867023,-0.520266,-0.331542,-0.407402,-0.493039,0.173906,0.390393,2.716107,-0.355961,...,0,0,0,0,1,0,0,0,160.0,5.075174
1,2061725,4601412,-0.520266,-0.331542,-0.407402,0.381672,0.173906,-0.965897,1.295605,0.933455,...,0,0,0,0,0,0,1,0,58.0,4.060443
2,44974,198425,-0.520266,-0.331542,-0.407402,-0.493039,2.889531,-1.205242,0.822104,-0.355961,...,0,0,0,0,0,0,1,0,185.0,5.220356
3,4701675,22590025,-0.520266,-0.331542,-0.407402,0.381672,-0.601986,1.108429,-0.493176,-0.355961,...,0,0,0,0,0,0,1,0,195.0,5.273
4,68914,343302,1.690892,-0.331542,1.266328,1.256383,-0.21404,-0.407424,0.295992,0.073844,...,0,0,0,0,0,0,1,0,165.0,5.105945


In [7]:
reg_params = 10.**np.linspace(-10, 5, 10)
RidgeCV_model = RidgeCV(alphas=reg_params, fit_intercept=True, cv=5)

In [8]:
RidgeCV_model.fit(x,y_log)

RidgeCV(alphas=array([  1.00000e-10,   4.64159e-09,   2.15443e-07,   1.00000e-05,
         4.64159e-04,   2.15443e-02,   1.00000e+00,   4.64159e+01,
         2.15443e+03,   1.00000e+05]),
    cv=5, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None,
    store_cv_values=False)

In [9]:
sample = results_nona.sample(frac=0.4,axis=0)
len(sample)

1092

In [10]:
# some of the id's in the sample can't be found. So at the end we readjust the sample dataframe too so they have the same entries
sample_variables=data.loc[data['id'].isin(sample['listing_id'])]
sample_variables.head(5)
sample_variables.shape
sample = sample.loc[sample['listing_id'].isin(sample_variables['id'])]

In [19]:
sample['Mon']

435     550.000000
8        70.000000
463     599.000000
1912    200.000000
2552    150.000000
71      144.000000
2223     90.901961
2062    170.000000
1890    125.000000
273     207.000000
1066    200.000000
2369    159.000000
634      90.000000
2346    238.000000
161      60.576923
2063    325.723404
620     225.000000
756     150.000000
22      104.285714
508     300.000000
2406    143.000000
1294     95.000000
2609    110.000000
2019    225.000000
2157    525.000000
505      95.000000
1929     41.935484
470      70.000000
58      120.000000
1256    185.862069
           ...    
554     151.442308
1925    425.555556
1024     89.313725
2515     50.000000
1170     58.000000
1290     80.882353
2000     50.000000
787     210.000000
10      150.000000
1530    189.300000
1176     70.000000
1818    100.000000
334     299.000000
1219    350.000000
1332    109.000000
209      95.000000
652     130.000000
259      95.000000
753      86.707317
1586    278.644444
2058    125.000000
2693    389.

In [12]:
X_sample = sample_variables.iloc[:, 2:-2]
#note y_test is the un-exponentiated price

In [13]:
new_predictions = sample.copy()
new_predictions.loc[:,0:7]=0

In [20]:
new_predictions['Mon']=sample['Mon']
for i in b[1:]:
    new_predictions[i]=new_predictions['Mon']*multiplier[i]

In [27]:
new_predictions.iloc[:,1:-1]

Unnamed: 0,Tue,Wed,Thu,Fri,Sat,Sun
435,549.890393,549.909876,550.201924,566.847003,566.870964,550.509802
8,69.986050,69.988530,70.025699,72.144164,72.147214,70.064884
463,598.880628,598.901847,599.219913,617.347918,617.374013,599.555221
1912,199.960143,199.967228,200.073427,206.126183,206.134896,200.185383
2552,149.970107,149.975421,150.055070,154.594637,154.601172,150.139037
71,143.971303,143.976404,144.052867,148.410852,148.417125,144.133475
2223,90.883845,90.887065,90.935334,93.686371,93.690331,90.986219
2062,169.966122,169.972143,170.062413,175.207255,175.214661,170.157575
1890,124.975089,124.979517,125.045892,128.828864,128.834310,125.115864
273,206.958748,206.966081,207.075997,213.340599,213.349617,207.191871


In [28]:
#sample = results_multiplier.sample(frac=0.2,axis=0)
model_err = metrics.median_absolute_error(sample.iloc[:,1:-1].values.flatten(), new_predictions.iloc[:,1:-1].values.flatten())

In [29]:
model_err

0.12209482832382434

## Using ARIMA Time Series Models for Future Forecasting

ARIMA and SARIMA time series models are very powerful tools when dealing with time series data. They could help us take advantage of the seasonal nature of the data.