# 3B. Data Modeling: Seasonality
<hr>

Now that we've explored what the seasonality dataset looks like, the goal is to be able to make modifications to our average prediction from the initial regression to account for seasonality of pricing. This basically means that we will make +/- modifications to our average prediction based on the the day that we are projecting the data for. We can also do similar things for months of the year as well as holidays. This will hopefully reduce the residuals because the seasonality of the pricing data would cause some correlation among the residuals (based on time of the year) violating a lot of the OLS assumptions. We use averages as a way to explore seasonality. More advanced seasonality measurements could be used if we had more data over several years (where we could build an ARIMA or SARIMA model).

In [37]:
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors
from sklearn import linear_model
import sklearn.metrics as metrics
from sklearn.grid_search import GridSearchCV
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import pylab
import scipy.stats as stats
%matplotlib inline
import datetime as dt
from datetime import datetime

### Process Overview:
The general idea behind this analysis is as follows: we aggregate prices by weekday for each listing. Then, we normalize each listing's price by the monday price to find an average multiplier for each listing for each day. Then, for each day we average across all listings to get a final average multiplier for each day. Lastly, we compare these predictions to a subset of the listings.

In [21]:
#Importing Datafile
results_nona = pd.read_csv('../datasets/seasonality_tomodel.csv')
results_multiplier = pd.read_csv('../datasets/seasonality_tomodel.csv')
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
b=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']
for i in b[1:7]:
    results_multiplier[i] = results_multiplier[i]/results_multiplier['Mon']
results_multiplier['Mon']= 1
results_multiplier.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
0,1,1.0,1.0,1.0,1.0,1.0,1.0,3604481.0
1,1,1.0,1.0,1.0,1.0,1.0,1.0,2949128.0
2,1,0.991826,0.991826,0.999846,1.138965,1.138965,1.0,4325397.0
3,1,1.0,1.0,1.0,1.0,1.0,1.0,4325398.0
4,1,0.991494,0.994952,1.004395,1.015027,1.011263,0.99895,3426149.0


We see that the dataframe now contains a multiplier for each day of the week for each listing. Now we take an average for each day(averaging across all listings) to see an average multiplier value for each day

In [108]:
multiplier = dict.fromkeys(b)
for index,i in enumerate(multiplier):
    multiplier[i]=results_multiplier.mean()[i]
multiplier

{'Fri': 1.0306309142782333,
 'Mon': 1.0,
 'Sat': 1.0306744793102525,
 'Sun': 1.0009269127770359,
 'Thu': 1.0003671343632679,
 'Tue': 0.99980071473169119,
 'Wed': 0.9998361380714712}

The results are very much in line with what we saw earlier in our seasonality-exploration file. Monday and Tuesday see a slight dip in their prices(99.9%) while Friday and Saturday see a sizable increase in prices (103%). These are thus the numbers we will be using to apply seasonality to the averages from our previous models. Here, we apply it to RidgeCV the best one that we found.

## Predicting Prices Using Our Seasonality Averages

Now, it is important to test the performance of the averages we arrived at. Here we seek to utilize the RidgeCV regression-- one of the best ones from the models we ran-- and apply seasonality training to it.

In [33]:
#We import the data and rerun the RidgeCV Regression
data = pd.read_csv('../datasets/listings_clean.csv')
data.head()
# split into x and y (note that we do not include id and host_id as predictors)
x = data.iloc[:, 2:-2]
y = data.iloc[:, -2]
y_log = data.iloc[:, -1]

In [73]:
reg_params = 10.**np.linspace(-10, 5, 10)
RidgeCV_model = RidgeCV(alphas=reg_params, fit_intercept=True, cv=5)
RidgeCV_model.fit(x,y_log)
sample = results_nona.sample(frac=0.4,axis=0)
# some of the id's in the sample can't be found. So at the end we readjust the sample dataframe too so they have the same entries
sample_variables=data.loc[data['id'].isin(sample['listing_id'])]
sample_variables.head(5)
sample_variables.shape
sample = sample.loc[sample['listing_id'].isin(sample_variables['id'])]
X_sample = sample_variables.iloc[:, 2:-2]
new_predictions = sample.copy()
new_predictions.loc[:,0:7]=0
new_predictions['Mon']=np.exp(RidgeCV_model.predict(X_sample))
for i in b[1:]:
    new_predictions[i]=np.exp(RidgeCV_model.predict(X_sample))*multiplier[i]

In [177]:
new_predictions.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
875,38.752293,38.74457,38.745943,38.76652,39.939311,39.941,38.788213,4526313.0
787,117.820157,117.796677,117.80085,117.863412,121.429096,121.434229,117.929366,2097795.0
2093,93.586445,93.567795,93.57111,93.620804,96.453083,96.457161,93.673192,3409586.0
2721,95.8294,95.810303,95.813698,95.864583,98.764742,98.768917,95.918226,3027838.0
2007,246.199927,246.150863,246.159585,246.290316,253.741256,253.751982,246.428133,2893512.0


In [178]:
sample.head(5)

Unnamed: 0,Mon,Tue,Wed,Thu,Fri,Sat,Sun,listing_id
875,100.0,100.0,100.0,100.0,100.0,100.0,100.0,4526313.0
787,210.0,210.0,210.0,210.0,210.0,210.0,210.0,2097795.0
2093,75.0,75.0,75.0,75.0,95.0,95.0,75.0,3409586.0
2721,130.0,130.0,130.0,130.0,130.0,130.0,130.0,3027838.0
2007,80.0,80.0,80.0,80.0,80.0,80.0,80.0,2893512.0


We see already from the head that the output of our seasonality data may not yield the best results. The top data frame is our projections versus the lower which is the actual prices.

In [179]:
metrics.median_absolute_error(sample.iloc[:,:-1].values.flatten(), new_predictions.iloc[:,:-1].values.flatten())

72.1070818480257

Overall, our results did suffer from greater error than our original RidgeCV regression, as we experience a median absolute error of $\$72.10$. We came to our predictions by using our RidgeCV to predict an average price for our sample of the total listings( we used $40\%$ of the total dataset to help it run faster). Then, we multiplied that average by the appropriate day multiplier to predict the price by day of the week. Lastly, we looked at the errors this generated versus the actual prices by day of the week that we know. This is for three reasons-- many of the listings still do not incorporate day-variations in their prices, large rather than small price difference between days of the week, and inaccuracies within the original RidgeCV predictions.
To address the first inaccuracy, we see that most of the listings don't vary prices over time. This can be seen by the extremely clean averages that are the same each day for every listing. This is strongly suggestive that the lister just uses one set price no matter what time of the year it is. If there were some variation, then the averages would be a lot more messy. This directly leads to the second point. Because so many people do not use dynamic, day-based pricing, the overall effect we see and apply is very minimal (with a max of $103\%$ on Friday and Saturday as we said earlier). However, when we look at the strategies of people who do employ dynamic pricing, the price changes are never $3\%$. For example, we see ID 3409586 increases its average listing price from \$75 to \$95 on Fridays and Saturdays. Thus, even for listings that do use dynamic pricing, the multipliers we applied are less than the shifts in prices actually used. Lastly, a lot of the error here is in predicting the average price from our RidgeCV regression. In the future, by better predicting the average price, it will also dramatically increase the accuracy of our seasonality analysis. However, all in all this was a great step forward.

## Using ARIMA time series models for future forecasting

As inspired by [this Duke webpage about using ARIMA models for time series forecasting](https://people.duke.edu/~rnau/seasarim.htm), we feel like this could be a model with exploring with our AirBnB data.

We can take the cross section of a price on each day of the year and then map that time series.


<img src="../img/average price time series.png" width="400">

We see that this time series is not stationary, thus we take a first order difference and see that the time series is a lot more stationary.

<img src="../img/1 difference.png" width="400">

This looks a lot better so then we can look at the ACF and PACF graphs to estimate the parameters of the ARIMA model. The ACF supports our analysis from the averages in that there is high autocorrelation in the multiples of lag=7.

<img src="../img/acf.png" width="400">
<img src="../img/PACF.png" width="400">

Lastly, we use the forecast package in R to have it choose the best model based on lowest AIC. We ultimately come up with a model of ARIMA(4,0,3) as highlighted by this output from R
<img src="../img/arima model.png" width="400">


This was just a beginning exploration, however with more comprehensive data, this type of modeling could prove to be very powerful for understanding the seasonal nature of the pricing data. With only one year at our disposal though, it is hard to project forward prices and understand annual pricing trends. However, this is very promising!

Seasonality is perhaps the most promising area of the entire project because it shows that many Airbnb listers are not taking advantage of dynamic pricing by the day of the week, something that is important to establish optimal pricing. Already, there are some promising results-- people should price Friday and Saturday the highest and Tuesday and Wednesday the lowest. With further refinements to our model, such as looking at various seasonal time series models, hopefully people can look at the trends and be able to price their AirBnB listings more appropriately with more concrete percentage change suggestions.
