# Predicting Restaurant Demand: A Walk-Through For Beginners and The Curious 

*This kernel draws from work by DSEverything, JdPaletto, the1owl1, and hklee and aims to explain some of the methods they took for solving this problem with some tweaks of my own of course :D

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import ensemble, neighbors, linear_model, metrics, preprocessing
from datetime import datetime
import glob, re

# Our Methodology
#### I'm very excited to share this work with you as it combines two very different approaches to predicting future customer demand in a way that is both educational and performs quite well!
 #### 1. The machine learning approach of building one big dataframe, feeding that dataframe to a model (our ensemble of models in our case), and letting it predict future demand
 #### 2. The old-school and dare I say elegant inferential statistics approach of time weighted means averaging to predict future demand
 #### Think of it like taking a high level approach then getting down in the weeds of the math and statistics to really understand what machine learning approaches are really built on. Should be interesting!
 #### If you have any ideas on how to improve this notebook let me know in the comments. 

# Approach 1 (ML Model Ensembling)

## Cleaning, Feature Engineering & Merging (...But Mostly Cleaning)

We'll begin by collecting all our datasets in one dictionary

In [None]:
data = {
    'ar': pd.read_csv('../input/air_reserve.csv'),
    'as': pd.read_csv('../input/air_store_info.csv'),
    'tra': pd.read_csv('../input/air_visit_data.csv'),
    'hol': pd.read_csv('../input/date_info.csv').rename(columns={'calendar_date': 'visit_date'}),
    'hr': pd.read_csv('../input/hpg_reserve.csv'),
    'hs': pd.read_csv('../input/hpg_store_info.csv'),
    'tes': pd.read_csv('../input/sample_submission.csv'),
    'id': pd.read_csv('../input/store_id_relation.csv')
}

Then we'll go through each dataset looking where we can merge them together, sort them for understanding, and feature engineer them for new patterns. In the end we should have one well optimized dataset to feed our machine learning model.

In [None]:
data['id'].sample(3)

Our id set has both restaruant ids in it. Making it the perfect candidate for linking hpg data to air data. Let's start merging!

In [None]:
data['hr'] = data['hr'].merge(data['id'], how='inner', on=['hpg_store_id'])
data['hr'].sample(5)

Next we need to convert the datetime columns to a pandas datetime so we can extract our most valuable piece of data in this competition. The date each customer came to eat! (we'll be doing this a lot so you know). We'll engineer our own potentially useful feature as well; tracking how many days in advance each customer reserved their table. Could be interesting!

In [None]:
# Notice datetime initially an object. We will fix that
np.dtype(data['hr']['visit_datetime'])

In [None]:
for df in ['ar', 'hr']:
    data[df]['visit_datetime'] = pd.to_datetime(data[df]['visit_datetime'])
    data[df]['visit_datetime'] = data[df]['visit_datetime'].dt.date
    data[df]['reserve_datetime'] = pd.to_datetime(data[df]['reserve_datetime'])
    data[df]['reserve_datetime'] = data[df]['reserve_datetime'].dt.date
    data[df]['reserve_datetime_diff'] = data[df].apply(
        lambda r: (r['visit_datetime'] - r['reserve_datetime']).days, axis=1)
    data[df] = data[df].groupby(['air_store_id', 'visit_datetime'], as_index=False)[['reserve_datetime_diff',
                                                                                     'reserve_visitors']].sum().rename(columns={'visit_datetime':'visit_date'})
data['hr'].sample(5)

In [None]:
data['ar'].sample(4)

'hr' and 'ar' look good, next on the list is 'as'. Let's take a peek.

In [None]:
data['as'].sample(5)

### **Why Not Weather?**
I see a lot of people getting fancy mapping out the location of every restaurant on a map of Japan which is pretty cool to witness but not very useful to our problem so I won't be going that direction.

I've also seen a lot of kernels incorporating the weather down to the area of every restaurant and looking to see if the weather forecast might spur or disincentivise going out to eat. Unfortunately, from what I've seen no one has been able to gather any significant results from this method. This makes sense as from what we've seen so far everyone seems to book a table in advance and thus is probably serious about showing up. 

The uselessness of weather in our forecasting should also be viewed from a business angle. We can only know the weather a few days in advance and with not great accuracy. That's not something stable to forecast our future staffing or food supply from. 

### **Back to Cleaning**
Instead we'll drawl more stable features from the 'as' set to base our model off of. The type of cuisine served and area the restaurant is located in are classic indicators of forecasting future demand. We'll have to encode those columns so our machine can read  them. 

In [None]:
lbl = preprocessing.LabelEncoder()
data['as']['air_genre_name'] = lbl.fit_transform(data['as']['air_genre_name'])
data['as']['air_area_name'] = lbl.fit_transform(data['as']['air_area_name'])
data['as'].sample(3)

Now to the 'tra' dataset

In [None]:
data['tra'].sample(4)

This time we'll use the pandas datetime method a little differently. Instead of having it pull apart a date & time stamp into just the date, we'll have it extract a day of the week, month, year AND date from  a date stamp. Very useful library. 

In [None]:
data['tra']['visit_date'] = pd.to_datetime(data['tra']['visit_date'])
data['tra']['day_of_week'] = data['tra']['visit_date'].dt.dayofweek
data['tra']['month'] = data['tra']['visit_date'].dt.month
data['tra']['year'] = data['tra']['visit_date'].dt.year
data['tra']['visit_date'] = data['tra']['visit_date'].dt.date


In [None]:
data['tra'].sample(4)

Looks good. For day_of_week, Monday=0 if you're wondering. Moving on

In [None]:
data['hol'].sample(5)

Our date info looks pretty good. Let's just convert it to a datetime object so it can merge with our other table's visit_date later on. We'll also drop the day_of_week column as we already built one in 'tra' and what matters most about this table is the holiday flag data

In [None]:
data['hol'] = data['hol'].drop(columns=['day_of_week'])
data['hol']['visit_date'] = pd.to_datetime(data['hol']['visit_date'])
data['hol']['visit_date'] = data['hol']['visit_date'].dt.date
data['hol'].sample(4)

All right, let's see what's in the next table

In [None]:
data['hs'].sample(4)

I wish I had some use for this data. All of this data is already in the 'as' table though. The id column is unique but does not provide any additional info for solving the problem. Adding it in only adds redundency and increases the risk of overfitting our model. Looks like we're gona have to walk away from this one. 

And our last dataset to clean up is 'tes'

In [None]:
data['tes'].sample(3)

It might not look like there's much for us to work with but if you look closer there's a date stamp in the id we can pull out and extract more data from. Feature engineering to the rescue!

In [None]:
data['tes']['visit_date'] = data['tes']['id'].map(lambda x: str(x).split('_')[2])
data['tes']['air_store_id'] = data['tes']['id'].map(lambda x: '_'.join(x.split('_')[:2]))
data['tes']['visit_date'] = pd.to_datetime(data['tes']['visit_date'])
data['tes']['day_of_week'] = data['tes']['visit_date'].dt.dayofweek
data['tes']['month'] = data['tes']['visit_date'].dt.month
data['tes']['year'] = data['tes']['visit_date'].dt.year
data['tes']['visit_date'] = data['tes']['visit_date'].dt.date
data['tes'].sample(4)

With the datasets now all clean and related to one another. Now we can merge them all into a nice train and test set with our 'tra' and 'tes' files acting as the base we'll merge all the other datasets onto. 

In [None]:
# Store each unique restaurant in an array
unique_stores = data['tes']['air_store_id'].unique()
# Break each restaurant into 7 rows to track each day of the week
stores = pd.concat([pd.DataFrame({'air_store_id': unique_stores, 
                                  'day_of_week': [i]*len(unique_stores)}) for i in range(7)], 
                   axis=0, ignore_index=True)
# Make a temporary variable to store new 'tra' features on visitors
# Then merge that new feature into our stores dataframe 
tmp = data['tra'].groupby(['air_store_id', 'day_of_week'], 
                          as_index=False)['visitors'].min().rename(columns={'visitors':'min_visitors'})
stores = stores.merge(tmp, how='left', on=['air_store_id','day_of_week'])
# Continue this process for mean, max, and count of visitors
tmp = data['tra'].groupby(['air_store_id', 'day_of_week'],
                         as_index=False)['visitors'].mean().rename(columns={'visitors':'mean_visitors'})
stores = stores.merge(tmp, how='left', on=['air_store_id', 'day_of_week'])
tmp = data['tra'].groupby(['air_store_id', 'day_of_week'], 
                          as_index=False)['visitors'].median().rename(columns={'visitors':'median_visitors'})
stores = stores.merge(tmp, how='left', on=['air_store_id', 'day_of_week'])
tmp = data['tra'].groupby(['air_store_id', 'day_of_week'], 
                          as_index=False)['visitors'].max().rename(columns={'visitors':'max_visitors'})
stores = stores.merge(tmp, how='left', on=['air_store_id', 'day_of_week'])
tmp = data['tra'].groupby(['air_store_id', 'day_of_week'], 
                          as_index=False)['visitors'].count().rename(columns={'visitors':'visitor count'})
stores = stores.merge(tmp, how='left', on=['air_store_id', 'day_of_week'])
# Now we'll merge 'as' in
stores = stores.merge(data['as'], how='left', on=['air_store_id'])

In [None]:
# Check everything looks good and is machine readable
stores.head(4)

Great. Now we'll build the actual train and test variables to store our new stores table data and add in the remaining tables

In [None]:
train = pd.merge(data['tra'], data['hol'], how='left', on=['visit_date'])
train = pd.merge(data['tra'], stores, how='left', on=['air_store_id', 'day_of_week'])

In [None]:
test = pd.merge(data['tes'], data['hol'], how='left', on=['visit_date'])
test = pd.merge(data['tes'], stores, how='left', on=['air_store_id', 'day_of_week'])
test.sample(3)

In [None]:
data['ar'] = data['ar'].merge(data['hr'], how='left', on=['air_store_id', 'visit_date', 'reserve_datetime_diff', 'reserve_visitors'])
data['ar'].sample(4)

In [None]:
train = train.merge(data['ar'], how='left', on=['air_store_id', 'visit_date'])
test = test.merge(data['ar'], how='left', on=['air_store_id', 'visit_date'])

In [None]:
train.sample(5)

With our tables all merged and cleaned, the last bit of prep work we'll need is to convert null values to -1. Our machine doesn't like NaNs and replacing with -1 will ensure those previously empty values don't bias our result. And separate out the columns we want our machine to make predictions from. Notice we're mostly dropping columns we needed for joining disparate tables but do not add value in a regression analysis.

In [None]:
col = [c for c in train if c not in ['id', 'air_store_id', 'visit_date', 'visitors']]
train = train.fillna(-1)
test = test.fillna(-1)
train.sample(4)

## Modeling & Ensembling

Now to the cool stuff. I won't say the fun stuff since cleaning and merging can be fun as well in a piecing the puzzle together sort of way. 

We're going to make our first demand forecase by running our train dataframe through an Extremely Randomized Trees Regressor and a KNN Regressor. Then taking the average (ensemble) of the two predictions to use are our initial submission file. 

We chose these two models as both are robust and complement each other well as one tends to have low bias while the other has low variance (our attempt at leveling bias-variance tradeoff). Ensembling models is the name of the game these days and if I was going to pick just two for a problem like ours (numerical time series forecasting) I'd go with these two. Of course feel free to try other combinations of models. If another combination works better please share in the comments.

In [None]:
def RMSLE(y, pred):
    return metrics.mean_squared_error(y, pred)**0.5

In [None]:
etc = ensemble.ExtraTreesRegressor(n_estimators=225, max_depth=5, n_jobs=-1, random_state=3)
knn = neighbors.KNeighborsRegressor(n_jobs=-1, n_neighbors=4)
etc.fit(train[col], np.log1p(train['visitors'].values))
knn.fit(train[col], np.log1p(train['visitors'].values))

In [None]:
test['visitors'] = (etc.predict(test[col]) / 2) + (knn.predict(test[col]) / 2)
test['visitors'] = np.expm1(test['visitors']).clip(lower=0.)
sub1 = test[['id', 'visitors']].copy()

And with that we're ready for the second approach. Hope you're ready for some handmade weights, log transformations, and pythagorean means!

# Approach 2 (Good Ole' Statistics)

Now begins our second approach that does away with fancy ML trees and neighbors replacing it with a simple yet effective approach of taking the average visitor count for each day of the week and nudging that average a little to better reflect recent visits than older ones. 

If you're into deep learning think of it like the weight bias we apply to each data point and alter as we learn more. The only difference is we're deciding what the weight value is going to be upfront instead of using backpropogation to continual refine it.

As you'll see, the results of both approaches are not that different.

To start, we'll need a clean set of tables again.

In [None]:
# Create a glob expression for finding and reading all .csv files in the data warehouse
dfs = {re.search('/([^/\.]*)\.csv', fn).group(1):
      pd.read_csv(fn) for fn in glob.glob('../input/*.csv')}

# store the CSVs locally for quick access
for k, v in dfs.items(): locals()[k] = v

## More Cleaning and Feature Engineering

Next we'll make our weights. We want our weights to discount older dates more than newer ones and we want to do so using a multiplicative function so that the relationship between weights is consistent (this will matter when we convert our visitor data to log scale).

In terms of what exponent to choose hklee already tested out a few and found 5 to be best. I left their test results commented out if you're interested

In [None]:
#date_info['weight'] = ((date_info.index + 1) / len(date_info))       # LB 0.509
#date_info['weight'] = ((date_info.index + 1) / len(date_info)) ** 2  # LB 0.503
#date_info['weight'] = ((date_info.index + 1) / len(date_info)) ** 3  # LB 0.500
#date_info['weight'] = ((date_info.index + 1) / len(date_info)) ** 4  # LB 0.498
date_info['weight'] = ((date_info.index + 1) / len(date_info)) ** 5 # LB 0.497

With weight values decided we'll now merge them based on date with the newest dates getting the smallest weights. We'll then convert our visitor count data to log to reduce skew before applying the weights.

In [None]:
visit_data = air_visit_data.merge(date_info, left_on = 'visit_date', right_on = 'calendar_date', how = 'left')
visit_data.drop('calendar_date', axis=1, inplace = True)
visit_data['visitors'] = visit_data.visitors.map(pd.np.log1p)
visit_data.head()

All right we got our weights and visitors under the same roof. We're ready to apply the weights and update our visitor count based off the result. The formula we'll be using is called weighted arithmetic mean and is worth remembering.

In [None]:
wmean = lambda x:((x.weight * x.visitors).sum() / x.weight.sum())
visitors = visit_data.groupby(['air_store_id', 'day_of_week', 'holiday_flg']).apply(wmean).rename('visitors').reset_index()
visitors.head()

Looks great. Now lets apply those visitor values where it counts; the sample_submission table.

In [None]:
sample_submission['air_store_id'] = sample_submission.id.map(lambda x: '_'.join(x.split('_')[:-1]))
sample_submission['calendar_date'] = sample_submission.id.map(lambda x: x.split('_')[2])
sample_submission.drop('visitors', axis=1, inplace=True)
sample_submission = sample_submission.merge(date_info, on='calendar_date', how='left')
sample_submission = sample_submission.merge(visitors, on=[
    'air_store_id', 'day_of_week', 'holiday_flg'], how='left')

In [None]:
sample_submission.head()

Now most of our store dates and days have visitor count predictions set but a quick call of null values will show our merging strategy still left some holes in the data.

Our first move will be to fill holidays dates that are missing data with the weighted mean of that stores normal day of the week. So a Monday holiday will get replaced with a normal Monday. If you look on your own you'll find this isn't a big deal as holiday visitor counts are not all that different than normal visitor counts. 

'visitor_y' is specified because to make the comparison along the holiday_flg we'll end up duplicating holiday_flg and visitors. Here we are saying we want to use the values of the second visitor count which is the value from our visitors dataframe.

In [None]:
sample_submission.visitors.isnull().sum()

In [None]:
missings = sample_submission.visitors.isnull()
sample_submission.loc[missings, 'visitors'] = sample_submission[missings].merge(
    visitors[visitors.holiday_flg == 0], on=('air_store_id', 'day_of_week'), 
    how='left')['visitors_y'].values

In [None]:
sample_submission.isnull().sum()

Notice we still have a lot of values where the merge criteria we outlined above didn't line up. That's all right, we will fill these remaining null values with a less precise but still pretty good approximation based on the overall average visitors for the store

In [None]:
missings = sample_submission.visitors.isnull()
sample_submission.loc[missings, 'visitors'] = sample_submission[missings].merge(
    visitors[['air_store_id', 'visitors']].groupby('air_store_id').mean().reset_index(), 
    on='air_store_id', how='left')['visitors_y'].values

In [None]:
# Double check we filled all the missing values
sample_submission.isnull().sum()

## Final Model For Submission

With every air_store_id day given a prediction it's time to store the results in variable 'sub2' which we'll then merge with 'sub1' from our machine learning ensemble.

In [None]:
sample_submission['visitors'] = sample_submission.visitors.map(pd.np.expm1)
sub2 = sample_submission[['id', 'visitors']].copy()
sub_merge = pd.merge(sub1, sub2, on='id', how='inner')

Now it is time to make our final merge! Here I chose the geometric mean as my mathematical average of choice as it is less vulnerable to being skewed by the larger value of the two and is best used when averaging values of a multiplicative nature which our weighted means model definitely is. I'll also leave the harmonic mean function below in case anyone wants to try that method instead. 

And with that, I hope you learned a thing or two about merging dataframes, data transformations, or building a predictive model by hand with some classic statistics. If you see any spots in this notebook that could use improvement or further clarification please share in the comments! Happy modeling!

In [None]:
## Geometric Mean  
sub_merge['visitors'] = (sub_merge['visitors_x'] * sub_merge['visitors_y']) ** (1/2)
sub_merge[['id', 'visitors']].to_csv('sub_geo_mean.csv', index = False)
sub_merge[['id', 'visitors']].head()

In [None]:
## Harmonic Mean 
sub_merge['visitors'] = 2/(1/sub_merge['visitors_x'] + 1/sub_merge['visitors_y'])
sub_merge[['id', 'visitors']].to_csv('sub_hrm_mean.csv', index = False)
sub_merge[['id', 'visitors']].head()