# Restaurant Revenue Prediction

> **Goal**: Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

> **Evaluaton Metric**: RMSE(Root Mean Squared Error)

## Data Source

* Kaggle: https://www.kaggle.com/c/restaurant-revenue-prediction/overview

## Data Fields

1. **Id**: Restaurant id. 
2. **Open Date**: opening date for a restaurant
3. **City**: City that the restaurant is in. Note that there are unicode in the names. 
4. **City Group**: Type of the city. Big cities, or Other. 
5. **Type**: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
6. **P1, P2 - P37**: There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
7. **Revenue**: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values. 


### About the company

* TFI has over 1,200 quick service restaurants across the globe.
* They employ over 20,000 people in Europe and Asia.
* They make significant investments in their niche.
* When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.
* Their goal is to increase effectiveness in their investments.

> You have the opening dates, cities, types of cities, types of restaurants and obfuscated data to predict the revenue

## References
1. https://towardsdatascience.com/restaurant-revenue-prediction-467f0990403e
2. https://towardsdatascience.com/random-forest-hyperparameters-and-how-to-fine-tune-them-17aee785ee0d

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

## Data Preparation 

In [None]:
# Load training data
data_train = pd.read_csv("../input/restaurant-revenue-prediction/train.csv.zip")

In [None]:
data_train.head()

In [None]:
data_train.info()

In [None]:
# Check for null values
data_train.isna().sum()

In [None]:
# Check all the cities and provinces we are dealing with
data_train.City.unique()

In [None]:
# The number of provinces and cities in our data
len(data_train.City.unique())

In [None]:
# Types of cities we are dealing with
data_train['City Group'].unique()

In [None]:
bigCities = len(data_train[data_train['City Group'] == "Big Cities"])
otherCount = len(data_train[data_train['City Group'] == "Other"])
dic_1 = {"Big Cities": bigCities, "Other": otherCount}

fig, ax = plt.subplots(figsize=(5, 5))
ax.bar(dic_1.keys(), 
       dic_1.values(), 
       width=0.8, 
       color=['skyblue', 'orange'])
ax.set(xlabel= "City Group", 
       ylabel='Count',
       title='Training Examples of the City Groups');
data_train['City Group'].value_counts()

In [None]:
data_train['Type'].value_counts()

In [None]:
fc = len(data_train[data_train['Type'] == "FC"])
il = len(data_train[data_train['Type'] == "IL"])
dt = len(data_train[data_train['Type'] == "DT"])
dic_2 = {'Food Court': fc ,"Inline": il , "Drive Thru": dt}

fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(dic_2.keys(), 
       dic_2.values(), 
       width=0.8, 
       color=['darkorange', 'bisque', 'moccasin'])
ax.set(xlabel='Type of Restaurant', 
       ylabel='Count',
       title='Training Examples of the Types of Restaurants');

In [None]:
data_train['Open Date'].dtype

In [None]:
# Convert the Open Date column to the datetime data type
data_train['Open Date'] = pd.to_datetime(data_train['Open Date'])

In [None]:
data_train['Open Date'].dtype

In [None]:
# Sort the values by year in ascending order
data_train.sort_values(by=['Open Date'], inplace=True, ascending=True, ignore_index=True)

In [None]:
data_train = data_train.drop('Id', axis=1)

In [None]:
data_train.head()

In [None]:
# Add seperate columns for the Open date values
data_train['Sale Day'] = data_train['Open Date'].dt.day
data_train['Sale Year'] = data_train['Open Date'].dt.year
data_train['Sale Month'] = data_train['Open Date'].dt.month

In [None]:
data_train.head()

In [None]:
data_train['Sale Year'].value_counts()

In [None]:
# Store categorical variable names in a list
ctg_vars = []

for col in data_train:
    if len(data_train[col].unique()) <= 30:
        ctg_vars.append(col)

In [None]:
# Remove the P variables from categorical variables' list
i = 1
for k in range(1, 43):
    for p in ctg_vars:
        if p == "P" + str(i):
            ctg_vars.remove("P" + str(i))
            i += 1

In [None]:
print(ctg_vars)

In [None]:
len(ctg_vars)

## Exploratory Data Analysis

In [None]:
#Plot histograms for all the P columns and the revenue column
hist_cols = list(data_train.columns[4:42])
data_train[hist_cols].hist(figsize= (12,60), layout=(19,2), bins=15);

In [None]:
sns.distplot(data_train['revenue']);

> We can observe that the revenue independent variable is rightly skewed

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(data_train['Open Date'], data_train['revenue'])
ax.set(ylabel="Revenue / 10^-7",
       xlabel='Year',
       title='Annual Restaurant Revenue');

In [None]:
# Median Revenue of big cities and other cities
ax_wp_1 = sns.boxplot(x='revenue', y='City Group', data=data_train)
ax_wp_1.set(title='Whisker plot');

bc_median = data_train[data_train['City Group'] == 'Big Cities']['revenue'].median()
oc_median = data_train[data_train['City Group'] == 'Other']['revenue'].median()
print("Median Revenue of Big cities:", bc_median)
print("Median Revenue of Other cities:", oc_median)

In [None]:
data_train['revenue'].max()

In [None]:
# Median revenue for the types of restaurants
rt_median = data_train.groupby('Type')['revenue'].aggregate(np.median)
print("Median Revenue of the types of restaurants per annum: \n", rt_median[1:])

In [None]:
data_train[data_train['Type'] == 'FC']['revenue'].cumsum().plot()
data_train[data_train['Type'] == 'IL']['revenue'].cumsum().plot()
plt.ylabel('Cumulative Sum of Revenue')
plt.xlabel('Number of examples')
plt.legend(['Food Court', 'Inline'])
plt.title('Cumulative Revenue Graph');

In [None]:
# Type of restaurant with the most revenue
data_train[data_train['revenue'] == data_train['revenue'].max()]['Type']

In [None]:
plt.figure(figsize=(45,25))
sns.heatmap(data_train.corr(),annot=True)
sns.set(font_scale=1.4)

## Imputing Null P values

In [None]:
# P variables will be considered as continous variables rather than categorical variables
imp_train = IterativeImputer(max_iter=30, random_state=0, missing_values=0, sample_posterior = True, min_value=1)
p_vals = ["P" + str(i) for i in range(1, 38)]
data_train[p_vals] = np.round(imp_train.fit_transform(data_train[p_vals]))

### Save changes made to the data in another file

In [None]:
data_temp = data_train.copy()

In [None]:
data_temp.drop('Open Date', axis=1, inplace=True)
data_temp.drop('City', axis=1, inplace=True)

In [None]:
data_temp['revenue'] = np.log1p(data_temp['revenue'])

In [None]:
data_temp.to_csv('train_data_modified.csv', index=False)

## Modelling

> **Regression models**:
*  Random Forest
* CatBoost 

In [None]:
# Load the temp data
data = pd.read_csv('train_data_modified.csv')

In [None]:
data = pd.get_dummies(data, columns=ctg_vars)

 These columns are missing from the training data which will become an issue when the input features from our test set will not match the input features from our training set
- Sale Year_1995
- Sale Year_2001
- Sale Year_2003
- Sale Day_19

In [None]:
# Add new columns to our dataset to match our input features
data['Sale Year_1995'] = pd.DataFrame(np.zeros((137, 1)), dtype='uint8')
data['Sale Year_2001'] = pd.DataFrame(np.zeros((137, 1)), dtype='uint8')
data['Sale Year_2003'] = pd.DataFrame(np.zeros((137, 1)), dtype='uint8')
data['Sale Day_19'] = pd.DataFrame(np.zeros((137, 1)), dtype='uint8')

In [None]:
X = data.drop('revenue', axis=1)
y = data['revenue']

In [None]:
# Split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.1, random_state=0)

In [None]:
# Create Random Forest Regressor model
model = RandomForestRegressor(n_estimators=1000 ,random_state=0)
model.fit(X_train, y_train)

In [None]:
# Evaluation Function

def rmse(y_test, y_preds):
    return np.sqrt(mean_squared_error(y_test, y_preds))

def show_scores(model):
    train_preds = model.predict(X_train)
    val_preds = model.predict(X_valid)
    scores = {"Training MAE": mean_absolute_error(y_train, train_preds),
              "Validating MAE": mean_absolute_error(y_valid, val_preds),
              "Training RMSE": rmse(y_train, train_preds),
              "Validating RMSE": rmse(y_valid, val_preds),
              "Training R^2": model.score(X_train, y_train),
              "Validating R^2": model.score(X_valid, y_valid)}
    return scores

In [None]:
show_scores(model)

In [None]:
# Model 2 - CatBoost
from catboost import CatBoostRegressor
model_2 = CatBoostRegressor(verbose=False)
model_2.fit(X_train, y_train);

In [None]:
cat_pred = model_2.predict(X_valid)

In [None]:
show_scores(model_2)

## Hyperparameter Tuning

### Random Forest

In [None]:
# Number of trees
trees = np.arange(100, 1000, 100)

for i in trees:
    print("Number of Trees: {}".format(i))
    rf_test_model = RandomForestRegressor(n_estimators=i, random_state=0, criterion='mae')
    rf_test_model.fit(X_train, y_train)
    train_preds = rf_test_model.predict(X_train)
    val_preds = rf_test_model.predict(X_valid)
    print('RMSE for training set: {}'.format(rmse(y_train, train_preds)))
    print('RMSE for validation set: {} \n'.format(rmse(y_valid, val_preds)))

In [None]:
# Parameter dictionary for GridSearch
rf_grid = {'n_estimators': [200, 600, 800],
           'criterion': ['mse', 'mae'],
           'max_features': [0.33, 0.5, 'auto', 'sqrt'],       
           }

In [None]:
rf_gs = GridSearchCV(estimator = RandomForestRegressor(),
                     param_grid = rf_grid,
                     cv = 5,
                     verbose = True)

rf_gs.fit(X_train, y_train)

In [None]:
rf_gs.score(X_valid, y_valid)

In [None]:
rf_gs.score(X_train, y_train)

In [None]:
rf_gs.best_params_

In [None]:
rf_gs.best_params_['n_estimators']

In [None]:
rf_test_model = RandomForestRegressor(n_estimators=rf_gs.best_params_['n_estimators'], random_state=0, 
                                      criterion=rf_gs.best_params_['criterion'], max_features = rf_gs.best_params_['max_features'])
rf_test_model.fit(X_train, y_train)
train_preds = rf_test_model.predict(X_train)
val_preds = rf_test_model.predict(X_valid)
print('RMSE for training set: {}'.format(rmse(y_train, train_preds)))
print('RMSE for validation set: {}'.format(rmse(y_valid, val_preds)))

## Test set

In [None]:
data_test = pd.read_csv("../input/restaurant-revenue-prediction/test.csv.zip")

In [None]:
data_test.head()

In [None]:
data_test.isna().sum()

In [None]:
len(data_test.City.unique())

In [None]:
data_test['Type'].unique()

MB type of restaurants will be replaced with DT

In [None]:
data_test['Open Date'] = pd.to_datetime(data_test['Open Date'])
data_test.sort_values(by=['Open Date'], inplace=True, ascending=True, ignore_index=True)

In [None]:
data_test['Open Date'].dtype

In [None]:
data_test['Sale Day'] = data_test['Open Date'].dt.day
data_test['Sale Year'] = data_test['Open Date'].dt.year
data_test['Sale Month'] = data_test['Open Date'].dt.month

In [None]:
data_test.drop('Open Date', axis=1, inplace=True)
data_test.drop('City', axis=1, inplace=True)

In [None]:
ctg_vars_test = []

for col in data_test:
    if len(data_test[col].unique()) <= 31:
        ctg_vars_test.append(col)

In [None]:
# Remove the P variables
i = 1
for k in range(1, 43):
    for p in ctg_vars_test:
        if p == "P" + str(i):
            ctg_vars_test.remove("P" + str(i))
            i += 1

In [None]:
print(ctg_vars_test)

### Save changes made to the test data in another file

In [None]:
data_temp_test = data_test.copy()

In [None]:
data_temp_test.loc[data_temp_test['Type'] == 'MB', 'Type'] = 'DT'

In [None]:
imp_test = IterativeImputer(max_iter=30, random_state=0, missing_values=0, sample_posterior = True, min_value=1)
p_vals_test = ["P" + str(i) for i in range(1, 38)]
data_temp_test[p_vals_test] = np.round(imp_test.fit_transform(data_temp_test[p_vals_test]))

In [None]:
data_temp_test = pd.get_dummies(data_temp_test, columns=ctg_vars_test)

In [None]:
data_temp_test.to_csv('test_data_modified.csv', index=False)

### Making Predictions on the test set

In [None]:
test_data = pd.read_csv('test_data_modified.csv')

In [None]:
submission = pd.DataFrame(columns=["Id", "Prediction"])
submission["Id"] = test_data['Id']

# Random Forest Model predictions
rf_pred_sub = rf_test_model.predict(test_data.drop('Id', axis=1))
submission['Prediction'] = np.expm1(rf_pred_sub)
submission.to_csv('submission_random_forest.csv', index=False)

# CatBoost Model predictions
cb_pred_sub = model_2.predict(test_data.drop('Id', axis=1))
submission['Prediction'] = np.expm1(cb_pred_sub)
submission.to_csv('submission_cat_boost.csv', index=False)