# Restaurant Revenue Prediction
## Predict annual restaurant sales based on objective measurements

### Data description:

- TFI has provided a dataset with 137 restaurants in the training set, and a test set of 100000 restaurants. The data columns include the open date, location, city type, and three categories of obfuscated data: Demographic data, Real estate data, and Commercial data. The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. 

### File description:

- train.csv - the training set. Use this dataset for training your model. 
- test.csv - the test set. To deter manual "guess" predictions, Kaggle has supplemented the test set with additional "ignored" data. These are not counted in the scoring.
- sampleSubmission.csv - a sample submission file in the correct format

### Data fields:
- Id : Restaurant id. 
- Open Date : opening date for a restaurant
- City : City that the restaurant is in. Note that there are unicode in the names. 
- City Group: Type of the city. Big cities, or Other. 
- Type: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile
- P1, P2 - P37: There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators.
- Revenue: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values. 

### Evaluations:

#### Root Mean Squared Error (RMSE)
- Submissions are scored on the root mean squared error. 

Import the tools for data preprocessing and EDA.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import date, datetime
from sklearn.preprocessing import MinMaxScaler

Read the train and test set

In [None]:
train_df = pd.read_csv('../input/restaurant-revenue-prediction/train.csv.zip',parse_dates=['Open Date'])
test_df = pd.read_csv('../input/restaurant-revenue-prediction/test.csv.zip',parse_dates=['Open Date'])

### Basic information of the dataset.

In [None]:
train_df.shape, test_df.shape

The test data set is so much larger than the training set.

In [None]:
train_df.describe()

In [None]:
test_df.describe()

In [None]:
train_df.isnull().sum()

In [None]:
test_df.isnull().sum()

No missing data in both training and testing dataset.

# Data EDA

In [None]:
def print_cols():
    print(train_df.columns)

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(train_df.corr(),annot=True, cbar=False, cmap='Reds')

In [None]:
print_cols()

### Filter out the outliers

In [None]:
plt.figure(figsize=(20,4))
sns.boxplot(x='revenue',data=train_df)

In [None]:
from scipy.stats import iqr

upper_limit = train_df.revenue.quantile(0.75) + (1.5* iqr(train_df.revenue))
lower_limit = train_df.revenue.quantile(0.25)- (1.5* iqr(train_df.revenue))

condition = (train_df.revenue > upper_limit) | (train_df.revenue<lower_limit)
train_df[condition]

- There are eight rows of outlier, the training set is too small, I don't want to drop too many instances.
- Set the threshold higher to 10,000,000.

In [None]:
rev_filter = (train_df.revenue < 10000000)
train_df = train_df[rev_filter]

train_df.shape

### EDA : restaurant ID

- The id is unique in both training set and testing set, it provide little information, so we will drop it.

In [None]:
len(train_df.Id.unique()) == train_df.shape[0]

In [None]:
len(test_df.Id.unique()) == test_df.shape[0]

In [None]:
# Drop the ID
train_df.drop('Id',axis=1,inplace=True)
test_df.drop('Id', axis=1, inplace=True)

In [None]:
#check the shape
train_df.shape, test_df.shape

### EDA : Open date

In [None]:
train_df['Open Date'].value_counts()

Divide the open_date column into year, month, day

In [None]:
train_df['open_year'] = train_df['Open Date'].dt.year
# Do it to the test data 
test_df['open_year'] = test_df['Open Date'].dt.year

In [None]:
"""
Visualize the data on the graph according the open year

"""

fig = plt.figure(figsize=(12,10))
ax1=fig.add_subplot(2,1,1)
ax2=fig.add_subplot(2,1,2)

sns.countplot(train_df.open_year,ax=ax1)
ax1.set_title('Number of restaurant open from 1996-2014')
ax1.set_ylabel('Number of restaurant')

train_df.groupby(['open_year']).mean()['revenue'].plot.bar(ax=ax2, width=0.7)
ax2.set_title('Average revenue for the restaurant according to their open year')
ax2.set_ylabel('Average Revenue')

plt.show()

In [None]:
plt.figure(figsize=(10,7))
sns.boxplot(x='open_year',y='revenue', data=train_df)

### EDA: City & City Group
- There are unicode in the names.

In [None]:
train_df['City Group'].value_counts()

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,4))
train_df.groupby(['City Group']).mean()['revenue'].plot.bar(ax=ax1)
sns.boxplot(x='City Group',y='revenue', data=train_df,ax=ax2)

Plot the average revenue of different cities.


In [None]:
train_df.groupby(['City']).mean()['revenue'].plot.barh(figsize=(20,20))
plt.yticks(fontsize=17)

In [None]:
len(train_df.City.unique()) ,len(test_df.City.unique())

There are more cities appear in test set than in training set. It provides less information to our model, also, the data of cities appear on the P columns, so I drop the 'city' columns.

In [None]:
train_df.drop('City',axis=1,inplace=True)
test_df.drop('City',axis=1,inplace=True)

### EDA: Type

In [None]:
print_cols()

In [None]:
train_df.Type.value_counts()

There is only 1 data in 'DT' group, group this data to 'IL'.

In [None]:
train_df.loc[124,'Type'] = 'IL'

#### There is no MB type in training set

In [None]:
test_df.Type.value_counts()

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,4))
train_df.groupby(['Type']).mean()['revenue'].plot.bar(ax=ax1)
sns.boxplot(train_df.Type, train_df.revenue,ax=ax2)
plt.show()

### EDA: p1 , p2 - p37

- There are three categories of these obfuscated data. 
    1. Demographic data, inlcude given area, age and gender.
    2. Real estate data, car park, front facade, m2 of the location.
    3. Commerical main include the existence of points of interest including schools, banks.

In [None]:
train_df.P1.value_counts()

In [None]:
#plt.figure(figsize=(20,25))
#stop = 37
#for i in range(1,stop+1):
    #col_name = 'P' + str(i)
    #plt.subplot(8,5,i)
    #train_df.groupby([col_name]).median()['revenue'].plot.bar(width=0.2)
    #sns.boxplot(col_name, 'revenue',data=train_df, width=0.3)
#plt.show()

The range of all P columns are different.

# Data formatting

In [None]:
train_df.head()

In [None]:
test_df.head()

### Combine training set and test set into complete set for formatting.

In [None]:
comp_df = pd.concat([train_df, test_df])
comp_df.reset_index(drop=True, inplace=True)

### Label encoding: City Group, Type

In [None]:
comp_df.Type.value_counts(),comp_df['City Group'].value_counts()

In [None]:
comp_df.Type = comp_df.Type.map({'MB':0,'DT':1, 'IL':2,'FC':3})
comp_df['City Group'] = comp_df['City Group'].map({'Big Cities':1, 'Other':0})

In [None]:
comp_df.head(3)

### Normalize the p-columns

In [None]:
p_name = ['P'+str(i) for i in range(1,38)]
comp_df[p_name] = MinMaxScaler().fit_transform(comp_df[p_name])

In [None]:
# DPCA to p-columns
from sklearn.decomposition import PCA
pca = PCA().fit(comp_df[p_name])
plt.figure(figsize=(7,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of Components')
plt.ylabel('Explained variance')
plt.yticks(np.arange(0.1,1.1,0.05))
plt.xticks(np.arange(0,41,2))
plt.grid(True)

set n_components to 29.

In [None]:
pca_list = ['pca'+str(i) for i in range(1,30,1)]
comp_df[pca_list] = PCA(n_components=29).fit_transform(comp_df[p_name])
comp_df.drop(p_name,axis=1,inplace=True)

In [None]:
comp_df

### Normalize the date
- Create new column 'launch days'
- Normalize open_year, launch days

In [None]:
import datetime
comp_df['launch_days'] = (datetime.datetime.now() - comp_df[['Open Date']])
comp_df['launch_days'] = comp_df['launch_days'].dt.days

In [None]:
comp_df.drop('Open Date',axis=1,inplace=True)

In [None]:
comp_df['launch_days'] = MinMaxScaler().fit_transform(comp_df[['launch_days']])
comp_df['open_year'] = MinMaxScaler().fit_transform(comp_df[['open_year']])

# Split to training set and test set

In [None]:
test_df = comp_df[comp_df['revenue'].isnull()]
train_df = comp_df[comp_df['revenue'].notnull()]
test_df.drop('revenue',axis=1, inplace=True)

In [None]:
train_df.shape, test_df.shape

In [None]:
x_train = train_df.drop('revenue',axis=1)
y_train = train_df['revenue']

# Start training model

- LGBMRegressor()

In [None]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV, cross_validate, RepeatedKFold

### Train without tuning

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=3)
scores = cross_validate(LGBMRegressor(), x_train,y_train, scoring=['r2','neg_root_mean_squared_error'],cv=cv)

In [None]:
r2 = scores['test_r2']
rmse = scores['test_neg_root_mean_squared_error']
print(np.mean(r2),np.mean(rmse))

# Hyperparameters tuning

### LGBM regressor

In [None]:
# random forest first
#cv = RepeatedKFold(n_splits=10, n_repeats=3)
#params = {
    #'n_estimators':[20,50,100,200],
    #'max_depth':[3,5,7],
    #'learning_rate':[0.0001,0.001,0.01,0.1,1],
    #'boosting_type':['gbdt','dart','goss'],
    #'subsample':[0.3,0.5,0.7,1]
#}

#lgbm_grid = GridSearchCV(LGBMRegressor(random_state=42),params, cv=cv, verbose=1, n_jobs=-1,scoring='neg_root_mean_squared_error')
#lgbm_grid.fit(x_train,y_train)

# Prediction on test data
- And output the file

In [None]:
final_model = LGBMRegressor(boosting_type='dart',max_depth=3,n_estimators=20,random_state=42, subsample=0.3).fit(x_train,y_train)

In [None]:
test_file = pd.read_csv('../input/restaurant-revenue-prediction/test.csv.zip')
answer = pd.DataFrame(final_model.predict(test_df))
answer.columns = ['Prediction']
answer['Id'] = test_file.index.tolist()
answer.set_index('Id',inplace=True)

In [None]:
answer.to_csv('result.csv')