<p style="font-size:35px;text-align:center"> <b>Tab Food Investments Revenue Prediction Problem</b> </p>

## Problem Statement

New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred. 

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees. Using demographic, real estate, and commercial data, this competition challenges you to predict the annual restaurant sales of 100,000 regional locations.

## Data

__Id__ : Restaurant id. <br> 
__Open Date__ : opening date for a restaurant <br>
__City__ : City that the restaurant is in. Note that there are unicode in the names. <br>
__City Group__: Type of the city. Big cities, or Other. <br>
__Type__: Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile  <br>
__P1, P2 - P37__: There are three categories of these obfuscated data. __Demographic data__ are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. __Real estate data__ mainly relate to the m2 of the location, front facade of the location, car park availability. __Commercial data__ mainly include the existence of points of interest including schools, banks, other QSR operators. <br>
__Revenue__: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don't mean real dollar values. <br>

## ML Problem Formulation

<p style="font-size:17px;"> <b>Objectives</b> </p>

1) To predict the Revenue of each restaurant based on the given features <br>
2) Challenge ahead : To make sense out of the anonymized data

<p style="font-size:17px;"> <b>Metrics</b> </p>

1) Root Mean Squared Error

In [None]:
%matplotlib inline

# Data wrapper libraries
import pandas as pd
import numpy as np
from collections import Counter

#Data Visualization Libraries
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.markers import MarkerStyle
import seaborn as sns

#Date time Libraries
import time
import datetime

In [None]:
TFI_data_train = pd.read_csv("C:/Users/IBM_ADMIN/Desktop/appliedai/TFI_Restaurant/train.csv")
TFI_data_test = pd.read_csv("C:/Users/IBM_ADMIN/Desktop/appliedai/TFI_Restaurant/test.csv")

In [None]:
print("size of train data",TFI_data_train.shape)
print("size of test data",TFI_data_test.shape)

In [None]:
TFI_data_train.info()

##### Uni-Variate Analysis: Non-Obfuscated Features

Three categorical variables "City","City Group","Type". Let's take a look

In [None]:
TFI_data_train.columns

In [None]:
TFI_data_train["Citygroup"]=TFI_data_train["City Group"]
TFI_data_train.drop("City Group",axis=1)
TFI_data_train=TFI_data_train[['Id', 'Open Date', 'City', 'Citygroup', 'Type', 'P1', 'P2', 'P3', 'P4',
       'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11', 'P12', 'P13', 'P14', 'P15',
       'P16', 'P17', 'P18', 'P19', 'P20', 'P21', 'P22', 'P23', 'P24', 'P25',
       'P26', 'P27', 'P28', 'P29', 'P30', 'P31', 'P32', 'P33', 'P34', 'P35',
       'P36', 'P37', 'revenue']]

In [None]:
TFI_data_test["Citygroup"]=TFI_data_test["City Group"]
TFI_data_test.drop("City Group",axis=1)
TFI_data_test=TFI_data_test[['Id', 'Open Date', 'City', 'Citygroup', 'Type', 'P1', 'P2', 'P3', 'P4',
       'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11', 'P12', 'P13', 'P14', 'P15',
       'P16', 'P17', 'P18', 'P19', 'P20', 'P21', 'P22', 'P23', 'P24', 'P25',
       'P26', 'P27', 'P28', 'P29', 'P30', 'P31', 'P32', 'P33', 'P34', 'P35',
       'P36', 'P37']]

In [None]:
plt.figure(figsize=(18,5))
sns.set_style("whitegrid")
(TFI_data_train.City.value_counts()/len(TFI_data_train)).plot(title="Distribution of City in Train",kind='bar',color='green')
plt.show()
plt.figure(figsize=(18,5))
(TFI_data_test.City.value_counts()/len(TFI_data_test)).plot(title="Distribution of City in Test",kind='bar',color='red')
plt.show()

<p style="font-size:15px"> The presence of some cities in train is not found in test and vice versa. Let's take a look. </p>

In [None]:
cnotintest=[]
cnotintrain=[]
a=TFI_data_train.City.unique()
b=TFI_data_test.City.unique()
for i in a:
    if not i in b:
        cnotintest.append(i)

for i in b:
    if not i in a:
        cnotintrain.append(i)

In [None]:
print("Cities in Test but not in Train are",len(cnotintrain))
print(cnotintrain)
print("Cities in Train but not in Test are",len(cnotintest))
print(cnotintest)

In [None]:
TFI_data_train["Citygroup"].where(TFI_data_train["City"].isin(cnotintest)).unique()

In [None]:
TFI_data_test["Citygroup"].where(TFI_data_test["City"].isin(cnotintrain)).unique()

In [None]:
TFI_data_train["Type"].where(TFI_data_train["City"].isin(cnotintest)).unique()

In [None]:
TFI_data_test["Type"].where(TFI_data_test["City"].isin(cnotintrain)).unique()

In [None]:
a=TFI_data_test.where(TFI_data_test["City"].isin(cnotintrain))

In [None]:
len(a[(a["Type"]=='MB') | (a["Type"]=='DT')])

* The cities present in Test but not in Train and vice-versa are represented numerically.
* These cities are further grouped with "City Group". Cities in either datasets belong to "Others"
* When these cities in Test set are grouped with "Type". Cities in train belonged to 'IL' and 'FC' whereas in the test they belong to 'FC','IL','MB','DT'
* "MB" and "DT" of the above cities constitues only 373 datapoints out of 100000. Hence they can be ignored.
  </n>




* These cities can be renamed "UNK' in either of the datasets

In [None]:
TFI_data_test.loc[TFI_data_test.City.isin(cnotintrain), 'City'] = 'UNK'

In [None]:
TFI_data_test.City.value_counts()

In [None]:
TFI_data_train.loc[TFI_data_train.City.isin(cnotintest), 'City'] = 'UNK'
TFI_data_train.City.value_counts()

In [None]:
plt.figure(figsize=(12,5))
sns.set_style("whitegrid")
(TFI_data_train.Citygroup.value_counts()/len(TFI_data_train)).plot(title="Distribution of City Group in Train",kind='bar',color='green')
plt.show()
plt.figure(figsize=(12,5))
(TFI_data_test.Citygroup.value_counts()/len(TFI_data_test)).plot(title="Distribution of City Group in Test",kind='bar',color='red')
plt.show()

The distributions of City Group are almost similiar in both the datasets.

In [None]:
plt.figure(figsize=(12,5))
sns.set_style("whitegrid")
(TFI_data_train.Type.value_counts()/len(TFI_data_train)).plot(title="Distribution of restaurant type in Train",kind='bar',color='green')
plt.show()
plt.figure(figsize=(12,5))
(TFI_data_test.Type.value_counts()/len(TFI_data_test)).plot(title="Distribution of restaurant type in Test",kind='bar',color='red')
plt.show()

* The prevalence of restaurant types "DT" and "MB" is extremely rare. Hence, They can be ignored.
* The percentage distribution of "FC" and "IL" are approximately same.

In [None]:
TFI_data_test["Open Date"]=pd.to_datetime(TFI_data_test["Open Date"])
TFI_data_test["DayssinceInception"]=(datetime.date.today()-TFI_data_test["Open Date"]).dt.days
del TFI_data_test["Open Date"]

In [None]:
TFI_data_train["Open Date"]=pd.to_datetime(TFI_data_train["Open Date"])
TFI_data_train["DayssinceInception"]=(datetime.date.today()-TFI_data_train["Open Date"]).dt.days
del TFI_data_train["Open Date"]

In [None]:
TFI_data_train.head(3)

In [None]:
plt.figure(figsize=(12,5))
sns.set_style("whitegrid")
plt.scatter(x=TFI_data_train.DayssinceInception,y=TFI_data_train.revenue,c='r')
plt.show()

plt.figure(figsize=(12,5))
sns.set_style("whitegrid")
plt.scatter(x=np.log(TFI_data_train.DayssinceInception),y=TFI_data_train.revenue,c='g')
plt.show()

* The "Dayssinceinception" feature has a linearly increasing relationship with the revenue.
* Applying the Log to "Dayssinceinception" not only preserve the linearly increasing relationship with the revenue but also scale the values.

In [None]:
plt.figure(figsize=(10,6))
f, (ax1, ax2) = plt.subplots(1,2)
sns.boxplot(TFI_data_train.DayssinceInception,ax=ax1,orient='v',color='r')
ax1.set_title("DayssinceInception-Train")
sns.boxplot(TFI_data_test.DayssinceInception,ax=ax2,orient='v',color='g')
ax2.set_title("DayssinceInception-Test")
f.tight_layout()

In [None]:
np.log(TFI_data_test.DayssinceInception).describe()

In [None]:
np.log(TFI_data_train.DayssinceInception).describe()

The statistical figures are closely same for both Test and Train

In [None]:
sns.distplot(np.log(TFI_data_test.DayssinceInception),label='Test')
sns.distplot(np.log(TFI_data_train.DayssinceInception),label='Train')

The distributions are closely similiar.

In [None]:
TFI_data_train["DayssinceInception"]=np.log(TFI_data_train.DayssinceInception)
TFI_data_test["DayssinceInception"]=np.log(TFI_data_test.DayssinceInception)

In [None]:
a=(TFI_data_train==0).astype(int).sum(axis=0)
a

In [None]:
b=(TFI_data_test==0).astype(int).sum(axis=0)
b

In [None]:
df1 = pd.DataFrame(data=a.index, columns=['cols'])
df2 = pd.DataFrame(data=a.values/len(TFI_data_train), columns=['cnt_trn'])
df_trn = pd.merge(df1, df2, left_index=True, right_index=True)

In [None]:
df11 = pd.DataFrame(data=b.index, columns=['cols'])
df21 = pd.DataFrame(data=b.values/len(TFI_data_test), columns=['cnt_tst'])
df_tst = pd.merge(df11, df21, left_index=True, right_index=True)

In [None]:
df_zeros = pd.merge(df_trn, df_tst, left_index=True, right_index=True)

In [None]:
df_zeros.drop("cols_y",axis=1)

The thorough comparison of P features is performed w.r.t to the distribution of zeros and non-zeros in each feature. Interestingly Both Test and Train display the same ratio except __P3__ and __P29__

Sample CDF plot of a feature to further reinstate the uniformity we observed.

In [None]:
c = np.cumsum(TFI_data_train.P36.values/len(TFI_data_train))
sns.set_style("whitegrid")
plt.plot(c,label='Cumulative distribution of P36 in train')
plt.grid()
plt.legend()
plt.show()

c = np.cumsum(TFI_data_test.P36.values/len(TFI_data_test))
sns.set_style("whitegrid")
plt.plot(c,label='Cumulative distribution of P36 in test')
plt.grid()
plt.legend()
plt.show()

50% mark of both Test and Train occured at 1.0

In [None]:
TFI_data_train=TFI_data_train[TFI_data_train.Id!=16]
# Removal of the only outlier

In [None]:
TFI_data_train["revenue"]=np.log(TFI_data_train.revenue)
#Since revenue is the approximate lognormal distribution and can be checked from the below plot

In [None]:
plt.figure(figsize=(10,6))
f, (ax1, ax2) = plt.subplots(2)
sns.distplot(TFI_data_train["revenue"],ax=ax1)
ax1.set_title("revenue")
sns.distplot(np.log(TFI_data_train["revenue"]),ax=ax2)
ax2.set_title("log of revenue")
f.tight_layout()

In [None]:
TFI_data_train.revenue[0]

In [None]:
import math
math.e**TFI_data_train.revenue[0]

In [None]:
TFI_data_train.columns

In [None]:
TFI_data_train_fin = TFI_data_train[['Citygroup', 'Type','DayssinceInception','P1', 'P2','P4', 'P5', 'P6',
       'P7', 'P8', 'P9', 'P10', 'P11', 'P12', 'P13', 'P14', 'P15', 'P16',
       'P17', 'P18', 'P19', 'P20', 'P21', 'P22', 'P23', 'P24', 'P25', 'P26',
       'P27', 'P28', 'P30', 'P31', 'P32', 'P33', 'P34', 'P35', 'P36',
       'P37', 'revenue']]

In [None]:
sns.barplot(y=math.e ** TFI_data_train["revenue"],x=TFI_data_train_fin["Citygroup"])

Since Revenue of Big cities is more than that of others. Assigning 1 to Big Cities and small to Others

In [None]:
TFI_data_test["Citygroup"]=TFI_data_test.Citygroup.replace(to_replace="Big Cities",value="1")
TFI_data_test["Citygroup"]=TFI_data_test.Citygroup.replace(to_replace="Other",value="0")
TFI_data_test["Citygroup"]=pd.to_numeric(TFI_data_test["Citygroup"])

In [None]:
TFI_data_train_fin.head(2)

In [None]:
sns.barplot(y=math.e ** TFI_data_train["revenue"],x=TFI_data_train_fin["Type"])

In [None]:
sns.countplot(TFI_data_train.Type)

In [None]:
sns.countplot(TFI_data_test.Type)

* Creating Dummy variables for "Type" and deleting "Type_MB" and "Type_DT" in both Train and Test as they are extremely rare and would only add noise.
* Remove City from either datasets as It's already been made clear that the obfuscated features contain the geographical data as well

In [None]:
TFI_data_train_fin = pd.get_dummies(TFI_data_train_fin,columns=['Type'])

In [None]:
TFI_data_train_fin.head(3)

In [None]:
TFI_data_test = pd.get_dummies(TFI_data_test,columns=['Type'])

In [None]:
TFI_data_test.head(2)

In [None]:
TFI_data_test1=TFI_data_test.drop(["City","Type_DT","Type_MB"],axis=1)

In [None]:
TFI_data_test1=TFI_data_test1.drop(["Type_MB"],axis=1)

In [None]:
TFI_data_train_fin=TFI_data_train_fin.drop(["Type_DT"],axis=1)

In [None]:
TFI_data_train_fin.columns

In [None]:
TFI_data_test1.columns

In [None]:
TFI_data_train_fin = TFI_data_train_fin[['Citygroup', 'DayssinceInception','Type_FC', 'Type_IL','P1', 'P2', 'P4', 'P5', 'P6', 'P7',
       'P8', 'P9', 'P10', 'P11', 'P12', 'P13', 'P14', 'P15', 'P16', 'P17',
       'P18', 'P19', 'P20', 'P21', 'P22', 'P23', 'P24', 'P25', 'P26', 'P27',
       'P28', 'P30', 'P31', 'P32', 'P33', 'P34', 'P35', 'P36', 'P37']]

In [None]:
train_rev = TFI_data_train.revenue
print(len(train_rev))
print(len(TFI_data_train_fin))

In [None]:
TFI_data_test1=TFI_data_test1[['Citygroup', 'DayssinceInception','Type_FC', 'Type_IL','P1', 'P2', 'P4', 'P5', 'P6', 'P7',
       'P8', 'P9', 'P10', 'P11', 'P12', 'P13', 'P14', 'P15', 'P16', 'P17',
       'P18', 'P19', 'P20', 'P21', 'P22', 'P23', 'P24', 'P25', 'P26', 'P27',
       'P28', 'P30', 'P31', 'P32', 'P33', 'P34', 'P35', 'P36', 'P37']]

In [None]:
y=train_rev.values
x=TFI_data_train_fin.values

##### Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.2,random_state=4)

##### Linear Model

In [None]:
#with all the features
import statsmodels.api as sm

# Note the difference in argument order
model = sm.OLS(y_train, x_train).fit()
y_trn_pred = math.e ** model.predict(x_train) 
y_test_pred = math.e ** model.predict(x_test) # make the predictions by the model

# Print out the statistics
model.summary()

In [None]:
print("Root mean squared error achieved from Linear Model:",np.sqrt(mean_squared_error(math.e **y_test, y_test_pred)))

##### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
cls = RandomForestRegressor(n_estimators=1250)
cls.fit(x_train, y_train)
y_pred_trn_rf = cls.predict(x_train)
y_pred_test_rf = math.e ** cls.predict(x_test)

In [None]:
cls.score(x_train, y_train)

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(cls,x_train, y_train, cv=5)
scores

In [None]:
print("Root mean squared error achieved from RF:",np.sqrt(mean_squared_error(math.e **y_test, y_pred_test_rf)))

##### Ridge Regressor

In [None]:
from sklearn.grid_search import GridSearchCV
# Ridge model
model_grid = [{'normalize': [True, False], 'alpha': np.logspace(0,10)}]
ridge_clf = Ridge()

# Use a grid search and leave-one-out CV on the train set to find the best regularization parameter to use.
grid = GridSearchCV(ridge_clf, model_grid, cv=10, scoring='mean_squared_error')
grid.fit(x_train,y_train)

In [None]:
print("Root mean squared error achieved from Ridge:",np.sqrt(mean_squared_error(math.e **y_test, y_pred_ridge)))

In [None]:
print("Root mean squared error achieved from Linear Model:",np.sqrt(mean_squared_error(math.e **y_test, y_test_pred)))
print("Root mean squared error achieved from RF:",np.sqrt(mean_squared_error(math.e **y_test, y_pred_test_rf)))
print("Root mean squared error achieved from Ridge:",np.sqrt(mean_squared_error(math.e **y_test, y_pred_ridge)))

<p style="font-size:17px;"> Random Forest has proved to be the best model among the three. Hence we are good to make the final submission with this model.

In [None]:
x_tst = TFI_data_test1.values

In [None]:
type(x_train)

In [None]:
final_pred = math.e ** cls.predict(x_tst)

In [None]:
submission = pd.DataFrame({
        "Id": TFI_data_test["Id"],
        "Prediction": final_pred
    })
submission.to_csv('randomres.csv',header=True, index=False)

<p style="font-size:17px;"> The RMSE for my submission in Kaggle is 1757763.58777 which has a slight with the RMSE we got from Random Forest 1651321.84111.


Difference of apprx 100000. Added a noise of app 1 pt/test observation to the whole RMSE :)