# Introduction #
Restaurants are constantly looking for ways to cut costs while continue to serve food with the same quality. You can do this by finding efficiencies in labour, production or raw materials. In this project, we will look at a dataset to determine whether we can run a model to more accurately predict when people will order a certain item. If it is completed effectively, this will allow the restaurant to do more accurate purchases. This help save costs by reducing waste and finding sales when making required purchases to meet the demand.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

In [None]:
ful = pd.read_csv('/kaggle/input/food-demand/foodDemand_train/fulfilment_center_info.csv')
meal = pd.read_csv('/kaggle/input/food-demand/foodDemand_train/meal_info.csv')
df = pd.read_csv('/kaggle/input/food-demand/foodDemand_train/train.csv')


In [None]:
ful.head()

In [None]:
meal.head()

In [None]:
df.describe()

In [None]:
ful.region_code.nunique()

In [None]:
ful.head()

I merged the columns from all of the datasets to see if there is additional information that can help the model more accurately 

In [None]:
df = df.merge(meal, on='meal_id')

In [None]:
df = df.merge(ful, on='center_id')

In [None]:
df.isna().sum()

## EDA ##
First off I will look at a breakdown of the number unique id's (This can have multiple order per id) broken down by the cuisine. We can see that Beverages are clearly sold the most. In general, beverages would be seen as cheaper than full meals so we would expect to see higher sales of this. The rest of the meals are relatively evenly spread.

In [None]:
fig = plt.figure(figsize=(15,15))

g = sns.displot(data=df, x='category', hue = 'cuisine',height = 6, aspect = 2, multiple = 'stack')
g.set_xticklabels(rotation=30)
plt.show()

Next, I just wanted to show the average base price of each of the meals. Seafood, Fish and Pizza have the highest price among the meals/categories.

In [None]:
df.groupby(['category'])['base_price'].mean()

In the below graph, we're looking at the total price received by week and cuisine. It's interesting to see that there is such a large discrepancy between the Continental cuisine and the rest of the cuisine's, when it come to the total base_price sold. It would seems the higher priced meals (pizza and seafood) fall under the Continental cuisine. In addition, we can see that total base price drops around the same time among all of the cuisine's (The percentage drops are different among all of the cuisine's).

In [None]:
plt.figure(figsize=(15,8))
for typ in list(df['cuisine'].unique()):
    weekwise = df[df['cuisine'] == typ].groupby('week').base_price.sum()
    weekwise.plot()
plt.ylabel('Total Sales Per Cuisine')
plt.legend(list(df['cuisine'].unique()))
plt.show()

In [None]:
top_cont = df[df['cuisine'] == 'Continental'].groupby('category').num_orders.sum()
top_cont

Next, I wanted to see what the price difference is betweent he base and checkout price. This could show whether the total prices are mainly driven by the discount. Thai food seemed to have the most consistent price between the base and checkout price, while the Continental food had the largest difference between the base and checkout price. 

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize = (15,15))
subs = [ax1, ax2, ax3, ax4]
color = ['green', 'red', 'blue', 'orange']
for typ, sub, col in zip(list(df['cuisine'].unique()), subs, color):
    weekwise = df[df['cuisine'] == typ].groupby('week').base_price.sum() - df[df['cuisine'] == typ].groupby('week').checkout_price.sum()
    sub.plot(weekwise, color = col)
    sub.set_title(typ)
    sub.set_xlabel('Week')
    sub.set_ylabel('Difference in Base and Checkout Price')
   
fig.suptitle('Price Difference Per Week', fontsize = 20)
fig.subplots_adjust(top=0.90)
plt.xlabel('Week')
fig.tight_layout()
plt.show()

After looking at the price, I wanted to take a look at which categories were the leading the overall number of order. For the graph, I thought it would be reasonable to look at the weekly sales of the top 6 number of orders. The lower cost items were the leaders for the number of orders, with Pizza being the lone higher priced item that made it into the top 6. 

In [None]:
top6 = df.groupby('category').num_orders.sum()
top6

In [None]:
fig, (ax1, ax2, ax3,ax4, ax5,ax6) = plt.subplots(6, figsize = (15,15))
top6 = ['Beverages', 'Rice Bowl', 'Sandwich', 'Salad', 'Pizza', 'Other Snacks']
color = ['purple', 'red', 'blue', 'orange','purple', 'purple']
subs = [ax1,ax2,ax3,ax4,ax5,ax6]
for typ, sub, col in zip(top6, subs, color):
    weekwise = df[df['category'] == typ].groupby('week').num_orders.sum() 
    sub.plot(weekwise, color = col)
    sub.set_title(typ)
    sub.set_xlabel('Week')
    sub.set_ylabel('Number of Orders by Category')
   
fig.suptitle('Orders Per Week: Top 6', fontsize = 20)
fig.subplots_adjust(top=0.90)
plt.xlabel('Week')
fig.tight_layout()
plt.show()

The next graphs that are presented a violin plot showing the density in the Checkout Price by a Promotion on a certain item and whether it has been Featured on the HomePage. In both cases when there was a promotion or featured on the HomePage (1), are mostly congregated around 300 and 500.  

In [None]:
plt.clf()
sns.violinplot(data=df, x='emailer_for_promotion', y= 'checkout_price')
plt.title('Promotion Compared to Checkout Price')

plt.show()
sns.violinplot(data=df, x='homepage_featured', y= 'checkout_price')
plt.title('Featured on The Homepage to Checkout Price')
plt.show()

In [None]:
norm_df = df
norm_df

In order to present a more normally distributed base and checkout price, I took the log of each of the columns.

In [None]:
norm_df['log_check'] = np.log(df['checkout_price'])
norm_df['log_base'] = np.log(df['base_price'])
norm_df

In [None]:
sns.displot(data=norm_df, x='log_check', kind='kde')
plt.show()

## Data Preparation ##
In this section I will be dropping some columns that will no longer be required and normalized.

In [None]:
norm_df = norm_df.drop(['checkout_price', 'base_price', 'center_id', 'meal_id', 'id'], axis=1)

In [None]:
dtyp = []
for col in norm_df.columns[0:]:
    if col == 'num_orders':
        next
    else:
        dtyp.append(col)
dtyp.append('num_orders')
dtyp

In [None]:
norm_df = norm_df[dtyp]

In [None]:
from sklearn import preprocessing

for col in ['op_area', 'log_check', 'log_base']:
    x = np.array(norm_df[col]) #returns a numpy array
    x = np.reshape(x,(-1,1))
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    norm_df[col] = x_scaled

In [None]:
for col in norm_df.columns[:8]:
    norm_df = pd.get_dummies(norm_df, columns=[col], prefix = [col])

In [None]:
target = norm_df.iloc[:,3]

xinfo = norm_df.drop(['num_orders'],axis=1)
xinfo

In [None]:
#Use the train_test_split to split the data into a training and testing dataset.
from sklearn.model_selection import train_test_split



x_train, x_test, y_train, y_test = train_test_split(xinfo, target, test_size=.2, random_state=10)

## Models ##
I chose to run a couple different linear models, a decision tree and the support vector regressor. After running the linear regression, lasso, SGD Regression and support vector regression model, they resulted in an R sqaured value lower than 50%. The decision tree provides an R squared value of over 70%. 

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn import svm
from sklearn.linear_model import SGDRegressor

In [None]:
linreg = LinearRegression()
linreg.fit(x_train, y_train)
pre_linear = linreg.predict(x_test)
r_sq = linreg.score(x_test, y_test)
print('Coefficient of determination:', r_sq)

In [None]:
#Ran a for loop to determine the best tradeoff between number of branches and accuracy of the results
score_list = []

for i in range(2,20):
    decreg = DecisionTreeRegressor(max_depth = i)
    decreg.fit(x_train, y_train)
    pre_tree = decreg.predict(x_test)
    r_sq = decreg.score(x_test, y_test)
    score_list.append(r_sq)

In [None]:
fig = plt.figure()
plt.plot(list(range(2,20)), score_list)
plt.title("Best Depth For The Tree")
plt.xticks(list(range(2,20)))
plt.ylabel("R-Squared Score")
plt.xlabel("Depth of Tree")
plt.grid()
plt.show()

In [None]:
decreg = DecisionTreeRegressor(max_depth = 12)
decreg.fit(x_train, y_train)
r_sq = decreg.score(x_test, y_test)
print('Coefficient of determination:', r_sq)

In [None]:
new_alp = []

for i in np.arange(0.5,5,.5):
    lasreg = Lasso(alpha=i)
    lasreg.fit(x_train, y_train)
    pre_linear = lasreg.predict(x_test)
    r_sq = lasreg.score(x_test, y_test)
    new_alp.append(r_sq)

In [None]:
fig = plt.figure()
plt.plot(np.arange(0.5,5,.5), new_alp)
plt.title("Best Depth For The Tree")
plt.xticks(np.arange(0.5,5,.5))
plt.ylabel("R-Squared Score")
plt.xlabel("Alpha Application")
plt.grid()
plt.show()

In [None]:
loss_func = []

for loss in ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive']:
    SGDreg = SGDRegressor(loss= loss)
    SGDreg.fit(x_train, y_train)
    pre_SGD = linreg.predict(x_test)
    loss_func.append(pre_SGD)
    r_sq = SGDreg.score(x_test, y_test)
    print('Coefficient of determination:', r_sq)

In [None]:
for i in range(0,5):
    svreg = svm.LinearSVR(epsilon=i)
    svreg.fit(x_train, y_train)
    pre_svr = svreg.predict(x_test)
    r_sq = svreg.score(x_test, y_test)
    print('Coefficient of determination:', r_sq)

## Conclusion ##
Although the decision tree regressor provides a much higher R squared value (accuracy) than the other models, but this would probably be unacceptable for any restaurant to use for determining the number of orders. With an accuracy slightly above 70%, it may be result in some more accurate predictions for future orders, but it may give the user too much confidence in the model. This could result in under investment in the food required, which may not be enough for the demand. When customers start losing trust in a restaurant, it can result in a lot of lost sales in the future.
