# Restaurant Business Analysis

> **Situation :**
> 
> Restaurant business has been rapidly evolving in United States. How much restaurant business can evolve may be depend on which category they have choose. Restaurant business category consist of franchise type, food type, service type, or restaurant location. To determined the most economical category, three dataset is used to determined which categories shows the most profitable sales.
> 
> 
> 
> **Problem : **

> The first dataset (future50) depend on frachising and non-franchising category.
> The second dataset (independence100) depend on restaurant location.
> The third dataset (top250) depend on food and restaurant service type.
> 
> To defined the most profitable categories should combined many categories. Although, we can approach it from the three dataset.
> 
> 
> **Solution :**
> Solution for each dataset represent in the end of dataset analysis
> 
> 
> 
> **Summary : **
> 
>  Franchising and non-franchising has been used by several restaurant(Future50). While several restaurant dominantly established and evolved in Big City (Independence 100). Therefore, restaurant categories based on the food and service type can be the most effected factors to defining restaurant type (top250).

> # Analysis has shown below

In [None]:
#import basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
sns.set_theme(style="whitegrid")
from statsmodels.formula.api import ols

In [None]:
#import the files

top50 = pd.read_csv('../input/restaurant-business-rankings-2020/Future50.csv')
others100 = pd.read_csv('../input/restaurant-business-rankings-2020/Independence100.csv')
top250 = pd.read_csv('../input/restaurant-business-rankings-2020/Top250.csv')

# FUTURE 50 ANALYSIS

In [None]:
top50.columns
top50.dtypes

In [None]:
top50.Franchising.unique()

In [None]:
top50["YOY_Sales"] = top50["YOY_Sales"].apply(lambda x: x.replace('%', ''))
top50["YOY_Units"] = top50["YOY_Units"].apply(lambda x: x.replace('%', ''))

#top50["YOY_Sales"] = top50["YOY_Sales"].astype('float')/100
#top50["YOY_Units"] = top50["YOY_Units"].astype('float')/100
top50["YOY_Sales"] = (pd.to_numeric(top50["YOY_Sales"]))/100
top50["YOY_Units"] = (pd.to_numeric(top50["YOY_Units"]))/100

top50

In [None]:
top50["Franchising"] = top50["Franchising"].apply(lambda x: x.replace('Yes', '1'))
top50["Franchising"] = top50["Franchising"].apply(lambda x: x.replace('No', '0'))

top50["Franchising"] = (pd.to_numeric(top50["Franchising"]))

YES = 1
NO = 0

In [None]:
sns.set(font_scale=1.8)
plt.figure(figsize=(20,10))
df_50=top50[['Sales','YOY_Sales','Units','YOY_Units','Unit_Volume','Franchising']]
sns.heatmap(df_50.corr(),annot=True,cmap="viridis")

In [None]:
sns.pairplot(df_50,height=2, hue = "Franchising")
# YES = 1 and NO = 0

In [None]:
sns.set(font_scale=1)
fig = sns.PairGrid(df_50, hue = "Franchising")

fig.map_lower(sns.kdeplot, cmap='YlOrBr')
fig.map_upper(sns.scatterplot, color = 'green')
fig.map_diag(sns.histplot, bins=10, color = 'red')
fig.add_legend()

# YES = 1 and NO = 0

There are two types of restaurant categories, consisting of franchising (1 = orange) and non-franchising (0 = blue). YOY units and YOY Sales has a good correlation, outliers were found in < 10 Restaurant. Franchising is one of important factor in Food and Beverage Industry. Although, first rank restaurant can be an exceptional, but overall, the franchising restaurant has a positive correlation with YOY Sales and YOY Sales do not have any good correlation. It looks erratic. The similiar things happened to Units and YOY Units which do not have any good correlation either. Otherwise, Units and Volume Units has a good correlation based on Power Law. Let's check it out!

In [None]:
#plot data UNITS VS UNITS_VOLUME

plt.figure(figsize=[15,10])
g = sns.scatterplot(data=top50, x = "Units"  , y = "Unit_Volume",  alpha = 0.7, hue="Franchising")
#g. despine(left=True)
#g.set_axis_labels("YOY Units in percents", " YOY Sales in percents")
data, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.title("Units vs Unit_Volume")


In [None]:
#the graph shown can be modeled using a power function
#modifying both x and y axis into a logarithmic scale

log_units = np.log(top50[['Units']])
log_unitvolume = np.log(top50[['Unit_Volume']])
log_unitvolume

plt.figure(figsize=[15,10])
g = sns.scatterplot(data=top50, x = np.log(top50['Units']) , y = np.log(top50['Unit_Volume']))
g.set_title('Log Unit Volume vs Units')
g.set_ylabel('Unit_Volume')
g.set_xlabel('Unit')

In [None]:
#Let us use ordinary least squares :
# first, let us transform the data frame into its log value
# get only the unit and the unit volume
dflog = np.log(top50[['Units', 'Unit_Volume']])
#dflog
model = ols('Units ~ Unit_Volume', data = dflog).fit()
model.summary()

In [None]:
#The intercept and time are significant
#Let us now convert it into a model that will fit the data

# log(Unit_Volume) = 11.2105 - 1.0976 * log (Units)
# e^ (log(Unit_Volume)) = e^(11.2105-1.0976 * log (Units))
# Unit_Volume = e^(11.2105) * Units^(-1.0976)
# Unit_Volume = 73,902.356934992675941518885203091 * Units^(-1.0976)

#Get the predicted data using the equation above
UnitVolume_hat = 73902.356934992675941518885203091 * top50[['Units']]**(-1.0976)

#plot the data :
#change plotting parameters
matplotlib.rcParams.update({'font.size' : 18, 'font.family' : 'serif'})

fig, me = plt.subplots(figsize = (10 , 6))

me.scatter(top50[['Units']], top50[['Unit_Volume']], s = 50, color = 'red', label = 'Units')
#me.plot(top50[['Units']], UnitVolume_hat, lw = 3, color = 'blue', label = 'UnitVolume_hat')
me.plot(top50[['Units']], UnitVolume_hat, r'g*', markersize = 10, label = 'UnitVolume_hat')
me.legend(loc = 1)
me.grid(True)
me.set_title('Actual Unit vs Predicted')
me.set_ylabel('Unit')
me.set_xlabel('Unit Volume')

The good correlation between Unit volume and Units has been shown in Power Law correlation. This correlation can predict Unit Volume that can be produced by selling product in several Unit.

# INDEPENDENCE 100 ANALYSIS

In [None]:
others100.columns

In [None]:
others100.head()

In [None]:
others100.City.unique()

> # Let's see if we grouped these restaurant based on **their location**

In [None]:
others_max_100_count = others100.groupby('City').count()
others_max_100_count = others_max_100_count.sort_values(by=['Sales'], ascending=False)
others_max_100_count.reset_index(inplace=True)
others_max_100_count = others_max_100_count.drop(columns=['Rank','Restaurant','State'])
others_max_100_count.head(15)


New York is the first position of City which has the highest amount of restaurant business.

In [None]:
others_max_100_state = others100.groupby('State').max()
others_max_100_state = others_max_100_state.sort_values(by=['Sales'], ascending=False)
others_max_100_state.reset_index(inplace=True)
others_max_100_state = others_max_100_state.drop(columns=['Rank','Restaurant'])
#others_max_100_state.head(10)
others_max_100_state

In [None]:
others_max_100_Sales = others100.groupby('City').max()
others_max_100_Sales = others_max_100_Sales.sort_values(by=['Sales'], ascending=False)
others_max_100_Sales.reset_index(inplace=True)
others_max_100_Sales = others_max_100_Sales.drop(columns=['Rank','Restaurant'])
others_max_100_Sales.head(10)

> New York is the first position of City which has the highest sales.

In [None]:
others_max_100_Meals = others100.groupby('City').max()
others_max_100_Meals = others_max_100_Meals.sort_values(by=['Meals Served'], ascending=False)
others_max_100_Meals.reset_index(inplace=True)
others_max_100_Meals = others_max_100_Meals.drop(columns=['Rank','Restaurant'])
others_max_100_Meals.head(10)

Frankenmuth is the first position of City which has the highest meals served. While New York is in the second position.

In [None]:
others_max_100_Check = others100.groupby('City').max()
others_max_100_Check = others_max_100_Check.sort_values(by=['Average Check'], ascending=False)
others_max_100_Check.reset_index(inplace=True)
others_max_100_Check = others_max_100_Check.drop(columns=['Rank','Restaurant'])
others_max_100_Check.head(10)

New York is the first position of City which has the highest average check.

In [None]:
others_100 = others100.drop(columns = ['Restaurant','City','State'])
others_100

In [None]:
df_100 = others_100[['Sales','Average Check','Meals Served']]

In [None]:
sns.set(font_scale=1)
plt.figure(figsize=(10,7))
sns.heatmap(df_100.corr(),annot=True,cmap="rocket_r")

In [None]:
sns.set(font_scale=1)
sns.pairplot(df_100,height=3)

> If we correlate each columns throughout the City, they have a poor correlation shown in heat map and pair plot. Let's try to seperate them.

In [None]:
x = 'New York'
z = 'Frankenmuth'
citix = others100[others100['City'] == x]
citix

In [None]:
y = len(others100[others100['City'] == x])
s = len(others100[others100['City'] == z])
print('The amount of the restaurants in', x, 'are ', y)
print('The amount of the restaurants in', z, 'are ', s)

In [None]:
citix = citix.drop(columns = 'Rank')
sns.set(font_scale=1)
sns.pairplot(citix,height=3)

> Overall, The highest restaurant occurrence in independence 100 dataset are located in New York city, due to Sales and Average Check. Frankenmuth is the second highest Meals Served due to their facility included Inn. If hospitality such as Inn in Frankenmuth is not counted, New York will be the most location which food and beverage business thrive. Do not forget the high maintenance cost :) 

# TOP 250 RESTAURANT ANALYSIS

In [None]:
top250.columns

In [None]:
top250.head()

In [None]:
top250.Headquarters.unique()

In [None]:
headquarter = top250.Headquarters.count()
total_restaurant = top250.Restaurant.count()
print('Total amount of restaurant headquarter is',headquarter)
print('Total amount of restaurant is',total_restaurant)

Total restaurants amount are 250, but the restaurant who had a headquarters are only 52. This parameter can not be analyzed. So, we dropped it down

In [None]:
top_250 = top250.drop(columns=['Rank','Restaurant','Content','Headquarters'])
top_250


In [None]:
top_250['YOY_Sales'] = top_250['YOY_Sales'].apply(lambda x: x.replace('%', ''))
top_250['YOY_Units'] = top_250['YOY_Units'].apply(lambda x: x.replace('%', ''))

#top250["YOY_Sales"] = top250["YOY_Sales"].astype('float')/100
#top250["YOY_Units"] = top250["YOY_Units"].astype('float')/100
top_250["YOY_Sales"] = (pd.to_numeric(top_250["YOY_Sales"]))/100
top_250["YOY_Units"] = (pd.to_numeric(top_250["YOY_Units"]))/100

top_250

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

corpus = []
a = top250["Segment_Category"]
rmv = ['Quick', 'Service','&','Fast','Casual','Casual','Dining']


for i in a:
       
    b = i.split()
    #ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    all_stopwords.extend(rmv)
    
    
    b = [word for word in b if not word in all_stopwords]
    
    b = ' '.join(b)
    corpus.append(b)   
   
    
corpus
    

In [None]:
top_250.insert(0,"Category",corpus)
#del top_250["Sub Category"]
top_250

In [None]:
top_250["Category"] = top_250["Category"].apply(lambda x: x.replace('Family Family Style', 'Family'))
top_250["Category"] = top_250["Category"].apply(lambda x: x.replace('Family Style', 'Family'))
top_250

In [None]:
df = top_250.groupby('Category')['Sales'].count()
df = df.sort_values(ascending=False)
df.head(5)

In [None]:
df.tail(5)

In [None]:
means = top_250.groupby('Category').mean().reset_index(drop=False)
counts = top_250.groupby('Category').count().reset_index(drop=False)

In [None]:
fig_dims = (20, 7)
fig, ax = plt.subplots(figsize=fig_dims)

sns.set_theme(style="whitegrid")
means.reset_index(inplace = True)
plt.xticks(rotation=90)
sns.barplot(x = "Category", y = "Sales" , data=counts, palette = "YlOrBr")


In [None]:
The restaurant category which has the largest occurrence is Varied Menu. This things may be happen while most people usually get confused when choosing what menu they want to eat. Otherwise, different type of people has their own tastes. So, they will choose the one that suit their tastes. 

While the smallest amount of restaurant category are Healty, Fine Steak and Ethic. The reason may be :

1. Not anyone like the plain salad, nor dietery food. It tastes plain
2. Fine steak is too fancy for most people, or they were vegan, or might not suitable for certain people.
3. Ethic food usually has a strong or plain tastes that might not suit their taste.

In [None]:
import plotly.express as px


#shows The most common restaurant in top 250 restaurant ranking

datas = "Units"
fig = px.pie(means, values=datas, names='Category',title= datas +' Averaging')
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

The restaurant category which has the higher rank of Units is Coffee Cafe, Sandwich, and Burger 

In [None]:
import plotly.express as px


#shows The most common restaurant in top 250 restaurant ranking

datas = "Sales"
fig = px.pie(means, values=datas, names='Category',title= datas +' Averaging')
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

The restaurant category which has the higher rank of Sales is Coffee Cafe, Burger, and Chicken 

In [None]:
import plotly.express as px


#shows The most common restaurant in top 250 restaurant ranking
datas = "YOY_Units"
fig = px.pie(means, values=datas, names='Category',title= datas +' Averaging')
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

Otherwise, when we analyzed the Year of Yied (YOY) Sales/Units, the mostly picked restaurant is Healty Food. Despite of its taste, people still refer the healthy food restaurant as the restaurant which makes their body healthy when they have no time to cook vegetables or slice fruits.


The restaurant category which has the higher rank of YOY_Units is Healthy, Beverages, and Ethnic

In [None]:
import plotly.express as px


#shows The most common restaurant in top 250 restaurant ranking

datas = "YOY_Sales"
fig = px.pie(means, values=datas , names='Category',title= datas +' Averaging')
fig.update_traces(textposition='inside', textinfo='percent+label')

fig.show()

The restaurant category which has the higher rank of YOY_Sales is Healthy, Beverages, and Ethnic