# Introduction
The data was obtained by means of web scraping, i.e. data download with the use of programming code based on the website code. In this case, the "rvest" package from the R programming language was used along with the "SelectorGadet" browser add-on to facilitate work with the website.

The data was downloaded from www.restaurantbusinessonline.com on January 30, 2021 with three plants describing 3 rankings: top 250, top 100 indenents and future 50 thus creating 3 tables, where the restaurant is described by several variables in each row.

# Purpose
The purpose of this notebook is to explore the plotly library and tell a visual story of the data.

The goal is to be as visual as possible, so we are going to limit the dataframes and calculations to the very minimum, they are going to be used mostly by cleaning the data and creating features

## Table of Contents
1. [Data Loading and Data Cleaning](#1.-Data-Loading-and-Data-Cleaning)
2. [Future 50](#2.-Future-50)
3. [Independence 100](#3.-Independence-100)
4. [top 250](#4.-top-250)

# Features
## future 50
"Future ranking" of 50 restuarants from 2020
- **Rank**: Position in ranking
- **Restaurant**: Name of restaurant
- **Location**: Location of origin of the restaurant
- **Sales**: 2019 Systemwide Sales (000000)
- **YOY_Sales**: Year on year sales increase in %
- **Units**: Number of premises
- **YOY_Units**: Year on year premises increase in %
- **Unit_Volume**: 2019 Average Unit Volume (000)
- **Franchising**: Is the restaurant a franchise? (Y/N)

## independence 100
- **Rank**: Position in ranking
- **Restaurant**: Name of restaurant
- **Sales**: Annual sales
- **Average Check**: Average client expenses per visit (sales / number of visits)
- **City**: City of origin of the restaurant
- **State**: State of origin of the restaurant
- **Meals Served**: Number of meals served in 202

## top 250
- **Rank**: Position in ranking
- **Restaurant**: Name of restaurant
- **Content Description**: only for certain restaurants
- **Sales**: in 2019 (000000)
- **YOY_Sales**: Year on year sales increase in %
- **Units**: Number of premises in US
- **YOY_Units**: Year on year premises increase in %
- **Headquarters**: Place of the restaurant's headquarters
- **Segment_Category**: Menu type and / or industry segment

In [None]:
# data manipulation
import pandas as pd
import numpy as py

# data viz
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# 1. Data Loading and Data Cleaning

In [None]:
f50 = pd.read_csv('../input/restaurant-business-rankings-2020/Future50.csv')
i100 = pd.read_csv('../input/restaurant-business-rankings-2020/Independence100.csv')
t250 = pd.read_csv('../input/restaurant-business-rankings-2020/Top250.csv')

let's first check if there any missing values in one of the three datasets

In [None]:
fig, ax = plt.subplots(1,3, figsize=(23,5))

sns.heatmap(f50.isnull(), ax=ax[0])
sns.heatmap(i100.isnull(), ax=ax[1])
sns.heatmap(t250.isnull(), ax=ax[2])

plt.show()

t250 has a lot of missing values on the 'Content' and 'Headquarters' columns. Let's check them and decide wheter to drop them or not

In [None]:
display(t250.head(20))

- 'Headquarters' would be valuable if most of the restaurants had it so we could implement some data wrangling, but since there are so little, we are going to drop it.

In [None]:
t250.drop(['Headquarters'], axis=True, inplace=True)
t250.drop(['Content'], axis=True, inplace=True)

Let's now check our three datasets and see what we are going to face

In [None]:
display(f50.head(3))
display(i100.head(3))
display(t250.head(3))

Let's now clean each dataframe, adjusting their data types, eliminating symbols creating new columns

## 1.1 future 50
- For this dataframe, we are going to eliminate the simbol '$' from the 'YOY_Sales' and 'YOY_Units' values
- then we are going to create a state column from the location, this is done by separating the column 

In [None]:
# city and state
f50['City'] = f50['Location'].str.split(',', expand=True)[0].str.strip()
f50['State'] = f50['Location'].str.split(',', expand=True)[1].str.strip()
f50.drop(['Location'], axis=1, inplace=True)

# & symbol
f50['YOY_Sales'] = f50['YOY_Sales'].str.replace('%', '').astype(float)
f50['YOY_Units'] = f50['YOY_Units'].str.replace('%', '').astype(float)
f50.rename(columns={'YOY_Sales':'YOY_Sales (%)', 'YOY_Units':'YOY_Units (%)'}, inplace=True)

f50.tail(5)

## 1.2 independence 100
- The indpendence 100 list doesn't have much to further expand the dataframe, let's give it a quick look and continue

In [None]:
i100.head()

## 1.3 top 250
- this list has some special characters that have to be removed to analyze numbers.

In [None]:
t250['YOY_Sales'] = t250['YOY_Sales'].str.replace('%','').astype(float)
t250['YOY_Units'] = t250['YOY_Units'].str.replace('%','').astype(float)

In [None]:
t250.head()

We are ready to inspect each list one by one

# 2. Future 50

Let's start describing the data

In [None]:
# Numbers

fig = make_subplots(rows=6, cols=2)

fig.update_layout({'title': {'text':
    'Plots of restaurant sales',
    'x': 0.5, 'y': 0.96}})

fig.add_trace(go.Histogram(x=f50['Sales'], nbinsx=10, name='Sales', marker_color='rgb(0, 71, 119)', opacity=0.5), row=1, col=1)
fig.add_trace(go.Box(x=f50['Sales'], name='Sales', marker_color='rgb(0, 71, 119)'), row=2, col=1)

fig.add_trace(go.Histogram(x=f50['YOY_Sales (%)'], nbinsx=10, name='YOY_Sales (%)', marker_color='rgb(163, 0, 0)', opacity=0.5), row=1, col=2)
fig.add_trace(go.Box(x=f50['YOY_Sales (%)'], name='YOY_Sales (%)', marker_color='rgb(163, 0, 0)'), row=2, col=2)

fig.add_trace(go.Histogram(x=f50['Units'], nbinsx=10, name='Units', marker_color='rgb(255, 119, 0)', opacity=0.5), row=3, col=1)
fig.add_trace(go.Box(x=f50['Units'], name='Units', marker_color='rgb(255, 119, 0)'), row=4, col=1)

fig.add_trace(go.Histogram(x=f50['YOY_Units (%)'], nbinsx=10, name='YOY_Units (%)', marker_color='rgb(239, 210, 141)', opacity=0.5), row=3, col=2)
fig.add_trace(go.Box(x=f50['YOY_Units (%)'], name='YOY_Units (%)', marker_color='rgb(239, 210, 141)'), row=4, col=2)

fig.add_trace(go.Histogram(x=f50['Unit_Volume'], nbinsx=10, name='Unit_Volume', marker_color='rgb(0, 175, 181)', opacity=0.5), row=5, col=1)
fig.add_trace(go.Box(x=f50['Unit_Volume'], name='Unit_Volume', marker_color='rgb(0, 175, 181)'), row=6, col=1)

fig.update_layout(
    autosize=False,
    width=1200,
    height=1000,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100))

fig.show()


- Most of the features present a normal distribution with no so many outliers.
- Important to notice that, in this case, outliers are our friends since they provide important information: 'YOY_Sales (%)' and 'YOY_Units (%)' outliers show the restaurants with the highest rank.
- 'YOY_Sales (%)' and 'YOY_Units (%)' are key features in the list, since the rank is based on this values: the higher the value, the better the rank. 

Let's see how the features correlate so we can build a nice scatter plot

In [None]:
cr = f50.corr()

fig = go.Figure(go.Heatmap(
        x=cr.columns,
        y=cr.columns,
        z=cr.values.tolist(),
        colorscale='RdBu', zmin=-1, zmax=1))
fig.show()

luckilly, 'YOY_Sales (%)' and 'YOY_Units (%)' correlate well, let's dive into the plot

In [None]:
fig = px.scatter(data_frame=f50, x='YOY_Units (%)', y='YOY_Sales (%)',
                 hover_data=['Restaurant','City', 'State'],
                color='Franchising',
                size='Units',
                color_discrete_sequence=['rgb(163, 0, 0)','rgb(0, 71, 119)'])

rank1 = {
    'x': 116.7, 'y': 130.5,
    'showarrow': True,'arrowhead': 3,
    'text': "Rank 1",
    'font' : {'size': 15, 'color': 'black'}}

rank50 = {
    'x': 7.7, 'y': 14.4,
    'showarrow': True,'arrowhead': 3,
    'text': "Rank 50",
    'font' : {'size': 15, 'color': 'black'}}

fig.update_layout({'annotations': [rank1, rank50]})

fig.show()

- We see that a higher 'YOY_Sales (%)' and 'YOY_Units (%)' yields a better rank, we can see in the plot rank 1 and 50 in opposing corners
- The size of the bubbles represent the number of premises per restaurant. Interesting to see how rank one has a few premises in comparisson with the other restaurants. This is because of the nature of the list, rank 1 is there because is growing fast and not because is already big
- rank 1 is not franchising and is still growing really fast, if we hove over, we can see that the restaurant is Evergreens. Would be interesting to analyze the growing strategy from this restaurant and its value proposal, taking into account that most of the restaurant in this list are franchising, implying a relationship between growth and this business model

So where are this restaurants?

In [None]:
fig = px.treemap(data_frame=f50, path=['State'], values='Units', color_continuous_scale='RdBu', color='Rank')
fig.show()

- California, New York and North Carolina have the most restaurants. Not a surprise since these states are innovation hubs.

# 3. Independence 100

Let's begin by describing the numbers distributions

In [None]:
# Numbers

fig = make_subplots(rows=6, cols=2)

fig.update_layout({'title': {'text':
    'Numbers distribution',
    'x': 0.5, 'y': 0.96}})

fig.add_trace(go.Histogram(x=i100['Sales'], nbinsx=10, name='Sales', marker_color='rgb(0, 71, 119)', opacity=0.5), row=1, col=1)
fig.add_trace(go.Box(x=i100['Sales'], name='Sales', marker_color='rgb(0, 71, 119)'), row=2, col=1)

fig.add_trace(go.Histogram(x=i100['Average Check'], nbinsx=10, name='Average Check', marker_color='rgb(163, 0, 0)', opacity=0.5), row=1, col=2)
fig.add_trace(go.Box(x=i100['Average Check'], name='Average Check', marker_color='rgb(163, 0, 0)'), row=2, col=2)

fig.add_trace(go.Histogram(x=i100['Meals Served'], nbinsx=10, name='Meals Served', marker_color='rgb(255, 119, 0)', opacity=0.5), row=3, col=1)
fig.add_trace(go.Box(x=i100['Meals Served'], name='Meals Served', marker_color='rgb(255, 119, 0)'), row=4, col=1)

fig.update_layout(
    autosize=False,
    width=1200,
    height=800)

- All of the distributions have a normal shape with no so many outliers
- 'Meals Served' is the one that has the most outliers with 7

In [None]:
cr = i100.corr()

fig = go.Figure(go.Heatmap(
        x=cr.columns,
        y=cr.columns,
        z=cr.values.tolist(),
        colorscale='RdBu', zmin=-1, zmax=1))
fig.show()

- We are going to make use of the almost perfectly correlated 'Sales'-'Rank' relationship to build our scatterplot
- It's pretty ovbios that the most relevant desicion factor in this list is the number of Sales: the higher the sales, the better the rank

In [None]:
fig = px.scatter(data_frame=i100, x='Rank', y='Sales', size='Average Check',opacity=0.5, hover_data=['Restaurant','City', 'State'],
                color_discrete_sequence=['rgb(163, 0, 0)','rgb(0, 71, 119)'])

fig.show()

- We can see that, indeed, sales plays an absolute role in the list
- in Rank 1, we can find Carmine's restaurant located in New york
- Rank 100 belongs to Virgil's real barbecue, located in Las Vegas
- The size of the bubles represent the avarage check. We don't see any decreasing or increasing trend, which means that being number doesn't imply to be expensinve or the other way around.

In [None]:
fig = px.treemap(data_frame=i100, path=['State'], values='Sales', color_continuous_scale='RdBu', color='Rank')
fig.show()

- We see a clear dominance of New York and Illinois in total sales per state and rank

# 4. top 250

In [None]:
fig = make_subplots(rows=4, cols=2)

fig.update_layout({'title': {'text':
    'Plots of restaurant sales',
    'x': 0.5, 'y': 0.96}})

fig.add_trace(go.Histogram(x=t250['Sales'], nbinsx=10, name='Sales', marker_color='rgb(0, 71, 119)', opacity=0.5), row=1, col=1)
fig.add_trace(go.Box(x=t250['Sales'], name='Sales', marker_color='rgb(0, 71, 119)'), row=2, col=1)

fig.add_trace(go.Histogram(x=t250['YOY_Sales'], nbinsx=10, name='YOY_Sales', marker_color='rgb(163, 0, 0)', opacity=0.5), row=1, col=2)
fig.add_trace(go.Box(x=t250['YOY_Sales'], name='YOY_Sales', marker_color='rgb(163, 0, 0)'), row=2, col=2)

fig.add_trace(go.Histogram(x=t250['Units'], nbinsx=10, name='Units', marker_color='rgb(255, 119, 0)', opacity=0.5), row=3, col=1)
fig.add_trace(go.Box(x=t250['Units'], name='Units', marker_color='rgb(255, 119, 0)'), row=4, col=1)

fig.add_trace(go.Histogram(x=t250['YOY_Units'], nbinsx=10, name='YOY_Units', marker_color='rgb(239, 210, 141)', opacity=0.5), row=3, col=2)
fig.add_trace(go.Box(x=t250['YOY_Units'], name='YOY_Units', marker_color='rgb(239, 210, 141)'), row=4, col=2)

fig.update_layout(
    autosize=False,
    width=1200,
    height=600)

- We can see a lot of outliers in every distribution of this dataframe. This means that restaurants in this list are being separated by large values, outsanding restaurants are part of this list

In [None]:
cr = t250.corr()

fig = go.Figure(go.Heatmap(
        x=cr.columns,
        y=cr.columns,
        z=cr.values.tolist(),
        colorscale='RdBu', zmin=-1, zmax=1))
fig.show()

- YOY_Sales and YOY_Units are not correlated at all with Rank, which means that wath matters in this list is the actual position of the restaurants
- Let's build the scatter plot with Units against Sales: the strongest relationship

In [None]:
fig = px.scatter(data_frame=t250, x='Units', y='Sales', hover_data=['Rank','Restaurant', 'Segment_Category'],
                color_discrete_sequence=['rgb(163, 0, 0)','rgb(0, 71, 119)'])

rank1 = {
    'x': 13846, 'y': 40412,
    'showarrow': True,'arrowhead': 3,
    'text': "Rank 1",
    'font' : {'size': 15, 'color': 'black'}}

rank50 = {
    'x': 40, 'y': 126,
    'showarrow': True,'arrowhead': 3,
    'text': "Rank 250",
    'font' : {'size': 15, 'color': 'black'}}

fig.update_layout({'annotations': [rank1, rank50]})


fig.show()

- From the scatter plot we can see that 'Sales', again, is the absolute factor determining the rank of the restaurants
- In rank 1 we can find McDonald's
- Rank 250 belongs to Jollibee

In [None]:
fig = px.treemap(data_frame=t250, path=['Segment_Category'], values='Sales', color_continuous_scale='RdBu', color='Units')
fig.show()

- Finally, we can see that the segment that sells the most is 'Quick Service & Burger' where McDonald's belongs, without having most of the units
- 'Quick Servic & Coffee Cafe' has the second position in sales with almost having the highest amount of units, 'Quick Service & Sandwich' has more
- 'Quick Service & Chicken' has the third place without having a lot of units, in comparisson with other segments