# Project Description

We have a new project at hand : **we are opening a small robot-run cafe in Los Angeles**. The project is promising but expensive, so we will need to attract investors. They’re interested in the current market conditions and if we will be able
to maintain our success when the novelty of robot waiters wears off?

So the goal from this project will be to prove the viability of our project through analyzing data on the restauration business in Los Angeles.

# Table of contents  

1. [Data examination](#1)  
2. [Part 1 : Prioritizing Hypotheses](#2)   
3. [Part 2 :  A/B Test Analysis](#3)    
    3.1. [Cumulative Revenue](#31)    
    3.2. [Cumulative Average Order Size](#32)     
    3.3. [Relative Difference of the Cumulative Average Order Size](#33)     
    3.4. [Daily Conversion Rate Per Group](#34)    
    3.5. [Orders Per User](#35)   
    3.6. [Order Size Per User](#36)  
    3.7. [Comparison of Conversion Rates - Raw Data](#37)    
    3.8. [Comparison of Average Order Size - Raw Data](#38)    
    3.9. [Comparison of Conversion Rates - Filtered Data](#39)   
    3.10. [Comparison of Average Order Size - Filtered Data](#310)    
    3.11. [Final Conclusion](#311)    
   

## 1. Data Examination and Preprocessing <a name="1"></a>

In [24]:
!pip install plotly==4.5.0
!pip install -U seaborn

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import requests

from io import BytesIO
from scipy.stats import ttest_ind
from IPython.display import display_html 
from pandas.plotting import register_matplotlib_converters
from plotly import graph_objects as go 


import sys
import warnings

if not sys.warnoptions:
       warnings.simplefilter("ignore")



In [25]:
# Opening the data files, taking into consideration the separator
data_url = 'https://drive.google.com/file/d/1a5PJJvlSJtx6yIdvUTxZ31V86xfgrEBu/view?usp=sharing'
data_url = 'https://drive.google.com/uc?id=' + data_url.split('/')[-2]
rest_data = pd.read_csv(data_url)


# We print informations about the dataset to examine
print(' ')
print('------------------------------------------ Informations About the Restaurants Dataset ------------------------------------------')
print(' ')
display (rest_data.info())
print(' ')

# Print a few lines of the datasets to examine data
print('---------------------------------------------------- Sample of the dataset ---------------------------------------------------')
display(rest_data.head(10))

 
------------------------------------------ Informations About the Restaurants Dataset ------------------------------------------
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           9651 non-null   int64 
 1   object_name  9651 non-null   object
 2   address      9651 non-null   object
 3   chain        9648 non-null   object
 4   object_type  9651 non-null   object
 5   number       9651 non-null   int64 
dtypes: int64(2), object(4)
memory usage: 452.5+ KB


None

 
---------------------------------------------------- Sample of the dataset ---------------------------------------------------


Unnamed: 0,id,object_name,address,chain,object_type,number
0,11786,HABITAT COFFEE SHOP,3708 N EAGLE ROCK BLVD,False,Cafe,26
1,11787,REILLY'S,100 WORLD WAY # 120,False,Restaurant,9
2,11788,STREET CHURROS,6801 HOLLYWOOD BLVD # 253,False,Fast Food,20
3,11789,TRINITI ECHO PARK,1814 W SUNSET BLVD,False,Restaurant,22
4,11790,POLLEN,2100 ECHO PARK AVE,False,Restaurant,20
5,11791,THE SPOT GRILL,10004 NATIONAL BLVD,False,Restaurant,14
6,11792,CPK,100 WORLD WAY # 126,False,Restaurant,100
7,11793,PHO LALA,3500 W 6TH ST STE 226,False,Restaurant,7
8,11794,ABC DONUTS,3027 N SAN FERNANDO RD UNIT 103,True,Fast Food,1
9,11795,UPSTAIRS,3707 N CAHUENGA BLVD,False,Restaurant,35


<b>Observations :</b>  

From the first observation of the data informations, and the first lines of the data, we can state the following :  

- The dataset, which we named **'rest_data'**, contains 9651 rows of data with information on restaurants in the Los Angeles. The data is separated into the following 6 columns :
    - **id** : a unique identifier of each entry;
    - **object_name** :  the establishment name;
    - **address** : address of the establishment;
    - **chain** : is the establishment part of a chain? (True / False);
    - **object_type** : establishment type (Restaurant, Cafe, etc.);
    - **number** : number of seats in the establishment.  
    
    
- We notice 3 rows missing data in the 'chain' column.

In [26]:
# Rows with missing data
rest_data[rest_data['chain'].isna()]

Unnamed: 0,id,object_name,address,chain,object_type,number
7408,19194,TAQUERIA LOS 3 CARNALES,5000 E WHITTIER BLVD,,Restaurant,14
7523,19309,JAMMIN JIMMY'S PIZZA,1641 FIRESTONE BLVD,,Pizza,1
8648,20434,THE LEXINGTON THEATER,129 E 3RD ST,,Restaurant,35


**Let's make sure that there are no mistakes in our dataset and that the users are unique to each group.**  

We first start by filling in the missing data. After research on the internet, we found that all 3 restaurants are not parts of a chain, so we will fill in the missing values with 'False'. Then we check for duplicates.

In [27]:
# Filling in the missing data
rest_data['chain'] = rest_data['chain'].fillna(False)

# We process duplicated rows
print('---------------------------------------------')
print('')
print('number of duplicated rows in "rest_data" :',rest_data.duplicated().sum())
print('')
print('---------------------------------------------')

# Making sure that there are no duplicates in the categorical columns
print('')
print(rest_data['object_type'].value_counts(sort = True))
print(rest_data['chain'].value_counts(sort = True))
print('')
print('---------------------------------------------')
print(rest_data.info())

---------------------------------------------

number of duplicated rows in "rest_data" : 0

---------------------------------------------

Restaurant    7255
Fast Food     1066
Cafe           435
Pizza          320
Bar            292
Bakery         283
Name: object_type, dtype: int64
False    5975
True     3676
Name: chain, dtype: int64

---------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9651 entries, 0 to 9650
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           9651 non-null   int64 
 1   object_name  9651 non-null   object
 2   address      9651 non-null   object
 3   chain        9651 non-null   bool  
 4   object_type  9651 non-null   object
 5   number       9651 non-null   int64 
dtypes: bool(1), int64(2), object(3)
memory usage: 386.5+ KB
None


<b>Observations :</b>  

The data has been preprocess and no longer has any missing values and has no duplicates. It is ready for analysis. 

## 2. Data Analysis :<a name="2"></a>


In this part, we will analyze various aspects of the data, from proportions to correlations.

###  2.1. Establishment Types : <a name="21"></a>

In [28]:
# Creating a dataset
estb_type = rest_data.groupby('object_type', as_index = False).agg({'object_name' : 'count'}).sort_values(
    by = 'object_name', ascending = False)
estb_type.columns = ['object_type' , 'count']

# Display
display(estb_type)

# Plot a pie chart

Unnamed: 0,object_type,count
5,Restaurant,7255
3,Fast Food,1066
2,Cafe,435
4,Pizza,320
1,Bar,292
0,Bakery,283


In [29]:
# Preparing the data
data = {
    "values": estb_type['count'],
    "labels": estb_type['object_type'],
    "hoverinfo":"label+value",
    "marker": {'colors': ['rgb(178,24,43)', 'rgb(214,96,77)', 'rgb(244,165,130)', 'rgb(253,219,199)', 'rgb(209,229,240)', 'rgb(146,197,222)', 'rgb(67,147,195)', 'rgb(33,102,172)', 'rgb(5,48,97)']},
    "textinfo":"percent",
    "type": "pie",
    "pull" :[0.1, 0, 0, 0, 0, 0]
}

# Preparing the layout
layout = go.Layout(
   title={
        'text': "Establishments by Types in LA",
        'y':0.9,
        'x':0.47,
        'xanchor': 'center',
        'yanchor': 'top'},
    template = "seaborn")

# Plotting
fig = go.Figure(data = data, layout = layout)

fig.show()

<b>Observations :</b>    

From the pie chart above, we notice that the type of establishment taking over **75%** of the market is **'Restaurant'**, with 7255 Restaurants in total in the area of LA. The following leading types, which are way less prominent compared to the top establishment, are **'Fast Food'** at **11%** 1066 establishments and **'Cafe'** at **4.51%** with 435 establishments.  
    
The least popular type is **'Bakery'** at only **2.93%** with 283 establishments in total.

###  2.2. Chain Establishments : <a name="22"></a>

In this part we will analyze the establishments py type and try to answer the following questions :  
* How many of the establishments in LA are part of a chain?
* What types of establishments are typically chains?

In [30]:
# Creating a dataset of establishments by chain 
df1 = rest_data.groupby('chain', as_index = False).agg({'object_name' : 'count'})
df1.columns = ['chain' , 'count']

# Preparing the data
data = {
    "values": df1['count'],
    "labels": df1['chain'],
    "hoverinfo":"label+value",
    "marker": {'colors': ['rgb(178,24,43)','rgb(67,147,195)']},
    "textinfo":"percent+label",
    "type": "pie",
    "pull" :[0.1, 0]
}

# Preparing the layout
layout = go.Layout(
   title={
        'text': "Establishments by Chain in LA",
        'y':0.9,
        'x':0.47,
        'xanchor': 'center',
        'yanchor': 'top'},
    template = "seaborn")

# Plotting
fig = go.Figure(data = data, layout = layout)

fig.show()

<b>Observations :</b>    

From the pie chart, we notice that the majority of the establishments are part of a chain, specifically **61.9%**. The remaining **38.1%** are not part of a chain.

In [31]:
# Creating a dataset
chains = rest_data.groupby(['object_type', 'chain'], as_index = False).agg({'object_name' : 'count'})
chains.columns = ['object_type' , 'chain','count']

# Sorting data by cahin and count
chains = chains.sort_values(by =['object_type','count'], ascending = False)

# Changing data types to not have problems in plotting
chains['chain'] = chains['chain'].astype('str')
chains['object_type'] = chains['object_type'].astype('str')
chains['count'] = chains['count'].astype('int')

# Display
chains

Unnamed: 0,object_type,chain,count
9,Restaurant,False,4963
10,Restaurant,True,2292
7,Pizza,False,167
8,Pizza,True,153
6,Fast Food,True,605
5,Fast Food,False,461
4,Cafe,True,266
3,Cafe,False,169
1,Bar,False,215
2,Bar,True,77


In [32]:
# Plotting an interactive sunburst chart
fig =px.sunburst(
    data_frame = chains,
    path = ['object_type', 'chain'],
    values = 'count',
    color = 'count',
    color_continuous_scale='RdBu',
    title={
        'text': "Establishments by Chain and Type",
        'y':0.9,
        'x':0.47,
        'xanchor': 'center',
        'yanchor': 'top'},
    height = 700,
)
fig.update_traces(textinfo="label+percent parent")
fig.show()

<b>Observations and conclusions :</b>    

The second plot allows us to see the proportions of establishments by type but also allows us to see for each type, how many establishments are part of a chain. By observing the figure, we notice the following :

- Bakeries are 100% part of a chain.
- Next in line are cafes, of which 61% are part of a chain.
- For 'Pizza' establishments it's almost a tie at 52%-48%.
- Bars are mostly not part of a chain (74%).
- Only 32% of restaurants is part of a chain. 
    
So we conclude that **Bakeries are typically part of a chain, as the study showed thatb 100% of the bakeries in the region are afiliates. On the other hand, the establishment types that are most likely not to be part of a chain are bars (74%) and restaurants (68%).**

###  2.3. Chain Establishments : <a name="23"></a>

We will now focus on chain establishments and look closer at their characteristics.

In [33]:
# Creating a slice containing only chain establishments
chain_estb = rest_data.query('chain == True')

chain_estb = chain_estb.groupby('object_name').agg({'chain' : 'count', 'number' : 'sum'})


chain_estb = chain_estb.sort_values(by = 'chain' , ascending = False).reset_index()

chain_estb.columns = ['object_name', 'chain_count', 'total_seats']

# Adding a ratio column
chain_estb['average_seating'] = (chain_estb['total_seats'] / chain_estb['chain_count']).round(decimals = 2)

# Display
display(chain_estb)

# Printing descriptions
print('----------------------- Informations About Chain Count -----------------------')
print('')
print(chain_estb['chain_count'].describe())
print('')
print('----------------------- Informations About Total Seats -----------------------')
print('')
print(chain_estb['total_seats'].describe())
print('')
print('--------------------- Informations About Average Seating ---------------------')
print('')
print(chain_estb['average_seating'].describe())


Unnamed: 0,object_name,chain_count,total_seats,average_seating
0,THE COFFEE BEAN & TEA LEAF,47,1256,26.72
1,SUBWAY,31,509,16.42
2,DOMINO'S PIZZA,15,185,12.33
3,WABA GRILL,14,600,42.86
4,KENTUCKY FRIED CHICKEN,14,467,33.36
...,...,...,...,...
2728,JAMBA JUICE #425,1,29,29.00
2729,JAMBA JUICE #644,1,12,12.00
2730,JAMBA JUICE #661,1,23,23.00
2731,JAMBA JUICE #919,1,12,12.00


----------------------- Informations About Chain Count -----------------------

count    2733.000000
mean        1.345042
std         1.489055
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max        47.000000
Name: chain_count, dtype: float64

----------------------- Informations About Total Seats -----------------------

count    2733.000000
mean       53.390413
std        73.328710
min         1.000000
25%        16.000000
50%        31.000000
75%        58.000000
max      1259.000000
Name: total_seats, dtype: float64

--------------------- Informations About Average Seating ---------------------

count    2733.000000
mean       41.437578
std        44.313241
min         1.000000
25%        14.000000
50%        25.500000
75%        45.000000
max       229.000000
Name: average_seating, dtype: float64


In [34]:
# Plotting the scatter plot
fig = px.scatter(chain_estb, 
                 y = "chain_count",
                 x = "total_seats",
                 title = {'text': "Establishments Number of Chains - Number of Seats",
                          'y':0.9,
                          'x':0.47,
                          'xanchor': 'center',
                          'yanchor': 'top'},
                 color = 'average_seating', 
                 color_continuous_scale = px.colors.sequential.RdBu,
                 template = 'seaborn'
                )

fig.show()

<b>Observations and conclusions :</b>    

The above is a scatter plot of the relation between the number of chains and the total number of seats in an establishment. The information when hovering over each dot allows us to also see the average seats per locals for each chain.
    
From the scatter plot, we notice that the more common instance is for establishments to have 3 chains or less and around 60 seats (the concentration of dots at 50-70 seats). This is confirmed by the descriptions of these variables, where we can see that the majority of the chains have only one location in the area of LA and around 60 seats.
    
- **So we conclude that chains are mostly characterized by a small number of locations (typically one) and a big amount of seats (around 60 seats).**

###  2.4. Average Seats : <a name="24"></a>

In this part we will study what is the average number of seats for each type of restaurant. 

In [35]:
print('-------------------------------------------------------------------------------------')
print('')
print("The general average number of seats is : {:.0f} seats.".format(rest_data['number'].mean()))
print("The general median number of seats is : {:.0f} seats.".format(rest_data['number'].median()))
print('')
# Printing averages
print('-------------------------------------------------------------------------------------')
print('')
print("The average number of seats for cafes is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Cafe']['number'].mean()))
print("The median number of seats for cafes is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Cafe']['number'].median()))
print('')
print("The average number of seats for restaurants is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Restaurant']['number'].mean()))
print("The median number of seats for restaurants is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Restaurant']['number'].median()))
print('')
print("The average number of seats for fast food is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Fast Food']['number'].mean()))
print("The median number of seats for fast food is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Fast Food']['number'].median()))
print('')
print("The average number of seats for bakeries is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Bakery']['number'].mean()))
print("The median number of seats for bakeries is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Bakery']['number'].median()))
print('')
print("The average number of seats for bars is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Bar']['number'].mean()))
print("The median number of seats for bars is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Bar']['number'].median()))
print('')
print("The average number of seats for pizza ventures is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Pizza']['number'].mean()))
print("The median number of seats for pizza ventures is : {:.0f} seats.".format(rest_data[rest_data['object_type'] == 'Pizza']['number'].median()))
print('')
print('-------------------------------------------------------------------------------------')

# Plotting
fig = px.box(rest_data,
             color = 'object_type',
             y = "number",
            color_discrete_sequence = ['rgb(178,24,43)', 'rgb(214,96,77)', 'rgb(244,165,130)','rgb(146,197,222)',
                                       'rgb(67,147,195)', 'rgb(33,102,172)', 'rgb(5,48,97)'],
            title = {'text': "Average Seats Number Per Restaurant Type",
                          'y':0.9,
                          'x':0.47,
                          'xanchor': 'center',
                          'yanchor': 'top'},
 )

fig.show()



-------------------------------------------------------------------------------------

The general average number of seats is : 44 seats.
The general median number of seats is : 27 seats.

-------------------------------------------------------------------------------------

The average number of seats for cafes is : 25 seats.
The median number of seats for cafes is : 21 seats.

The average number of seats for restaurants is : 48 seats.
The median number of seats for restaurants is : 29 seats.

The average number of seats for fast food is : 32 seats.
The median number of seats for fast food is : 21 seats.

The average number of seats for bakeries is : 22 seats.
The median number of seats for bakeries is : 18 seats.

The average number of seats for bars is : 45 seats.
The median number of seats for bars is : 28 seats.

The average number of seats for pizza ventures is : 28 seats.
The median number of seats for pizza ventures is : 18 seats.

----------------------------------------------

<b>Observations and conclusions :</b>    

The study of average seats per type of retaurant allows us to see that the averages vary depending on the type of establishment.
    
The types of establishments with the highest average number of seats are **restaurants with an average of 48 seats** and **bars with an average of 45 seats**.
    
While on the other hand, the establishments with the lowest number of seats are **bakeries with an average of 22 seats** and **cafes with an average of 25 seats**.
    
**The median caluclations, which disregard the outliers, maintain the same the order, althought the differences are smaller.**
    
The box plot illustrates further the differences stated in numbers. It allows us to see that generally restaurants and bars have the biggest numbers of seats while although there are some ouliers, bakeries and cafes tend to generally have fewer seats.

###  2.5. Location Study : <a name="25"></a>

We will now define the optimal locations to open a new establishment bny identifying the most popular streets for restaurants, but also the least popular ones.

The first step would be to retrieve the street name from the adress column

In [36]:
# Creating a new column with the street without the number at the start
rest_data['st']= rest_data['address'].str.strip('0123456789.-#')

# Turning everything to uppercase
rest_data['st']= rest_data['st'].str.upper()

# Removing words with less than 3 letters
rest_data['st'] = rest_data['st'].str.findall('\w{3,}').str.join(' ')

# A List of words to remove from the address
words = ['\\bTHE\\b', '\\bWAY\\b', '\\bSTE\\b', 'BLDG', 'DRIVE', 'STEK', 'BLVD', '\\bAVE\\b', 'unit', 'UNIT', 'APT',
         'AVENUE', 'NPQ', 'BSMT', 'LVL', 'PLZ', '\\bLOB\\b', '\\bFLR\\b', 'LOBBY', 'LBBY','A233B', 'B300F', 'B300A',
        'RDFL', 'RDF', 'SPC','BLV', 'FLOOR', 'T4K', 'QSR','LOWR', 'CIR', 'T7J', 'TSB', 'T5B', 'R1A', 'ROL', 'DLC',
        'PLAZA', 'SECIO', 'VLG', 'VILLAGE', 'MALL', 'ASTRONAUT', 'ELLISON','\\bSAN\\b', 'PKWY', '\\bSHL\\b', '\\bLOS\\b',
         '\\bLOS\\b', '\\bBOX\\b', 'C2B', 'LEVEL', '\\bLANE\\b', 'STREET', 'P2Z', '\\bALLEY\\b', 'MEZZ','\\bSUIT\\b',
         '\\bROOM\\b', 'MEDICAL','AUGY', 'CAFE']
for word in words:
     rest_data['st'] = rest_data['st'].str.replace(word, '')
        
# Removing words containing a number
rest_data['st']= rest_data['st'].str.replace('/\b[^\s]*\d[^\s]*\b/g', '')

# Removing white space at the end of the street name
rest_data['st'] = rest_data['st'].str.rstrip()

# Separating the last word from the rest of the street
new = rest_data['st'].str.rsplit(' ', 1, expand = True)

# Appending the results to the dataset
rest_data['prefix'] = new[0]
rest_data['suffix'] = new[1]

# Removing numbers from the suffix and then removing it completely if it's less than 3 characters
rest_data['suffix'] = rest_data['suffix'].str.strip('0123456789.-# ')
rest_data['suffix'] = rest_data['suffix'].str.findall('\w{3,}').str.join(' ')

# Reattaching
rest_data['st_stripped'] = rest_data['prefix'] + ' ' + rest_data['suffix']

# To fill in the error where the suffix is a NaN
rest_data['st_stripped'] = rest_data['st_stripped'].fillna(rest_data['prefix'])

# Dropping useless columns
rest_data = rest_data.drop(columns = ['st', 'prefix', 'suffix'])

# Removing white space from the start and the end
rest_data['st_stripped'] = rest_data['st_stripped'].str.rstrip()
rest_data['st_stripped'] = rest_data['st_stripped'].str.lstrip()

# Removing ahite space in the middle
rest_data['st_stripped'] = rest_data['st_stripped'].str.replace('  ', ' ')
rest_data['st_stripped'] = rest_data['st_stripped'].str.replace('  ', ' ')

# Rename 
rest_data = rest_data.rename(columns = {"st_stripped":"street"})

rest_data.head(10)


Unnamed: 0,id,object_name,address,chain,object_type,number,street
0,11786,HABITAT COFFEE SHOP,3708 N EAGLE ROCK BLVD,False,Cafe,26,EAGLE ROCK
1,11787,REILLY'S,100 WORLD WAY # 120,False,Restaurant,9,WORLD
2,11788,STREET CHURROS,6801 HOLLYWOOD BLVD # 253,False,Fast Food,20,HOLLYWOOD
3,11789,TRINITI ECHO PARK,1814 W SUNSET BLVD,False,Restaurant,22,SUNSET
4,11790,POLLEN,2100 ECHO PARK AVE,False,Restaurant,20,ECHO PARK
5,11791,THE SPOT GRILL,10004 NATIONAL BLVD,False,Restaurant,14,NATIONAL
6,11792,CPK,100 WORLD WAY # 126,False,Restaurant,100,WORLD
7,11793,PHO LALA,3500 W 6TH ST STE 226,False,Restaurant,7,6TH
8,11794,ABC DONUTS,3027 N SAN FERNANDO RD UNIT 103,True,Fast Food,1,FERNANDO
9,11795,UPSTAIRS,3707 N CAHUENGA BLVD,False,Restaurant,35,CAHUENGA


**Let's verify the names of streets we have obtained to see if there are any mistakes tht pop**.

In [37]:
# Create an array we call streets
streets = rest_data['street'].unique()


print('number of unique streets :',len(streets))

number of unique streets : 450


We notice that there are indeed some mistakes that catch the attention in the street names. For example :
- ZOO misspelled as Z00,
- ALFRED misspelled as ALFERD,
- STA MON is an abreviation for SANTA MONICA,
- HEWITT misspelled as NEWITT,
- PASEO as PASEO RANCHO CASTILLA,
- ALAMEDA as ALAMESA,
- OLYMPIC as OLYMPC,
- VICENTE as VINCENTE,
- ROSCOMARE as ROSCOMORE,
- WESTERN as WERSTERN,
- ALVARADO as ALAVARADO,
- MILLENNIUM as MILLENIUM,
- MCCARTHY also spelled as MC CARTHY,
- MCCLINTOCK also spelled as MC CLINTOCK,
- SAWTELLE misspelled as SAWTTLE,
- MULHOLLAND as MULHOOLAND, 





Let's correct the instances we have noticed to minimize error in the calculations following.


In [38]:
rest_data['street'] = rest_data['street'].replace({
    'Z00' : 'ZOO',
    'ALFERD' : 'ALFRED',
    'STA MON' : 'SANTA MONICA',
    'NEWITT' : 'HEWITT',
    'PASEO RANCHO CASTILLA' : 'PASEO',
    'ALAMESA' : 'ALAMEDA',
    'OLYMPC' : 'OLYMPIC',
    'VINCENTE' : 'VICENTE',
    'ROSCOMORE' : 'ROSCOMARE',
    'HILHURST' : 'HILLHURST',
    'WERSTERN' : 'WESTERN',
    'ALAVARADO' : 'ALVARADO',
    'MILLENIUM' : 'MILLENNIUM',
    'MC CARTHY' : 'MCCARTHY',
    'MC CLINTOCK' : 'MCCLINTOCK',
    'CARTHY' : 'MCCARTHY',
    'SAWTLLE' : 'SAWTELLE',
    'MULHOOLAND' : 'MULHOLLAND',
    
})

len(rest_data['street'].unique())

436

**Observations :**  

**We have now created a new column with the street name.** Next, we will calculate how many establishments each street has.

In [39]:
# Creating a dataset wit street counts
street_count = rest_data.groupby('street', as_index = False).agg({'object_name' : 'count', 'number' : 'sum'})
street_count.columns = ['street' , 'restaurant_count', 'seats']

# Sorting the values
street_count = street_count.sort_values(by = 'restaurant_count', ascending = False)

# Popular streets
popular = street_count.head(10)

# Display
display(popular)


# Unpopular streets
unpop = street_count.query('restaurant_count == 1')
# print
print('----------------------------------------------------------------------')
print('')
print('There are {:.0f} streets with only one restaurant.'.format(len(unpop)))
print('')
print('----------------------------------------------------------------------')

# Plot bar chart
fig = px.bar(popular, x = 'street', y = 'restaurant_count',
             text = 'restaurant_count',
             hover_data = ['restaurant_count', 'seats'], color = 'restaurant_count', height = 400,
             color_continuous_scale = px.colors.diverging.RdBu,
             title = {'text': '10 Most Popular Locations in Los Angeles',
                          'y':0.9,
                          'x':0.47,
                          'xanchor': 'center',
                          'yanchor': 'top'})
fig.show()

Unnamed: 0,street,restaurant_count,seats
374,SUNSET,405,19401
424,WILSHIRE,398,20957
327,PICO,371,15042
414,WESTERN,370,15281
165,FIGUEROA,334,15047
309,OLYMPIC,310,15280
397,VERMONT,288,13067
348,SANTA MONICA,277,9949
17,3RD,263,10660
213,HOLLYWOOD,254,14438


----------------------------------------------------------------------

There are 158 streets with only one restaurant.

----------------------------------------------------------------------


<b>Observations and conclusions :</b> 
    
**The most popular street in LA as a location for restaurants is Sunset Boulevard, with a total of 405 restaurants, followed by Wilshire Boulevard with 397 restaurants and Pico Boulevard with 371 restaurants.**  
    

**We notice that there are 158 streets with one restaurant only. This could be accurate but part of this number can be related to duplicates in the street name due to error.** 

<b>An this is where I went a bit far..</b> 
    
Let's plot a density map, showing the sreets with the most restaurants.

To plot this map, I will need latitude and longitude data for each street and this is not provided in the dataset we have. Therefore, I outsourced the data from an open source [dataset](https://www.kaggle.com/cityofLA/los-angeles-parking-citations?select=parking-citations.csv) of parking citations in LA. I will treat the dataset in the next code block to make it ready to be merged with the restaurant dataset.

In [40]:
# Import the new dataset
spreadsheet_id = '1Kq8iMBVcIaAcRxommfmpsO33k1m4ppptBgriYMT41vs'
file_name = 'https://docs.google.com/spreadsheets/d/{}/export?format=csv'.format(spreadsheet_id)
r = requests.get(file_name)
la = pd.read_csv(BytesIO(r.content))

la.head(10)

Unnamed: 0,Street,Longitude,Latitude
0,S RENO ST,-118.281268,34.071436
1,S VENDOME ST,-118.282194,34.07184
2,W 2ND ST,-118.281093,34.069341
3,W 2ND ST E,-118.155411,33.764106
4,129TH ST W,-118.290245,33.915393
5,N DILLON ST,-118.283109,34.07225
6,N EDINBURGH AVE,-118.363747,34.083778
7,N HAYWORTH AVE,-118.362629,34.076099
8,N NEW HAMPSHIRE AVE,-118.293313,34.103609
9,N OCCIDENTAL BLVD,-118.278521,34.072127


In [41]:
# Creating a new column with the street without the adress
la['st']= la['Street'].str.strip('0123456789.-#')

# Turning everything to uppercase
la['st']= la['st'].str.upper()

# Removing words with less than 3 letters
la['st'] = la['st'].str.findall('\w{3,}').str.join(' ')

# A List of words to remove from the address
words = ['\\bTHE\\b', '\\bWAY\\b', '\\bSTE\\b', 'BLDG', 'DRIVE', 'STEK', 'BLVD', '\\bAVE\\b', 'UNIT', 'APT',
         'AVENUE', 'NPQ', 'BSMT', 'LVL', 'PLZ', '\\bLOB\\b', '\\bFLR\\b', 'LOBBY', 'LBBY','A233B', 'B300F', 'B300A',
        'RDFL', 'RDF', 'SPC','BLV', 'FLOOR', 'T4K', 'QSR','LOWR', 'CIR', 'T7J', 'TSB', 'T5B', 'R1A', 'ROL', 'DLC',
        'PLAZA', 'SECIO', 'VLG', 'VILLAGE', 'MALL', 'ASTRONAUT', 'ELLISON','\\bSAN\\b', 'PKWY', '\\bSHL\\b', '\\bLOS\\b', 
        '\\bBOX\\b', 'C2B', 'LEVEL', '\\bLANE\\b', 'STREET', 'P2Z', '\\bALLEY\\b', 'MEZZ','\\bSUIT\\b', '\\bROOM\\b', 'MEDICAL',
        'AUGY']
for word in words:
     la['st'] = la['st'].str.replace(word, '')
        
# Removing words containing a number
la['st']= la['st'].str.replace('/\b[^\s]*\d[^\s]*\b/g', '')

# Removing white space at the end of the street name
la['st'] = la['st'].str.rstrip()

# Separating the last word from the rest of the street
new = la['st'].str.rsplit(' ', 1, expand = True)

# Appending the results to the dataset
la['prefix'] = new[0]
la['suffix'] = new[1]

# Removing numbers from the suffix and then removing it completely if it's less than 3 characters
la['suffix'] = la['suffix'].str.strip('0123456789.-# ')
la['suffix'] = la['suffix'].str.findall('\w{3,}').str.join(' ')

# Reattaching
la['st_stripped'] = la['prefix'] + ' ' + la['suffix']

# To fill in the error where the suffix is a NaN
la['st_stripped'] = la['st_stripped'].fillna(la['prefix'])

# Dropping useless columns
la = la.drop(columns = ['st', 'prefix', 'suffix'])

# Removing white space from the start and the end
la['st_stripped'] = la['st_stripped'].str.rstrip()
la['st_stripped'] = la['st_stripped'].str.lstrip()

# Removing white space in the middle
la['st_stripped'] = la['st_stripped'].str.replace('  ', ' ')
la['st_stripped'] = la['st_stripped'].str.replace('  ', ' ')

# Rename 
la = la.rename(columns = {"st_stripped":"street"})

# Separating the first word from the rest of the street
new_ = la['street'].str.split(' ', 1, expand = True)

# Appending the results to the dataset
la['prefix'] = new_[0]
la['suffix'] = new_[1]

# Creating slices of -st, -nd, -rd and -th streets
ST = la[la['prefix'].str.contains('ST')]
ND = la[la['prefix'].str.contains('ND')]
TH = la[la['prefix'].str.contains('TH')]
RD = la[la['prefix'].str.contains('RD')]

# Removing numbers from the suffix and then removing it completely if it's less than 3 characters
la['prefix'] = la['prefix'].str.replace('\d+', '')

# Reattaching
la['st_stripped'] = la['prefix'] + ' ' + la['suffix']

# To fill in the error where the suffix is a NaN
la['st_stripped'] = la['st_stripped'].fillna(la['prefix'])

# Removing white space from the start and the end
la['st_stripped'] = la['st_stripped'].str.rstrip()
la['st_stripped'] = la['st_stripped'].str.lstrip()

# Removing white space in the middle
la['st_stripped'] = la['st_stripped'].str.replace('  ', ' ')
la['st_stripped'] = la['st_stripped'].str.replace('  ', ' ')
la['st_stripped'] = la['st_stripped'].str.replace('  ', ' ')

# Dropping useless columns
la = la.drop(columns = ['street', 'prefix', 'suffix'])

# Rename 
la = la.rename(columns = {"st_stripped":"street"})

# Dropping rows with the values 'ND' and 'TH' and ''
la = la.drop(index=la[la['street'] == 'ND'].index)
la = la.drop(index=la[la['street'] == 'TH'].index)

# Drop NaN and useless columns
la = la.dropna()

# Grouping
la = la.groupby('street' , as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})

# Dropping useless columns ST, ND, RD and TH
ST = ST[['street', 'Latitude', 'Longitude']]
ND = ND[['street', 'Latitude', 'Longitude']]
TH = TH[['street', 'Latitude', 'Longitude']]
RD = RD[['street', 'Latitude', 'Longitude']]


# ST, ND, RD and TH contain all the streets containing those letters, such as HAYWOR-TH-, RAYMO-ND-,
# the streets with numbers would usually have three numbers and the suffix : xxxND, XXXTH, so we remove rows with more than
# 6 charcaters to be safe
ST = ST[ST['street'].str.len() <= 5]
ND = ND[ND['street'].str.len() <= 5]
TH = TH[TH['street'].str.len() <= 5]
RD = RD[RD['street'].str.len() <= 5]

# Grouping
ST = ST.groupby('street' , as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})
ND = ND.groupby('street' , as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})
TH = TH.groupby('street' , as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})
RD = RD.groupby('street' , as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})

# Concatenating the datasets
LA = pd.concat([la, ST, ND, RD, TH])
LA = LA.drop_duplicates()

# Removing duplicates
LA = LA.groupby('street', as_index = False).agg({'Latitude' : 'mean', 'Longitude' : 'mean'})
LA = LA.drop(index=LA[LA['street'] == ''].index)

# Merging the data of restaurents and coordinates together
street_count = street_count.merge(LA, how = 'left', left_on = ['street'], right_on = ['street'])

# Adding additional streets which have more than 1 restaurant
additional = pd.DataFrame({
    'street'  : ['FIRESTONE', 'JAPANSE', 'TROUSDALE', 'STATE UNIVERSITY', 'OLVERA' , 'MEDNIK' , 'FORD', 'GIN LING',
                 'GETTY CENTER', 'KERN', 'BRANNICK', 'CHARLES YOUNG', 'EXPOSITION PARK', 'MEI LING', 'TELEGRAPH', 'MCDONELL',
                'MOUNTAIN GATE', 'CIENGA' , 'LMU', 'SUNOL', 'JAPANESE'],
    'lat': [33.88343, 34.049515, 34.01987, 34.067744, 34.0527, 34.043578, 34.0527, 34.0527, 34.0527, 34.02021, 34.0527,
            34.066702, 34.015976, 34.065323, 34.013242, 34.0527, 34.108348, 34.052696, 33.969677, 37.60803, 34.049955],
    'lo' : [-118.026427, -118.238695, -118.286469, -118.166754, -118.2437, -118.161913, -118.2437, -118.2437, -118.2437,
            -118.165968, -118.2437, -118.445397, -118.286287, -118.237879, -118.163976, -118.2437, -118.487648, -118.376458,
            -118.416139, -121.874084, -118.240055]})

# Merging again, replacing NaN with 0, and adding the latitude and longitude columns to each other
street_count = street_count.merge(additional, how = 'left', left_on = ['street'], right_on = ['street'])
street_count = street_count.fillna(value = 0)

street_count['Latitude'] = street_count['Latitude'] + street_count['lat']
street_count['Longitude'] = street_count['Longitude'] + street_count['lo']

# Dropping useless columns
street_count = street_count.drop(columns = ['lat', 'lo'])

# Display
display(street_count.head(15))


Unnamed: 0,street,restaurant_count,seats,Latitude,Longitude
0,SUNSET,405,19401,34.078114,-118.330796
1,WILSHIRE,398,20957,34.058408,-118.309995
2,PICO,371,15042,34.044414,-118.345266
3,WESTERN,370,15281,34.019047,-118.307747
4,FIGUEROA,334,15047,33.981518,-118.266744
5,OLYMPIC,310,15280,34.043184,-118.289723
6,VERMONT,288,13067,34.027117,-118.291553
7,SANTA MONICA,277,9949,34.074602,-118.347323
8,3RD,263,10660,34.018878,-118.274267
9,HOLLYWOOD,254,14438,34.102585,-118.321968


In [42]:
# Plotting the map

fig = px.scatter_mapbox(street_count, 
                        lat = "Latitude", 
                        lon = "Longitude", 
                        color = 'restaurant_count', 
                        size = 'restaurant_count', 
                        hover_name = 'street',
                        color_continuous_scale = px.colors.diverging.RdBu,
                        opacity = .7,
                        size_max = 30,
                        center = go.layout.mapbox.Center(lat = 34.05, lon = -118.35),
                        zoom = 10.5,
                        height = 400,
                        mapbox_style = "carto-positron")
fig.update_layout(margin = {"r":0,"t":0,"l":0,"b":0},
                 title = 'Density Map of Los Angeles Restaurants',
                 autosize = True,
                  hovermode = 'closest')

fig.show()

<b>Observations and conclusions :</b> 
    
The map above allows us to see the areas of the city with the biggest number of restaurants, and also the less dense areas. We notice a big concentration of restaurants in the centeral region of the city, that's where Sunset Boulevrad, Wilshire Boulevard, Pico Union and Avenue Western. We also have a dense presence in the area of Downtown LA and the south of the city.

###  2.6. Seats in Popular Streets  : <a name="26"></a>

In this part, we will focus the observations on the top 20 streets to analyze the distribution of number of seats and its correlation with other factors

In [43]:
# Popular 5 streets slice
Popular = rest_data[rest_data['street'].isin(street_count['street'].head(85))]
Popular['chain'] = Popular['chain'].astype('str')
# Grouping
Popular = Popular.groupby('object_name').agg({'address' : 'count', 'number' : 'sum', 'chain' : 'first'})

# Sorting values
Popular = Popular.sort_values(by = 'address' , ascending = False).reset_index()

# Renaming
Popular.columns = ['object_name', 'chain_count', 'total_seats', 'chain']

# Adding a ratio column
Popular['average_seating'] = (Popular['total_seats'] / Popular['chain_count']).round(decimals = 2)

# Display
display(Popular.head(10))


Unnamed: 0,object_name,chain_count,total_seats,chain,average_seating
0,THE COFFEE BEAN & TEA LEAF,44,1166,True,26.50
1,SUBWAY,27,434,True,16.07
2,WABA GRILL,14,600,True,42.86
3,MCDONALD'S,13,1259,True,96.85
4,KENTUCKY FRIED CHICKEN,13,439,True,33.77
...,...,...,...,...,...
7517,"GUATEMALTECA BAKERY, INC.",1,84,True,84.00
7518,GUATEMALTECA BAKERY,1,45,True,45.00
7519,GUANAQUITA BAKERY & PUPUSERI,1,3,True,3.00
7520,GUADALUPE'S PLACE,1,4,False,4.00


In [44]:
# Plotting the scatter plot
fig = px.scatter(Popular, 
                 x = "total_seats",
                 y = "chain_count",
                 color = 'chain',
                 size = 'average_seating',
                 opacity = .3,
                 title = {'text': "Establishments Number of Chains - Number of Seats for Popular Streets",
                          'y':0.9,
                          'x':0.47,
                          'xanchor': 'center',
                          'yanchor': 'top'},
                color_discrete_sequence = ['rgb(214,96,77)', 'rgb(67,147,195)'],
                hover_data = ['object_name' , 'total_seats', 'average_seating', 'chain_count'])
fig.update_traces(mode='markers', marker_size=10)
fig.show()

<b>Observations :</b>  

For the most popular streets, we notice a tendency similar to that of the full dataset, where most restaurants are concentrated at the bottom left side of the graph, meaning one location and a small number of seats.
    
Clicking on the chain type in the graph allows us to see each type separately. We can see then that for non chain restaurants, the number of seats varies from very low numbers to 200 seats with the most concentration being under 50 seats. For the chain restaurants, we notice the same tendency wheremost restaurants have a smaller number of seats (less than 200) and 3 locations or less.

## 3. Presentation :<a name="3"></a>


presentation : <https://1drv.ms/p/s!At4zyKb-Zd44iWti6PWLXulDHdhV?e=xKsCam>

## 4. Conclusion :<a name="4"></a>


<div class="alert alert-block alert-info">
<b>
For a final conclusion, considering the fact that this is a new innovation restaurant, which means it wont be part of a chain (not just yet!), it would be best to keep the number of seats around 50 seats.
    
In terms of location, our advice would be to choose one of the most popular streets mentioned above, because although they might know a dense presence of establishments, only 5% of them is a cafe, and these areas are popular streets knowing high foot and car traffic, which means they would be very visible to a large number of potential clients.
</b>  </div>