# Predicting the Success of a Restaurant
- Bengaluru is one of India's major cities, renowned for its vibrant culture, diverse cuisine
- To better understand the culinary landscape of Bengaluru, an exploratory data analysis was conducted on the restaurants listed on Zomato, a popular restaurant search and discovery platform. The data was analyzed to uncover key trends in the restaurant industry, such as the most popular cuisines and pricing, as well as the most popular restaurants and their ratings. Additionally, the analysis explored geographic trends in the restaurant industry, including the most popular neighbourhoods and restaurant types.
- we will go trough a complete Data Analysis on Zomato Bengalore Restaurants dataset . The goal of this project is to provide decision power for decision makers when looking at informations about Bengalore restaurants. Apply a predictive point of view for helping people to choose the best restaurant.
- Using this predictive approach for predicting the success of a new restaurant in Bengaluru. 
-----------------------

## Libraries


In [1]:
# Standard libraries
import pandas as pd
import numpy as np
from warnings import filterwarnings
filterwarnings('ignore')
pd.set_option('display.max_columns', 500)
from collections import Counter
from PIL import Image


# Viz libs
import plotly.express as px

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from matplotlib.gridspec import GridSpec
from mpl_toolkits.axes_grid.inset_locator import InsetPosition
import folium
from folium.plugins import HeatMap, FastMarkerCluster
from wordcloud import WordCloud

# Geolocation libs
from geopy.geocoders import Nominatim


## Reading and Exploring the Data



In [96]:
df = pd.read_csv(r'C:\Users\hp\Desktop\my projects\e-commerce data\zomato.csv', encoding='latin1')

In [70]:
lists = df.columns
data = pd.DataFrame({'column_name': lists,
                     'Description':['conains url of restaurant in the zomato website',
                                    'contains the address of the restaurant in Bengaluru',
                                     'contains the name of restaurant',
                                     'whether online ordering is available in the restaurant or not',
                                     'table book option available or not',
                                     'the overall rating of the restaurant out of 5',
                                     'contains total number of rating for the restaurant as of the above mentioned date',
                                     'contains the phone number of the restaurant',
                                     'contains the neighborhood in which the restaurant is located',
                                     'restaurant type',
                                     'dishes people liked in the restaurant',
                                     'food styles',
                                     'contains the approximate cost for meal for two people',
                                     'containing reviews for the restaurant, each tuple',
                                     'list of menus available in the restaurant',
                                     'type of meal',
                                     'contains the neighborhood in which the restaurant is listed']})
                                    
data.set_index('column_name',inplace=True)

In [71]:
def Data_Overview (df):
    print(f"- Dataset Shape: {df.shape}")
    
    print("--"*20)
    
    print(f'- Daplication data: {df.duplicated().sum()}')
    
    print("--"*20)
    
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values 
    summary['Missing %'] = round(df.isnull().mean()*100,2).values    
    summary['Uniques'] = df.nunique().values
    return summary.style.background_gradient(cmap ='RdBu_r')


Data_Overview(df)

- Dataset Shape: (51717, 17)
----------------------------------------
- Daplication data: 0
----------------------------------------


Unnamed: 0,Name,dtypes,Missing,Missing %,Uniques
0,url,object,0,0.0,51717
1,address,object,0,0.0,11495
2,name,object,0,0.0,8792
3,online_order,object,0,0.0,2
4,book_table,object,0,0.0,2
5,rate,object,7775,15.03,64
6,votes,int64,0,0.0,2328
7,phone,object,1208,2.34,14926
8,location,object,21,0.04,93
9,rest_type,object,227,0.44,93


***There are some columns that could be threated better. Let's point in topics:***

- approx_cost(for two people):
    - Change the data type from object to float
- rate:
    - Let's eliminate the "/5" text and change data type from object to float
- drop unuseful columns

------
## Data Cleaning

In [97]:
df= df.rename(columns={'listed_in(city)':'city','approx_cost(for two people)':'cost','listed_in(type)':'type'})
# drop columns
df.drop(['url','address','phone','menu_item','reviews_list','dish_liked'],axis=1,inplace=True)
# drop duplicated
df.drop_duplicates(inplace=True)

# reset index
df.reset_index(drop=True, inplace=True)

In [98]:
# data cleaning 
def fix_rate(i):
    if '/' in i:
        return float(i[0:3])
    else:
        return np.nan
    
df['rate'] = df['rate'].astype(str).apply(fix_rate)

# make a target column   
def target(x):
    if x>=3.75:
        return 1
    else:
        return 0
    
df['target']=df['rate'].apply(target)


# drop nan values in rate to analyze
df.dropna(subset=['rate','cost'], inplace=True)
#reset index
df.reset_index(drop=True,inplace=True)

# change name and fix dtype o column 
df['cost']= df['cost'].str.replace(',','').astype(float)

df.isna().mean()* 100
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)

In [99]:
# count items in cuisines and rest_type
df['count_cuisines']=df['cuisines'].astype(str).apply(lambda x: len(x.split(',')))
df['count_rest_type']=df['rest_type'].astype(str).apply(lambda x: len(x.split(',')))
df['count_dish_liked']= df['dish_liked'].astype(str).apply(lambda x:len(x.split(',')))

---------------------------------------
## Conducting market research before diving into data analysis

   ### Understanding the Market Dynamics:
    - understand the current trends, preferences, and behaviors of restaurant-goers in Bengaluru. This includes knowing what types of cuisines are popular, the average spending capacity, and the dining preferences of different demographics.

   ### Identifying Key Success Factors:
    - These factors might include location, pricing, quality of food, service quality, ambiance, marketing strategies, and customer reviews.

   ### Consumer Preferences and Trends:
    -  Know if there is a growing trend towards spasific cuisines, this can be a critical insight for predicting the success of new restaurants.

## Based on market research in Bengaluru Current Trends :
-  A mix of global and local flavors
-  Healthy and Organic Eating
-  Fine Dining and Experiential Dining
-  online reservations and interactive menus

-------

##  Descriptive Analysis

In [75]:
# Summary Statistics
df[['rate', 'cost', 'votes']].describe()

Unnamed: 0,rate,cost,votes
count,41392.0,41392.0,41392.0
mean,3.70037,603.265269,351.813901
std,0.440687,464.326729,882.928116
min,1.8,40.0,0.0
25%,3.4,300.0,21.0
50%,3.7,500.0,73.0
75%,4.0,700.0,277.0
max,4.9,6000.0,16832.0


In [76]:
#Frequency Distribution for cuisines
df[['cuisines', 'location', 'rest_type']].describe(include='all')

Unnamed: 0,cuisines,location,rest_type
count,41384,41392,41245
unique,2376,92,87
top,North Indian,BTM,Quick Bites
freq,2117,3900,13871


----
- ***The average spending capacity***

In [77]:
# know avg cost
import plotly.express as px
import plotly.graph_objects as go

# Assuming df is your dataframe and it has a 'cost' column
fig = px.histogram(df, x='cost')

# Calculate the average cost
average_cost = df['cost'].mean()

# Add a vertical line for the average cost
fig.add_shape(
    go.layout.Shape(
        type='line',
        x0=average_cost,
        x1=average_cost,
        y0=0,
        y1=1,
        yref='paper',
        line=dict(color='Red', dash='dash')
    )
)

# Add annotation for the average cost
fig.add_annotation(
    x=average_cost,
    y=1,
    yref='paper',
    text=f"Average Cost: {average_cost:.2f}",
    showarrow=False,
    font=dict(color='Red')
)

fig.show()


----
##  Trend Analysis
- Cuisine Popularity: Analyze the cuisines column to identify the most popular cuisines.
- Restaurant Types: Examine the rest_type column to see which types of restaurants are most common.
- Location Preferences: Explore the location column to determine which areas in Bengaluru have the most restaurants.


- ***What types of cuisines are popular***

In [100]:
# Analyze the cuisines Popularity
from sklearn.preprocessing import MultiLabelBinarizer
mlb=MultiLabelBinarizer()
df['cuisines_temp']=df['cuisines'].astype(str).apply(lambda r:r.replace(', ',',').split(','))
df_cuisines=pd.DataFrame(mlb.fit_transform(df['cuisines_temp']),columns=mlb.classes_)
df.drop('cuisines_temp',axis=1,inplace=True)

In [101]:
#10 pouplar Cuisines
df_cus=df_cuisines.sum().sort_values(ascending=False).head(10).reset_index()
px.histogram(data_frame=df_cus,x='index', y=0,title='Top 10 Pouplar Cuisines')

In [102]:
#Rating and Cost Analysis by cuisines
df_cuisines = pd.concat([df_cuisines, df[['rate','cost','location','votes']]],axis=1)

In [201]:
# show cost analysis by cuisines
#df_cuisines.drop(['cost','votes'],inplace=True,axis=1)
df_cuisines.drop('votes',inplace=True,axis=1)
px.imshow(df_cuisines.groupby('rate').mean().sort_values(ascending=False , by='rate'))

- ***the most pouplar cuisines and most rated more than 3.75 are north indian and chinese***
----

In [82]:
# show cost analysis by cuisines
df_cuisines.drop('rate',inplace=True,axis=1)
px.imshow(df_cuisines.groupby('cost').mean().sort_values(ascending=False , by='cost'))

- The heatmap illustrates the cost distribution across various categories. Categories such as ***['Asian', 'Chinese', 'Continental', 'European', 'Fast Food', 'French','Grill', 'Italian', 'Japanese', 'Kashmiri', 'Kerala', 'Konkan','Mangalorean', 'Mediterranean', 'Mughlai', 'North Indian', 'Seafood','South Indian', 'Steak', 'Thai', 'Vietnamese']*** show higher cost items, indicated by the yellow spots, suggesting premium offerings. In contrast, categories like "Kebabs" and "Street Food" generally feature more affordable options. The intensity of the colors represents the frequency and value, with brighter colors indicating higher costs or values.


-------
***Which types of restaurants are most common***

In [83]:
#Which types of restaurants are most common
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df['rest_type_temp'] = df['rest_type'].astype(str).apply(lambda x: x.replace(', ',',').split(','))
df_rest_type= pd.DataFrame(mlb.fit_transform(df['rest_type_temp']),columns=mlb.classes_)
df.drop('rest_type_temp',axis=1,inplace= True)
                       
    
df_rest_type_ = pd.concat([df_rest_type, df[['rate','cost']]],axis=1)


In [84]:
df_rest=df_rest_type.sum().sort_values(ascending=False).head(10).reset_index()
px.histogram(data_frame=df_rest,x='index',y=0,title= 'Top 10 types of restaurants')

***What are the top expensive rest type in avgarge?***


In [86]:
df_rest_type_.drop('rate',inplace=True,axis=1)
px.imshow(df_rest_type_.groupby('cost').mean().sort_values(ascending=False , by='cost'))

***- The most costed restaurant type are fine dining and lounge***

-------

***which areas in Bengaluru have the most restaurants?***


In [105]:
df_location=df['location'].value_counts().head(10).reset_index()
px.histogram(data_frame=df_location,x='index',y='location',title ='Top 10 areas in Bengaluru have the most restaurants')

In [187]:
#Top Rated location
top_location=df[(df['target']==1) & (df['votes']>100)]
top_location.groupby('location').mean().sort_values(ascending=False,by='votes').reset_index()['location'].head()

0            Church Street
1    Koramangala 5th Block
2            Sarjapur Road
3             Lavelle Road
4    Koramangala 4th Block
Name: location, dtype: object

In [190]:
px.scatter(top_location,x='votes',y='rate')

***Filters the dataframe to get restaurants located in the specified location,with a rating greater than 3.75, a cost less than 1000 and votes more than 100 vote to know the top cuisines in this location***

In [129]:
import pandas as pd

def get_top_cuisines(location, df_cuisines):
    # Filter the dataframe based on location, rating, and cost
    top_loc = df_cuisines[(df_cuisines['location'] == location) & 
                          (df_cuisines['rate'] > 3.75) & 
                          (df_cuisines['cost'] < 1000)&
                         (df_cuisines['votes'] > 100)]
    
    # Generate summary statistics and reset index
    top_loc_1 = top_loc.describe().reset_index()
    
    # Identify columns with a value of 1 in any of their rows
    filtered_columns = top_loc_1.columns[(top_loc_1 == 1).any(axis=0)]
    
    return filtered_columns

# Example usage:
location = input('Enter location: ')
top_cuisines_columns = get_top_cuisines(location, df_cuisines)
print(top_cuisines_columns)

Enter location: BTM
Index(['American', 'Andhra', 'Arabian', 'Asian', 'BBQ', 'Bakery', 'Bengali',
       'Beverages', 'Bihari', 'Biryani', 'Burger', 'Cafe', 'Charcoal Chicken',
       'Chinese', 'Continental', 'Desserts', 'Fast Food', 'Healthy Food',
       'Hyderabadi', 'Ice Cream', 'Italian', 'Juices', 'Kebab', 'Kerala',
       'Lebanese', 'Mexican', 'Middle Eastern', 'Mithai', 'Momos', 'Mughlai',
       'North Indian', 'Oriya', 'Pizza', 'Rajasthani', 'Rolls', 'Sandwich',
       'Seafood', 'South Indian', 'Steak', 'Street Food', 'Tea', 'Thai'],
      dtype='object')


In [143]:
# Average Number of Restaurant Types per Location
average_count_rest_type = df.groupby('location')['rest_type'].nunique().mean()
print(f"Average Number of Restaurant Types per Location: {average_count_rest_type:.2f}")

Average Number of Restaurant Types per Location: 14.00


------

## Success Factors
- Rating Analysis: Investigate how rate varies across different cuisines, rest_type, and location. Use visualization tools to create bar charts or heatmaps.
- Cost vs. Rating: Analyze the relationship between cost and rate to understand if higher-priced restaurants receive better ratings.
- Online Order and Booking: Assess the impact of online_order and book_table options on restaurant ratings and reviews.

In [118]:
# Rating Analysis by rest type
df_rest_type =pd.concat([df_rest_type,df['rate']],axis=1)

In [119]:
df_rest_type=df_rest_type[df_rest_type['rate']>=3.7]

px.imshow(df_rest_type.groupby('rate').sum().sort_values(ascending=False , by='rate'))

***Top Rated Resturant type are Quick Bites and Casual Dining***

----

In [146]:
import plotly.express as px
import pandas as pd

# Group by location and calculate mean for rate and votes
grouped_df = df.groupby('location').mean()[['rate', 'votes']].reset_index().sort_values(ascending=False, by='votes').head(10)

# Create the bar plot
fig = px.bar(grouped_df, x='location', y='votes', color='rate',
             title='Top 10 location with high rate based on votes', labels={'location':'Location', 'votes':'Votes'},
             category_orders={"location": grouped_df['location']})

# Update layout for better visualization
fig.update_layout(xaxis_tickangle=-90, width=1000, height=600)

fig.show()


In [218]:
grouped_df.head()

Unnamed: 0,location,rate,votes
12,Church Street,3.992125,1089.705128
50,Lavelle Road,4.141545,1050.402923
44,Koramangala 5th Block,4.006925,964.641115
43,Koramangala 4th Block,3.918668,814.692033
80,St. Marks Road,4.017201,775.798834


In [176]:
grouped_df['location'].tolist()

['Church Street',
 'Lavelle Road',
 'Koramangala 5th Block',
 'Koramangala 4th Block',
 'St. Marks Road',
 'Koramangala 3rd Block',
 'Indiranagar',
 'Cunningham Road',
 'MG Road',
 'Residency Road']

---------------------

In [144]:
#Cost vs. Rating
px.scatter(data_frame=df, x='cost',y='rate')

***The relationship between cost and rate the higher-priced restaurants receive better ratings***


-----

In [145]:
from plotly.subplots import make_subplots

# Create the scatter plot for 'book_table'
fig1 = px.scatter(df, x='rate', y='cost', color='book_table',
                  title='Cost vs Rate by Book Table', color_continuous_scale='reds')

# Create the scatter plot for 'online_order'
fig2 = px.scatter(df, x='rate', y='cost', color='online_order',
                  title='Cost vs Rate by Online Order', color_continuous_scale='reds')

# Combine both plots into a subplot
fig = make_subplots(rows=1, cols=2, subplot_titles=('Cost vs Rate by Book Table', 'Cost vs Rate by Online Order'))

# Add traces
for trace in fig1['data']:
    fig.add_trace(trace, row=1, col=1)

for trace in fig2['data']:
    fig.add_trace(trace, row=1, col=2)

# Update layout
fig.update_layout(height=500, width=1000, title_text="Online Order and Table Booking by Rating")

fig.show()


***- The relationship between cost and rate by book table the higher-priced restaurants have book table option***

***- NO relationship between cost and rate by online order***


---------------------

# Market Trends Integration
- Healthy and Organic Eating: Filter the dataset for keywords related to healthy and organic foods in the cuisines and dish_liked columns.
- Experiential Dining: Identify restaurants offering unique experiences by analyzing the rest_type and menu_item columns.

In [162]:
# Top 5 Healthy and Organic resturants with high rate and votes
healthy_keywords = ['vegan', 'organic', 'healthy']
df['is_healthy'] = df['cuisines'].astype(str).apply(lambda x: any(k in x.lower() for k in healthy_keywords))
healthy_restaurants = df[df['is_healthy']]
healthy_restaurants.groupby(['name','location']).mean()[['rate','votes']].sort_values(ascending=False,by='votes').reset_index().head(5)

Unnamed: 0,name,location,rate,votes
0,Green Theory,Residency Road,4.204762,2849.571429
1,Fava,Lavelle Road,4.4,2047.611111
2,Leon Grill,Jeevan Bhima Nagar,4.2,1327.0
3,Go Native,Jayanagar,4.3,1266.454545
4,The Yogisthaan Cafe,Indiranagar,4.2,1208.0


In [156]:
px.scatter(healthy_restaurants, x= 'rate', y='cost')

***The relationship between cost and rate in healthy resturants the higher-priced restaurants receive better ratings***

----

In [170]:
experiential_keywords = ['rooftop', 'alfresco', 'fine dining']
df['is_experiential'] = df['rest_type'].astype(str).apply(lambda x: any(k in x.lower() for k in experiential_keywords))
experiential_restaurants = df[df['is_experiential']]
experiential_restaurants.groupby(['name','location']).mean()[['rate','votes','cost']].sort_values(ascending=False,by='votes').reset_index().head(9)

Unnamed: 0,name,location,rate,votes,cost
0,Yauatcha,MG Road,4.6,2353.666667,2800.0
1,Shiro,Lavelle Road,4.4,2236.785714,3000.0
2,JW Kitchen - JW Marriott Bengaluru,Lavelle Road,4.4,2128.8,2200.0
3,Caperberry,Lavelle Road,4.6,1433.4,2200.0
4,Tandoor,MG Road,3.9,1367.416667,2100.0
5,Rim Naam - The Oberoi,MG Road,4.6,983.833333,3000.0
6,Sly Granny - The Community House,Indiranagar,4.44,963.8,2500.0
7,Feast - Sheraton Grand Bangalore Hotel at Brig...,Malleshwaram,4.3,958.25,2000.0
8,KazÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ...,Lavelle Road,4.4,890.2,3000.0


In [171]:
px.scatter(experiential_restaurants, x= 'rate', y='cost')

***- All top experiential restaurants have high ratings, and there are few that go below 3.7. This indicates that the experiential dining options in the dataset are consistently well-regarded.***

In [172]:
experiential_restaurants['book_table'].value_counts()

Yes    325
No      76
Name: book_table, dtype: int64

***- a majority of experiential dining venues provide the convenience of advance reservations.***

-------
## Data Preparation for Machine Learning


In [223]:
# Remove the unnecessary Columns
df.drop(["name" , "cuisines" ,"rest_type" ] , axis = 1 , inplace = True)

In [231]:
df_location_counts=df['location'].value_counts(normalize=True)*100

Desired_Index = df_location_counts[df_location_counts.values > 0.5].index

def Reduce_Location(r):
    if r in Desired_Index:
        return r
    else:
        return "other"
    
df["location"] =df["location"].apply(Reduce_Location)

In [237]:
df.to_csv("Zomato_app_analysis.csv" , index=False)