## Introduction

### Welcome!
Hello everyone, this is my second full end-to-end Kaggle kernel. We all will learn about visualization from this dataset. My goal is to learn and contribute to the data science community.

I have referred to some of the best kernel, to name a few
* <a href="https://www.kaggle.com/shahules/zomato-complete-eda-and-lstm-model"> Zomato Complete EDA and LSTM model</a>
* <a href="https://www.kaggle.com/akshayjhamb2/zomato-restaurants-eda-and-dashboard-to-search">
Zomato Restaurants and their dishes EDA</a>

Also, you can look at my previous work on basic framework for any data science competition along with use of automated machine learning for a regression problem 
 <a href="https://www.kaggle.com/akshay1296/house-price-prediction-with-eda-visualization-tpot"> House Price prediction with EDA+Visualization+TPOT</a>

## Understanding the problem
The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment of different types of restaurant at different places in Bengaluru, aggregate rating of each restaurant, Bengaluru being one such city has more than 12,000 restaurants with restaurants serving dishes from all over the world. With each day new restaurants opening the industry has’nt been saturated yet and the demand is increasing day by day. Inspite of increasing demand it however has become difficult for new restaurants to compete with established restaurants. Most of them serving the same food. Bengaluru being an IT capital of India. Most of the people here are dependent mainly on the restaurant food as they don’t have time to cook for themselves. With such an overwhelming demand of restaurants it has therefore become important to study the demography of a location. By studying the factors such as • Location of the restaurant • Approx Price of food • Theme based restaurant or not • Which locality of that city serves that cuisines with maximum number of restaurants • The needs of people who are striving to get the best cuisine of the neighborhood • Is a particular neighborhood famous for its own kind of food. **Let's get started**

### Importing required libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import squarify
import matplotlib
import re
from wordcloud import WordCloud, STOPWORDS

### Load the data

In [None]:
zomato_df = pd.read_csv('/kaggle/input/zomato-bangalore-restaurants/zomato.csv')

In [None]:
zomato_df.head()

## Performing EDA to know more about data

In [None]:
zomato_df.info()

Based on the understanding of the data, we can consider address as the unique key to identify distinct restaurants

In [None]:
g = zomato_df.groupby('address')
g1 = g.filter(lambda x: len(x) > 1)[g.filter(lambda x: len(x) > 1)['name']=='Jalsa']
g1[g.filter(lambda x: len(x) > 1)[g.filter(lambda x: len(x) > 1)['name']=='Jalsa'].address=='942, 21st Main Road, 2nd Stage, Banashankari, Bangalore']

1) From the above example we can see that for the same restaurant we have  different URLs, listed_in(type), listed_in(city) <br/>
2) Also we observed that we have different vote counts i.e. 775, 783, and 804 for the above example, my guess here is that the data which has more votes is the latest data as the voting count will only increase w.r.t. time, but we don't have date and time of when this data was scrapped to confirm this. I won't be removing previous votes as the reviews are different in it, which we can use it later<br>
3) Based on my observations will be removing URLs, listed_in(type), and listed_in(city) information<br/>
4) Also, listed_in(type) is more generalize form of rest_type. Hence even after removing it, we won't be loosing information

In [None]:
zomato_df1 = zomato_df.drop(['url', 'listed_in(type)', 'listed_in(city)'], axis=1).reset_index().drop(['index'], axis=1).drop_duplicates().copy()

## Now let's start analyzing the data with help of visualizations

#### Top restaurant chains in Bangalore

In [None]:
plt.figure(figsize=(10, 7))
sns.set_style('white')
restaurants = zomato_df1.groupby(['address','name'])
chains= restaurants.name.nunique().index.to_frame()['name'].value_counts()[:15]
ax = sns.barplot(x= chains, y = chains.index, palette='Blues_d')
sns.despine()
plt.title('Top 15 restaurant chains in Bangalore')
plt.xlabel('Number of outlets')
plt.ylabel('Name of restaurants')
for p in ax.patches:
    width = p.get_width()
    ax.text(width+0.007, p.get_y() + p.get_height() / 2. + 0.2, format(width), 
            ha="left", color='black')
plt.show()

CCD has the highest outlets in Bangalore, followed by Domino's Pizza<br/>

#### Top cuisines in Bangalore

In [None]:
#Preprocessing cuisines
cuisines_p = zomato_df1.groupby(['address','cuisines']).cuisines.nunique().index.to_frame()
tmp = pd.DataFrame()
tmp = cuisines_p.cuisines.str.strip().str.split(',', expand=True)

In [None]:
cuisines=pd.DataFrame()
cuisines=pd.concat([tmp.iloc[:,0].str.strip(), tmp.iloc[:,1].str.strip(), tmp.iloc[:,2].str.strip(), tmp.iloc[:,3].str.strip(), tmp.iloc[:,4].str.strip(), tmp.iloc[:,5].str.strip(), tmp.iloc[:,6].str.strip(), tmp.iloc[:,7].str.strip() ]).value_counts()

In [None]:
plt.figure(figsize=(10, 7))
sns.set_style('white')
cuisine= cuisines[:15]
ax = sns.barplot(x= cuisine, y = cuisine.index, palette='Blues_d')
sns.despine()
plt.title('Top 15 cuisines served in Bangalore')
plt.xlabel('Number of restaurants')
plt.ylabel('Name of cuisines')
total = len(cuisines_p)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y),
        ha="left", color='black')
plt.show()

In [None]:
cuisines[-15:]

In [None]:
cuisines[cuisines.index=='Healthy Food']

1) Around 40% restaurants serves North Indian cuisines, followed by Chinese and South Indian<br/>
2) Mongolian, Russian, and Australian (Other country specific) cuisines are the rearest to find in Bangalore, which explains their demand<br/>
3) There are around 200+ restaurants in Bangalore serving healthy food

#### Top restaurant types in Bangalore

In [None]:
#Preprocessing Restaurant Types
rest_t = zomato_df1.groupby(['address','rest_type']).rest_type.nunique().index.to_frame()
tmp_r = pd.DataFrame()
tmp_r = rest_t.rest_type.str.strip().str.split(',', expand=True)
tmp_r.shape

In [None]:
rest_types=pd.DataFrame()
rest_types=pd.concat([tmp_r.iloc[:,0].str.strip(), tmp_r.iloc[:,1].str.strip()]).value_counts()
rest_types

In [None]:
plt.figure(figsize=(10, 7))
sns.set_style('white')
restaurant_types = rest_types[:15]
ax = sns.barplot(x= restaurant_types, y = restaurant_types.index, palette='Blues_d')
sns.despine()
plt.title('Top 15 restaurant types in Bangalore')
plt.xlabel('Number of restaurants')
plt.ylabel('Restaurant types')
total = len(rest_t)
for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y),
        ha="left", color='black')
plt.show()

1) Quick Bites, Casual Dinning and Delivery are common restaurant types in Banagalore<br/>
2) As Bangalore being the IT hub, the fact that Delivery comes 3rd as the restaurant type is no surprise<br/>
3) Bhojanalya being the rarest restaurant type to find in Bangalore

#### Top locations for foodies in Bangalore

In [None]:
loc_t = zomato_df1.groupby(['address','location']).location.nunique().index.to_frame()
print (loc_t['location'].value_counts()[loc_t['location'].value_counts().index.str.contains('Koramangala')]
,"\n","Total number of restaurants in Koramangala",sum(loc_t['location'].value_counts()[loc_t['location'].value_counts().index.str.contains('Koramangala')]))

In [None]:
plt.figure(figsize=(10, 7))
sns.set_style('white')
locations= loc_t['location'].value_counts()[:15]
ax = sns.barplot(x= locations, y = locations.index, palette='Blues_d')
sns.despine()
plt.title('Top 15 locations for foodies in Bangalore')
plt.xlabel('Number of restaurants')
plt.ylabel('Name of Location')
for p in ax.patches:
    width = p.get_width()
    ax.text(width+0.007, p.get_y() + p.get_height() / 2. + 0.2, format(width), 
            ha="left", color='black')
plt.show()

1) Whitefield, BTM, Electronic City, Marathahali and HSR has the most number of restaurants <br/>
2) Whitefield dominates by having more than 700 restaurants <br/>
3) Kormangala (combining all blocks of Kormangala as they all come under Kormangala area) has 868 restaurants more than any area

#### Common cuisines loaction wise

In [None]:
df_1=zomato_df1.groupby(['location','cuisines']).agg('count')
data=df_1.sort_values(['address'],ascending=False).groupby(['location'],
                as_index=False).apply(lambda x : x.sort_values(by="address",ascending=False).head(3))['address'].reset_index().rename(columns={'address':'count'})

In [None]:
data.tail(10)

#### Top dishes served in Bangalore

In [None]:
#Preprocessing Dish liked
dish_t = zomato_df1.groupby(['address','dish_liked']).dish_liked.nunique().index.to_frame()
tmp_d = pd.DataFrame()
tmp_d = dish_t.dish_liked.str.strip().str.split(',', expand=True)
tmp_d.shape

In [None]:
dish_liked=pd.DataFrame()
dish_liked=pd.concat([tmp_d.iloc[:,0].str.strip(), tmp_d.iloc[:,1].str.strip()]).value_counts()
dish_liked

In [None]:
plt.figure(figsize=(10, 7))
sns.set_style('white')
dishes = dish_liked[:15]
ax = sns.barplot(x= dishes, y = dishes.index, palette='Blues_d')
sns.despine()
plt.title('Top 15 commonly served dishes in Bangalore')
plt.xlabel('Number of restaurants')
plt.ylabel('Name of dishes')
total = len(dish_t)

for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_width()/total)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y),
        ha="left", color='black')
plt.show()


1) In general there are more Quick Bites restaurants in Bangalore, hence we can see Burgers and Pasta being served more in Bangalore<br/>
2) Biryani is the most famous Indian dish which is served in Bangalore

**Let's see the trend restaurant type wise**

In [None]:
#Utilise matplotlib to scale our goal numbers between the min and max, then assign this scale to our values.

def pre_dish_liked(restaurant_type):
	#Preprocessing Dish liked
	dish_rest_type = zomato_df1.groupby(['address','dish_liked','rest_type']).dish_liked.nunique().index.to_frame()
	tmp_d = pd.DataFrame()
	tmp_d = dish_rest_type[dish_rest_type['rest_type']==restaurant_type].dish_liked.str.strip().str.split(',', expand=True)
	dish_liked=pd.DataFrame()
	dish_liked=pd.concat([tmp_d.iloc[:,0].str.strip(), tmp_d.iloc[:,1].str.strip()]).value_counts()
	df = pd.DataFrame({'nb_people':dish_liked[:10], 'group': dish_liked[:10].index})
	
	norm = matplotlib.colors.Normalize(vmin=min(dish_liked[:10]), vmax=max(dish_liked[:10]))
	colors = [matplotlib.cm.Blues(norm(value)) for value in dish_liked[:10]]
  	
  	

  
	squarify.plot(sizes=df['nb_people'], label=df['group'], alpha=.8, color = colors )
	plt.title("Top 10 dishes served in "+restaurant_type,fontsize=15,fontweight="bold")
 
	plt.axis('off')
	plt.show() 

In [None]:
pre_dish_liked('Quick Bites')
pre_dish_liked('Casual Dining')
pre_dish_liked('Cafe')
pre_dish_liked('Delivery')
pre_dish_liked('Dessert Parlor')

Burger is the only dish served most common in Quick Bites, Cafes and Delivery restaurant types

#### Wordcloud of dishes liked in top restaurants for different restaurant types

In [None]:
df_1=zomato_df1.groupby(['rest_type','name']).agg('count')
datas=df_1.sort_values(['address'],ascending=False).groupby(['rest_type'],
                as_index=False).apply(lambda x : x.sort_values(by="address",ascending=False).head(5))['address'].reset_index().rename(columns={'address':'count'})
datas

In [None]:
all_ratings = []

for name,ratings in zip(zomato_df['name'],zomato_df['reviews_list']):
    ratings = eval(ratings)
    for score, doc in ratings:
        if score:
            score = score.strip("Rated").strip()
            doc = doc.strip('RATED').strip()
            score = float(score)
            all_ratings.append([name,score, doc])

In [None]:
rating_df=pd.DataFrame(all_ratings,columns=['name','rating','review'])
rating_df['review']=rating_df['review'].apply(lambda x : re.sub('[^a-zA-Z0-9\s]'," ",x))

In [None]:
rating_df.head()

In [None]:
def produce_wordcloud(r):
  print (r+' as a restaurant type')
  plt.figure(figsize=(20,30))
  
  for j, n in enumerate(datas[datas['rest_type'] == r].name):
        plt.subplot(2,5,j+1)
        #print(r, x+j+1)
        stopword_t = set(STOPWORDS)
        stopword_t.update(n.split(" "))
        corpus=rating_df[n == rating_df.name]['review'].values.tolist()
        corpus=' '.join(x  for x in corpus)

        wordcloud = WordCloud(stopwords=stopword_t, max_font_size=None, background_color='black', collocations=False,
                      width=1500, height=1500).generate(corpus)
        plt.imshow(wordcloud)
        plt.title(n)
        plt.axis("off")
  #plt.tight_layout()

In [None]:
produce_wordcloud('Quick Bites')

In [None]:
produce_wordcloud('Casual Dining')

In [None]:
produce_wordcloud('Cafe')

In [None]:
produce_wordcloud('Delivery')

In [None]:
produce_wordcloud('Dessert Parlor')

In [None]:
produce_wordcloud('Casual Dining, Bar')

In [None]:
produce_wordcloud('Bakery')

In [None]:
produce_wordcloud('Beverage Shop')

#### % of restaurants accepting online order

In [None]:
online_order_t = zomato_df1.groupby(['address','online_order'])
online_orders=online_order_t['online_order'].nunique().index.to_frame()

In [None]:
rest_online_orders = online_orders.online_order.value_counts()
cmap = plt.get_cmap("tab20")
inner_colors = cmap(np.array([0, 1]))
plt.pie(rest_online_orders, labels=rest_online_orders.index, autopct='%1.1f%%', shadow=True, colors=inner_colors)
plt.axis('equal')
plt.show()

48.4% of restaurants still do not accept online orders, which means Zomato needs to put in effort to capture the remaining market

#### % of restaurants allowing table booking

In [None]:
booking_t = zomato_df1.groupby(['address','book_table'])
table_booking=booking_t['book_table'].nunique().index.to_frame()

In [None]:
online_booking = table_booking.book_table.value_counts()
cmap = plt.get_cmap("tab20")
inner_colors = cmap(np.array([0, 1]))
plt.pie(online_booking, labels=online_booking.index, autopct='%1.1f%%', shadow=True, colors=inner_colors)
plt.axis('equal')
plt.show()

1) 92.4% of restaurants do not provide table booking facility<br/>
2) Mostly 3 star and above restaurants provide table booking, we'll deep dive into it by comparing it with the cost

#### Disribution of cost for two people

In [None]:
cost_t = zomato_df1.groupby(['address','approx_cost(for two people)'])
cost=cost_t['approx_cost(for two people)'].nunique().index.to_frame()

In [None]:
cost['approx_cost(for two people)'] = cost['approx_cost(for two people)'].str.replace(',', '').astype(float)

In [None]:
plt.figure(figsize=(6,6))
cost_dist=cost['approx_cost(for two people)'].dropna()
sns.distplot(cost_dist,bins=20,kde_kws={"color": "k", "lw": 3, "label": "KDE"})
plt.show();

There are very few restaurants with cost more than 1000. Overall the distribution of cost is right skewed

#### Exploring the relationship between cost and restaurants by availability of table booking facility

In [None]:
cost_book_t = zomato_df1.groupby(['address','book_table','approx_cost(for two people)'])
cost_booking=cost_book_t['book_table','approx_cost(for two people)'].nunique().index.to_frame()

In [None]:
cost_booking['approx_cost(for two people)'] = cost_booking['approx_cost(for two people)'].str.replace(',', '').astype(float)

In [None]:
sns.boxplot(x='book_table',y='approx_cost(for two people)',data=cost_booking)

plt.show()

In [None]:
cost_booking.columns

In [None]:
cost_booking[cost_booking.book_table=="Yes"]['approx_cost(for two people)'].describe()

In [None]:
cost_booking[cost_booking.book_table=="No"]['approx_cost(for two people)'].describe()

As we can see that the restaurants providing booking facility are in an average three times costlier than the one without it

#### Checking if there is any difference between votes of restaurants accepting and not accepting order

In [None]:
online_vote_t = zomato_df1.groupby(['address','online_order','rate'])
online_votes=online_vote_t['online_order','rate'].nunique().index.to_frame()

In [None]:
# Mann-Whitney U test
from scipy.stats import mannwhitneyu

#We will be doing Mann-Whitney U test as our distribution is not normal, hence non-parametric type
data1 = online_votes[online_votes['online_order']=='Yes']['rate'].dropna().apply(lambda x : float(x.split('/')[0]) if (len(x)>3)  else np.nan ).dropna()
data2 = online_votes[online_votes['online_order']=='No']['rate'].dropna().apply(lambda x : float(x.split('/')[0]) if (len(x)>3)  else np.nan ).dropna()
# compare samples
stat, p = mannwhitneyu(data1, data2)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
	print('Same distribution (fail to reject H0)')
else:
	print('Different distribution (reject H0)')

In [None]:
plt.figure(figsize=(6,5))
sns.distplot(data1,bins=20,kde_kws={"color": "g", "lw": 3, "label": "Accepting online orders"})
sns.distplot(data2,bins=20,kde_kws={"color": "k", "lw": 3, "label": "Not accepting online orders"})
plt.show()

Hence, its statistically evident that both votes come from different distribution, as restaurants accepting online orders tend to get more votes from customers as there is a rating option poping up after each order through zomato application.

#### Distribution of rating

In [None]:
rating_t = zomato_df1.groupby(['address','rate'])
plt.figure(figsize=(6,5))
rating=rating_t.rate.nunique().index.to_frame()['rate'].dropna().apply(lambda x : float(x.split('/')[0]) if (len(x)>3)  else np.nan ).dropna()
sns.distplot(rating,bins=20,kde_kws={"color": "k", "lw": 3, "label": "KDE"})
plt.show()

In [None]:
rating.describe()

Majority of restaurants are rated more than 3.5, the distribution tends to be negative/left skew

#### Effect of rating

In [None]:
cost_dist_t = zomato_df1.groupby(['address','rate','approx_cost(for two people)','book_table','online_order'])
cost_dist=cost_dist_t['rate','approx_cost(for two people)','book_table','online_order'].nunique().index.to_frame()

In [None]:
cost_dist['approx_cost(for two people)'] = cost_dist['approx_cost(for two people)'].str.replace(',', '').astype(float)
cost_dist['rate']=cost_dist['rate'].apply(lambda x: float(x.split('/')[0]) if len(x)>3 else np.nan ).dropna()

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))
sns.scatterplot(x="rate",y='approx_cost(for two people)',hue='book_table',data=cost_dist, ax=axis[0])
sns.scatterplot(x="rate",y='approx_cost(for two people)',hue='online_order',data=cost_dist, ax=axis[1])
plt.show()

1) There is no trend between cost and rating a restaurant gets<br/>
2) Generally we see restaurants with table booking facility tends to get higher rating<br/>
3) Also as explained before, restaurants with table booking facility tends to be more costlier<br/>
4) There is no much relationship between cost of restaurants accepting online orders vs not accepting, but mostly costlier restaurant don't accept online orders

#### *Bonus* - Custom restaurant search in any location based on restaurant type, cost, rating and number of votes

In [None]:
custom_restaurant=zomato_df1[['address','rate','approx_cost(for two people)','location','name','rest_type','votes']].dropna().drop_duplicates()

In [None]:
custom_restaurant['rate']=custom_restaurant['rate'].apply(lambda x: float(x.split('/')[0]) if len(x)>3 else 0)
custom_restaurant['approx_cost(for two people)']=custom_restaurant['approx_cost(for two people)'].apply(lambda x: int(x.replace(',','')))

In [None]:
def search_restaurant(location='',rest='',rate=4,no_of_votes=200,min_cost=0, max_cost=500):
    if location!='' and rest!='':
      search_rest=custom_restaurant[(custom_restaurant['approx_cost(for two people)']>=min_cost) & (custom_restaurant['approx_cost(for two people)']<=max_cost) 
                      & (custom_restaurant['location']==location) & (custom_restaurant['rate']>rate) & (custom_restaurant['rest_type']==rest)
                      & (custom_restaurant['votes']>=no_of_votes)]
      pd.options.display.max_colwidth = 500
      return(print(search_rest.loc[:,['name']].reset_index().drop('index', axis=1)))
    else:
      search_rest=custom_restaurant[(custom_restaurant['approx_cost(for two people)']>=min_cost) & (custom_restaurant['approx_cost(for two people)']<=max_cost) 
                       & (custom_restaurant['rate']>rate) & (custom_restaurant['votes']>=no_of_votes)]
      pd.options.display.max_colwidth = 500
      return(print(search_rest.loc[:,['name']].reset_index().drop('index', axis=1)))

In [None]:
search_restaurant('Whitefield',"Casual Dining",4,400,0,1000)

You can use the custom restaurant finder to find a desired restaurant based on your preference

This was a simple EDA of restaurants in Bangalore. Hope you enjoyed the kernel.<br/> 
**Please upvote if you learned and feel free to provide your feedback below!**