# **Introduction to Zomato**

Zomato is an Indian multinational restaurant aggregator and food delivery company founded by Pankaj Chaddah and Deepinder Goyal in 2008. Zomato provides information, menus and user-reviews of restaurants as well as food delivery options from partner restaurants in select cities. As of 2019, the service is available in 24 countries and in more than 10,000 cities.

![](https://th.bing.com/th/id/Rac9d820c33ac6d356859d2e2dc655bbd?rik=U8rQJ2ik6NpUpg&riu=http%3a%2f%2ftechstory.in%2fwp-content%2fuploads%2f2017%2f09%2fzomato-valuation-1.jpg&ehk=JQM%2bE8fPR1Ovar%2fmUw9ehlkoF6RCKCvCJ%2fUIBN7oxU0%3d&risl=&pid=ImgRaw)

# Introduction to the Notebook

In this notebook we can explore the activities of the Zomato restaurants in Hyderabad, the capital and largest city of the Indian state of Telangana and the the jure capital of Andhra Pradesh. 
We are going to work with two dataset:
* **Restaurant names and Metadata** - This could help in clustering the restaurants into segments. Also the data has valuable information around cuisine and costing which can be used in cost vs. benefit analysis
* **Restaurant reviews** - Data could be used for sentiment analysis. Also the metadata of reviewers can be used for identifying the critics in the industry.

# **Importing libvraries and reading data files**

**Import libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

**Reading datasets**

In [None]:
name_filepath = '../input/zomato-restaurants-hyderabad/Restaurant names and Metadata.csv'
reviews_filepath = '../input/zomato-restaurants-hyderabad/Restaurant reviews.csv'

name_metadata_df = pd.read_csv(name_filepath)
reviews_df = pd.read_csv(reviews_filepath)

**Create a copy of the datasets on which i will work**

In [None]:
restaurants = name_metadata_df.copy()
reviews = reviews_df.copy()

------------------------------------------------------------------------------------------------------

# **Restaurants dataset preprocessing**

**How big is the restaurants dataset?**

In [None]:
restaurants.shape

**Let's have a look on the fisrt 5 rows of the dataset**

In [None]:
restaurants.head()

**Convert the 'Cost' column, deleting the comma and changing the data type into 'int64'**

In [None]:
restaurants['Cost'] = restaurants['Cost'].str.replace(",","").astype('int64')

**We convert the 'Cost' column data type but, what are the other features data type?**

In [None]:
restaurants.info()

**Finally, how much missing values there are in these dataset?**

In [None]:
restaurants.isnull().sum()

--------------------------------------------------------------------------------------------------

# **Reviews dataset preprocessing**

**How big is the reviews dataset?**

In [None]:
reviews.shape

**Let's have a look on the fisrt 5 rows of the dataset**

In [None]:
reviews.head()

In [None]:
reviews.isnull().sum()

**As we can see, there are few missing values compared to the shape of the dataset so I decide to drop them all because there isn't a big loss**

In [None]:
reviews.dropna(inplace = True)

In [None]:
reviews.info()

**After have had a look on the type of the features, I want to change a few of them. First of all i want to change the 'Rating' column dtype **

In [None]:
reviews['Rating'].value_counts()

**As we can see there is that 'like' which is inconvertible to a number so the only way I can convert this functionality to type 'float64' is to omit that data**

In [None]:
reviews=reviews[reviews['Rating']!='Like']

**Now, as said before i change a lot of dtype of the other features. I also add a new feature called 'Year' extrapolating the data from the 'Time' column**

In [None]:
reviews['Rating'] = reviews['Rating'].astype('float64')
reviews['Reviews'],reviews['Followers']=reviews['Metadata'].str.split(',').str
reviews['Reviews'] = pd.to_numeric(reviews['Reviews'].str.split(' ').str[0])
reviews['Followers'] = pd.to_numeric(reviews['Followers'].str.split(' ').str[1])
reviews['Time']=pd.to_datetime(reviews['Time'])
reviews['Year'] = pd.DatetimeIndex(reviews['Time']).year
reviews['Hour'] = pd.DatetimeIndex(reviews['Time']).hour
reviews = reviews.drop(['Metadata'], axis = 1)
reviews.dtypes

**And finally we can have another look of the dataset with our changes**

In [None]:
reviews.head()

---------------------------------------------------------------------------------------------------------

# Analysis of various themes of the datasets 

# Cuisines variety analisys

**First of all we can see the 10 cuisines most presents in our dataset**

In [None]:
cuisine_list = restaurants.Cuisines.str.split(', ') #split the list into names
cuis_list = {} #create an empty list
for names in cuisine_list: # for any names in cuisine_list
    for name in names: # for any name in names
        if (name in cuis_list): #if this name is already present in the cuis_list
            cuis_list[name]+=1 # increase his value
        else:  # else
            cuis_list[name]=1 # Create his index in the list
cuis_df = pd.DataFrame(cuis_list.values(),index = cuis_list.keys(),columns = {'Counts of Restaurants'}) #Create a cuis dataframe
cuis_df.sort_values(by = 'Counts of Restaurants',ascending = False,inplace = True) #Sort the dataframe in ascending order
top_10_cuis = cuis_df[0:10] #Pick the 10 restaurant most nominated
print('The Top 10 Cuisines are:\n',top_10_cuis)

**A list is a good summertime but a bar chart is perhaps a better way to represent a lot of data in a more intuitive way**

In [None]:
plt.figure(figsize=(15,10))
plt.plot(cuis_df.index,cuis_df['Counts of Restaurants'],color='red')
plt.bar(cuis_df.index,cuis_df['Counts of Restaurants'],color= sns.color_palette("crest",len (cuis_df.index)))
plt.xlabel('Cuisines',size=15)
plt.xticks(rotation=90)
plt.ylabel('Cuisine available at Number of Restaurants',size=15)
plt.title('Most popular cuisines at Restaurants in Hyderabad',size=30, color = 'green')

**Secondly we can draw a WordCloud of the most nominated words in the cuisine feature**

In [None]:
from wordcloud import WordCloud, STOPWORDS
words_list = cuis_list.keys()
strr = ' '
for i in words_list:
    strr=strr+i+' '
    
wordcloud = WordCloud(width = 1400, height = 1400, 
                background_color ='ivory',  
                min_font_size = 12).generate(strr) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

# Costs of the restaurants

**Now we can plot a graph representing the restaurants based on the costs of them**

In [None]:
restaurants_cost=restaurants.groupby('Name').apply(lambda x:np.average(x['Cost'])).reset_index(name='Cost')
restaurants_cost.sort_values(by='Cost',ascending=False,inplace=True)
avg=np.average(restaurants_cost['Cost'])
plt.figure(figsize=(25,10))
plt.bar(restaurants_cost['Name'],restaurants_cost['Cost'], color = sns.color_palette("viridis", len(restaurants_cost['Name'])))
for i in restaurants_cost['Name']:
    plt.scatter(i,avg,color='red')
plt.xlabel('Restaurants',size=15)
plt.xticks(rotation=90)
plt.ylabel('Average Cost',size=15)
plt.title('Overall Cost Summary of Restaurants in Hyderabad',size=30)
plt.legend(['Average Cost at Restaurant'])

**But which are the most expensive restaurants?**

In [None]:
best_5_rest = restaurants_cost[:5]
best_5_rest.sort_values(by='Cost',ascending=True,inplace=True)
plt.figure(figsize=(20,5))
plt.bar(best_5_rest['Name'],best_5_rest['Cost'], color = sns.color_palette("hls", 8))
plt.title('The 5 most expensive Restaurants in Hyderabad',size=28)
plt.xlabel('Restaurants',size=15)
plt.ylabel('Average Cost',size=15)

**And which are the cheapest restaurants?**

In [None]:
worst_5_rest = restaurants_cost[-5:]
plt.figure(figsize=(20,5))
plt.bar(worst_5_rest['Name'],worst_5_rest['Cost'], color = sns.color_palette("hls", 8))
plt.title('The 5 most cheap Restaurants in Hyderabad',size=28)
plt.xlabel('Restaurants',size=15)
plt.ylabel('Average Cost',size=15)

**Finally, we can draw a WordCloud of the most nominated words in the 'Name' feature**

In [None]:
Rests = restaurants.Name.unique()
rest_string = ' '
for i in Rests:
   rest_string = rest_string+i+' '
    
wordcloud = WordCloud(width = 1400, height = 1400, 
                background_color ='lavenderblush',  
                min_font_size = 12).generate(rest_string) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

# Top 15 Reviewer

**First of all, we extrapolate the 15 profiles that have made more reviews**

In [None]:
reviewer_list = reviews.groupby('Reviewer').apply(lambda x: x['Reviewer'].count()).reset_index(name='Review Count')
reviewer_list = reviewer_list.sort_values(by = 'Review Count',ascending=False)
top_reviewers = reviewer_list[:15]

**Secondly, we plot a graph to represent them**

In [None]:
plt.figure(figsize=(13,5))
plt.bar(top_reviewers['Reviewer'], top_reviewers['Review Count'], color = sns.color_palette("hls", 8))
plt.xticks(rotation=75)
plt.title('Top 15 reviews',size=28)
plt.xlabel('Name (or Nickname)',size=15)
plt.ylabel('N° of reviews',size=15)

**Then we can calculate the average of their ratings review**

In [None]:
review_ratings=reviews.groupby('Reviewer').apply(lambda x:np.average(x['Rating'])).reset_index(name='Average Ratings')
review_ratings=pd.merge(top_reviewers,review_ratings,how='inner',left_on='Reviewer',right_on='Reviewer')
top_reviewers_ratings=review_ratings[:15]

In [None]:
top_reviewers_ratings

**And also plot a line chart to view them in a simple way**

In [None]:
review_ratings_plot = top_reviewers_ratings.groupby('Review Count').apply(lambda x:np.average(x['Average Ratings'])).reset_index(name='Average')
plt.figure(figsize=(8,8))
plt.plot(review_ratings_plot['Review Count'],review_ratings_plot['Average'])
plt.scatter(review_ratings_plot['Review Count'],review_ratings_plot['Average'],color='red')
plt.xlabel('N° of Reviews by an User',size=15)
plt.ylabel('Average Ratings per review submitted',size=15)
plt.title('Average Ratings per Review Submitted Distribution',size=20)

# Big comparison: Followers vs Reviews vs Pictures

**How it has been the growth of the followers and the reviews in this 3 years of Zomato?**

In [None]:
review_follow_1=reviews.groupby('Year').apply(lambda x:np.sum(x['Reviews'])).reset_index(name='Total Reviews')
review_follow_2=reviews.groupby('Year').apply(lambda x:np.max(x['Followers'])).reset_index(name='Total Followers')
pictures =reviews.groupby('Year').apply(lambda x:np.sum(x['Pictures'])).reset_index(name='Total Pictures')
review_follow=pd.merge(review_follow_1,review_follow_2, how='inner',left_on='Year',right_on='Year')
review_follow=pd.merge(review_follow, pictures, how='inner',left_on='Year',right_on='Year')
review_follow

In [None]:
plt.figure(figsize=(12,6))
plt.plot(review_follow['Year'],review_follow['Total Followers'], color = 'blue')
plt.plot(review_follow['Year'],review_follow['Total Reviews'], color = "green")
plt.plot(review_follow['Year'],review_follow['Total Pictures'], color = "orange")
plt.xlabel('Year',size=15)
plt.xticks(rotation=75)
plt.grid()
plt.ylabel('Reviews/Followers/Pictures',size=15)
plt.title('Reviews,Followers and Pictures count with Time',size=20)
plt.legend(['Total Followers','Total Review', 'Total Pictures'])

**As we can see there is a big difference between them, there are many more reviews than the total followers. Not to mention the images that have been considered by even fewer people, too bad because they could be an added value as they can entice customers more immediately than just written reviews.**

# Reviews per hour

**What is the period of time in a day what the people make more reviews?**

In [None]:
reviews_for_hour = reviews.groupby('Hour').apply(lambda x: x['Hour'].count()).reset_index(name='Reviews per hour')
reviews_for_hour

In [None]:
plt.figure(figsize=(15,5))
plt.bar(reviews_for_hour['Hour'], reviews_for_hour['Reviews per hour'], color = sns.color_palette("hls", 8))
plt.title('Reviews per hour',size=28)
plt.grid()
plt.xlabel('Hours',size=15)
plt.ylabel('N° of reviews',size=15)

**As we can se there is an increase of reviews made from the afternoon to the midnight and then in the morning there is a decrease, but I think this is normal because in during the morning the most of the people is working or is at school**

# Most common words in reviews

**What are the most common words in the reviews?**

**First of all we import the 'spacy' library that is one of the most popular library for NLP (Natural Language Processing)**

In [None]:
import spacy

**And then we draw a WordCloud of the most common words after have found them with the tokenization method**

In [None]:
nlp = spacy.load('en')
reviews_feature = reviews['Review']
for review in reviews_feature:
    doc = nlp(review)
words = [token.text
         for token in doc
         if not token.is_stop and not token.is_punct]
rest_string = ' '
for i in words:
   rest_string = rest_string+i+' '
    
wordcloud = WordCloud(width = 1400, height = 1400, 
                background_color ='lavenderblush',  
                min_font_size = 12).generate(rest_string) 
  
# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 

# Costs vs Ratings

**Will the most expensive restaurants be the highest rated? Will high prices please customers with high quality food?**

In [None]:
df_merged= reviews.merge(restaurants, how='inner', left_on='Restaurant', right_on='Name')
df_merged.head(1)

In [None]:
sns.boxplot(df_merged.Rating, df_merged.Cost)
plt.show()

**As we can see there is not much data on expensive restaurants probably because many people prefer to spend less, but we can also see that the Rating 5.0 box is the most elongated vertically, so we can say that people do not go to eat in expensive restaurants. often but the few times they go there they can feel satisfied**

----------------------------------------------------

# **Thank you so much for looking at this notebook, I hope you enjoyed it and if so I would invite you to put an upvote. If you have found any errors, please write them to me in the comments or even if you have any suggestions for improving the notebook. thank you very much again and good Kaggling!**