<table align="center" width=100%>
    <tr>
        <td width="15%">
            <img src="edaicon.png">
        </td>
        <td>
            <div align="center">
                <font color="#21618C" size=24px>
                    <b>Exploratory Data Analysis
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## Problem Statement

The zomato exploratory data analysis is for the foodies to find the best restaurants, value for money restaurants in their locality. It also helps to find their required cuisines in their locality.  

## Data Definition

**res_id**: The code given to a restaurant (Categorical) 

**name**: Name of the restaurant (Categorical)

**establishment**: Represents the type of establishment (Categorical)

**url**: The website of the restaurant (Categorical)

**address**: The address of the restaurant (Categorical)

**city**: City in which the restaurant located (Categorical)

**city_id**: The code given to a city (Categorical)

**locality**: Locality of the restaurant (Categorical)

**latitude**: Latitude of the restaurant (Categorical)

**longitude**: Longitude of the restaurant (Categorical)

**zipcode**: Zipcode of the city in which the restaurant located (Categorical)

**country_id**: Country code in which the restaurant located (Categorical)

**locality_verbose**: Locality along with the city in which the restaurant located (Categorical)

**cuisines**: The cuisines a restaurant serves (Categorical)

**timings**: The working hours of a restaurant (Categorical)

**average_cost_for_two**: The average amount expected for 2 people (Numerical)

**price_range**: The categories for average cost (Categories - 1,2,3,4) (Categorical)

**currency**: The currency in which a customer pays (Categorical)

**highlights**: The facilities of the restaurant (Categorical)

**aggregate_rating**: The overall rating a restaurant has got (Numerical) 

**rating_text**: Categorized ratings (Categorical)

**votes**: Number of votes received by the restaurant from customers (Numerical)

**photo_count**: The number of photos of a restaurant (Numerical)

**opentable_support**: Restaurant reservation from Opentable (Categorical)

**delivery**: The restaurant deliver an order or not (Categorical)

**takeaway**: The restaurant allows  a 'takeaway' of an order or not (Categorical)

## Table of Contents

1. **[Import Libraries](#import_lib)** 
2. **[Set Options](#set_options)** 
3. **[Read Data](#Read_Data)** 
4. **[Understand  and Prepare the Data](#Understand_Data)**
5. **[Understand the variables](#Understanding_variables)**
6. **[Check for Missing Values](#missing)**
7. **[Study Correlation](#correlation)**
8. **[Detect Outliers](#outliers)**
9. **[Create a new variable 'region'](#region)**
10. **[Some more analysis](#more)** 


<a id='import_lib'></a>
## 1. Import Libraries

<table align ="left">
    <tr>
        <td width="8%">
            <img src="todo.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Import the required libraries and functions
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [0]:
import pandas as pd
import numpy as np 

Inference- 
Importing the needed libraries namely numpy,pandas,geopandas, matplotlib

<a id='set_options'></a>
## 2. Set Options

<table align="left">
    <tr>
        <td width="8%">
            <img src="todo.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>Make necessary changes to :<br><br>
Set the working directory              
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [0]:
!type pip
!type python

Inference- 
The directory was already set and hence no changes were needed to be 
I checked for the location of pip and python on the system.

<a id='Read_Data'></a>
## 3. Read Data

In [None]:
data=pd.read_csv("ZomatoRestaurantsIndia.csv")
data

Inference- Reading the data

<a id='Understand_Data'></a>
## 4. Understand  and Prepare the Data

A well-prepared data proves beneficial for analysis as it limits errors and inaccuracies that can occur during analysis. The processed data is more accessible to users.<br> <br>
                         Data understanding is the process of getting familiar with the data, to identify data type, to discover first insights into the data, or to detect interesting subsets to form hypotheses about hidden information. Whereas, data preparation is the process of cleaning and transforming raw data before analysis. It is an important step before processing and often involves reformatting data, making corrections to data. <br> <br>
                        Data preparation is often a lengthy process, but it is essential as a prerequisite to put data in context to get insights and eliminate bias resulting from poor data quality.

<table align="left">
    <tr>
        <td width="8%">
            <img src="todo.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Analyze and prepare data:<br>
                        1. Check dimensions of the dataframe <br>
                        2. View the head of the data<br>
                        3. Note the redundant variables and drop them <br>
                        4. Check the data types. Refer to data definition to ensure your data types are correct. If data types are not as per business context, change the data types as per requirement <br>
                        5. Check for duplicates<br>
                        Note: It is an art to explore data and one will need more and more practice to gain expertise in this area
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

### -------------------------*** Provide the inference's from the output of every code executed.***----------------------------

**1. Check dimensions of the dataframe in terms of rows and columns**

In [None]:
data.shape

Inference drawn: 
    * The no.of rows are 211944
    * The no.of columns are 26

**2. View the head of the data**

In [None]:
data.head(5)
# data.locality_verbose

Inference -

*There are 26 different columns in the dataset with data related to various aspects of sales related to Zomato

*The focus has is only on data related to the Indian subcontinent

**3. Note the redundant variables and drop them**

In [0]:
data.columns

#country_id
#city
#country_id
#locality_verbose
#currency
#opentable_support

data=data.drop(['country_id','locality_verbose','currency','opentable_support'],axis=1)


The variables are 

1) country_id - since focus is only for the country India and hence country_id is same everywhere

2) locality_verbose - address parameter is already given 

3) currency - since it is only for India, the currency is constant

4) Opentable_support - all various are 0 hence can be dropped

In [None]:
data.shape

Inference- 

checking to undersand if the columns were droppped or not
We can see that the columns reduced to 22 from 26

**4. Check the data types. Refer to data definition to ensure your data types are correct. If data types are not as per business context, change the data types as per requirement**


In [None]:
data
data.dtypes

Inference- 

checking the datatype of the dataset

#### Change the incorrect data type

In [None]:
#TO CATEGORICAL FORMAT

data=data.astype('category')

#TO NUMERICAL FORMAT

# average_cost_for_two
# aggregate_rating
# photo_count
# votes

data.average_cost_for_two=data.average_cost_for_two.astype('int64')
data.aggregate_rating=data.aggregate_rating.astype('float64')
data.photo_count=data.photo_count.astype('int64')
data.votes=data.votes.astype('int64')



data.dtypes

Inference- 

* the datatypes were changed according to the business requirements into the two types- categorial/numerical

**5. Check for Duplicates**

In [None]:
#this is done based on all columns in the data
data_duplicated=data[data.duplicated()]
data_duplicated

Inference-
    
    We can understand that there are about 151533 rows in the dataset having some/all values as repeated values 

In [None]:
#let us do for city column 

data_city_dup=data[data['city'].duplicated()]
data_city_dup

Inference-
 
We can observe that since there are only 98 unique cities in the dataset, and there are 211944 values , there rest i.e (211944-98) are duplicated values

<a id = 'Understanding_variables'> </a>
## 5. Understand the variables

**1. Variable 'name'**

In [None]:
data.name

Inference- 
    
    This column states the names of all the restaurants registered with zomato in india 

**2. Variable 'establishment'**

In [None]:
data.establishment

Inference- 
    
    This shows what type of restaurant is it namely quick biting, casual dining, cafe and so on

**3. Variable 'city'**

In [None]:
data.city

Inference-
        
    Understanding the names of the cities in the dataset
        

**Let us find the count of restaurants in each city**

In [None]:
data.city.value_counts()


Inference-

      Counting the no.of restaurants in a specific city in india with the help of value_counts() function of pandas

**4. Variable 'locality'**

In [None]:
data.locality

Inference-

        Understanding the locality where the restaurant is situated

**4. Variable 'latitude'**

From the variable 'latitude', we know the latitudinal location of the restaurant

The Latitudinal extent of India 8º4‛N to 37º6‛ N. 

We must check whether we have any points beyond this extent.

- We need to replace all these values with NaN's.

- check if the values are replace by NaN's

- We see all the values are replaced by NaN's

In [None]:
data.latitude

Inference-

        reading the latitudes

In [None]:
data.latitude.isna().sum()

In [0]:
data.latitude=data.latitude.astype('Float64')
data.loc[(data.latitude<8.4)| (data.latitude >37.6),['latitude']]=None

Inference-

        This shows that if a latitude is not within the given paramaters we can say that it does not lie in the indian subcontinent and hence we can replace those values by null values since they are not benificial for us

In [None]:
data.latitude.isna().sum()

**5. Variable 'longitude'**

From the variable 'longitude', we know the longitudinal location of the restaurant

The Longitudinal extent of India is from 68°7'E to 97°25'E

We must check whether we have any points beyond this extent.

- We need to replace all these values with NaN's.

- Check if the values are replace by NaN's

- From variable 'latitude' and 'longitude', plot the location of restaurants.

In [None]:
data.longitude

Inference-

        reading the longtitude

In [None]:
data.longitude.isna().sum()

In [0]:
data.longitude=data.longitude.astype('Float64')
data.loc[(data.longitude<68.7)| (data.longitude >97.25),['longitude']]=None

Inference-

        This shows that if a longitude is not within the given paramaters we can say that it does not lie in the indian subcontinent and hence we can replace those values by null values since they are not benificial for us

In [None]:
data.longitude.isna().sum()

In [None]:
!pip install git+git://github.com/geopandas/geopandas.git

In [None]:
import matplotlib.pyplot as plt
import geopandas

gdf = geopandas.GeoDataFrame(
    data, geometry=geopandas.points_from_xy(data.longitude, data.latitude))
gdf

world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
ax = world[world.name == 'India'].plot(
    color='black', edgecolor='white')
gdf.plot(ax=ax, color='yellow',figsize=(50,50))


Inference- 

        *Plotting the various coordinates of the restaurants across India 
        *The yellow dots indicate the various locations where the restaurants are situated

**6. Variable 'cuisines'**

In [None]:
data.cuisines

Inference- 

        *Cuisines are the various types of foods offered by the various restaurants

- To find the unique cusines we write a small user defined function.



In [0]:
def cuisines(x):
    save_cuisines=(list(map(str,(x))))
    result_cuisines=set()
    for i in save_cuisines:
        inter=(i.split(", "))
        for j in inter:
            result_cuisines.add(j)


    return (result_cuisines)



Inference- 

        Function for finding out unique cuisines

- find out the frequency of each cuisine

In [None]:
result_cuisines=cuisines(data.cuisines)

print(result_cuisines)

Inference- 

        these are the various unique cuisines offered by the restaurants pan India 

In [None]:
store_cuisine_inter=[]

df_cuisine=pd.DataFrame(result_cuisines,columns=['Cuisines'])
for i in result_cuisines:
    store_cuisine_inter.append(data['cuisines'].str.count(i).sum())


df_cuisine['frequency']=store_cuisine_inter
df_cuisine

Inference- 

        *finding the top cuisines over india 
        *showing the top cuisines based on the no.of restaurants that offer them (i.e their frequency)

**8. Variable 'average_cost_for_two'**

In [None]:
data.average_cost_for_two

Inference- 

        Getting the idea how much would it cost for two people to have food at that restaurant

**9. Variable 'price_range'**

In [None]:
data.price_range

Inference- 

        understanding the price range and trying to make an better understanding of how costly is the restaurant

- visualize a exploded pie chart.

In [None]:
data.price_range.value_counts().plot.pie()

Inference- 

        understanding the price range of the restaurants and analysing the no.of restaurants in a particular price range

**10. Variable 'highlights'**

In [None]:
data.highlights

Inference- 

        Understanding the various important aspects of the restaurant



- write a small function to know the number of times a facility has appeared in the 'Highlights'.

In [0]:
def highlights(x):
    save_highlights=(list(map(str,(x))))
    result_highlights=set()
    for i in save_highlights:
        inter=(i.split(", "))
        for j in inter:
            result_highlights.add(j)


    return (result_highlights)

Inference- 

        Function for getting the unique cuisines all over India

In [None]:
ans_highlights=highlights(data.highlights)
store_inter=[]

df_highlights=pd.DataFrame(ans_highlights,columns=['Facility'])
for i in ans_highlights:
    store_inter.append(data['highlights'].str.count(i).sum())


df_highlights

In [None]:
# print(store_inter)
df_highlights['frequency']=store_inter
df_highlights

Inference- 

        Finding the frequency of the various facilities offered by the restaurants over India

- Now we find out which facility occurs most number of in the data.

In [None]:
save_df=df_highlights.sort_values(by=['frequency'],ascending=False).head(1)
save_df['Facility'].head(1)

Inference- 

        we understand that cash is the most frequent facility offered by the restaurants

**11. Variable 'aggregate_rating'** 



In [None]:
data.aggregate_rating

Inference- 

        Understanding the rating of the restaurants on a scale of 1-5 

**12. Variable 'rating_text'**



In [None]:
data.rating_text

Inference- 

        understanding the rating of the restaurant in descriptive format

Creating a New feature for better understanding of ratings


In [None]:
data.rating_text.unique()

data.loc[((data.aggregate_rating>=0) & (data.aggregate_rating<=1)),'rating_text_new']='poor'
data.loc[((data.aggregate_rating>=1) & (data.aggregate_rating<=2)),'rating_text_new']='average'
data.loc[((data.aggregate_rating>=2) & (data.aggregate_rating<=3)),'rating_text_new']='good'
data.loc[((data.aggregate_rating>=3) & (data.aggregate_rating<=4)),'rating_text_new']='very good'
data.loc[((data.aggregate_rating>=4) & (data.aggregate_rating<=5)),'rating_text_new']='excellent'
data['rating_text_new'].unique()

Inference- 
        
        Creating a new column with range defined from 0 to 5 in five parts and assining each range an ordinal value

**13. Variable 'votes'**

In [None]:
data.votes

Inference- 
    
    This shows the votes that are given to the restaurants by the customers visiting it

**14. Variable 'photo_count'**

In [None]:
data.photo_count

Inference- 
    
    This shows what are the no.of photos that uploaded by the customers visting the restaurant

**15. Variable 'delivery'**

In [None]:
data.delivery

Inference- 
    
    This shows what type of restaurant is it namely does it delivery (1) or it does not deliver (-1)

<a id ='missing'></a>
## 6. Check for missing values

In [None]:
data.isna().sum()

Inference- 
    
    This shows what are the various values in the dataset having null values
    
    *one can observe the zipcode column has 163187 null values which is about 2/3rd of the dataset 
    
    *highlights, estabhlishment can be replaced with the help of mode value 
    
    *latitude & longtitude have null val showing out of India range
    
    *some of the addresses are also not there

**6. Study summary statistics**

Let us check the summary statistics for numerical variables.

In [None]:
data.describe()

Inference- 
    
    This shows what are the the most extreme values in the data set (the maximum and minimum values), the lower and upper quartiles, and the median

<a id = 'correlation'> </a>
## 7. Study correlation

In [None]:
import seaborn as sns
corr=data.corr()
sns.heatmap(corr,annot=True)


Inference- 
    
    This shows what is the correlation between the various columns in the dataset 
    for example -
    
    *photo counts and votes are highly correlated
    
    *votes and latitude are negative correlated
    
    *votes and average cost for two people have a low correlation


## 8. Detect outliers

In [None]:
data.plot.box()

Inference- 
    
    it helps us understanding what how is data distributed

In [None]:

data.res_id=data.res_id.astype('int64')
data['res_id'].plot.box()

Inference- 
    
    This shows res_id does not have outliers

In [None]:
data['photo_count'].plot.box()

Inference- 
    
    This shows photo count has alot of outliers

In [None]:
data['votes'].plot.box()

Inference- 
    
    This shows votes have alot of outliers

In [None]:
data['aggregate_rating'].plot.box()

Inference- 
    
    This shows aggregate rating has a few outliers rest of the data is in the interquantile range

<a id='region'> </a>
## 9. Create a new variable 'region'


In [0]:
east_zone=['Arunachal Pradesh', 'Assam', 'Manipur', 'Meghalaya', 'Mizoram', 'Nagaland', 'Sikkim' , 'Tripura','Bihar', 'Orissa', 'Jharkhand', 'West Bengal ']

west_zone=['Rajasthan' , 'Gujarat', 'Goa', 'Maharashtra', 'Daman and Diu',
'Dadra and Nagar Haveli','Madhya Pradesh','Chhattisgarh']

north_zone=['Jammu and Kashmir', 'Himachal Pradesh','Chandigarh','Delhi', 'Punjab', 'Uttarakhand' , 'Uttar Pradesh','Haryana']

south_zone=['Andhra Pradesh', 'Karnataka', 'Kerala','Tamil Nadu','Telangana','puducherry']


# data.locality

Inference- 
    
    Understanding the various states in each zone

Create a variable 'region' with four categories 'northern','eastern', 'southern', 'western' and 'central'. To do so, use the 'city' column, group all cities belonging to the same region. 

In [0]:
dict_region={'Agra':"north", 'Ahmedabad':'west', 'Ajmer':'west', 'Alappuzha':'south', 'Allahabad':'north', 'Amravati':"west",
       'Amritsar':'north', 'Aurangabad':'west', 'Bangalore':'south', 'Bhopal':'west', 'Bhubaneshwar':'east',
       'Chandigarh':'north', 'Chennai':'south', 'Coimbatore':'south', 'Cuttack':'east', 'Darjeeling':'east',
       'Dehradun':'north', 'Dharamshala':'north', 'Faridabad':"north", 'Gandhinagar':"west", 'Gangtok':'east',
       'Ghaziabad':"north", 'Goa':"west", 'Gorakhpur':"north", 'Greater Noida':"north", 'Guntur':"south",
       'Gurgaon':"north", 'Guwahati':"east", 'Gwalior':"east", 'Haridwar':"north", 'Howrah':"east",
       'Hyderabad':"south", 'Indore':"west", 'Jabalpur':"west", 'Jaipur':"west", 'Jalandhar':"north", 'Jammu':"north",
       'Jamnagar':"west", 'Jamshedpur':"east", 'Jhansi':"north", 'Jodhpur':"west", 'Junagadh':"west",
       'Kanpur':"north", 'Kharagpur':"east", 'Kochi':"south", 'Kolhapur':"west", 'Kolkata':"east", 'Kota':"west",
       'Lucknow':"north", 'Ludhiana':"north", 'Madurai':"south", 'Manali':"north", 'Mangalore':"south", 'Manipal':"south",
       'Meerut':"north", 'Mohali':"north", 'Mumbai':"west", 'Mussoorie':"north", 'Mysore':"south", 'Nagpur':"west",
       'Nainital':"north", 'Nashik':"west", 'Navi Mumbai':"west", 'Nayagaon':"north", 'Neemrana':"west",
       'New Delhi':"north", 'Noida':"north", 'Ooty':"south", 'Palakkad':"south", 'Panchkula':"north", 'Patiala':"north",
       'Patna':"north", 'Puducherry':"south", 'Pune':"west", 'Pushkar':"west", 'Raipur':"west", 'Rajkot':"west",
       'Ranchi':"east", 'Rishikesh':"north", 'Salem':"south", 'Secunderabad':"south", 'Shimla':"north",
       'Siliguri':"east", 'Srinagar':"north", 'Surat':"west", 'Thane':"west", 'Thrissur':"south", 'Tirupati':"south",
       'Trichy':'south', 'Trivandrum':'south', 'Udaipur':'west', 'Udupi':"south", 'Vadodara':"west", 'Varanasi':"north",
       'Vellore':"south", 'Vijayawada':"south", 'Vizag':"south", 'Zirakpur':"north"}

Inference- 
    
    creating a mapping of city -> region 

In [0]:
data.drop(data[data.city=='north'].index,inplace=True)
save_region=[]
for i in data.city:
    save_region.append(dict_region[i])

In [None]:
# print(len(save_region))
data['region']=save_region
data


Inference- 
    
    This shows creating a new column named as REGION depecting the region in which the city is situated

In [None]:
city_data=data.groupby('region')['city'].unique()
city_data

Inference- 
    
    This groups the cities lying in the same region together

<a id='more'> </a>
## 10. Some more Analysis

<b>Lets us explore the data some more now that we have extrapolated and removed the missing values <br>
We now conduct analysis to compare the regions.</b>

### 1. To find which cities have expensive restaurants 

-  plot the cities which have costliest restaurants. 

In [None]:
data[['name','city','average_cost_for_two','price_range']].head(100)

Inference- 
    
    This shows getting the name of the restaurant, the city, the average cost, the price range

In [None]:
save_costly=data[['city','average_cost_for_two']].groupby(['city']).mean()

values_for_graph=save_costly.sort_values(by=['average_cost_for_two'],ascending=False).head(5)

values_for_graph['city']=values_for_graph.index

values_for_graph

Inference-

        Obtaining the top5 cities having the highest avgerage cost for two people

In [None]:
sns.barplot(x='city',y='average_cost_for_two',data=values_for_graph)

Inference-

        plotting a graph for better understanding

### 2.  Comparing regions

### 2a. Highlights available in restaurants for different regions

To cater our analysis we define the regions as nothern, eastern, western and southern.

We first need to select the unique facilities available in each region and sort according to their frequencies.

**Highlights of the northern region**

In [None]:
data_high_north=(data[['highlights']][data.region=='north'])
ans_north=(highlights(data_high_north.highlights))
store_inter_north=[]

df_highlights_north=pd.DataFrame(ans_north,columns=['Facility'])

for i in ans_north:
    store_inter_north.append(data_high_north['highlights'].str.count(i).sum())

df_highlights_north['frequency']=store_inter_north
df_highlights_north.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency facility used in North region 

**Highlights of the eastern region**

In [None]:
data_high_east=(data[['highlights']][data.region=='east'])
ans_east=(highlights(data_high_east.highlights))
store_inter_east=[]

df_highlights_east=pd.DataFrame(ans_east,columns=['Facility'])
for i in ans_east:
    store_inter_east.append(data_high_east['highlights'].str.count(i).sum())

df_highlights_east['frequency']=store_inter_east
df_highlights_east.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency facility used in east region 

**Highlights of the southern region**

In [None]:
data_high_south=(data[['highlights']][data.region=='south'])
ans_south=(highlights(data_high_south.highlights))
store_inter_south=[]

df_highlights_south=pd.DataFrame(ans_south,columns=['Facility'])
for i in ans_south:
    store_inter_south.append(data_high_south['highlights'].str.count(i).sum())

df_highlights_south['frequency']=store_inter_south
df_highlights_south.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency used in south region 

**Highlights of the western region**

In [None]:
data_high_west=(data[['highlights']][data.region=='west'])
ans_west=(highlights(data_high_west.highlights))
store_inter_west=[]

df_highlights_west=pd.DataFrame(ans_west,columns=['Facility'])
for i in ans_west:
    store_inter_west.append(data_high_west['highlights'].str.count(i).sum())

df_highlights_west['frequency']=store_inter_west
df_highlights_west.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency facility used in west region 

#### Plot the barplot for different regions

We shall now plot the graphs for top 10 highlights.

In [None]:
#for NORTH

import seaborn as sns
save_north=df_highlights_north.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='Facility',y='frequency',data=save_north)

In [None]:
#for EAST

import seaborn as sns
save_east=df_highlights_east.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='Facility',y='frequency',data=save_east)

In [None]:
#for SOUTH
import seaborn as sns
save_south=df_highlights_south.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='Facility',y='frequency',data=save_south)

In [None]:
#for WEST
import seaborn as sns
save_west=df_highlights_west.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='Facility',y='frequency',data=save_west)

Inference-

        Plotting the graphs for the respective regions namely - NORTH,SOUTH,EAST,WEST for top 10 facilities used in these regions


### 2b. Cuisines available in restaurants for different regions

**Cuisines in the northern region**

In [None]:
data_cui_north=(data[['cuisines']][data.region=='north'])
ans_north=(cuisines(data_cui_north.cuisines))
store_inter_north=[]

df_cui_north=pd.DataFrame(ans_north,columns=['cuisines'])

for i in ans_north:
    store_inter_north.append(data_cui_north['cuisines'].str.count(i).sum())

df_cui_north['frequency']=store_inter_north
df_cui_north.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency of cuisine in north region 

**Cuisines in the eastern region**

In [None]:
data_cui_east=(data[['cuisines']][data.region=='east'])
ans_east=(cuisines(data_cui_east.cuisines))
store_inter_east=[]

df_cui_east=pd.DataFrame(ans_east,columns=['cuisines'])
for i in ans_east:
    store_inter_east.append(data_cui_east['cuisines'].str.count(i).sum())

df_cui_east['frequency']=store_inter_east
df_cui_east.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency of cuisine in east region 

**Cuisines in the southern region**

In [None]:
data_cui_south=(data[['cuisines']][data.region=='south'])
ans_south=(highlights(data_cui_south.cuisines))
store_inter_south=[]

df_cui_south=pd.DataFrame(ans_south,columns=['cuisines'])
for i in ans_south:
    store_inter_south.append(data_cui_south['cuisines'].str.count(i).sum())

df_cui_south['frequency']=store_inter_south
df_cui_south.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency of cuisine in south region 

**Cuisines in the western region** 

In [None]:
data_cui_west=(data[['cuisines']][data.region=='west'])
ans_west=(cuisines(data_cui_west.cuisines))
store_inter_west=[]

df_cui_west=pd.DataFrame(ans_west,columns=['cuisines'])
for i in ans_west:
    store_inter_west.append(data_cui_west['cuisines'].str.count(i).sum())

df_cui_west['frequency']=store_inter_west
df_cui_west.sort_values(by=['frequency'],ascending=False)

Inference-

        Getting the highest frequency of cuisine in west region 

- Plot the barplot for top 10 cuisines served in the four regions

In [None]:
#for NORTH

import seaborn as sns
save_north=df_cui_north.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='cuisines',y='frequency',data=save_north)

In [None]:
#for EAST

import seaborn as sns
save_east=df_cui_east.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='cuisines',y='frequency',data=save_east)

In [None]:
#for SOUTH

import seaborn as sns
save_south=df_cui_south.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='cuisines',y='frequency',data=save_south)

In [None]:
#for WEST

import seaborn as sns
save_west=df_cui_west.sort_values(by=['frequency'],ascending=False).head(10)
sns.barplot(x='cuisines',y='frequency',data=save_west)


Inference-

        Plotting the graphs for the respective regions namely - NORTH,SOUTH,EAST,WEST for top 10 cuisines used in these regions

###  3. The Northern Region

**Now we shall consider only the northern region**

**1. The top 10 cuisines served in Restaurants** 

In [None]:
df_cui_north.sort_values(by=['frequency'],ascending=False).head(10)

Inference-

        Getting the highest frequency of cuisine in north region 

**2. Do restaurants with more photo counts and votes have better rating?**

In [None]:
data_north=data[data.region=='north']
data_north.head(5)

data_north[['aggregate_rating','rating_text','photo_count','votes']].sort_values(by=['photo_count','votes'],ascending=False).head(50)

Inference- 
            
            By observing the table above one can understand that if the photo counts and votes are higher than the rating text is either - "Very good" or "Excellent" and also the rating is above 4.2 which is a high rating.

    Hence,it is safe to say restaurants with more photo counts and votes have better rating.

- Plot a boxplots for the above table

In [None]:
data_for_boxplot=data_north[['aggregate_rating','photo_count','votes']]
ax = sns.boxplot(data=data_for_boxplot, orient="h", palette="Set2")

In [None]:
sns.boxplot(x='photo_count',y='aggregate_rating',data=data_for_boxplot)

In [None]:
sns.boxplot(x='votes',y='aggregate_rating',data=data_for_boxplot)

Inference- 
        
    plotting the various boxplot to understand the relationship between the photo count, votes and their individual influence on the aggregate rating of the restaurant

### 4. The Mumbai city

consider the city mumbai and get a better insights of restuarants in Mumbai.

In [None]:
df_mum=data[data.city=='Mumbai']
df_mum.head()

Inference-

        Getting data only related to mumbai

**1. Expensive restaurants in Mumbai**

-  Define the costliest restaurants whose average cost of two people exceeds Rs.5000 .
-  Plot the restaurants which are costliest based on their average cost for two .




In [None]:
df_mum['name'][df_mum.average_cost_for_two>5000].head()

In [None]:
save_ans=df_mum[['name','average_cost_for_two']].sort_values(by=['average_cost_for_two'],ascending=False).drop_duplicates().head(5)
save_ans

Inference-

        getting the costliest  restaurants whose average cost of two people exceeds Rs.5000 .

In [None]:
# sns.barplot(x='name',y='average_cost_for_two',data=save_ans)

ans_plot=save_ans.plot.bar(x='name',y='average_cost_for_two')
ans_plot

Inference-

        plotting the values on graph for better understanding

**2.To find the top 20 cuisines of Mumbai**

- select unique cuisines available at restaurants in Mumbai


- sort cuisines based on frequency


In [None]:
mumbai_cuisines=cuisines(df_mum['cuisines'])
print(mumbai_cuisines)


Inference-

        getting the unique cuisine in Mumbai region

In [None]:
store_cuisine_inter_mum=[]

df_cuisine_mumbai=pd.DataFrame(mumbai_cuisines,columns=['Cuisines_Of_Mumbai'])
for i in mumbai_cuisines:
    store_cuisine_inter_mum.append(df_mum['cuisines'].str.count(i).sum())


df_cuisine_mumbai['frequency']=store_cuisine_inter_mum
df_cuisine_mumbai.sort_values(by=['frequency'],ascending=False).head(20)

Inference-

        getting the frequency of the top 20 cuisines in the mumbai region

**3. To find the popular localities in Mumbai**

In [None]:
df_mum_local=df_mum[['locality']]
df_mum_local['locality'].value_counts().head(5)

Inference-

        getting the top5 popular localities in mumbai 

**4. Check for relationship between 'aggregate_rating' and 'average_cost_for_two'**

In [None]:
import seaborn as sns
new_corr=df_mum[['aggregate_rating','average_cost_for_two']]
corr_new=new_corr.corr()
sns.heatmap(corr_new,annot=True)
sns.set(rc={'figure.figsize':(11,8)})

Inference-

        getting the correlation btwn aggregate_rating and average cost for two

**5. Multiple box plot for photo_counts based on establishment type.**



In [None]:
sns.boxplot(x='photo_count',y='establishment',data=data)
sns.set(rc={'figure.figsize':(30,1)})

Inference-

        plotting box plot for the photo counts and the 

**6. Check for payments method offered in restaurants**

- select unique facilities available at restaurants in western region
- sort facilities based on frequency


In [0]:
#for WEST
data_high_west=(data[['highlights']][data.region=='west'])
ans_west=(highlights(data_high_west.highlights))
store_inter_west=[]

df_highlights_west=pd.DataFrame(ans_west,columns=['Facility'])
for i in ans_west:
    store_inter_west.append(data_high_west['highlights'].str.count(i).sum())

df_highlights_west['frequency']=store_inter_west
df_highlights_west.sort_values(by=['frequency'],ascending=False)


import seaborn as sns
save_west=df_highlights_west.sort_values(by=['frequency'],ascending=False).head(10)

In [None]:
sns.barplot(x='Facility',y='frequency',data=save_west)
sns.set(rc={'figure.figsize':(30,10)})

Inference-

        *getting the unique facilities in the western region 
        * sorting it wrt the frequency by which they are used


THANK YOU
-DARSHAN GANDHI