<a href="https://colab.research.google.com/github/yash5891/Python-Programming-Assignments/blob/main/plotly_tutorial_on_zomato.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***About Dataset:***
Zomato is an Indian multinational restaurant aggregator and food delivery company founded by Deepinder Goyal and Pankaj Chaddah in 2008. Zomato provides information, menus and user-reviews of restaurants as well as food delivery options from partner restaurants in select cities.

This dataset containing information of food restraunts in banglaore who are working with Zomato.The data was scraped from Zomato in two phase. After going through the structure of the website I found.

for each neighborhood there are 6-7 category of restaurants viz. Buffet, Cafes, Delivery, Desserts, Dine-out, Drinks & nightlife, Pubs and bars. So, here we are trying to find the best restaurants for customer depends on their need.

# ***Exploratory Analysis:***
To begin this exploratory analysis, first to import libraries and define functions for plotting the data. Depending on the data, not all plots will be made.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# ***Data cleaning:-***
* Deleting redundant columns
* Renaming the columns
* Dropping duplicates
* Cleaning the individual columns
* Remove the NaN values from the dataset
* Check for some more transformations

### **Reading CSV:**

In [None]:
df=pd.read_csv('/kaggle/input/zomato-eda/zomato.csv')
df.head(5)

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.info()

### **Droping the unnecessary columns:-**
* Unnecessary column are those columns which are not required for analysis.
* so we can drop those columns
* eg:-Columns such as Url,address,Phone,menu item,dish liked

In [None]:
df.drop(['url','address','phone','menu_item','dish_liked','reviews_list','listed_in(city)'],axis=1,inplace=True)
df.head(5)

In [None]:
df.info()

### **Checking the null values by columns:-**
Here we will get to know the count of null value column by column.

In [None]:
df.isnull().sum()

In [None]:
df.isnull().sum().sum() #Total null/NaN values

### **Renaming columns:-**
Changing the names of columns for better understanding

In [None]:
df.rename(columns={'name':'restaurants','book_table':'booking','listed_in(city)':'city','rate':'rating','approx_cost(for two people)':'cost','listed_in(type)':'types'},inplace=True)
df.head(5)

### **Droping NaN values:-**
Droping out NaN values helps for getting better result and understanding.

In [None]:
len(df)

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().values.any()

In [None]:
df.info()

### **Finding out the duplicate rows:-**
* Helps to find the same repeated rows.
* Then removing the duplicate rows for understanding the data.

In [None]:
df[df.duplicated()].count().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
#Shows the count or rows and columns after removing the duplicates
df.shape

In [None]:
#showing number of duplicate count after removing it
df[df.duplicated()].count().sum()

### **Cleaning individual columns**:-

***column:-ratings***


In [None]:
df['rating'].unique()

In [None]:
df['rating'].replace("NEW|-",'0',regex=True).replace('/5','',regex=True).unique()

In [None]:
df['rating']=df['rating'].replace("NEW|-",'0',regex=True).replace('/5','',regex=True).astype("float")
df['rating'].head()

In [None]:
#Counting the NaN valuesn in rating column
df['rating'].isnull().sum()

In [None]:
df.info()

***column:-location***

In [None]:
df['location'].unique()

In [None]:
df['location'].isnull().sum()

In [None]:
a=df['location'].value_counts(ascending=False)
a

***column:-cost***

In [None]:
df['cost'].unique()

In [None]:
df['cost']=df['cost'].replace(",",'',regex=True).astype(int)

In [None]:
df.cost.unique()

In [None]:
df['cost'].isnull().sum()

In [None]:
df.info()

***column:-booking***

In [None]:
df['booking'].unique()

In [None]:
df['booking'].isnull().sum()

***column:-online order***

In [None]:
df['online_order'].unique()

In [None]:
df['online_order'].isnull().sum()

***column:-rest type***

In [None]:
df['rest_type'].isnull().any()

In [None]:
df['rest_type'].unique()

In [None]:
b=df['rest_type'].value_counts(ascending=False)
b

***Column:-listed in type***

In [None]:
df['types'].unique()

In [None]:
c=df['types'].value_counts()
c

In [None]:
df['types'].isnull().sum()

***column:-cuisines***

In [None]:
d=df['cuisines'].value_counts()
d

In [None]:
df['cuisines'].unique()

***column:-restaurants***

In [None]:
df.groupby('restaurants').count().head()

In [None]:
# df.groupby('restaurants').count().head()

In [None]:
df['restaurants']=df['restaurants'].str.replace('[Ãx][^A-Za-z]+','',regex=True)

In [None]:
df.groupby('restaurants').count().head()

### **Checking NaN values after cleaning individual columns:**

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
df.to_csv('./clean_zomato.csv')

# ***Data visualization:-***
* Restaurants delivering Online or not
* Table booking Rate vs Rate
* Best Location
* Relation between Location and Rating
* Cost of Restaurant
* No. of restaurants in a Location
* Restaurant type
* Most famous restaurant chains in Bengaluru

**Plotly** library in Python is an open-source library that can be used for data visualization and understanding data simply and easily. Plotly supports various types of plots like line charts, scatter plots, histograms, box plots, etc. So you all must be wondering why Plotly is over other visualization tools or libraries. So here are some reasons :

- Plotly has hover tool capabilities that allow us to detect any outliers or anomalies in a large number of data points.
- It is visually attractive and can be accepted by a wide range of audiences.
- Plotly generally allows us endless customization of our graphs and makes our plot more meaningful and understandable for others.

In [None]:
#Restaurant type
type1=df['rest_type'].value_counts().head(5)
type1.to_frame()

In [None]:
fig=px.bar(x=type1.index,
           y=type1,
           color=type1.index # Each bar is colored differently
           ,title='Restaurant type',labels={'x':'Restaurant','y':'Count of Restaurant'} # changes x and y label
          )
fig.show()

* In this graph it shows that the types of restaurants in Banglore and among all of them we have ploted first 20 restaurant types.
* Here the top 3 restaurant types are:
1. Quick bites
2. Casual dining
3. Cafe

In [None]:
delivery_or_not = df["online_order"].value_counts()
delivery_or_not

In [None]:
fig=px.pie(delivery_or_not,
           title='Delivery or not',
           values=delivery_or_not, # values inside pie plot
           names=delivery_or_not.index # legend or names displayes with color
          )
fig.show()

* Here we have plotted count plot for checking whether the restaurants are delivering online or not.
* So by the observations we concluded that:-
1. 64% of restaurans deliver online.
2. 36% of restaurants currently do not provide online delivery.

In [None]:
fig = px.violin(df,x='online_order', y="rating",
                color='online_order',
                box=True, # draw box plot inside the violin
                points='outliers', # can be 'outliers', or False
               )
fig.show()

* The restaurants which have the online order have the maximum rating but the restaurants which dont have has the lowest rating.
* Average rating of online order is higher.

In [None]:
fig= px.box(df,x='booking',
               y='rating',
            color='booking'
               )
fig.show()

* The restaurants which accepts booking have the highest rating while the restaurants having the lowest rating.
* If particular restaurants want high ratings they should start taking bookings.

In [None]:
#Best location(votes)
loc_vote=df.groupby('location')['votes'].sum()
loc_vote=loc_vote.nlargest(5).sort_values(ascending=True)
loc_vote

In [None]:
fig=px.bar(y=loc_vote.index,
           x=loc_vote,
           color=loc_vote.index # Each bar is colored differently
           ,title='Best location by votes',labels={'x':'Location','y':'Votes'} # changes x and y label
          )
fig.show()

* Here we have plotted bar plot for the locations of restaurants in banglore in considaration with votes.
* Graphs shows that the which location has more number of restaurants and which has the lowest.
* Top 3 restaurants are:
1. koramangla 5th blockkoramangla 5th block
2. indiranagarindiranagar
3. koramangla 4th blockkoramangla 4th block

In [None]:
#Best location(ratings) & Location vs Rating
loc_rate=df.groupby('location')['rating'].sum()
loc_rate=loc_rate.nlargest(5).sort_values()
loc_rate.head()

In [None]:
fig=px.bar(y=loc_rate.index,
           x=loc_rate,
           color=loc_vote.index # Each bar is colored differently
           ,title='Best location by rating',labels={'x':'Location','y':'Rating'} # changes x and y label
          )
fig.show()

* Here we have plotted bar plot for the locations of restaurants in banglore in considaration with ratings.
* Graphs shows that the which location has more number of restaurants and which has the lowest.
* Top 3 restaurants are:
1. BTM
2. koramangla 5th blockkoramangla 5th block
3. indiranagarindiranagar

In [None]:
fig=px.histogram(df['cost'],nbins=100)
fig.show()

* Graph shows that the maximum cost of restaurant is 400 - 499 and they are maximum in number.
* Restaurants costs lie between 0 to 1000 mostly

In [None]:
#Most famous restaurant chains in Bengaluru(votes)
vote=df.groupby('restaurants')[['votes','rating']].agg('mean').sort_values(by=['rating','votes'],ascending=False).head(10)
vote

In [None]:
fig=px.bar(y=vote['votes'],
           x=vote.index,
           color=vote.index # Each bar is colored differently
           ,title='Best location by rating',labels={'x':'Location','y':'Rating'} # changes x and y label
          )
fig.show()

In [None]:
fig=px.bar(y=vote['rating'],
           x=vote.index,
           color=vote.index # Each bar is colored differently
           ,title='Best location by rating',labels={'x':'Location','y':'Rating'} # changes x and y label
          )
fig.show()

* Here the graph is according to the famous restaurants in banglore with respect to votes and ratings both.
* So according to votes Byg Brewski Brewing Company has the highest vote.



In [None]:
fig=px.scatter(df,x='cost',y='rating',color='online_order')
fig.show()

* Here the graph is according to the rating and cost of restaurants.
* So we can assume most of costly restaurants do not provide online orders.
* High cost restaurants rarely have bad reviews.


# ***Thankyou....***

For further exploring plotly visit this link
https://plotly.com/python/plotly-express/