#### The goal of this exploratory data analysis (EDA) is to understand customer preferences and restaurant performance in Bangalore using the Zomato dataset. Specifically, this analysis aims to:

#### Explore the Relationship Between Ratings and Orders:
>By examining the correlation between restaurant ratings and the number of orders, we aim to uncover how customer satisfaction (via ratings) impacts the popularity and demand for restaurants in Bangalore.

#### Explore the preferences of people of Bangalore:
>The type of food that is most preferred by people in different regions of the city.

#### Perform Geospatial Analysis: 
>This analysis will focus on identifying the geographical distribution of different cuisines across Bangalore. We will investigate which areas have the highest concentration of specific cuisines.

#### The findings will help provide insights into:

>How customer ratings influence restaurant success. <br>
>Popular food trends in Bangalore.<br>
>The areas in Bangalore where specific types of cuisine are most in demand, potentially helping restaurants make data-driven decisions about location and menu offerings.


### Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3
import folium
from nltk.corpus import RegexpTokenizer
from folium.plugins import HeatMap
from nltk.corpus import stopwords
from nltk import FreqDist, bigrams , trigrams
from geopy.geocoders import Nominatim

#### Reading data from a sqlite file

In [None]:
db =sqlite3.connect(r"zomato_rawdata.sqlite")

In [None]:
pd.read_sql_query("SELECT * FROM Users", db).head()

Creating a Pandas dateframe using a sql quaery from the database

In [None]:
df = pd.read_sql_query("SELECT * FROM Users" , db)

In [None]:
df.shape

In [None]:
df.columns

#### Checking for missing values

In [None]:
df.isnull().sum()

In [None]:
# Calculate the percentage of missing values in each column
null_percentage = (df.isnull().sum() / len(df)) * 100
# Format the result as percentages with 2 decimal places
null_percentage_formatted = null_percentage.map('{:.2f}%'.format)

print(null_percentage_formatted)


#### Since approximately 50% of the data would be lost by removing the missing values in the dish_liked column, we will retain this column for now.

#### Next, let's examine the rate column, which has 15% missing values. Given that this is a key feature, it's important to address this carefully.

In [None]:
df['rate'].unique()

#### I identified that this column contains 'NEW' and '-' values. These should be replaced with zero or np.nan.

#### I also noticed some entries like '3.8/5' instead of just '3.8'. We'll need to clean and standardize these values.


In [None]:
df['rate'].replace(('NEW' , '-') , np.nan , inplace=True)

In [None]:
df['rate']=df['rate'].apply(lambda x: float(x.split('/')[0]) if type(x) == str else x)

In [None]:
df.rate.dtype

In [None]:
df.rate

In [None]:
df['rate'].unique()

 We aim to explore how many restaurants with ratings such as 0, 1, 1.2, 1.4, 1.6, and so on accept or do not accept online orders.

>To address this, we'll create frequency tables to capture the distribution of ratings across restaurants that accept online orders and those that don't.



In [None]:
x = pd.crosstab(df['rate'] , df['online_order'])
x

In [None]:
x.plot(kind='bar', stacked=True, color=['#2D2D2D','#cb202d'], figsize=(10, 6))

# Add title and labels
plt.title('Ratings vs Nature of Orders', fontsize=16)
plt.xlabel('Ratings', fontsize=12)
plt.ylabel('No. of Orders', fontsize=12)


plt.xticks(rotation=45, ha='right')

# Add a grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add a legend
plt.legend(title='Online Order?', loc='upper right', fontsize=10)

# Show plot
plt.tight_layout()
plt.show()


We need to perform floating division of the DataFrame, or normalize the values in the x DataFrame across rows. To achieve this, we can use the x.div() function and set axis=0.

The div() function is an in-built method in pandas designed specifically for DataFrame operations.

In [None]:
normalize_df = x.div(x.sum(axis=1).astype(float) , axis=0)*100

In [None]:
normalize_df.plot(kind='bar', stacked=True, color=['#2D2D2D','#cb202d'], figsize=(10, 6))

# Add title and labels
plt.title('Ratings vs Nature of Orders', fontsize=16)
plt.xlabel('Ratings', fontsize=12)
plt.ylabel('No. of Orders', fontsize=12)


plt.xticks(rotation=45, ha='right')

# Add a grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add a legend
plt.legend(title='Online Order?', loc='upper right', fontsize=10)

# Show plot
plt.tight_layout()
plt.show()


### Conclusion:

>For restaurants with good ratings (i.e., greater than 4), it appears that in most cases, those that accept online orders tend to receive a higher number of ratings compared to restaurants that do not offer online ordering.



### Data Cleaning to perform Text Analysis

We are going to check the different kinds of restaurants we have here but first, we need to remove missing values

In [None]:
rest_data = df.dropna(subset=['rest_type']).reset_index(drop=True)


In [None]:
rest_data['rest_type'].value_counts()

Let's pick 'Quick Bites' type restaurants to make some inspection:

In [None]:
quick_bites_df = rest_data[rest_data['rest_type'].str.contains('Quick Bites')]

In [None]:
quick_bites_df.shape

In [None]:
quick_bites_df.reviews_list

In [None]:
quick_bites_df.columns

### Text Data Pre-processing Steps:
>Convert Text to Lowercase: Transform all text data to lowercase for uniformity.<br>
>Tokenization: Break down the text into individual tokens (words).<br>
>Remove Stopwords: Eliminate common stopwords (e.g., "and", "the") from the data to focus on meaningful words.<br>
>Store Data in a List: Store the processed data in a list to compute word frequency.<br>
>Plot Word Frequencies: Perform Unigram, Bigram, and Trigram analysis to visualize word frequencies and patterns.


In [None]:
#Transforming all text data to lowercase 
quick_bites_df['reviews_list'] = quick_bites_df['reviews_list'].apply(lambda x:x.lower())

In [None]:
##  Creating a regular expression tokenizer that have only alphabets , ie remove all the special characters

tokenizer = RegexpTokenizer("[a-zA-Z]+")

In [None]:
tokenizer

In [None]:
reviews_tokens = df['reviews_list'].apply(tokenizer.tokenize)

In [None]:
reviews_tokens

In [None]:
#importing stopwords of English
stop = stopwords.words('english')

In [None]:
# Adding custom words to stopwords 
stop.extend(['rated' , "n" , "nan" , "x" , "RATED" , "Rated"])

In [None]:
## remove stopwords from "reviews_tokens" Series ..
reviews_tokens_clean = reviews_tokens.apply(lambda x : [token for token in x if token not in stop])

In [None]:
reviews_tokens_clean

In [None]:
#converting the reviews into a 2d list
rev = reviews_tokens_clean.tolist()

In [None]:
#extracting words from each reviews to count them 
total_reviews=[]

for review in rev:
    for word in review:
        total_reviews.append(word)

In [None]:
total_reviews

Unigram analysis

In [None]:
fd = FreqDist()

In [None]:
for word in total_reviews:
    fd[word] = fd[word] + 1

In [None]:
# Examining the top 20 most frequent words
fd.most_common(20)

In [None]:
fd.plot(20)

#### Observations
>The 20 most frequent words in customer reviews include "place," "food," "good," "chicken," "taste," "service," and "ambience"
<br>
>However, it's not entirely clear whether the food is actually good based solely on these words. Similarly, we need to examine the context of mentions regarding "chicken."
<br>
To derive more meaningful insights, we should consider performing a Bi-gram analysis.<br>

Bi-gram analysis

In [None]:
bi_grams = bigrams(total_reviews)

In [None]:
fd_bigrams = FreqDist()

for bigram in bi_grams:
    fd_bigrams[bigram] = fd_bigrams[bigram] + 1

In [None]:
fd_bigrams.most_common(20)

In [None]:
fd_bigrams.plot(20)

In [None]:
fd_bigrams.most_common(100)

### Observations
We have gained some new insights! The food items and preferences highlighted in the top 50 bigrams include:

- Fried Rice
- North Indian
- Indian food
- Non-Veg
- Chicken Biryani
- Main Course

Factors contributing to the restaurant experience are:
- Good Food
- Goog Service
- Pocket-Friendly
- Good Ambience
- Friendly behaviour of staffs
- Home Delivery

A key insight here is that the expense factor, which was overlooked in the individual word frequency counts, has been captured through the bigram frequency counts.


Tri-gram Analysis

In [None]:
tri_grams = trigrams(total_reviews)
fd_trigrams = FreqDist()

for trigram in tri_grams:
    fd_trigrams[trigram] = fd_trigrams[trigram] + 1

In [None]:
fd_trigrams.plot(20)

In [None]:
fd_trigrams.most_common(100)

The specific food preferences highlighted include
- North Indian food 
- Paneer Butter Masala
- White Sauce Pasta
- Vanilla Ice cream
- Various chicken items. 

This indicates that Bangalore is home to many chicken lovers.

### Extract geographical coordinates from the data 

In [None]:
#!pip install geocoder
#!pip install geopy

In [None]:
df['location']

In [None]:
df['location'].unique()

In [None]:
#No. of unique locations
len(df['location'].unique())

In [None]:
# We are adding the city, state and country to make precise analysis
df['location'] = df['location'] + " , Bangalore  , Karnataka , India "

In [None]:
df['location'].unique()

In [None]:
df_copy = df.copy()

In [None]:
df_copy = df_copy.dropna(subset=['location'])

In [None]:
locations = pd.DataFrame(df_copy['location'].unique())

In [None]:
locations

In [None]:
#Naming the column
locations.columns = ['Area']

In [None]:
from geopy.geocoders import Nominatim

In [None]:
### assign timeout=None in order to get rid of timeout error..
geolocator = Nominatim(user_agent="app" , timeout=None)

In [None]:
#to assign co-ordinates into a list
lat=[]
lon=[]

for location in locations['Area']:
    location = geolocator.geocode(location)
    if location is None:
        lat.append(np.nan)
        lon.append(np.nan)
    else:
        lat.append(location.latitude)
        lon.append(location.longitude)
    

In [None]:
# adding the latitude and longitude column in the locations dataset
locations['latitude'] = lat
locations['longitude'] = lon

In [None]:
locations

In [None]:
### lets find it out whether we have misssing values or not !
locations.isnull().sum()

In [None]:
locations[locations['latitude'].isna()]

In [None]:
#Adding missing values from google search
locations['latitude'][79] = 13.0163
locations['longitude'][79] = 77.6785
locations['latitude'][85] = 13.0068
locations['longitude'][85] = 77.5813

In [None]:
df['cuisines'].isnull().sum()

In [None]:
df = df.dropna(subset=['cuisines'])

In [None]:
df.cuisines.unique()

In [None]:
#Let's find it out what are those areas where we have most number of Momos restaurants ?
#Because I love them

momos = df[df['cuisines'].str.contains('Momo')]

In [None]:
momos.shape

In [None]:
momos_rest_count = momos['location'].value_counts().reset_index().rename(columns={'index':'name' , "location":"Area"})

In [None]:
momos_rest_count

In [None]:
heatmap_df = momos_rest_count.merge(locations , on='Area' , how='left')

In [None]:
heatmap_df

Adding Basemap

In [None]:
#!pip install folium
import folium
basemap = folium.Map()

In [None]:
basemap

In [None]:
from folium.plugins import HeatMap

In [None]:
HeatMap(heatmap_df[['latitude', 'longitude' , "count"]]).add_to(basemap)

In [None]:
basemap

#### Conclusions:
- It is evident that restaurants are primarily concentrated in the central Bangalore area.
- The density of restaurants decreases as we move away from the center.
- This information can be valuable for potential restaurant entrepreneurs in identifying favorable locations for their ventures.
- It’s important to note that heatmaps are most effective when we have latitude and longitude data or when indicating the significance or count of specific locations.

#### Automating the task of generating basemap for different type of cuisnes

In [None]:
def get_heatmap(cuisine):
    cuisine_df = df[df['cuisines'].str.contains(cuisine)]
    
    cuisine_rest_count = cuisine_df['location'].value_counts().reset_index().rename(columns={'index':'name' , "location":"Area"})
    heatmap_df = cuisine_rest_count.merge(locations , on='Area' , how='left')
    print(heatmap_df.head(4))
    
    basemap = folium.Map()
    HeatMap(heatmap_df[['latitude', 'longitude' , "count"]]).add_to(basemap)
    return basemap



In [None]:
#Let's find Bengali restaurants as I'm a Bengali and I'd love to find a place which serves bengali food
get_heatmap('Bengali')

In [None]:
#for South Indian cuisines
get_heatmap('South Indian')

In [None]:
#for North Indian cuisines
get_heatmap('North Indian')

### Conclusion<br>

- In this exploratory data analysis of the Zomato dataset, we uncovered valuable insights into restaurant dynamics in Bangalore. Our analysis revealed a strong correlation between restaurant ratings and the acceptance of online orders, indicating that restaurants with higher ratings tend to receive more online orders.

- Furthermore, through unigram and bigram analyses, we identified prevalent food preferences among customers, with items such as Fried Rice, Paneer Butter Masala, Vanilla Icecream and Chicken Biryani frequently mentioned. This suggests that certain cuisines are particularly popular in the city, reflecting local tastes and dining habits.

- Geospatial analysis highlighted the concentration of restaurants in central Bangalore, with a notable decrease in density as one moves outward. This information can be instrumental for potential restaurant entrepreneurs seeking to establish their businesses in high-traffic areas.

- Overall, this EDA has provided a comprehensive understanding of customer preferences, restaurant performance, and geographical trends, laying a solid foundation for further research and business strategy development in the food and beverage sector.