# **Project Name**    - Zomato Restaurant Analysis Project



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual
##### **Team Member 1 -** Vivek Lamba

# **Project Summary -**

The data science problem focuses on approaching a business challenge analytically while applying critical thinking. Among various machine learning approaches, Unsupervised Learning is widely used, particularly for clustering, association mining, and dimensionality reduction.

In this project, the primary objective is to perform restaurant clustering and then conduct sentiment analysis to understand customer opinions from their reviews.

The goal of this project is to analyze Zomato restaurant data across various Indian cities, group restaurants into meaningful clusters, and further analyze customer reviews to determine whether the sentiment toward each restaurant is positive or negative.

Before performing any analysis, it is essential to understand the dataset thoroughly. Therefore, initial exploratory steps such as .head() and .info() were used to observe structure, column types, and basic characteristics of the data.

Once familiarized with the dataset, data wrangling was performed to clean and prepare the data. This included converting the cost field into numerical format, handling invalid rating values, and correcting inconsistencies.

After preprocessing, the next step was Exploratory Data Analysis (EDA) to uncover deeper insights. Various visualizations and comparisons were generated to understand restaurant characteristics and customer behavior patterns.

Certain assumptions and questions regarding the data were tested through Hypothesis Testing, where p-values and significance levels were used to accept or reject hypotheses.

Following this, feature engineering was carried out to prepare data for modeling. This involved handling missing values, addressing outliers, scaling features, extracting relevant variables, and transforming data where needed.

Clustering was first performed on the restaurant dataset. Before clustering, Principal Component Analysis (PCA) was applied for dimensionality reduction. Three clustering algorithms were implemented:
K-Means Clustering
Agglomerative Hierarchical Clustering

Based on results and Silhouette scores, it was observed that the dataset could be effectively grouped into six clusters.

Next, sentiment analysis was carried out on the review dataset. Comprehensive text preprocessing was applied, including punctuation removal, stopword removal, lemmatization, emoji handling, lowercase conversion, and tokenization using TF-IDF vectorization.
Multiple models were evaluated:
Logistic Regression
Decision Tree
Random Forest
XGBoost
K-Nearest Neighbors
Logistic Regression delivered the best performance based on AUC-ROC score. Hyperparameter tuning further confirmed logistic regression as the most suitable model for final sentiment prediction.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The rapid growth of online food delivery platforms such as Zomato has resulted in massive volumes of restaurant data and customer reviews. This data contains valuable insights regarding customer satisfaction, restaurant performance, pricing perception, and service quality. However, in its raw form, this information is unstructured and difficult to interpret for meaningful business decisions.

The objective of this project is to analyze Zomato restaurant data to understand how restaurant pricing influences customer ratings, identify meaningful restaurant clusters based on review patterns, and perform sentiment analysis on customer reviews to evaluate their opinions. Specifically, the project aims to:
Analyze the relationship between restaurant cost and customer ratings and determine whether premium restaurants receive significantly higher ratings than budget restaurants.

Cluster restaurants based on review text similarity to understand behavioral grouping and similarity in customer perception.

Perform sentiment analysis on user reviews to classify customer feedback into positive, negative, or neutral sentiment categories.

Provide business insights that help restaurants improve service quality, pricing strategies, and customer experience.

Through statistical analysis, machine learning techniques, clustering methods, and natural language processing, this project seeks to transform raw Zomato data into actionable intelligence that benefits customers, restaurant owners, and platform stakeholders.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import datetime as dt
from datetime import datetime

from wordcloud import WordCloud

import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Loading Zomato Restaurant names and Metadata Dataset
hotels=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant names and Metadata.csv')

#Loading Zomato Restaurant reviews Dataset
reviews=pd.read_csv('/content/drive/MyDrive/Zomato Project/Zomato Restaurant reviews.csv')



### Dataset First View

In [None]:
# Dataset First Look
hotels.head()

In [None]:
reviews.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotels.shape

In [None]:
reviews.shape

Hotels contains 105 records and 6 features while reviews dataset contains 10000 records and 7 features.

### Dataset Information

In [None]:
# Dataset Info
hotels.info()

Cost must be int type but it contains comma(,) , hence its datatype is object here. Also Timings represent the time from when the restaurant opens till end time when restaurants shut down, it is given in the form of text, hence object datatype.

In [None]:
reviews.info()

Here all the columns in both the dataset is 'object' type except 'pictures'

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotels.duplicated().sum()

In [None]:
#review Dataset Duplicate Value Count
reviews.duplicated().sum()

In [None]:
#Check what are dplicated values present in the dataset
reviews[reviews.duplicated()]

Since all the duplicated rows are null values. Hence we can drop them.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotels.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(hotels.isnull(), cbar=False)

There are some missing values in the column collections a, i.e., 54 and one in Timings column.

In [None]:
# Missing Values/Null Values Count in review
reviews.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(reviews.isnull(), cbar=False)

In reviews dataset, most of the columns have missing values.

### What did you know about your dataset?

There are two datasets given:



1.   Restaurant Names and metadata:

*   There are 105 records and 6 features in metadata.
*   There are missing or null values in Colllections and timings.
*   There are no duplicated values.
*   Cost must be int type but it contains comma(,) , hence its datatype  is  object here.
*   Timings represent the time from when the restaurant opens till end time when restaurants shut down.


2.   Reviews dataset:

*   There are 10000 records given with 7 features.
*   Except Name of Restaurants and Number of picture posted, There are null values.
*   There are some of the duplicated values for restaurnts which can be dropped.
*   Rating must be integer but it contais value 'like', hence it is object datatype.





## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotels.columns

In [None]:
# Dataset Columns
reviews.columns

In [None]:
# Dataset Describe
hotels.describe(include='all').T




*   Here none of the columns are seemed to be categorical.
*   Majority of the columns have unique values.



In [None]:
# Dataset Describe
reviews.describe(include='all').T



*   From description of dataset, we can infer that there are 100 unique Restaurants for which customers have given their review.
*   Some of the reviewers or customer have given review to more than 1 restaurant.





### Variables Description

**Zomato Restaurant**
* Name : Name of Restaurants

* Links : URL Links of Restaurants

* Cost : Per person estimated Cost of dining

* Collection : Tagging of Restaurants w.r.t. Zomato categories

* Cuisines : Cuisines served by Restaurants

* Timings : Restaurant Timings

**Zomato Restaurant Reviews**
* Restaurant : Name of the Restaurant

* Reviewer : Name of the Reviewer

* Review : Review Text

* Rating : Rating Provided by Reviewer

* MetaData : Reviewer Metadata - No. of Reviews and followers

* Time: Date and Time of Review

* Pictures : No. of pictures posted with review

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
hotels.nunique()

In [None]:
# Check Unique Values for each variable.
reviews.nunique()

In [None]:
reviews['Rating'].unique()

The ratings are from 0-5 but the customers have given in .5. Therefore they are going to 10 and their are like and null values also. We can replace the null value with the median.And we need to replace the like as well, so we can replace it with 3.5.

## 3. ***Data Wrangling***

### Data Wrangling Code

Hotel Dataset

In [None]:
# Renaming the hotel dataset column name
hotels.rename(columns={'Name':'Restaurant'},inplace=True)

In [None]:
# checking values for cost
hotels['Cost'].unique()

In [None]:
# Removing ',' from Cost
hotels['Cost']=hotels['Cost'].str.replace(",","").astype("int64")

In [None]:
# function for number of cuisine in a hotel
def no_of_cuisine(cuisine):
  Cuisine_list=list(str(cuisine).split(','))
  return len(Cuisine_list)

# Create a new column with no of cuisine in hotel dataframe
hotels['No_of_cuisine']=hotels['Cuisines'].apply(no_of_cuisine)

Review Dataset

In [None]:
# Write your code to make your dataset analysis ready.
# Dropping the duplicate values in reviews df
reviews.drop_duplicates(keep=False,inplace=True)

In [None]:
# Replace Rating 'Like' with rating 4
reviews['Rating']=reviews['Rating'].str.replace("Like",'3.5').astype('float')

In [None]:
#splitting the metadata into Reviews and Followers
reviews[['No_of_reviews','Followers']] = reviews['Metadata'].str.split(',', expand=True)
reviews['No_of_reviews'] = pd.to_numeric(reviews['No_of_reviews'].str.split(' ').str[0])
reviews['Followers'] = pd.to_numeric(reviews['Followers'].str.split(' ').str[1])
reviews.head()

In [None]:
# Filling the null values of Followes with 0
reviews['Followers'].fillna(0,inplace=True)

In [None]:
# Converting Time to date time and extracting Hour and year
reviews['Time']=pd.to_datetime(reviews['Time'])
reviews['Year']=pd.DatetimeIndex(reviews['Time']).year
reviews['Hour'] = pd.DatetimeIndex(reviews['Time']).hour

In [None]:
reviews.info()

In [None]:
# Create a new column for average rating in hotel dataset
Average_rating = reviews.groupby(by='Restaurant',as_index='False')['Rating'].mean().reset_index()
Average_rating.rename(columns={'Rating':'Average_rating'},inplace = True)
Average_rating.head()


In [None]:
# Let's merge the average rating with hotel dataset
hotels = hotels.merge(Average_rating,on = 'Restaurant')

In [None]:
# Let's merge the two dataset
df = hotels.merge(reviews, left_on = 'Restaurant',right_on='Restaurant')
df.shape

### What all manipulations have you done and insights you found?

For the Hotel dataset:

*   Rename the Column 'Name' to 'Restaurant'.
*   Removed comma(,) from Cost and changed its datatype to integer.
*   Formed the function for the number of cuisines.
*   Merged the average rating in hotel dataset.

For the Review dataset:

*   Dropped the duplicate rows.
*   Changed the Rating - Like to numeric value and changed it datatype.
*   Extracted No_of_review and followers from Metadata column and filled the null values of followes with 0.
*   Changed the time datatype to datetime and extracted Year and Hour from it.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# swarmplot to see the variation in price
sns.boxplot( y="Cost", data=hotels)

##### 1. Why did you pick the specific chart?


To find the cost of restaurants.


##### 2. What is/are the insight(s) found from the chart?

It is clearly visible that average cost per person in restaurants varies from below 500 to more than 2500. But there are few restaurants whose price is more than 2000.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The analysis clearly shows that restaurant pricing varies widely — most restaurants fall under budget to mid-range pricing (below 500 to 1500), while only a small segment lies above 2000 per person.

Since only a few restaurants operate successfully above 2000, blindly increasing prices without improving value may reduce customers, attract negative reviews, and harm business performance.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#Find out the costliest and cheapest restaurants
costly_res=hotels[['Restaurant','Cost']].groupby('Restaurant',as_index=False).mean().sort_values(by='Cost',ascending=False).head(5).reset_index(drop=True)
cheapest_res = hotels[['Restaurant','Cost']].groupby('Restaurant',as_index=False).mean().sort_values(by='Cost',ascending=True).head(5).reset_index(drop=True)
print(costly_res.head())
print('\n',cheapest_res.head())

In [None]:
#visualisation of most expensive and cheapest restaurant
fig,axes=plt.subplots(nrows=1,ncols=2,constrained_layout=True,figsize=(14,7))

#costliest restaurant
a=sns.barplot(x = 'Restaurant',y = 'Cost',data = costly_res,ax = axes[0],palette = 'plasma')
a.set_xticklabels(labels=costly_res['Restaurant'].to_list(),rotation=90)

#cheapest restaurant
b=sns.barplot(x = 'Restaurant',y = 'Cost',data = cheapest_res,ax = axes[1],palette = 'plasma')
b.set_xticklabels(labels=cheapest_res['Restaurant'].to_list(),rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

To visualize which are the expensive restaurants and which are the cheap restaurants avaliable on Zomato.

##### 2. What is/are the insight(s) found from the chart?

Expensive Restaurants : Here "Collage - Hyatt Hyderabad Gachibowli" is the most expensive restaurant whose price is 2800 which is followed by "Feast - Sheraton Hyderabad Hotel" whose price is rupees 2500.

Cheap Restaurants : Here "Mohammedia Shawarma" and "Amul" is the cheapest restaurant where we can get the dish with the minimum price of rupees 150 , which is followed by "Hunger Maggi Point", "Asian Meal Box", "Momos Delight etc whose price is rupees 200 .

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can create a positive business impact because they help clearly identify premium restaurants for high-end customers and budget restaurants for price-sensitive users.

This allows better pricing strategies, targeted marketing, and improved customer segmentation.

However, highlighting very expensive restaurants may discourage budget users, and extremely cheap restaurants may struggle with service quality and profit margins if demand rises.

Therefore, while insights are beneficial, businesses must balance pricing, quality, and customer expectations to avoid negative growth.

#### Chart - 3

In [None]:
from numpy.random.mtrand import normal
# Chart - 3 visualization code
plt.figure(figsize = (18,8));
for i,col in enumerate(['Cost','Rating','Year']) :
    # plt.figure(figsize = (8,5));
    plt.subplot(2,2,i+1);
    sns.histplot(df[col], kde=True, color='#055E85');
    feature = df[col]
    plt.axvline(feature.mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #red
    plt.axvline(feature.median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #cyan
    plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
    plt.title(f'{col.title()}');
    plt.tight_layout();

##### 1. Why did you pick the specific chart?

Histplot is helpful in understanding the distribution of the feature.

##### 2. What is/are the insight(s) found from the chart?

* All three are show skewness.
* Maximum restaurant show price range for 500.
* In 2018 number of reviews are more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Price always place important role in any business alongwith rating which show how much engagement are made for the product.

But in this chart it is unable to figure any impact on business when plotted all alone.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
#CREATING WORDCLOUD FOR EXPENSIVE RESTAURANT
from wordcloud import WordCloud
plt.figure(figsize=(20,10))
text = " ".join(name for name in hotels.sort_values('Cost',ascending=False).Restaurant[:30])

# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(width = 2000, height = 2000,collocations = False,
                       colormap='rainbow',background_color = 'black').generate(text)

# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear');
plt.axis("off")


In [None]:
#CREATING WORDCLOUD FOR CHEAPEST RESTAURANT
plt.figure(figsize=(15,8))
text = " ".join(name for name in hotels.sort_values('Cost',ascending=True).Restaurant[:30])

# Creating word_cloud with text as argument in .generate() method
wordcloud = WordCloud(background_color="white").generate(text)
# Display the generated Word Cloud
plt.imshow(wordcloud, interpolation='bilinear');
plt.axis("off")

##### 1. Why did you pick the specific chart?

I used Wordcloud because it show all text and highlight the most frequent words.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, HYDERABAD, HOTEL, BAR etc seems frequently repeating for expensive restaurant, while for cheap restaurants SHAWARMA, DHABA, RESTAURANTS seems frequently repeating. So it can be infer that Hotel and Bars of Hyderabad are expensive while Dhabas and Restaurants are cheaper.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can positively impact business by helping restaurants position themselves correctly—premium Hotels and Bars can justify higher pricing through enhanced service, while budget Dhabas and Restaurants can focus on affordability and high-volume strategies. These insights also support better market segmentation, targeted marketing, and improved customer recommendations.

However, negative growth is possible if premium outlets overprice without delivering quality, or if budget restaurants remain stuck in a low-price perception without improving service standards, which may lead to customer dissatisfaction and lost business.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
#average rating and total number of review given to the restaurants

avg_hotel_rating = reviews.groupby('Restaurant').agg({'Rating':'mean',
        'Reviewer': 'count'}).reset_index().rename(columns = {'Reviewer': 'Total_Review'})
avg_hotel_rating

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,constrained_layout=True,figsize=(14,7))

# Let's see te histogram of average rating
a=sns.histplot(data=avg_hotel_rating['Rating'],bins=20,kde=True,ax=axes[0])

# plot the pie chart of number of reveivers for restaurants
b=avg_hotel_rating['Total_Review'].value_counts().plot(kind='pie', shadow=False, autopct='%1.02f%%',
                                                       explode = (0.001, 0.5, 0.5),pctdistance=1.1,labeldistance=1.20,
                                                       colors=['green','red','purple'],ax=axes[1])
plt.show()

##### 1. Why did you pick the specific chart?

To see the distribution of average rating , I used histplot and to see review distribution, I used pie chart.

##### 2. What is/are the insight(s) found from the chart?

average Ratings are normally distributed for the restaurants.

100 reviews are given to all the restaurants except 2 restaurants whose reviews are 85 and 77 respectively.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can positively impact business decisions since ratings are normally distributed and backed by sufficient reviews, making them reliable for strategic planning. However, the two restaurants with significantly fewer reviews may lead to misleading conclusions and potential negative business impact if treated equally without considering review volume.

#### Chart - 6

In [None]:
#Most popular cuisines
cuisine_list=[]
cuisines=hotels.Cuisines.str.split(',')

#Get all the cuuisines in a list
for i in cuisines:
  for j in i:
    cuisine_list.append(j)

# converting it to dataframe
cuisine_series=pd.Series(cuisine_list)
cuisine_df=pd.DataFrame(cuisine_series,columns=['Cuisines'])
cuisine_df[cuisine_df['Cuisines']==' North Indian']='North Indian'
#cuisine_df
cuisine_=pd.DataFrame(cuisine_df.groupby(by='Cuisines',as_index=False).value_counts())
cuisine_

In [None]:
sns.barplot(x='count', y='Cuisines', data=cuisine_.sort_values(ascending=False, by='count')[:10],palette='crest')
plt.title('10 Most Famous Cuisine')
plt.show()

##### 1. Why did you pick the specific chart?

Since categorical features are best visualized through bar chart.

##### 2. What is/are the insight(s) found from the chart?

From the result of the graph it is clearly visible that North Indian is the most served cuisine in restaurants which is followed by Chinese then Continental.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps businesses understand that North Indian, Chinese, and Continental cuisines have the highest demand, enabling better menu planning, investment decisions, and targeted marketing strategies.

However, the dominance of North Indian cuisine indicates high competition, which makes it difficult for new or smaller restaurants to stand out and grow.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#CREATING WORDCLOUD FOR CUISINES

plt.figure(figsize=(12,10))
df_word_cloud = cuisine_df['Cuisines']
text = " ".join(word for word in df_word_cloud)

# Generate a word cloud image
wordcloud = WordCloud(background_color="white").generate(text)
# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

It show all text and highlight the most frequent words.

##### 2. What is/are the insight(s) found from the chart?

From the above chart, North Indian is the most frequently used which is followed by chinese and continental.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the word cloud highlights dominant cuisines like North Indian, Chinese, Italian, Fast Food, and Continental, helping businesses understand popular customer preferences and plan menus, marketing, and offerings accordingly.

However, over-reliance on these popular cuisines may lead to saturation and intense competition, while under-represented cuisines risk being ignored, potentially missing market opportunities.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Creating word cloud for reviews
plt.figure(figsize=(15,10))
text = " ".join(name for name in reviews.sort_values('Review',ascending=False).Review[:30])

# Creating word_cloud with text as argument in .generate() method
word_cloud = WordCloud(width = 1400, height = 1400,collocations = False, background_color = 'black').generate(text)

# Display the generated Word Cloud
plt.imshow(word_cloud, interpolation='bilinear')

plt.axis("off")

##### 1. Why did you pick the specific chart?

To see what word is frequently used by the reviewers.

##### 2. What is/are the insight(s) found from the chart?

Most of the time customers liked the food because good is repeating most in reviews. Also food,place,restaurant are next most repeating word.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The word cloud shows frequent positive terms like good, tasty, amazing, nice, perfect, service, indicating strong customer satisfaction. Businesses can use this information to keep improving their food quality and service, and they can also highlight these positive experiences in their marketing to attract more customers.

Some small negative words like less, wrong, leaking, spicy, and delayed show that a few customers are unhappy with service or food quality at times. If these issues are not fixed, customers may lose trust, give bad reviews, and may not want to come back.

#### Chart - 9

In [None]:
plt.figure(figsize = (15,8))
sns.countplot(
    x='Collections',
    data=hotels,
    order=hotels.Collections.value_counts().head(10).index,
    palette="viridis"
)
plt.title('Count of collections')
plt.xticks(rotation = 90)
plt.show()

##### 1. Why did you pick the specific chart?

To check the count of each collections.

##### 2. What is/are the insight(s) found from the chart?

Here Food Hygiene Rated Restaurants in Hyderabad has the maximum count of 4 which is followed by Hyderabad Hottest,New on Gold etc.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart helps businesses understand which types of restaurant collections, like Food Hygiene Rated, Veggie Friendly, and Trending This Week, are more popular. Restaurants can use this information to improve their positioning, join popular categories, and create better marketing plans to attract more customers.

Some collections have very few restaurants, which shows low participation or awareness. If restaurants ignore these less-represented categories, they might miss good opportunities and lose potential customers who are interested in those specific themes.

#### Chart - 10

In [None]:
#numerical columns for hotel dataset
num_cols_hotel = ['Cost', 'No_of_cuisine']

In [None]:
#Distribution plot for hotel dataset
n=1
plt.figure(figsize=(10,7))
for col in num_cols_hotel:
   plt.subplot(1,2,n)
   n+=1
   sns.distplot(hotels[col])
   plt.title(col)
   plt.tight_layout()

##### 1. Why did you pick the specific chart?

To see the distribution of numerical columns.




##### 2. What is/are the insight(s) found from the chart?

Most restaurants fall in the mid-price range, while only a few are very expensive. Also, most restaurants serve 2–4 cuisines, indicating they maintain variety without overcomplicating menus.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These insights help restaurants set competitive pricing and design balanced menus, while avoiding very high costs or too many cuisines that may negatively impact customer satisfaction and growth.

#### Chart - 11

In [None]:
#numerical columns for review dataset
num_cols_review = ['Rating', 'Pictures', 'No_of_reviews', 'Followers']
 #Distribution plot
n=1
plt.figure(figsize=(15,10))
for col in num_cols_review:
   plt.subplot(2,2,n)
   n+=1
   sns.distplot(reviews[col])
   plt.title(col)
   plt.tight_layout()

##### 1. Why did you pick the specific chart?

To see the distribution of numerical columns.

##### 2. What is/are the insight(s) found from the chart?

Most restaurants have higher ratings, showing generally good customer satisfaction. However, pictures, reviews, and followers are highly right-skewed

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the skewness indicates unequal popularity—restaurants with low engagement may struggle to attract customers if they do not improve marketing and customer experience.

#### Chart - 12

In [None]:
#Get the count of hour in which customers have given review
hr_count=pd.DataFrame(reviews.groupby(by='Hour',as_index=False)['Hour'].count().reset_index(drop=False))
hr_count.rename(columns={'index':'Hour','Hour':'Count'},inplace=True)
hr_count


In [None]:
#visualizing through bar plot
plt.figure(figsize=(15,10))
a=sns.barplot(x='Hour',y='Count',data=hr_count,palette='terrain_r',order=hr_count.sort_values('Count',ascending=False)['Hour'])
a.set_xticks(range(len(hr_count)))

##### 1. Why did you pick the specific chart?

To see the count the number of review for restauants given by each hour.

##### 2. What is/are the insight(s) found from the chart?

The frequency is higher during the night time from hour 21 to 23, i.e., from 9:00 pm to 11:00 pm and in the afternoon it is peak at 14 hour i.e 2:00 pm. Possibly because people prefer to order food during these hours.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This chart helps businesses know the busiest hours, mainly in the evening and night, so they can plan staff, kitchen work, and offers better to earn more.

Very slow hours can waste money if restaurants stay fully open without enough customers, which can hurt profit.

#### Chart - 13

In [None]:
# No of reviews given to the restaurants
res_review=reviews[['Restaurant', 'No_of_reviews']].sort_values(by = 'No_of_reviews', ascending = False).head(10).reset_index(drop=True)
res_review

In [None]:
# No of reviews for each restaurant
plt.figure(figsize = (15,15))
ax = sns.barplot(x = 'No_of_reviews',y = 'Restaurant',data = res_review ,palette = 'viridis')


##### 1. Why did you pick the specific chart?

To check which restaurants are given most number of reviews by the customers.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that a few restaurants like Labonel, Pista House, and Collage – Hyatt Hyderabad Gachibowli receive the highest number of reviews, indicating high customer engagement and popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This helps identify top-performing restaurants,restaurants with very few reviews may face low visibility and slower growth if not addressed.

#### Chart - 14

In [None]:
# Top 10 average rating restaurants
sns.barplot(x='Average_rating', y='Restaurant', data=hotels.sort_values(ascending=False, by='Average_rating')[:10],palette ='coolwarm' )
plt.title('10 Most Rated Restaurant')

plt.show()

##### 1. Why did you pick the specific chart?

To see the top 10 restaurants having highest average rating.

##### 2. What is/are the insight(s) found from the chart?

AB's - Absolute Barbecues is the top average rated restaurant followed by B-Dubs and 3B's - Buddies, Bar and Barbeque.

#### Chart - 15 - Correlation Heatmap

In [None]:
# checking heatmap/correlation matrix to see the how the colums are correlated with each other
f, ax = plt.subplots(figsize=(20,10))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='crest', linewidths=1)
plt.show()

##### 1. Why did you pick the specific chart?

To see the correlation among numerical features.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

* Number of reviews and followes has correlation of 0.47 which can be considered as moderate.

* Similarly cost and number of cuisines has moderate correlation of 0.4 and cost and average rating have correlation of 0.42

* There is low correlation between:

  * Pictures and Followers
  * Pictures and No of reviews
  * Cost and year

Other features have very low correlation.Answer Here

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)

##### 1. Why did you pick the specific chart?

A pair plot is chosen because it helps visualize the relationships between multiple numerical variables at once. It shows both individual feature distributions and how each pair of features is correlated, making it easy to identify trends, patterns, clusters, and outliers in a single chart.

##### 2. What is/are the insight(s) found from the chart?

* Most variables are right-skewed, such as followers, number of reviews, and pictures, showing that only a few restaurants get very high engagement.
* Cost and rating do not show a very strong linear relationship, indicating higher cost does not always guarantee higher ratings.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis H0: There is no relationship between the cost of restaurant and the rating it receives.

Alternative hypothesis H1: There is a positive relationship between the cost of a restaurant and the rating it receives.

Test : Simple Linear Regression Analysis

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.formula.api as smf

# fit the linear model
model = smf.ols(formula='Rating ~ Cost', data= df).fit()

# Check p-value of coefficient
p_value = model.pvalues[1]
print("p_value value is " + str(p_value))
if p_value <= 0.05:
  print('Reject null hypothesis')
else:
  print('Fail to reject null hypothesis')

##### Which statistical test have you done to obtain P-Value?

I have used Linear regression test for checking the relationship between the cost of a restaurant and its rating

##### Why did you choose the specific statistical test?

I chose this test because it is a common and straightforward method for testing the relationship between two continuous variables. This would involve fitting a linear model with the rating as the dependent variable and the cost as the independent variable. The p-value of the coefficient for the cost variable can then be used to determine if there is a statistically significant relationship between the two variables.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis H0 : There is no relation between number of cuisines and cost.

Alternative Hypothesis H1 : Restaurants which serve higher number of cuisines are more costly.

Test: chi-square contingency test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

# defining the table
sample = [hotels['No_of_cuisine'], hotels['Cost']]
stat, p_value, dof, expected = chi2_contingency(sample)

# interpret p_value-value
print("p_value value is " + str(p_value))
if p_value <= 0.05:
    print('Reject null hypothesis')
else:
    print('Failed to reject null hypothesis')

##### Which statistical test have you done to obtain P-Value?

I have used the chi-square contingency test to check if cost and Number of cuisines have relationship or not.

##### Why did you choose the specific statistical test?

 I choose this test because it is suitable for comparing the relationship between two categorical variables. This would involve creating a contingency table with the number of cuisines and the rating of the restaurant have a realtionship or not.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null hypothesis H0: The variety of cuisines offered by a restaurant has no effect on its rating.

Alternative hypothesis H1: The variety of cuisines offered by a restaurant has a positive effect on its rating.

Test : Chi-Squared Test

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

# create a contingency table
ct = pd.crosstab(df['Cuisines'], df['Rating'])

# perform chi-squared test
chi2, p_value, dof, expected = chi2_contingency(ct)

# Check p-value
print("p_value value is " + str(p_value))
if p_value <= 0.05:
    print("Reject Null Hypothesis")
else:
    print("Fail to reject Null Hypothesis")

##### Which statistical test have you done to obtain P-Value?

I have used chi-squared test for independence to test the relationship between the variety of cuisines offered by a restaurant and its rating.

##### Why did you choose the specific statistical test?

 I choose this test because it is suitable for comparing the relationship between two categorical variables. This would involve creating a contingency table with the number of restaurants that offer each cuisine as the rows and the rating of the restaurant as the columns.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Hotel Dataset
null_count = hotels.isnull().sum()
null_percent = hotels.isnull().sum() / hotels.shape[0] * 100

null_hotel_df = pd.DataFrame({
    'Missing_Count': null_count,
    'Missing_Percentage (%)': null_percent
})

print(null_hotel_df)

In [None]:
# checking for one missing value in Timings
hotels[hotels['Timings'].isnull()]

In [None]:
# Imputing timings missing value with mode of that column
hotels.Timings.fillna(hotels.Timings.mode()[0], inplace = True)

In [None]:
#dropping collection column since has more than 50% of null values
hotels.drop('Collections', axis = 1, inplace = True)

#### What all missing value imputation techniques have you used and why did you use those techniques?

For the hotel dataset or metadataset :
* we have drop the collection column becuase it have almost 50% of null values
* Their was 1 null value for the timing we have filled that value with the mode because all the hotels will have almost same opening and closing time.

In [None]:
# Handling Missing Values & Missing Value Imputation
# review Dataset
null_count = reviews.isnull().sum()
null_percent = reviews.isnull().sum() / reviews.shape[0] * 100

null_review_df = pd.DataFrame({
    'Missing_Count': null_count,
    'Missing_Percentage (%)': null_percent
})

print(null_review_df)

In [None]:
#filling null values in review_df  review column
reviews = reviews.fillna({"Review": "No Review"})

There were missing values in review column, filled it with 'No review'. As this column was having the data of reviews given by the customer to the restaurant.

### 2. Handling Outliers

In [None]:
#function to plot for outlier detection
def outlier_plots(df, features):
  for i in range(0,len(features)):
    plt.figure(figsize = (20,10))
    plt.subplot(1,3,1)
    sns.distplot(df[features[i]])
    plt.subplot(1,3,2)
    plt.scatter(range(df.shape[0]), np.sort(df[features[i]].values))
    plt.subplot(1,3,3)
    sns.boxplot(df[features[i]])

In [None]:
# Getting outliers for review dataset
outlier_plots(reviews,['Followers','Pictures','No_of_reviews'])

In [None]:
# getting outliers for hotel dataset
outlier_plots(hotels,['Cost','No_of_cuisine'])

In [None]:
from sklearn.ensemble import IsolationForest
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(hotels['Cost'].values.reshape(-1, 1))

xx = np.linspace(hotels['Cost'].min(), hotels['Cost'].max(), len(hotels)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='g',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Cost')
plt.show();

In [None]:
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(reviews['No_of_reviews'].values.reshape(-1, 1))

xx = np.linspace(reviews['No_of_reviews'].min(), reviews['No_of_reviews'].max(), len(hotels)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='g',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('No_of_reviews')
plt.show();

In [None]:
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(reviews['Followers'].values.reshape(-1, 1))

xx = np.linspace(reviews['Followers'].min(), reviews['Followers'].max(), len(hotels)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
where=outlier==-1, color='g',
alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Followers')
plt.show();

In [None]:
# For Skew Symmetric features defining upper and lower boundry

def outlier_treatment_skew(df,feature):
  #inter quartile range
  IQR= df[feature].quantile(0.75) - df[feature].quantile(0.25)
  lower_bound = df[feature].quantile(0.25) - 1.5*IQR
  upper_bound = df[feature].quantile(0.75) + 1.5*IQR
  return upper_bound,lower_bound

In [None]:
def replace_outliers(df,features):

  #lower limit capping
  df.loc[df[features]<= outlier_treatment_skew(df=df,feature=features)[1],features] = outlier_treatment_skew(df=df,feature=features)[1]

  #upper limit capping
  df.loc[df[features]>= outlier_treatment_skew(df=df,feature=features)[0],features] = outlier_treatment_skew(df=df,feature=features)[0]

In [None]:
# Replace the outlier value with its upper bound and lower bound
replace_outliers(hotels,'Cost')
replace_outliers(reviews,'No_of_reviews')
replace_outliers(reviews,'Followers')

##### What all outlier treatment techniques have you used and why did you use those techniques?

Since cost,reviewer and follower feature or column show positive skewed distribution and using isolation forest found they have outliers, hence using the capping technique, instead of removing the outliers and capped outliers with the highest and lowest limit using IQR method by replacing the upper with upper limit and lower with the lowe limit.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# create the new dataframe for clustering
# And have encoding on cuisines
cluster_df = hotels.drop([ 'Timings'],axis=1)
# Encode your categorical columns
cluster_df['Cuisines'] = cluster_df['Cuisines'].str.split(',')

#using explode converting list to unique individual items
cluster_df = cluster_df.explode('Cuisines')

#removing extra trailing space from Cuisines after exploded
cluster_df['Cuisines'] = cluster_df['Cuisines'].apply(lambda x: x.strip())

#using get dummies to get dummies for Cuisines
cluster_df = pd.get_dummies(cluster_df, columns=["Cuisines"], prefix=["Cuisines"])

cluster_df = cluster_df.groupby("Restaurant").sum().reset_index()

#### What all categorical encoding techniques have you used & why did you use those techniques?

For encoding of categorical feature which is 'Cuisines' , First I have splitted the cuisines into a list and then created dummy variables for each of the cuisines.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# creating datafame for sentiment analysis
sentiment_df = reviews[['Review', 'Rating']]

In [None]:
!pip install contractions

In [None]:
import contractions
# applying fuction for contracting text
sentiment_df['Review']=sentiment_df['Review'].apply(lambda x:contractions.fix(str(x)))

#### 2. Lower Casing

In [None]:
# Lower Casing
sentiment_df['Review'] = sentiment_df['Review'].str.lower()
sentiment_df.head()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string
def remove_punctuation(text):
  '''This function is for removing punctuation'''
   # replacing the punctuations with no space, hence punctuation marks will be removed
  translator = text.translate(str.maketrans('', '', string.punctuation))
  # return the text stripped of punctuation marks
  return (translator)

In [None]:
#remove punctuation using function created
sentiment_df['Review'] = sentiment_df['Review'].apply(remove_punctuation)
sentiment_df.head()

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
#function to remove digits
def remove_digit(text):
  '''Function to remove digit from text'''
  char_str = '' .join((z for z in text if not z.isdigit()))
  return char_str

In [None]:
#remove digit using function created
sentiment_df['Review'] = sentiment_df['Review'].apply(remove_digit)
sentiment_df.head()

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
sw = stopwords.words('english')

In [None]:
def remove_stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

sentiment_df['Review'] = sentiment_df['Review'].apply(remove_stopwords)

In [None]:
sentiment_df.head()

In [None]:
# Remove White spaces
sentiment_df['Review'] =sentiment_df['Review'].apply(lambda x: " ".join(x.split()))
sentiment_df.head()

#### 6. Rephrase Text

In [None]:
# Rephrase Text

No required in my analysis

#### 7. Tokenization

In [None]:
# Tokenization
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
sentiment_df['Review'] = sentiment_df['Review'].apply(nltk.word_tokenize)
sentiment_df.sample(10)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
#applying Lemmatization
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
  '''function for lemmatization'''
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
  return lemmatized_tokens

# Lemmatize the 'Review' column
sentiment_df['Review'] = sentiment_df['Review'].apply(lemmatize_tokens)

##### Which text normalization technique have you used and why?

I have used Lemmatization text normalization technique.

Lemmatization converts words into their meaningful base or dictionary form while preserving the context of the word. Unlike stemming, which may cut words incorrectly, lemmatization produces linguistically correct words. This helps reduce vocabulary size, improves model understanding, and leads to better performance in sentiment analysis by maintaining semantic meaning.


#### 9. Part of speech tagging

In [None]:
# POS Taging
#Lemmatization without POS tagging is sufficient for reducing words to their base form.

#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import  TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=lambda x: x, lowercase=False)
X = sentiment_df['Review']
X= vectorizer.fit_transform(X)

##### Which text vectorization technique have you used and why?

Here I have used Tf-idf Vectorization technique.

TF-IDF (term frequency-inverse document frequency) is a technique that assigns a weight to each word in a document. It is calculated as the product of the term frequency (tf) and the inverse document frequency (idf).

The term frequency (tf) is the number of times a word appears in a document, while the inverse document frequency (idf) is a measure of how rare a word is across all documents in a collection. The intuition behind tf-idf is that words that appear frequently in a document but not in many documents across the collection are more informative and thus should be given more weight.

The mathematical formula for tf-idf is as follows:

tf-idf(t, d, D) = tf(t, d) * idf(t, D)

where t is a term (word), d is a document, D is a collection of documents, tf(t, d) is the term frequency of t in d, and idf(t, D) is the inverse document frequency of t in D.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
hotels.columns

In [None]:
hotels.drop('Links',axis=1,inplace=True)

In [None]:
reviews.columns

In [None]:
# for sentiment analysis, creating dependant variable based on rating
#We will create 2 categories based on the rating by creating a python function
def sentiment(rating):
  if rating >=3.5:
    return 1
    # positive sentiment
  else:
    return 0
    # negative sentiment

In [None]:
# applying to sentiment dataset
sentiment_df['Sentiment'] = sentiment_df['Rating'].apply(sentiment)
sentiment_df.head()

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
print('for sentiment analysis : ',sentiment_df.columns)
# For clustering analysis
print('\nFor clustering analysis :', cluster_df.columns)

##### What all feature selection methods have you used  and why?

The features will be selected usign PCA feature selection,beneficial while using Dimensionality reduction technique.

##### Which all features you found important and why?

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
#check if data needs to be transformed
def skewed_feature(df,features):
  symmetric_f = []
  skewed_f = []
  for i in features:
      if (df[i].skew() <= -1) | (df[i].skew() >= 1) :
        skewed_f.append(i)
      else:
        symmetric_f.append(i)
  return symmetric_f, skewed_f

In [None]:
#finding symmetric and skew symmetric features IN CLUSTER DF
features = ['Cost', 'No_of_cuisine', 'Average_rating']
s,sk=skewed_feature(cluster_df,features)
print('Symmetric features :',s)
print('Skew symmetric features :',sk)

In [None]:
#finding symmetric and skew symmetric features in Sentiment DF
features=['Rating', 'Sentiment']
s,sk=skewed_feature(sentiment_df,features)
print('Symmetric features :',s)
print('Skew symmetric features :',sk)

In [None]:
# Transform Your data
#Cost is skewed symmetric. Hence we have applied log transformation on Cost.
cluster_df['Cost'] = np.log1p(cluster_df['Cost'])

In [None]:
# visualization of log transformation of cost
sns.distplot(cluster_df['Cost'], color = '#055E85')
plt.axvline(cluster_df['Cost'].mean(), color='#ff033e', linestyle='dashed', linewidth=3,label= 'mean');  #red
plt.axvline(cluster_df['Cost'].median(), color='#A020F0', linestyle='dashed', linewidth=3,label='median'); #cyan
plt.legend(bbox_to_anchor = (1.0, 1), loc = 'best')
plt.title('Cost');
plt.tight_layout();

### 6. Data Scaling

In [None]:
cluster_df.head()

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
#Min max scaler for only numeric columns

In [None]:
scaled_df = cluster_df.copy()
scaled_df[["Cost","No_of_cuisine","Average_rating"]] = scaler.fit_transform(cluster_df[["Cost","No_of_cuisine","Average_rating"]])
scaled_df.set_index("Restaurant", inplace= True)

In [None]:
# Applying minmax transformation to numeric data
numeric_cols = list(cluster_df.describe().columns)
scaled_df = pd.DataFrame(scaler.fit_transform(cluster_df[numeric_cols]))
scaled_df.columns = numeric_cols

In [None]:
scaled_df.head()

##### Which method have you used to scale you data and why?

I have used MinMax Scaler to scale the data. The feature scaling is used to prevent the models from getting biased toward a specific range of values. Since the dummy variables created from cuisines contains the value 0 and 1 while other variables have different range of values.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Yes, it is important to use dimensionality reduction techniques as dataset has 40 or more features. This is because, as the number of features increases, the computational cost of clustering algorithms also increases. Dimensionality reduction techniques such as PCA can help reduce the number of features while maintaining the important information in the data, making it easier to cluster and interpret the results.

In [None]:
# DImensionality Reduction (If needed)
#applying pca

features = scaled_df.columns

# create an instance of PCA
from sklearn.decomposition import PCA
pca = PCA()

# fit PCA on features
pca.fit(scaled_df)

In [None]:
#explained variance v/s no. of components
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker ='o', color = 'orange')
plt.xlabel('Number of components',size = 15, color = 'red')
plt.ylabel('cumulative explained variance',size = 14, color = 'blue')
plt.title('Variance v/s No. of Components',size = 20, color = 'green')
plt.xlim([0, 20])
plt.show()

In [None]:
#using n_component as 4
pca = PCA(n_components=3)

# fit PCA on features
pca.fit(scaled_df)

# explained variance ratio of each principal component
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
# variance explained by three components
print('Cumulative variance explained by 4 principal components: {:.2%}'.format(
                                        np.sum(pca.explained_variance_ratio_)))

# transform data to principal component space
df_pca = pca.transform(scaled_df)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

I have used PCA as dimension reduction technique, because PCA (Principal Component Analysis) is a widely used dimensionality reduction technique because it is able to identify patterns in the data that are responsible for the most variation.Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features.



### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
# for sentiment analysis using sentiment_df dataframe
#X = X_tfidf
y = sentiment_df['Sentiment']

In [None]:
y.value_counts()

In [None]:
#spliting test train
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30)

# describes info about train and test set
print("X_train ", X_train.shape)
print("y_train ", y_train.shape)
print("X_test ", X_test.shape)
print("y_test ", y_test.shape)

In [None]:
X_train = X_train.toarray()
X_test = X_test.toarray()

##### What data splitting ratio have you used and why?

I have used 70:30 split which is one the most used split ratio. Since there was only 9961 data, therefore I have used more in training set.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Yes the dataset is imbalanced but since it is slightly imbalanced, hence handling is not neccesaary. So we can proceed with same dataset.

In [None]:
# check if dataset is imbalanced or not
sentiment_df['Sentiment'].value_counts().plot(kind='pie',
                               autopct="%1.1f%%",
                               labels=['Positive Sentiment','Negative Sentiment'],
                               colors=['green','red'],
                               explode=[0.01,0.02])

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

**K Means clustering**

K-Means Clustering is an Unsupervised Learning algorithm.The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters.

In [None]:
df_pca_copy = df_pca.copy()

In [None]:
# ML Model - 1 Implementation
#importing kmeans
from sklearn.cluster import KMeans
# Fit the Algorithm
wcss_list= []  #Initializing the list for the values of WCSS
wcss_dict = {}
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
    kmeans.fit(df_pca)
    wcss_list.append(kmeans.inertia_)
    wcss_dict[i] = kmeans.inertia_

wcss_dict

In [None]:
# plot for sum of squared distance for each number of cluster
plt.plot(range(1, 11), wcss_list)
plt.plot(range(1,11),wcss_list, linewidth=2, color="green", marker ="o")
plt.title('The Elobw Method Graph')
plt.xlabel('Number of clusters(k)')
plt.ylabel('wcss_list')
plt.show()

In [None]:
# silhoutte score to find optimal number of scores
from sklearn.metrics import silhouette_score
from sklearn.metrics import silhouette_samples
from sklearn.model_selection import ParameterGrid

silhouette_avg =[]
 # Calculate average silhouette score for each number of clusters (2 to 10)

for k in range(2,11):
  km = KMeans(n_clusters=k, random_state=3)
  km.fit(df_pca)
  silhouette_avg.append(silhouette_score(df_pca, km.labels_))

# plot the results
plt.plot(range(2,11), silhouette_avg)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.grid(True)

In [None]:
from matplotlib.colors import ListedColormap
import matplotlib.cm as cm
#visualizing Silhouette Score for individual clusters and the clusters made
range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1
    ax1.set_xlim([-1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(df_pca) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(df_pca)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(df_pca, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(df_pca, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower =  y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(df_pca[:, 0], df_pca[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')


In [None]:
#fitting on 6 clusters
kmeans = KMeans(n_clusters=6, init='k-means++', random_state= 10)
y_predict= kmeans.fit_predict(df_pca)

In [None]:
#visulaizing the clusters
plt.figure(figsize=(15,10))
plt.scatter(df_pca[y_predict == 0, 0], df_pca[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
plt.scatter(df_pca[y_predict == 1, 0], df_pca[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
plt.scatter(df_pca[y_predict== 2, 0], df_pca[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
plt.scatter(df_pca[y_predict == 3, 0], df_pca[y_predict == 3, 1], s = 100, c = 'orange', label = 'Cluster 4') #for fourth cluster
plt.scatter(df_pca[y_predict == 4, 0], df_pca[y_predict == 4, 1], s = 100, c = 'purple', label = 'Cluster 5') #for first cluster
plt.scatter(df_pca[y_predict == 5, 0], df_pca[y_predict == 5, 1], s = 100, c = 'magenta', label = 'Cluster 6') #for second cluster
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 150, c = 'black', label = 'Centroid')
plt.title('Clusters of Restaurants')
plt.legend()
plt.show()

In [None]:
# Assigning clusters to our data
new_df_cluster = cluster_df.copy()
cluster_df['clusters'] = y_predict
# checking how it is working
cluster_df.head()

In [None]:
# count of each of 6 clusters
cluster_df['clusters'].value_counts()

In [None]:
#creating new df for checkign cuisine in each cluster
new_cluster_df = hotels.copy()
new_cluster_df['clusters'] = y_predict
new_cluster_df['Cuisines'] = new_cluster_df['Cuisines'].str.split(',')
new_cluster_df = new_cluster_df.explode('Cuisines')

#removing extra trailing space from cuisines after exploded
new_cluster_df['Cuisines'] = new_cluster_df['Cuisines'].apply(lambda x: x.strip())
new_cluster_df.head(10)

In [None]:
new_cluster_df.shape

In [None]:
#printing cuisine list for each cluster
for cluster in new_cluster_df['clusters'].unique().tolist():
  print('Cuisine List for Cluster :', cluster,'\n')
  print(new_cluster_df[new_cluster_df["clusters"]== cluster]['Cuisines'].unique(),'\n')
  print('\n')

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

ELBOW METHOD
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.

SILHOUETTE METHOD
The silhouette coefficient or silhouette score kmeans is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation).

### ML Model - 2

**Hierarchical clustering**

In [None]:
#importing module for hierarchial clustering and vizualizing dendograms
import scipy.cluster.hierarchy as sch
plt.figure(figsize=(10,5))
dendrogram = sch.dendrogram(sch.linkage(df_pca, method = 'ward'),orientation='top',
            distance_sort='descending',
            show_leaf_counts=True)

plt.title('Dendrogram')
plt.xlabel('Restaurants')
plt.ylabel('Euclidean Distances')

plt.show()

In [None]:
#Checking the Silhouette score for 8 clusters
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score

range_n_clusters = [2, 3, 4, 5, 6, 7, 8]

for n_clusters in range_n_clusters:
    hc = AgglomerativeClustering(
        n_clusters=n_clusters,
        linkage='ward'
    )
    y_hc = hc.fit_predict(df_pca)
    score = silhouette_score(df_pca, y_hc)
    print(f"For n_clusters = {n_clusters}, silhouette score is {score}")


In [None]:
from sklearn.cluster import AgglomerativeClustering

# define the model
model = AgglomerativeClustering(n_clusters=6,
        linkage='ward')

#fit and predict on model
y_predict = model.fit_predict(df_pca)

In [None]:
# visualize the clusters
plt.figure(figsize=(15,10))
plt.scatter(df_pca[y_predict == 0,0], df_pca[y_predict == 0,1], s=100, c='cyan')
plt.scatter(df_pca[y_predict == 1,0], df_pca[y_predict == 1,1], s=100, c='red')
plt.scatter(df_pca[y_predict == 2,0], df_pca[y_predict == 2,1], s=100, c='blue')
plt.scatter(df_pca[y_predict == 3,0], df_pca[y_predict == 3,1], s=100, c='green')
plt.scatter(df_pca[y_predict == 4,0], df_pca[y_predict == 4,1], s=100, c='orange')
plt.scatter(df_pca[y_predict == 5,0], df_pca[y_predict == 5,1], s=100, c='magenta')


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In this project, Agglomerative Hierarchical Clustering was used as the primary machine learning model. It is an unsupervised learning algorithm that groups data points based on their similarity without using labeled data. The model starts by treating each data point as an individual cluster and then progressively merges the closest clusters until the desired number of clusters is formed.

#### 2. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Hierarchical Clustering helps businesses group data based on similarities and differences, making it easier to understand customer segments and their unique characteristics. This allows companies to tailor pricing, products, services, and marketing strategies for each segment. Well-defined segments enable more personalized targeting, leading to improved customer engagement and better business performance in the market.


### ML Model - 3

**Sentimental Analysis**

In [None]:
#Importing all the required libraries for sentiment analysis
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,f1_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_score, recall_score, roc_auc_score, roc_curve

In [None]:
# List of models
models = [["LogisticRegression", LogisticRegression(fit_intercept = True, class_weight='balanced')], ["DecisionTree", DecisionTreeClassifier()],
          ["RandomForest",RandomForestClassifier()],["XGBoost", XGBClassifier()],
          ["KNN", KNeighborsClassifier()]]

In [None]:
#function for fitting the model and calculating scores

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve
import pandas as pd

def model_build(models, X_train, X_test, y_train, y_test):

    score_matrix = pd.DataFrame()
    roc_sc = {}

    for model_name, model in models:
        current_result = {}

        # Fit model
        model.fit(X_train, y_train)

        # Predictions
        y_pred_test = model.predict(X_test)
        y_pred_train = model.predict(X_train)
        ypredProb = model.predict_proba(X_test)

        # Metrics
        current_result["Model"] = model_name
        current_result["Train Accuracy"] = accuracy_score(y_train, y_pred_train)
        current_result["Test Accuracy"] = accuracy_score(y_test, y_pred_test)
        current_result["Test Precision"] = precision_score(y_test, y_pred_test)
        current_result["Test Recall"] = recall_score(y_test, y_pred_test)
        current_result["Test F1"] = f1_score(y_test, y_pred_test)
        current_result["Test ROC_AUC Score"] = roc_auc_score(y_test, ypredProb[:,1])

        # Convert to DataFrame
        current_result = pd.DataFrame([current_result])

        # Correct concat
        score_matrix = pd.concat([score_matrix, current_result], ignore_index=True)

        # ROC Curve values
        fpr, tpr, threshold = roc_curve(y_test, ypredProb[:,1])
        roc_sc[model_name] = (fpr, tpr)

    # Random ROC baseline
    random_probs = [0 for _ in range(len(y_test))]
    p_fpr, p_tpr, _ = roc_curve(y_test, random_probs)
    roc_sc["TPR = FPR"] = (p_fpr, p_tpr)

    return score_matrix, roc_sc

In [None]:
# Obtaining results
model_results, Curve = model_build(models,X_train,X_test,y_train,y_test)

In [None]:
model_results

In [None]:
# ROC_AUC curve
plt.figure(figsize = (10,7))
for model , value in Curve.items():
  sns.lineplot(value[1], label = model)

Based on the ROC curve visualization, Logistic Regression is the best-performing model and selected for final deployment.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# hyperparameter tuning
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : [100, 10, 1.0, 0.1, 0.01],
    'solver' : ['lbfgs','newton-cg','liblinear'],
    }
]



In [None]:
grid_lr = GridSearchCV( LogisticRegression(fit_intercept = True, class_weight='balanced'), param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
best_clf = grid_lr.fit(X_train,y_train)

# Get the results
print(grid_lr.best_score_)
print(grid_lr.best_estimator_)
print(grid_lr.best_params_)

In [None]:
final_model = LogisticRegression(random_state=42, solver='lbfgs', penalty= 'l2', C = 10 )
final_model.fit(X_train, y_train)

In [None]:
# prediction report
y_pred = final_model.predict(X_test)
print(classification_report(y_test,y_pred,digits=4))

In [None]:
# Confusion Matrix
conf_mat = confusion_matrix(y_test, y_pred)
plt.rcParams['figure.figsize'] = (5, 5)
sns.heatmap(conf_mat, annot = True, linewidths=.5, cmap="YlGnBu")
plt.title('Confusion matrix')
plt.show()

##### Which hyperparameter optimization technique have you used and why?

I have used Grid Search CV Hyperparameter optimization technique and tried to find the best values of C.I got best params 'C': 10. I have also used Cross validation with CV = 3.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

After the hyperparameter tuning of Logistic Regression we observed the following improvements in the evaluation metrics.

Accuracy Before: 85.48% || Accuracy After: 86.00%

Precision Before: 90.97% || Precision After:85.97 %

Recall Before: 85.96% || Recall After: 86.02%

F1 Score Before: 88.40%|| F1 Score After: 85.99%

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

For sentiment analysis, evaluation metrics used were precision, recall, F1-score, and accuracy.

* Precision measures the proportion of true positive predictions among all positive predictions. It is a good metric to use when the cost of false positives is high.

* Recall (also known as sensitivity or true positive rate) measures the proportion of true positive predictions among all actual positive instances. It is a good metric to use when the cost of false negatives is high.

* F1-score is the harmonic mean of precision and recall, and is a good overall measure of a classifier's performance.

* Accuracy is the proportion of correctly classified instances among all instances.

The specific evaluation metric to use will depend on the specific use case and the relative costs of false positives and false negatives. For a positive business impact, F1-score can be considered as it balances the precision and recall to give an overall performance measure.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I have chosen logistic regression model for my final prediction because auc_roc score for logistic regression is highest among other models.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, Zomato restaurant data was analyzed to understand restaurant performance and customer sentiment using data science techniques. Exploratory analysis showed that most restaurants fall in the mid-price range, popular cuisines like North Indian and Chinese dominate the market, and only a few restaurants receive most of the customer engagement.

Hierarchical clustering helped group restaurants into meaningful segments, making it easier to understand similarities between them. Silhouette analysis confirmed that six clusters provided the best separation. Sentiment analysis was performed using multiple machine learning models, and Logistic Regression was selected as the final model because it showed the best overall performance and highest ROC-AUC score.

Overall, this project demonstrates how restaurant data and customer reviews can be used to gain valuable business insights. These insights can help restaurants improve service quality, pricing strategies, and customer satisfaction, while platforms like Zomato can enhance recommendations and user experience.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***