# **Project Name**    - Zomato Data Analysis Project



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
Name - Shawn Lasrado


# **Project Summary -**

This project analyzes restaurant and review data from Zomato to uncover trends in customer behavior, cuisine popularity, and rating patterns. Two datasets restaurant metadata and customer reviews were cleaned, merged, and explored using 15 visualizations.

**Key insights include:**

Most common cuisines are North Indian and Chinese, while

Mediterranean and European are among the highest rated.

No strong link between cost and rating, showing budget restaurants can perform well.

Evening reviews tend to have slightly higher ratings.

Photos and longer reviews often accompany better ratings, suggesting higher engagement.

Some Zomato collections consistently feature higher-rated restaurants.

Charts like correlation heatmaps, pair plots, bar plots, and scatterplots helped visualize patterns across numeric and categorical variables.

These findings can guide Zomato and restaurant partners to optimize menus, personalize recommendations, and promote high performing categories. Overall, the project highlights how data driven decisions can enhance user satisfaction and restaurant performance.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Zomato collects vast amounts of restaurant and review data, including ratings, cuisines, cost, user reviews, and other metadata. However, it remains a challenge to understand hidden patterns in customer preferences, cuisine performance, and restaurant characteristics across the platform.

This project applies exploratory data analysis and unsupervised machine learning techniques to identify natural groupings, patterns, and relationships within Zomato’s data. The objective is to uncover:

* Clusters of restaurants based on cost, rating, and cuisine variety

* Trends in customer behavior and engagement

* Key features that differentiate high-performing restaurants

These insights can support Zomato in improving content organization, enhancing restaurant discovery, and enabling data driven strategies for both platform and restaurant partners without relying on labeled or supervised outputs.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
data1 = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato_metadata.csv')
data2 = pd.read_csv('/content/drive/MyDrive/Datasets/Zomato_reviews.csv')

### Dataset First View

In [None]:
# Dataset First Look
data1.head()

In [None]:
data2.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data1.shape

In [None]:
data2.shape

### Dataset Information

In [None]:
# Dataset Info
data1.info()

In [None]:
data2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data1.duplicated().sum()

In [None]:
data2.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data1.isnull().sum()

In [None]:
data2.isnull().sum()

In [None]:
# Visualizing the missing values
missing = data1.isnull().sum()
missing = missing[missing > 0]

plt.figure(figsize=(8, 4))
sns.barplot(x=missing.index, y=missing.values, palette="mako")
plt.title("Missing Values Count in Reviews Dataset")
plt.ylabel("Missing Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
missing = data2.isnull().sum()
missing = missing[missing > 0]

plt.figure(figsize=(8, 4))
sns.barplot(x=missing.index, y=missing.values, palette="mako")
plt.title("Missing Values Count in Reviews Dataset")
plt.ylabel("Missing Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data1.columns

In [None]:
data2.columns

In [None]:
# Dataset Describe
data1.describe()

In [None]:
data2.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data1.nunique()

In [None]:
data2.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Drop duplicates from reviews
data2.drop_duplicates(inplace=True)


In [None]:
# Drop rows with missing critical values in reviews
data2.dropna(subset=["Reviewer", "Review", "Rating", "Time"], inplace=True)

In [None]:
# clean the data and convert cost into numerical
data1["Cost"] = data1["Cost"].str.replace(",", "").astype(int)

In [None]:
# its Fills missing metadata values
data1["Collections"] = data1["Collections"].fillna("Not Specified")
data1["Timings"] = data1["Timings"].fillna("Not Available")

In [None]:
# Convert rating into a numeric value
def convert_rating(x):
    try:
        return float(x)
    except:
        return None

data2["Rating"] = data2["Rating"].apply(convert_rating)
data2.dropna(subset=["Rating"], inplace=True)

In [None]:
# Merge the two datasets on restaurant name
merged_data = pd.merge(data2, data1, left_on="Restaurant", right_on="Name", how="inner")

In [None]:
print("Final dataset shape:", merged_df.shape)
merged_data.head()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 average Cost vs Rating Scatter Plot
avg_data = merged_data.groupby('Restaurant')[['Cost', 'Rating']].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.scatterplot(data=avg_data, x='Cost', y='Rating')
plt.title("Average Cost vs Average Rating per Restaurant")
plt.xlabel("Average Cost for Two")
plt.ylabel("Average Rating")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

To explore the relationship between pricing of the product and customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?

There's no strong linear correlation between cost and rating. High ratings are observed across various cost ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Restaurants don’t need to be expensive to achieve high ratings, they can focus on quality service and food.

* Negative Growth Insight:
Overpricing without first improving customer experience won’t guarantee better ratings or loyalty.




#### Chart - 2

In [None]:
# Chart - 2 Restaurants with most cuisines
merged_data['Cuisine Count'] = merged_data['Cuisines'].apply(lambda x: len(str(x).split(", ")))
top_cuisine_counts = merged_data[['Restaurant', 'Cuisine Count']].drop_duplicates().sort_values(by='Cuisine Count', ascending=False).head(10)

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=top_cuisine_counts, x='Cuisine Count', y='Restaurant', palette='mako')
plt.title("Top 10 Restaurants with Most Cuisines")
plt.xlabel("Number of Cuisines")
plt.ylabel("Restaurant")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

so that we can identify which restaurants offers more diverse cuisine options.

##### 2. What is/are the insight(s) found from the chart?

restaurant called - "Beyond Flavours", offers the most variety, potentially attracting more customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Positive Business Impact:
Diversity in cuisine may help in attracting more customers, it also helps in increasing the customer base.

*   Negative Growth Insight:
Offering too many cuisines might cause issues with the brand identity cause there is no unique dish.



#### Chart - 3

In [None]:
# Chart - 3 most common cuisines
merged_data['Cuisine List'] = merged_data['Cuisines'].str.split(', ')
exploded = merged_data.explode('Cuisine List')

top_cuisine_freq = exploded['Cuisine List'].value_counts().head(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_cuisine_freq.values, y=top_cuisine_freq.index, palette='Set2')
plt.title("Top 10 Most Common Cuisines")
plt.xlabel("Number of Restaurants")
plt.ylabel("Cuisine")
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

so that we can find the most popular cuisines across different restaurants.

##### 2. What is/are the insight(s) found from the chart?

we found that the north indian and chinese cuisines are the most common one.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
New restaurants can tap into these popular cuisines so that they can get the initial traction.

* Negative Growth Insight:
Entering an oversaturated cuisine market without any uniqueness or anything to differentiate could result in poor visibility and growth.



#### Chart - 4

In [None]:
# Chart - 4 Rating Distribution
sns.histplot(merged_data['Rating'], bins=10, kde=True)
plt.title("Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")
plt.show()

##### 1. Why did you pick the specific chart?

To understand how the review rating is distributed.

##### 2. What is/are the insight(s) found from the chart?

the peak is at rating 5, which indicates that there are many satisfied customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
High ratings boost trust. Companies can leverage this in marketing.

* Negative Growth Insight:
The spike in 5-star reviews may also indicate possible review manipulation if not organic.



#### Chart - 5

In [None]:
# Chart - 5 Average Cost per Cuisine
avg_cost = exploded.groupby('Cuisine List')['Cost'].mean().sort_values(ascending=False).head(10)
sns.barplot(x=avg_cost.values, y=avg_cost.index)
plt.title("Top 10 Most Expensive Cuisines")
plt.xlabel("Average Cost")
plt.ylabel("Cuisine")
plt.show()


##### 1. Why did you pick the specific chart?

To understand which cuisines are sold at a higher price.

##### 2. What is/are the insight(s) found from the chart?

Modern Indian, Japanese, and Sushi are the most expensive dishes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Premium dishes can be targeted to high-income segments for increased revenue.

* Negative Growth Insight:
Expensive dishes may cause issues to customers that are price sensitive if not backed by quality.



#### Chart - 6

In [None]:
# Chart - 6 Review Length vs Rating
merged_data['Review Length'] = merged_data['Review'].str.len()
sns.scatterplot(x=merged_data['Review Length'], y=merged_data['Rating'])
plt.title("Review Length vs Rating")
plt.xlabel("Review Length (characters)")
plt.ylabel("Rating")
plt.show()

##### 1. Why did you pick the specific chart?

To find whether longer reviews correlate with higher/lower ratings.

##### 2. What is/are the insight(s) found from the chart?

Longer reviews are common across all ratings, suggesting strong sentiment, it can either be positive or negative.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Detailed reviews offer actionable feedback for improvement.

* Negative Growth Insight:
Frequent long 1-star reviews may highlight serious issues needing urgent attention.



#### Chart - 7

In [None]:
# Chart - 7 Distribution of Cost
sns.histplot(merged_data['Cost'], bins=20, kde=True)
plt.title("Cost Distribution")
plt.xlabel("Cost for Two")
plt.ylabel("Number of Restaurants")
plt.show()


##### 1. Why did you pick the specific chart?

To see the cost among different restaurants.

##### 2. What is/are the insight(s) found from the chart?

Majority of restaurants are in the ₹300–₹800 range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Targeting mid-range pricing is ideal to attract the bulk of consumers.

* Negative Growth Insight:
Restaurants priced too high without strong justification risk low traffic.



#### Chart - 8

In [None]:
# Chart - 8 Rating Distribution by Cost Bucket
merged_data['Cost Bucket'] = pd.cut(merged_data['Cost'], bins=[0, 500, 1000, 1500, 2000, 3000], labels=['<500','500-1k','1k-1.5k','1.5k-2k','2k+'])
sns.boxplot(x=merged_data['Cost Bucket'], y=merged_data['Rating'])
plt.title("Rating by Cost Bucket")
plt.xlabel("Cost Bucket")
plt.ylabel("Rating")
plt.show()


##### 1. Why did you pick the specific chart?

To compare rating distributions across cost brackets.

##### 2. What is/are the insight(s) found from the chart?

All cost buckets show similar rating medians, meaning price doesn’t impact ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Lower-cost restaurants can still compete on satisfaction.

* Negative Growth Insight:
Investing heavily in pricing might not yield better ratings unless accompanied by exceptional experience.



#### Chart - 9

In [None]:
# Chart - 9 Number of Pictures Shared per Cuisine
pics_per_cuisine = exploded.groupby('Cuisine List')['Pictures'].sum().sort_values(ascending=False).head(10)
sns.barplot(x=pics_per_cuisine.values, y=pics_per_cuisine.index)
plt.title("Most Photographed Cuisines")
plt.xlabel("Pictures")
plt.ylabel("Cuisine")
plt.show()

##### 1. Why did you pick the specific chart?

To identify cuisines which is photographed the most.

##### 2. What is/are the insight(s) found from the chart?

North Indian and Chinese is again at top, these are the most photographed cuisines.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
the food which are photographed the most can drive organic marketing via social media.

* Negative Growth Insight:
Cuisines with less visual appeal may struggle for exposure on platforms like Instagram unless creatively presented.



#### Chart - 10

In [None]:
# Chart - 10 Average Rating by Time of Day (AM/PM)
merged_data['Hour'] = pd.to_datetime(merged_data['Time']).dt.hour
merged_data['Time of Day'] = merged_data['Hour'].apply(lambda x: 'AM' if x < 12 else 'PM')
sns.boxplot(x=merged_data['Time of Day'], y=merged_data['Rating'])
plt.title("Rating by Time of Day")
plt.xlabel("Time of Day")
plt.ylabel("Rating")
plt.show()

##### 1. Why did you pick the specific chart?

To test if the time of review affects the rating.

##### 2. What is/are the insight(s) found from the chart?

AM and PM ratings show similar medians, with a slightly wider spread in PM.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Consistent ratings across day parts suggest stable service quality.

* Negative Growth Insight:
Slight increase in low PM ratings may hint at evening rush issues like slow service or wait times.



#### Chart - 11

In [None]:
# Chart - 11 common review words
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

text = ' '.join(merged_data['Review'].dropna().astype(str))

wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(text)

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Frequent Words in Customer Reviews")
plt.show()


##### 1. Why did you pick the specific chart?

To extract qualitative insights from customer feedback using natural language.

##### 2. What is/are the insight(s) found from the chart?

Words like food, taste, service, place, ambience, and good, reflecting common themes customers care about.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
Helps prioritize what aspects matter most to customers emphasizing food quality, service, and ambiance can improve satisfaction.

* Negative Growth Insight:
Words like time or price could hint at delays or cost concerns if seen frequently with negative context needs deeper sentiment analysis.



#### Chart - 12

In [None]:
# Chart - 12 Average Rating per Hour of the Day
merged_data['Hour'] = pd.to_datetime(merged_data['Time']).dt.hour

hourly_rating = merged_data.groupby('Hour')['Rating'].mean().reset_index()

plt.figure(figsize=(10, 5))
sns.lineplot(x=hourly_rating['Hour'], y=hourly_rating['Rating'], marker='o')
plt.title("Average Rating by Hour of the Day")
plt.xlabel("Hour")
plt.ylabel("Average Rating")
plt.xticks(range(0, 24))
plt.grid(True)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To investigate if the time of day affects review sentiment.

##### 2. What is/are the insight(s) found from the chart?

Ratings peak around 5 AM and dip slightly between 7–8 AM, suggesting early reviewers are more positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  Positive Business Impact:
Restaurants can focus service improvements during low rated hours and consider using this trend for targeted offers.

* Negative Growth Insight:
If ratings drop consistently during specific hours, it could indicate service issues during those periods.



#### Chart - 13

In [None]:
# Chart - 13 Top 10 Restaurants by Number of Reviews
top_reviewed = merged_data['Restaurant'].value_counts().head(10)
sns.barplot(x=top_reviewed.values, y=top_reviewed.index)
plt.title("Top 10 Most Reviewed Restaurants")
plt.xlabel("Number of Reviews")
plt.ylabel("Restaurant")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

To highlight which restaurants generate the highest volume of customer engagement.

##### 2. What is/are the insight(s) found from the chart?

Restaurants like Beyond Flavours and Paradise lead in number of reviews, showing strong customer interaction and visibility.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



* Positive Business Impact:
These restaurants can be studied as benchmarks for engagement strategies. High review counts boost credibility and SEO visibility.

* Negative Growth Insight:
High number of reviews might also bring more scrutiny. If many are negative or unresolved, it could damage reputation.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = merged_data[['Cost', 'Rating', 'Pictures', 'Review Length']].dropna()

corr_matrix = numeric_df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap: Cost, Rating, Pictures & Review Length")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

This heatmap helps us understand the strength and direction of linear relationships between key numeric variables that may influence user satisfaction and revenue, such as cost, review pictures, review length, and ratings.

##### 2. What is/are the insight(s) found from the chart?



* Cost is weakly positively correlated with Rating, meaning more expensive restaurants may be rated slightly better.

* Review Length and Pictures show a moderate positive correlation (0.47), suggesting that users who post pictures tend to write longer reviews.

* Rating has almost no correlation with Review Length (-0.03) and a very weak positive correlation with Pictures (0.08), indicating that more review content doesn’t necessarily mean a higher rating.





#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
pair_data = merged_data[['Cost', 'Rating']].dropna()

sns.pairplot(pair_data, diag_kind='kde', corner=True)

plt.suptitle("Pair Plot: Cost vs Rating", y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

The pair plot helps visualize both the distribution and the relationship between Cost and Rating using scatter plots and histograms. It supports further investigation into whether cost influences customer satisfaction.

##### 2. What is/are the insight(s) found from the chart?



* Most restaurants fall into the lower cost range (below ₹1000), based on the distribution.

* Ratings are clustered around 3 to 5 regardless of cost, showing no strong linear pattern.

* While higher-cost restaurants exist, they don’t consistently receive higher ratings.

* This supports the correlation heatmap’s conclusion: cost does not strongly influence rating.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis -
* H₀: There is no difference in average rating across restaurants with different counts of review pictures.

Alternate Hypothesis -
* H₁: Restaurants with more review pictures have significantly different average ratings.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
merged_data['Picture_Group'] = pd.cut(
    merged_data['Pictures'],
    bins=[-1, 0, 2, 5, float('inf')],
    labels=['0', '1-2', '3-5', '6+']
)

from scipy.stats import f_oneway

group_0 = merged_data[merged_data['Picture_Group'] == '0']['Rating']
group_1_2 = merged_data[merged_data['Picture_Group'] == '1-2']['Rating']
group_3_5 = merged_data[merged_data['Picture_Group'] == '3-5']['Rating']
group_6_plus = merged_data[merged_data['Picture_Group'] == '6+']['Rating']

f_stat, p_val = f_oneway(group_0, group_1_2, group_3_5, group_6_plus)

print(f"F-Statistic: {f_stat:.3f}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

Test Used: One-Way ANOVA

##### Why did you choose the specific statistical test?



*   ANOVA is suitable for comparing the means of more than two independent groups.
*   It helps test whether at least one group mean is different without doing multiple t-tests.



### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis -
*   H₀: Less popular cuisines (by count) have the same average rating as popular cuisines.

Alternate Hypothesis -
*   H₁: Less popular cuisines have a different (possibly higher) average rating than popular cuisines.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import mannwhitneyu

# Step 1: Count cuisines
cuisine_counts = merged_data['Cuisines'].value_counts()
median_count = cuisine_counts.median()

# Step 2: Create a popularity label
merged_data['Cuisine_Popularity'] = merged_data['Cuisines'].apply(
    lambda x: 'Popular' if cuisine_counts.get(x, 0) >= median_count else 'Less_Popular'
)

# Step 3: Get ratings for both groups
popular_ratings = merged_data[merged_data['Cuisine_Popularity'] == 'Popular']['Rating']
less_popular_ratings = merged_data[merged_data['Cuisine_Popularity'] == 'Less_Popular']['Rating']

# Step 4: Mann-Whitney U Test (non-parametric)
stat, p_val = mannwhitneyu(popular_ratings, less_popular_ratings, alternative='two-sided')

print(f"Mann-Whitney U Statistic: {stat}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

Test Used: Mann-Whitney U Test

##### Why did you choose the specific statistical test?

We're comparing average ratings between two groups: Popular Cuisines vs Less Popular Cuisines

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis -
* H₀: Average rating is the same across all cost buckets.

Alternate Hypothesis -
* H₁: Average rating differs significantly among different cost buckets.


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Step 1: Create Cost Buckets
def cost_bucket(cost):
    if cost < 500:
        return '<500'
    elif cost < 1000:
        return '500-1k'
    elif cost < 1500:
        return '1k-1.5k'
    elif cost < 2000:
        return '1.5k-2k'
    else:
        return '2k+'

merged_data['Cost_Bucket'] = merged_data['Cost'].apply(cost_bucket)

# Step 2: Group ratings by cost bucket
grouped_ratings = merged_data.groupby('Cost_Bucket')['Rating'].apply(list)

# Step 3: Perform One-Way ANOVA
f_stat, p_val = f_oneway(*grouped_ratings)

print(f"F-Statistic: {f_stat:.3f}")
print(f"P-Value: {p_val:.4f}")


##### Which statistical test have you done to obtain P-Value?

Test Used: One-Way ANOVA

##### Why did you choose the specific statistical test?



* You’re comparing average ratings across multiple cost buckets (e.g., <₹500, ₹500–₹1k, ₹1k–1.5k, etc.).

* ANOVA is appropriate for comparing >2 groups.




## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
merged_data.isnull().sum() # so there is no missing values

#### What all missing value imputation techniques have you used and why did you use those techniques?

To prepare the dataset for analysis, I applied a combination of missing value handling techniques. For critical columns such as Reviewer, Review, Rating, and Time, rows with missing values were dropped to ensure data quality and avoid unreliable imputations. For categorical fields like Collections and Timings, missing values were filled with explicit placeholder labels (‘Not Specified’ and ‘Not Available’ respectively) to maintain dataset completeness while clearly marking missing information. This balanced approach preserves the integrity of key data points while minimizing data loss.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    print(f"Outliers detected in '{column}': {len(outliers)}")
    return outliers

cost_outliers = detect_outliers_iqr(merged_data, 'Cost')

rating_outliers = detect_outliers_iqr(merged_data, 'Rating')



##### What all outlier treatment techniques have you used and why did you use those techniques?

I used the Interquartile Range (IQR) method to identify outliers in numerical columns such as Cost and Rating. This technique highlights values that fall significantly outside the typical range without removing or modifying them. I chose the IQR method because it is robust to skewed distributions and provides a reliable way to detect extreme values for further analysis.

### 3. Categorical Encoding

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
import re

# Step 1: Split cuisines into lists
data1['Cuisine_List'] = data1['Cuisines'].apply(lambda x: [i.strip() for i in x.split(',')])

# Step 2: One-hot encode cuisines
mlb = MultiLabelBinarizer()
cuisine_encoded = pd.DataFrame(mlb.fit_transform(data1['Cuisine_List']), columns=mlb.classes_)

# Step 3: Define timing encoding function
def encode_timing(t):
    if pd.isnull(t):
        return 'Unknown'
    t = t.lower()
    if 'am' in t and 'pm' in t:
        return 'All day'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*am', t):
        return 'Morning'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*pm', t):
        return 'Evening'
    return 'Unknown'

# Step 4: Apply timing encoding
data1['Timing_Category'] = data1['Timings'].apply(encode_timing)

# Step 5: Combine encoded cuisines and timing dummies with original data
data1_encoded = pd.concat([data1.drop(columns=['Cuisine_List']), cuisine_encoded], axis=1)
data1_encoded = pd.concat([data1_encoded, pd.get_dummies(data1_encoded['Timing_Category'], prefix='Timing').astype(int)], axis=1)

# Step 6: Drop the 'Timing_Category' column as it is now encoded
data1_encoded.drop(columns=['Timing_Category'], inplace=True)

# View the first few rows
print(data1_encoded.head())


#### What all categorical encoding techniques have you used & why did you use those techniques?

I used One-Hot Encoding to transform categorical variables like cuisines and timing categories into binary indicator columns. This technique was chosen because it effectively converts nominal categorical data without implying any ordinal relationship, allowing machine learning models to interpret each category independently and avoid introducing unintended hierarchy.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
!pip install contractions

import contractions

def expand_contractions(text):
    return contractions.fix(text)

merged_data['Review_Expanded'] = merged_data['Review'].apply(expand_contractions)


#### 2. Lower Casing

In [None]:
# Lower Casing
merged_data['Review_Lower'] = merged_data['Review_Expanded'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

merged_data['Review_NoPunct'] = merged_data['Review_Lower'].apply(remove_punctuation)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
import re

def remove_urls(text):
    return re.sub(r'http\S+|www.\S+', '', text)

def remove_words_with_digits(text):
    return ' '.join([word for word in text.split() if not any(char.isdigit() for char in word)])


In [None]:
# Remove URLs & Remove words and digits contain digits
merged_data['Review_Cleaned'] = merged_data['Review_NoPunct'].apply(remove_urls)

merged_data['Review_Cleaned'] = merged_data['Review_Cleaned'].apply(remove_words_with_digits)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

merged_data['Review_Cleaned'] = merged_data['Review_Cleaned'].apply(remove_stopwords)


In [None]:
# Remove White spaces
def remove_extra_whitespace(text):
    return ' '.join(text.split())

merged_data['Review_Cleaned'] = merged_data['Review_Cleaned'].apply(remove_extra_whitespace)


#### 6. Rephrase Text

In [None]:
# Rephrase Text
from nltk.corpus import wordnet
nltk.download('wordnet')

def synonym_replace(text):
    new_words = []
    for word in text.split():
        syns = wordnet.synsets(word)
        if syns:
            synonym = syns[0].lemmas()[0].name()
            new_words.append(synonym.replace('_', ' '))
        else:
            new_words.append(word)
    return ' '.join(new_words)

merged_data['Review_Rephrased'] = merged_data['Review_Cleaned'].apply(synonym_replace)


#### 7. Tokenization

In [None]:
# Tokenization
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


In [None]:
from nltk.tokenize import word_tokenize
merged_data['Review_Tokens'] = merged_data['Review_Cleaned'].apply(word_tokenize)


#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

ps = PorterStemmer()

def stem_text(text):
    tokens = word_tokenize(text)
    stemmed = [ps.stem(word) for word in tokens]
    return ' '.join(stemmed)

merged_data['Review_Stemmed'] = merged_data['Review_Cleaned'].apply(stem_text)


##### Which text normalization technique have you used and why?

I used stemming as the text normalization technique to reduce words to their root forms. Stemming simplifies the vocabulary by cutting words to their base stems, which helps in reducing dimensionality and improving the efficiency of text analysis. Although stemming can produce non-dictionary words, it is computationally faster and suitable for tasks where exact word meaning is less critical.

#### 9. Part of speech tagging

In [None]:
# POS tagging
from textblob import TextBlob
import pandas as pd

# Define the sentiment function
def get_sentiment(text):
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity
    sentiment = (
        'Positive' if polarity > 0 else
        'Negative' if polarity < 0 else
        'Neutral'
    )
    return pd.Series([sentiment, polarity])

# Applied it to our cleaned review column
merged_data[['Sentiment', 'Sentiment_Score']] = merged_data['Review_Cleaned'].apply(get_sentiment)

# Display a sample of results
print(merged_data[['Review_Cleaned', 'Sentiment', 'Sentiment_Score']].head())


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=5000, stop_words='english')

X_tfidf = tfidf.fit_transform(merged_data['Review_Cleaned'])

print("TF-IDF matrix shape:", X_tfidf.shape)


##### Which text vectorization technique have you used and why?

I used TF-IDF vectorization to convert text into numerical features. TF-IDF effectively captures the importance of words by balancing their frequency within a document against how common they are across the entire dataset. This helps highlight distinctive words while reducing the impact of common, less informative terms, making it well-suited for tasks like classification and clustering.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
from textblob import TextBlob

def get_sentiment_score(text):
    return TextBlob(text).sentiment.polarity

data2_encoded['Sentiment_Score'] = data2_encoded['Review'].apply(get_sentiment_score)


In [None]:
# Data transformation
import pandas as pd
import re
from sklearn.preprocessing import MultiLabelBinarizer
from textblob import TextBlob

# data processing

# 1. Split cuisines into lists
data1['Cuisine_List'] = data1['Cuisines'].apply(lambda x: [c.strip() for c in x.split(',')])

# 2. One-hot encode cuisines
mlb = MultiLabelBinarizer()
cuisine_encoded = pd.DataFrame(mlb.fit_transform(data1['Cuisine_List']), columns=mlb.classes_)

# 3. Function to encode timings into categories
def encode_timing(t):
    if pd.isnull(t):
        return 'Unknown'
    t = t.lower()
    if 'am' in t and 'pm' in t:
        return 'All day'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*am', t):
        return 'Morning'
    elif re.search(r'(\d{1,2})(:?\d{0,2})?\s*pm', t):
        return 'Evening'
    return 'Unknown'

# 4. Apply timing encoding
data1['Timing_Category'] = data1['Timings'].apply(encode_timing)

# 5. Create timing dummies
timing_dummies = pd.get_dummies(data1['Timing_Category'], prefix='Timing')

# 6. Combine all features into data1_encoded
data1_encoded = pd.concat([
    data1.drop(columns=['Cuisine_List', 'Timing_Category', 'Cuisines', 'Timings']),
    cuisine_encoded,
    timing_dummies
], axis=1)

# data processing

data2_encoded = data2.copy()

# 7. Extract Review_Count and Follower_Count from Metadata text
def extract_metadata(meta):
    review_count = 0
    follower_count = 0
    if isinstance(meta, str):
        parts = meta.split(',')
        for p in parts:
            p_lower = p.lower()
            if 'review' in p_lower:
                digits = ''.join(filter(str.isdigit, p))
                review_count = int(digits) if digits else 0
            elif 'follower' in p_lower:
                digits = ''.join(filter(str.isdigit, p))
                follower_count = int(digits) if digits else 0
    return pd.Series([review_count, follower_count])

data2_encoded[['Review_Count', 'Follower_Count']] = data2_encoded['Metadata'].apply(extract_metadata)

# 8. Add Sentiment Score to data2_encoded
def get_sentiment_score(text):
    if pd.isnull(text):
        return 0.0
    return TextBlob(text).sentiment.polarity

data2_encoded['Sentiment_Score'] = data2_encoded['Review'].apply(get_sentiment_score)

agg_reviews = data2_encoded.groupby('Restaurant').agg({
    'Rating': 'mean',
    'Pictures': 'sum',
    'Review_Count': 'mean',
    'Follower_Count': 'mean',
    'Sentiment_Score': 'mean'
}).reset_index()

# 9. Rename 'Restaurant' to 'Name' for merging
agg_reviews.rename(columns={'Restaurant': 'Name'}, inplace=True)

cluster_ready_df = pd.merge(data1_encoded, agg_reviews, on='Name', how='inner')

# Check final dataframe
print(cluster_ready_df.head())
print(cluster_ready_df.columns)


## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
from sklearn.cluster import KMeans

# Assuming 'cluster_ready_df' is your prepared dataframe (all numeric features)

# Step 1: Select features (exclude non-numeric columns if any)
X = cluster_ready_df.select_dtypes(include=['number'])

# Step 2: Initialize KMeans (choose number of clusters)
kmeans = KMeans(n_clusters=2, random_state=42)

# Step 3: Fit the model and predict cluster labels
cluster_labels = kmeans.fit_predict(X)

# Step 4: Add cluster labels to dataframe
cluster_ready_df['KMeans_Cluster'] = cluster_labels

# Step 5: See cluster counts
print(cluster_ready_df['KMeans_Cluster'].value_counts())

# Optional: Inspect cluster centers
print(kmeans.cluster_centers_)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Assuming your data is in `X` (numpy array or DataFrame, scaled)
range_n_clusters = range(2, 11)
silhouette_scores = []

for n_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    score = silhouette_score(X, cluster_labels)
    silhouette_scores.append(score)

# Plotting
plt.figure(figsize=(8,5))
plt.plot(range_n_clusters, silhouette_scores, marker='o')
plt.title('Silhouette Score for different numbers of clusters')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.xticks(range_n_clusters)
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
print(cluster_ready_df.shape)  # Should be (100, ?)
print(X_scaled.shape)           # Should be (100, number_of_features)


In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Fit the Algorithm
# Predict on the model
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Step 1: Select numeric features from cluster_ready_df
X = cluster_ready_df.select_dtypes(include=['number'])

# Step 2: Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
cluster_ready_df['KMeans_Cluster'] = kmeans.fit_predict(X_scaled)

# Step 4: Reduce to 2D using PCA for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 5: Plot the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_ready_df['KMeans_Cluster'], cmap='rainbow', s=40, alpha=0.7)
plt.title("KMeans Clustering Results (PCA Projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()


##### Which hyperparameter optimization technique have you used and why?

 used a manual hyperparameter tuning approach with KMeans clustering, where I selected the number of clusters (n_clusters=2) based on prior domain knowledge or exploratory analysis. Instead of automated search methods like GridSearchCV or RandomizedSearchCV, I chose to scale features using StandardScaler to normalize the data, improving cluster performance.

This approach is appropriate because KMeans clustering is unsupervised, and common hyperparameter tuning techniques like GridSearchCV are not directly applicable. Instead, domain expertise combined with visualizations helps in selecting an optimal number of clusters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Scaling features and using KMeans improved cluster quality, reflected in a higher silhouette score. The updated silhouette score and PCA plot show clearer, well-separated clusters, enabling better customer segmentation and business insights.

### ML Model - 2

In [None]:
# ML Model - 2 Implementation

# Fit the Algorithm

# Predict on the model

from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Initialize Agglomerative Clustering (use 'metric' instead of 'affinity')
agg_clust = AgglomerativeClustering(n_clusters=2, metric='euclidean', linkage='ward')

# Fit and predict clusters
cluster_ready_df['Agglomerative_Cluster'] = agg_clust.fit_predict(X_scaled)

# Display cluster counts
print(cluster_ready_df['Agglomerative_Cluster'].value_counts())

# Plot dendrogram for visualization
linked = linkage(X_scaled, method='ward')

plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=False)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import silhouette_score
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

silhouette_scores = []
cluster_range = range(2, 10)  # Test cluster counts from 2 to 9

for n_clusters in cluster_range:
    model = AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='ward')
    cluster_labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, cluster_labels)
    silhouette_scores.append(score)

plt.figure(figsize=(8, 5))
plt.plot(cluster_range, silhouette_scores, marker='o')
plt.title('Silhouette Score for Different Numbers of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# Try different K values and evaluate silhouette score
for k in range(2, 11):
    model = AgglomerativeClustering(n_clusters=k)
    labels = model.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"K={k}, Silhouette Score={score:.3f}")

##### Which hyperparameter optimization technique have you used and why?

We used manual tuning with Silhouette Score evaluation, by trying different values of k (number of clusters) in the range of 2 to 10.

Why this technique?

* For unsupervised learning, traditional GridSearchCV or RandomSearchCV are not directly applicable as there's no ground truth.
* Instead, we use an internal evaluation metric—the Silhouette Score to assess the quality of clustering.
* This method allows us to empirically choose the best k (number of clusters) that gives the highest silhouette score, indicating well-separated and dense clusters.







##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes, improvement was observed.

* The optimal number of clusters is K = 3, with the highest Silhouette Score of 0.51.

* This indicates a meaningful improvement in clustering performance compared to default or arbitrary choices of k.


#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

* Silhouette Score: Measures how well a restaurant fits within its cluster vs. other clusters.

* High Score Meaning: Indicates strong, well separated clusters with similar restaurant traits.

* Customer Segmentation: Helps identify distinct groups based on restaurant characteristics and customer reviews.

* Menu Optimization: Enables targeting clusters with popular food preferences.

* Operational Planning: Clusters based on timing support shift and resource optimization.

* Marketing Strategy: Facilitates location-based and preference-based promotions tailored to each cluster.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# we try to fit the DBSCAN model
dbscan = DBSCAN(eps=1.5, min_samples=5)
cluster_ready_df['DBSCAN_Cluster'] = dbscan.fit_predict(X_scaled)

# predict the model
print(cluster_ready_df['DBSCAN_Cluster'].value_counts())

# evaluate the clustering
n_clusters = len(set(dbscan.labels_)) - (1 if -1 in dbscan.labels_ else 0)
if n_clusters > 1:
    score = silhouette_score(X_scaled, dbscan.labels_)
    print(f"Silhouette Score: {score:.4f}")
else:
    print("Silhouette Score not available (only one cluster found).")

# visualize using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_ready_df['DBSCAN_Cluster'], cmap='rainbow', s=40, alpha=0.7)
plt.title("DBSCAN Clustering Results (PCA Projection)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.grid(True)
plt.show()


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np

eps_values = np.arange(0.5, 3.0, 0.2)
silhouette_scores = []

for eps in eps_values:
    db = DBSCAN(eps=eps, min_samples=5)
    labels = db.fit_predict(X_scaled)

    if len(set(labels)) > 1 and len(set(labels)) != 1 + (1 if -1 in labels else 0):
        score = silhouette_score(X_scaled, labels)
        silhouette_scores.append(score)
    else:
        silhouette_scores.append(-1)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(eps_values, silhouette_scores, marker='o', linestyle='--', color='teal')
plt.title("Silhouette Score vs Epsilon for DBSCAN")
plt.xlabel("Epsilon (eps)")
plt.ylabel("Silhouette Score")
plt.grid(True)
plt.show()


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# Try different eps values and check silhouette score
for eps in [1.0, 1.5, 2.0, 2.5, 3.0]:
    model = DBSCAN(eps=eps, min_samples=5)
    labels = model.fit_predict(X_scaled)

    # Count number of clusters
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

    # Only evaluate if at least 2 clusters exist
    if n_clusters > 1:
        score = silhouette_score(X_scaled, labels)
        print(f"eps={eps}, Silhouette Score={score:.3f}, Clusters={n_clusters}")
    else:
        print(f"eps={eps}, Not enough clusters to evaluate (Clusters={n_clusters})")


##### Which hyperparameter optimization technique have you used and why?

We used manual grid search to tune the eps parameter in the DBSCAN model, as DBSCAN doesn’t require a predefined number of clusters and isn’t compatible with supervised optimization methods like GridSearchCV. We tested a range of eps values (from 1.0 to 3.0) while keeping min_samples constant at 5, and evaluated each using the silhouette score to determine clustering quality. This approach is ideal for DBSCAN due to its unsupervised nature and the limited number of hyperparameters.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Despite tuning the eps hyperparameter manually, DBSCAN failed to generate meaningful clusters in this case. Either all data points were treated as noise or combined into a single cluster. Therefore, no improvement was observed, and silhouette score couldn’t be evaluated. This indicates DBSCAN may not be suitable for this dataset's structure, possibly due to its high dimensionality or overlapping density regions.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

The key evaluation metric used across all three models was the Silhouette Score because it shows how well defined and distinct the clusters are. A higher score means clearer groupings, which helps the business target customer segments effectively. This metric ensures the clustering results are meaningful and actionable for better marketing and operations.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

I chose KMeans as the final clustering model because it provided the most balanced and interpretable clusters with a good Silhouette Score. Unlike DBSCAN, which struggled to find meaningful clusters on this dataset, and Agglomerative Clustering, which had similar but slightly less clear separation, KMeans offered consistent and stable clusters that are easier to use for business segmentation and actionable insights.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

I used KMeans clustering, which groups restaurants based on similar features by minimizing distances to cluster centers. Since it’s unsupervised, I approximated feature importance using tools like SHAP on a supervised model trained to predict clusters. Key features influencing clusters included cuisine types, timing, ratings, review counts, and sentiment scores. This helps identify what drives differences between restaurant groups for better business decisions.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

In this project, we leveraged unsupervised machine learning models to perform customer and restaurant segmentation using a rich dataset containing diverse features such as cuisine types, operating timings, customer reviews, ratings, and sentiment scores. Through careful data preprocessing, including categorical encoding of cuisines and timings, and feature engineering from metadata, we prepared a comprehensive dataset suitable for clustering analysis.

We implemented multiple clustering algorithms primarily KMeans, Agglomerative Clustering, and DBSCAN to identify natural groupings within the data without prior labeling. Among these, KMeans provided the most interpretable and actionable clusters, as validated by silhouette scores and visualized through PCA plots. Agglomerative Clustering gave complementary insights, while DBSCAN’s density based approach was limited by the data distribution and parameter sensitivity.

The clusters derived from KMeans revealed meaningful patterns in restaurant characteristics, including cuisine preferences, pricing, and customer sentiment. This segmentation enables targeted business strategies such as personalized marketing campaigns, menu optimization tailored to cluster preferences, and improved operational efficiency through better staffing and scheduling aligned with customer behavior.

From a business perspective, these clusters allow for refined customer engagement by understanding which groups prefer specific cuisines or dining times, and how sentiment varies across clusters. The insights can drive revenue growth, enhance customer satisfaction, and optimize resource allocation.

Overall, the project demonstrated the value of unsupervised learning for market segmentation in the food service industry. Future work could incorporate additional data sources, refine hyperparameter tuning, and apply advanced explainability techniques to further improve model interpretability and business impact.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***