<a href="https://colab.research.google.com/github/sandeeps02/Zomato-Restaurant-Clustering-And-Sentiment-Analysis/blob/main/Zomato_Restaurant_Clustering_And_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Zomato Restaurant Clustering and Sentiment Analysis



##### **Project Type**    - Unsupervised
##### **Contribution**    - Individual


# **Project Summary -**

The Zomato Restaurant Clustering and Sentiment Analysis project focuses on analyzing Zomato's restaurant data from various cities in India. The goal is to cluster restaurants into different segments based on their attributes and analyze customer sentiments through reviews. By using visualizations, we aim to provide valuable insights for customers to find the best restaurants in their area and offer the company recommendations for improvements. The analysis will include cost vs. benefit assessments using cuisine and pricing data, with the ultimate aim of helping both customers and the company make informed decisions in the restaurant industry.

# **GitHub Link -**

https://github.com/sandeeps02/Zomato-Restaurant-Clustering-And-Sentiment-Analysis

# **Problem Statement**


Sentiment Analysis: Perform sentiment analysis on user reviews to understand
customer sentiments towards various restaurants. The analysis should classify reviews as positive, negative, or neutral, providing insights into customer satisfaction and identifying areas for improvement.

Restaurant Clustering: Apply clustering algorithms to group restaurants into distinct segments based on their attributes. The clustering should reveal patterns and similarities among restaurants, allowing for better recommendations to customers and strategic decisions for the company.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Display all the columns in the dataframe
pd.pandas.set_option("display.max_columns", None)

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset1
review= pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Zomato dataset/Zomato Restaurant reviews.csv")
# Load Dataset2
restrodata=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Zomato dataset/Zomato Restaurant names and Metadata.csv")

### Dataset First View

In [None]:
# Dataset1 & 2 First Look
review.head()

In [None]:
restrodata.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
review.shape,restrodata.shape

### Dataset Information

In [None]:
# Dataset Info
review.info()

In [None]:
restrodata.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
review.duplicated().sum()

In [None]:
review.loc[review.duplicated()]

In [None]:
restrodata.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
review.isnull().sum()

In [None]:
restrodata.isnull().sum()

In [None]:
# Visualizing the missing values
review_missing_percent=review.isnull().sum()/ len(review)*100
restrodata_null_percent= restrodata.isnull().sum()/ len(restrodata)* 100

In [None]:
sns.barplot(x=review_missing_percent.index, y=review_missing_percent)

In [None]:
sns.barplot(x=restrodata_null_percent.index, y=restrodata_null_percent)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
review.columns

In [None]:
restrodata.columns

In [None]:
# Dataset Describe
review.describe(include="all").T

In [None]:
restrodata.describe().T

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# For Restorant data
for col in restrodata:
    if restrodata[col].dtype==object:
        print(restrodata[col].value_counts())

In [None]:
for col in review:
    if review[col].dtype==object:
        print(review[col].value_counts())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# For Restorant data

restrodata["Cost"]=restrodata["Cost"].str.replace("," , "").astype(int)

In [None]:
restrodata.Cost.unique()

In [None]:
# For Review
## For Metadat, Sperating columns and followers
review["Review_count"] = review["Metadata"].str.split(",").str[0]
review["Review_count"] = review["Review_count"].str.split(" ").str[0]
review["Followers"] = review["Metadata"].str.split(",").str[1]
review["Followers"] = review["Followers"].str.split(" ").str[1]

In [None]:
# In Rating there is on "like value", we gonna replace that
review['Rating'] = review['Rating'].replace('Like', review[review['Rating'] != 'Like']['Rating'].astype(float).median())
review['Rating'] = pd.to_numeric(review['Rating'], errors='coerce')
review["Review_count"]=pd.to_numeric(review["Review_count"],errors="coerce")

In [None]:
# Extracting date/time from time column
# fOR RESTAURANT
review['Time'] = pd.to_datetime(review['Time'])
review['Review_Year'] = review['Time'].dt.year
review['Review_Month'] = review['Time'].dt.month
review['Review_Hour'] = review['Time'].dt.hour
review.head(2)

In [None]:
restrodata.head(2)

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#Restaurant Data

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10,6))
top10_restro=restrodata.groupby("Name")["Cost"].max().nlargest(10)
sns.barplot(y=top10_restro.index, x=top10_restro.values)
plt.ylabel('Restaurant Name')
plt.xlabel('Cost (Per Person)')
plt.title('Top 10 Restaurants based on Cost')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I'm using barplots to discover the priciest restaurants.

##### 2. What is/are the insight(s) found from the chart?

After analyzing the graph, it becomes evident that Hyatt Hyderabad Gachibowli, Sheraton Hyderabad Hotel, and 10 Downing Street are the top three costliest restaurants based on their per-person cost.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10, 6))
top5_economy_restro = restrodata.groupby("Name")["Cost"].min().nsmallest(5)
sns.barplot(x=top5_economy_restro.values, y=top5_economy_restro.index)
plt.xlabel('Cost (Per Person)')
plt.ylabel('Restaurant Name')
plt.title('Top 5 Economy Restaurant')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Using the bar plot visualization, I have investigated and identified the most budget-friendly restaurants, providing valuable insights into the economical culinary options available for diners.

##### 2. What is/are the insight(s) found from the chart?

Based on the data represented in the graph, it is evident that Amul, Mohammedia Shawarma, and Asian Meal Box are the most affordable restaurants, offering budget-friendly dining options to customers.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Split the cuisines and store them in a list
cuisine_list = restrodata['Cuisines'].str.split(', ').explode()

cuisine_data = cuisine_list.value_counts().reset_index()
cuisine_data.columns = ['Cuisine', 'Number of Restaurants']

# Select the top 10 cuisines based on occurrence
top10cuisine = cuisine_data.nlargest(10, 'Number of Restaurants')

plt.figure(figsize=(10, 6))
sns.barplot(x='Number of Restaurants', y='Cuisine', data=top10cuisine)
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.title('Top 10 Cuisines by Occurrence')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

By utilizing the barplot visualization technique, I am able to present a clear and insightful view of the most sought-after cuisines, highlighting the culinary preferences that are in high demand among consumers.

##### 2. What is/are the insight(s) found from the chart?

Based on the graph, it is evident that North Indian, Chinese, and Continental cuisines are the most in-demand and widely available options in restaurants. These cuisines enjoy a significant presence and popularity among customers.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
from wordcloud import WordCloud

# Storing all cuisines in the form of text
text = " ".join(name for name in cuisine_data.Cuisine)

# Creating the word cloud with text as an argument in .generate() method
word_cloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate(text)

# Display the generated Word Cloud
plt.figure(figsize=(10, 5))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud for Cuisines")
plt.show()

##### 1. Why did you pick the specific chart?

Through the utilization of a word cloud, I was able to visually identify the most prevalent cuisines, showcasing the dominant culinary choices that are abundantly available across various restaurants.

##### 2. What is/are the insight(s) found from the chart?

The word cloud visualization prominently displays North Indian, Chinese, and Continental cuisines as the most prevalent and frequently offered options among various restaurants. Their larger appearance in the word cloud indicates their higher representation in the data.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Convert the 'Collections' column to string and remove NaN values
restrodata['Collections'] = restrodata['Collections'].astype(str).replace('nan', '')

# Storing all collections in the form of text
text = " ".join(name for name in restrodata.Collections)

# Creating the word cloud with text as an argument in .generate() method
word_cloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate(text)

# Display the generated Word Cloud
plt.figure(figsize=(10, 6))
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

##### 1. Why did you pick the specific chart?

By employing the word cloud visualization, I have effectively captured and represented the most frequently used tags, providing a visually striking depiction of the prevalent themes and topics that are widely utilized in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Based on the word cloud graph, it is apparent that "Hyderabad," "food hygiene," and "rated restaurants" are the most commonly used tags employed by the restaurants. Their larger appearance in the word cloud highlights their significant prevalence and relevance in the datase

#For Restaurants Review

#### Chart - 6

In [None]:
# Chart - 6 visualization code
top10_rated = review.groupby("Restaurant")["Rating"].max().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top10_rated.values, y=top10_rated.index, palette='viridis')
plt.xlabel('Rating')
plt.ylabel('Restaurant')
plt.title('Top 10 Restaurants based on Maximum Ratings')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

By employing the barplot visualization, I have successfully identified and displayed the top-rated restaurants, offering a clear and concise view of the dining establishments that have received the highest accolades and positive feedback from customers.

##### 2. What is/are the insight(s) found from the chart?

 Among them, 10 Downing Street, 13 Dhaba, and Barbeque Nation emerge as the most highly rated choices.Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Sort the restaurants based on their minimum ratings in ascending order
sorted_restaurants = review.groupby("Restaurant")["Rating"].min().sort_values()

# Select the top 5 restaurants with the lowest ratings
top5_least_rated = sorted_restaurants.nsmallest(5)

plt.figure(figsize=(10, 6))
sns.barplot(x=top5_least_rated.values, y=top5_least_rated.index, palette='viridis')
plt.xlabel('Rating')
plt.ylabel('Restaurant')
plt.title('Top 5 Restaurants based on the Lowest Ratings')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Using the bar plot visualization, I have investigated the least rated restaurants.

##### 2. What is/are the insight(s) found from the chart?

Based on the barplot analysis, it becomes apparent that 10 Downing Street, Prism Club Kitchen, and Pourhouse7 are among the least rated restaurants, indicating that these establishments have received comparatively lower customer ratings and feedback. This information highlights potential areas for improvement and further attention to enhance their overall dining experiences.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(8, 6))
sns.histplot(data=review, x='Rating', bins=10, kde=True)
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Histogram of Ratings')
plt.show()

##### 1. Why did you pick the specific chart?


By utilizing the histogram visualization, I am able to identify the most frequently assigned rating scores, offering valuable insights into the preferred or common rating categories given by customers to various restaurants.

##### 2. What is/are the insight(s) found from the chart?

As per the histogram analysis of the dataset, it is evident that the majority of customers have given a 5-star rating to the restaurants. This indicates that a significant number of diners have had highly satisfactory experiences, leading to the prevalence of 5-star ratings as the most common rating category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9 - Correlation Heatmap

In [None]:
f, ax = plt.subplots(figsize = (15, 5))
sns.heatmap(review.corr(),ax = ax, annot=True, cmap = 'icefire', linewidths = 1)
plt.show()

Answer Here

## ***5. Feature Engineering & Data Pre-processing***

In [None]:
review=review.drop_duplicates()

### 1. Handling Missing Values

In [None]:
# Renaming restaurant column name to restaurant
restrodata=restrodata.rename(columns={"Name":"Restaurant"})

In [None]:
# Merging review and restrodata
merge_data=restrodata.merge(review, on="Restaurant")

In [None]:
#Checking Null Values
merge_data.isnull().sum()

In [None]:
# Checking Rows and column
merge_data.shape

In [None]:
# Checking Null values in Timings
merge_data.loc[merge_data["Timings"].isnull()]

In [None]:
# Filling Timing null value
merge_data.Timings.fillna(merge_data.Timings.mode()[0], inplace = True)

In [None]:
# Dropping Nan in Columns
merge_data=merge_data.dropna(subset=["Review", "Review_count"])

In [None]:
# Filling null values in review and reviewer follower column
merge_data= merge_data.fillna({"Review": "No Review"})

In [None]:
# Checking Null
merge_data.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Checking for Outliers
merge_data.describe()

In [None]:
merge_data['Followers'] = pd.to_numeric(merge_data['Followers'], errors='coerce')


In [None]:
# Handling Outliers & Outlier treatments
# Defining a function for calcualting outliers-
def calculate_outlier(df, column):
    Q3 = df[column].quantile(0.75)
    Q1 = df[column].quantile(0.25)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[column] > upper) | (df[column] < lower)]
    percent_outliers = round((outliers.shape[0] / df.shape[0]) * 100, 2)
    return lower, upper, percent_outliers

In [None]:
lower_cost, upper_cost, percentage_cost_outliers=calculate_outlier(merge_data, "Cost")
print("lower band",(lower_cost))
print("upper band",(upper_cost))
print("outlier percent",(percentage_cost_outliers))

In [None]:
merge_data.loc[merge_data["Cost"]> upper_cost, "Cost" ]=2250

In [None]:
lower_count, upper_count, followers_percentage_outliers=calculate_outlier(merge_data, "Followers")
print("lower band",(lower_count))
print("upper band",(upper_count))
print("outlier percent",(followers_percentage_outliers))

In [None]:
merge_data.loc[merge_data["Followers"]> upper_count, "Followers" ]=227

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Textual Data Preprocessing


###For Review

#### 1. Expand Contraction

In [None]:
merge_data.Review.head(8)

In [None]:
pip install contractions

In [None]:
# Expand Contraction
import contractions

def expand_contractions(text):
    # Using the contractions library to expand contractions in the text
    expanded_text = contractions.fix(text)
    return expanded_text

In [None]:
merge_data["Review"] = merge_data["Review"].apply(expand_contractions)

#### 2. Lower Casing

In [None]:
# Lower Casing
def to_lower(text):
    lower_text=text.lower()
    return lower_text

In [None]:
merge_data["Review"] = merge_data["Review"].apply(to_lower)

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import re
import unicodedata

def remove_punc(text):
    # Normalize text by removing accents and converting to NFC form
    normalized_text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode('utf-8')

    # Remove punctuation characters from the text, except for alphabets and numbers
    punc_text = re.sub('[^a-zA-Z0-9]', ' ', normalized_text)

    return punc_text

In [None]:
merge_data["Review"] = merge_data["Review"].apply(remove_punc)

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
def remove_urls(text):
    # Convert the input to a string if it's not already
    text = str(text)

    # Remove URLs using regular expression
    url_pattern = r'http\S+|www\S+'
    no_urls_text = re.sub(url_pattern, '', text)
    return no_urls_text


In [None]:
merge_data["Review"] = merge_data["Review"].apply(remove_urls)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stopwords(text):
    # Tokenize the text into individual words
    words = text.split()

    # Remove stopwords from the list of words
    stopwords_list = set(stopwords.words('english'))
    words = [word for word in words if word.lower() not in stopwords_list]

    # Join the remaining words back into a single string
    cleaned_text = ' '.join(words)

    return cleaned_text

In [None]:
merge_data["Review"] = merge_data["Review"].apply(remove_stopwords)

#### 6. Tokenization

In [None]:
import nltk
nltk.download('punkt')

In [None]:
# Tokenization
def word_token(text):
    tokens=nltk.word_tokenize(text)
    return tokens

In [None]:
merge_data["Review"] = merge_data["Review"].apply(word_token)

In [None]:
merge_data.Review.head(5)

#### 7. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import PorterStemmer

def stem_words(text):
    stemmer=PorterStemmer()
    stemmed_words=[stemmer.stem(words) for words in text]
    return stemmed_words

In [None]:
merge_data["Review"] = merge_data["Review"].apply(stem_words)

##Sentiment Analysis

In [None]:
#Sentiment lexicon
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

In [None]:
# Initialize Vender SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Function to get sentiment score for each review
def get_sentiment_score(review):
    review = ' '.join(review) # Convert list of words back to a sentence
    return sia.polarity_scores(review)['compound']

# Apply the sentiment analysis function to the 'Review' column
merge_data['Vader_Sentiment'] = merge_data["Review"].apply(get_sentiment_score)

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
# Initialize SentiWordNet
from nltk.corpus import sentiwordnet as swn

nltk.download('sentiwordnet')

def get_sentiwordnet_sentiment(review):
    sentiment_score = 0
    for word in review:
        synsets = list(swn.senti_synsets(word))
        if synsets:
            sentiment_score += synsets[0].pos_score() - synsets[0].neg_score()
    return sentiment_score

merge_data["SentiWordNet_Sentiment"] = merge_data["Review"].apply(get_sentiwordnet_sentiment)

In [None]:
pip install afinn

In [None]:
# Initialize Affinn Sentiment
from afinn import Afinn

afinn = Afinn()

def get_afinn_sentiment(review):
    return afinn.score(' '.join(review))

merge_data["AFINN_Sentiment"] = merge_data["Review"].apply(get_afinn_sentiment)

In [None]:
# Initialize Bing_Liu_Sentiment Lexicon
from nltk.corpus import opinion_lexicon

nltk.download('opinion_lexicon')

def get_bing_liu_sentiment(review):
    positive_words = set(opinion_lexicon.positive())
    negative_words = set(opinion_lexicon.negative())
    sentiment_score = sum(1 for word in review if word in positive_words) - sum(1 for word in review if word in negative_words)
    return sentiment_score

merge_data["Bing_Liu_Sentiment"] = merge_data["Review"].apply(get_bing_liu_sentiment)

In [None]:
# Visualizing All Sentiment Lexicon Methods

plt.figure(figsize=(10, 6))
plt.scatter(range(len(merge_data)), merge_data['Vader_Sentiment'], color='blue', label='VADER')
plt.scatter(range(len(merge_data)), merge_data['SentiWordNet_Sentiment'], color='green', label='SentiWordNet')
plt.scatter(range(len(merge_data)), merge_data['AFINN_Sentiment'], color='orange', label='AFINN-111')
plt.scatter(range(len(merge_data)), merge_data['Bing_Liu_Sentiment'], color='red', label="Bing Liu's Opinion Lexicon")

plt.axhline(y=0, color='gray', linestyle='--')  # Add a horizontal line at sentiment score = 0 (neutral)

plt.xlabel('Review Index')
plt.ylabel('Sentiment Score')
plt.title('Sentiment Scores by Different Lexicons')
plt.legend()
plt.show()

Through the application of a scatter plot visualization, I have effectively analyzed and compared the performance of different sentiment lexicons. This technique allows me to discern which lexicon demonstrates superior efficacy in capturing and interpreting sentiments from the data.

Upon meticulous examination of the scatter plot, it becomes evident that the Affin-111 and Bing_Liu_Sentiment lexicons exhibit notably superior accuracy and performance in capturing sentiments from the data. These lexicons prove to be more reliable and effective in interpreting the emotional content present in the analyzed text.

In [None]:
# Creaing a new DataFrame for Sentiment Analysis
sentimental_df=merge_data[["Restaurant", "Review", "Vader_Sentiment", "SentiWordNet_Sentiment", "AFINN_Sentiment", "Bing_Liu_Sentiment"]]

In [None]:
# Define a function to map sentiment scores to labels
def sentiment_label(score):
    if score > 0:
        return "positive"
    elif score < 0:
        return "negative"
    else:
        return "neutral"

In [None]:
# Create the new column "Overall_Sentiment" based on the sentiment scores
sentimental_df["Overall_Sentiment"] =sentimental_df.mean(axis=1).apply(sentiment_label)

In [None]:
sentiment_counts = sentimental_df["Overall_Sentiment"].value_counts()

# Create a pie chart to visualize the sentiment distribution
plt.figure(figsize=(15, 5))
colors = ['#4F6272', '#B7C3F3', '#DD7596']
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=90, colors=colors)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Overall Sentiment Distribution")
plt.show()

##Clustering

###Textual Data Preprocessing

In [None]:
restrodata.isnull().sum()

In [None]:
# Creating new dataset for clustering
cluster_df=restrodata[["Cost", "Cuisines"]]

In [None]:
cluster_data=cluster_df.copy()
cluster_df.head()

In [None]:
#### 1. Expand Contraction
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(expand_contractions)

In [None]:
#### 2. Lower Casing
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(to_lower)

In [None]:
### Removing spaces which are separated by commas

def remove_spaces_between_names(text):
    # Split the text by commas
    names = text.split(',')

    # Remove spaces between individual names
    cleaned_names = [name.strip().replace(' ', '') for name in names]

    # Join the cleaned names with commas
    cleaned_text = ', '.join(cleaned_names)

    return cleaned_text

In [None]:
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(remove_spaces_between_names)

In [None]:
#### 3. Removing Punctuations
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(remove_punc)

In [None]:
#### 4. Removing URLs & Removing words and digits contain digits.
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(remove_urls)

In [None]:
#### 5. Removing Stopwords
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(remove_stopwords)

In [None]:
# Tokenization
cluster_df["Cuisines"] = cluster_df["Cuisines"].apply(word_token)

In [None]:
cluster_df.Cuisines.head()

In [None]:
cluster_df['Cuisines'] = cluster_df['Cuisines'].apply(lambda cuisines: ' '.join(cuisines))

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
cluster_dff = vectorizer.fit_transform(cluster_df["Cuisines"])
cluster_dff

In [None]:
features = pd.concat([cluster_df['Cost'], pd.DataFrame(cluster_dff.toarray(), columns=vectorizer.get_feature_names_out())], axis=1)
features.head()

### 7. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = features[["Cost"]].values
features["Cost"] = scaler.fit_transform(X)
features.head()

## ***6. ML Model Implementation***

### ML Model - 1

###K_Means Algorithm


In [None]:
# ML Model - 1 K-means Implementation
from sklearn.cluster import KMeans

# Number of clusters you want to create
n_clusters = 3

# Create an instance of the KMeans clustering algorithm
kmeans = KMeans(n_clusters=n_clusters, random_state=42)


# Fit the Algorithm
kmeans.fit(features)

# Get the cluster labels for each data point
features["Kmean_ClusterLabel"] = kmeans.labels_


In [None]:
features.head()

###2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1
## K-Means Algorithm
### Check the optimum value of Cluster Using Elbow Method

from sklearn.cluster import KMeans

inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(features)
    inertias.append(kmeans.inertia_)

# Plot the Elbow curve
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

Here, the elbow point is 3.So, we have choose correct value of n_cluster.

In [None]:
# Visualizing the Cluster using PCA and t-sne

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Features1 = features.drop(columns=["Kmean_ClusterLabel"])
cluster_labels = features["Kmean_ClusterLabel"]

#Reduce the dimensionality using PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(Features1)

# Reduce the dimensionality using t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(Features1)

# Create subplots to visualize PCA and t-SNE results side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plot PCA
axes[0].scatter(pca_result[:, 0], pca_result[:, 1], c=cluster_labels, cmap="rainbow")
axes[0].set_title("PCA Visualization")
axes[0].set_xlabel("Principal Component 1")
axes[0].set_ylabel("Principal Component 2")

# Plot t-SNE
axes[1].scatter(tsne_result[:, 0], tsne_result[:, 1], c=cluster_labels, cmap="rainbow")
axes[1].set_title("t-SNE Visualization")
axes[1].set_xlabel("t-SNE Component 1")
axes[1].set_ylabel("t-SNE Component 2")

plt.tight_layout()
plt.show()

### ML Model - 2

###AgglomerativeClustering

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AgglomerativeClustering

In [None]:
# Determining Cosine-Similarity
cosine_sim = cosine_similarity(features.iloc[:,:-1])

In [None]:
# Implementing the Agglomertive Algorithm
n_clusters = 3
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters, affinity='precomputed', linkage='complete')
agg_clustering.fit(1 - cosine_sim)

In [None]:
# Creating Label in dataset
features["Agg_Cluster_Label"] = agg_clustering.labels_
features.head()

In [None]:
# Visualizing Cluster for Agglomerative Clustering Algorithm

Features2 = features.drop(columns=["Agg_Cluster_Label", "Kmean_ClusterLabel"])
cluster_labels = features["Agg_Cluster_Label"]

# Reduce the dimensionality using PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(Features2)

# Reduce the dimensionality using t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(Features2)

# Visualizing PCA and t-SNE results side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Plot PCA
axes[0].scatter(pca_result[:, 0], pca_result[:, 1], c=cluster_labels, cmap="rainbow")
axes[0].set_title("PCA Visualization")
axes[0].set_xlabel("Principal Component 1")
axes[0].set_ylabel("Principal Component 2")

# Plot t-SNE
axes[1].scatter(tsne_result[:, 0], tsne_result[:, 1], c=cluster_labels, cmap="rainbow")
axes[1].set_title("t-SNE Visualization")
axes[1].set_xlabel("t-SNE Component 1")
axes[1].set_ylabel("t-SNE Component 2")

plt.tight_layout()
plt.show()

In [None]:
# Cluters

# Group the data by the cluster labels
cluster_groups = features.groupby('Kmean_ClusterLabel')

# Iterate through each cluster and analyze the characteristics
for cluster_label, cluster_data in cluster_groups:
    print(f"Cluster {cluster_label}:")

    # Drop the unwanted columns before analyzing the cuisines
    cluster_data = cluster_data.drop(['Kmean_ClusterLabel', 'Agg_Cluster_Label'], axis=1)

    # Calculate the most frequent cuisines in the cluster
    most_frequent_cuisines = cluster_data.drop('Cost', axis=1).sum().nlargest(5)
    print("Most frequent cuisines:")
    print(most_frequent_cuisines)

    # Calculate the cost range in the cluster
    cost_range = (cluster_data['Cost'].min(), cluster_data['Cost'].max())
    print(f"Cost range: {cost_range[0]} - {cost_range[1]}")

    print("-------------------------------------")

In [None]:
#Visulaizing all three cluster by wrodcloud

# Group the data by the cluster labels
cluster_groups = features.groupby('Kmean_ClusterLabel')

# Iterate through each cluster and create a word cloud for the most frequent cuisines
for cluster_label, cluster_data in cluster_groups:
    # Drop the unwanted columns before analyzing the cuisines
    cluster_data = cluster_data.drop(['Kmean_ClusterLabel', 'Agg_Cluster_Label'], axis=1)

    # Calculate the most frequent cuisines in the cluster
    most_frequent_cuisines = cluster_data.drop('Cost', axis=1).sum().nlargest(10)

    # Convert the most frequent cuisines into a dictionary format (word: frequency)
    cuisines_dict = most_frequent_cuisines.to_dict()

    # Create a word cloud for the cluster
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(cuisines_dict)

    plt.figure(figsize=(8, 4))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(f"Cluster {cluster_label} - Most Frequent Cuisines")
    plt.axis('off')
    plt.show()

# **Conclusion**

This project successfully accomplished the objectives of clustering restaurants based on their features and conducting sentiment analysis on user reviews. Through clustering, we gained valuable insights into the grouping of restaurants, helping both users and businesses make informed decisions. The sentiment analysis allowed us to understand the sentiments expressed by users in their reviews, providing businesses with valuable feedback to enhance their services and improve the overall user experience.

The utilization of various data preprocessing techniques, such as text vectorization and feature normalization, played a crucial role in preparing the data for clustering and sentiment analysis. We employed popular machine learning algorithms, including K-Means and Agglomerative Clustering, to create meaningful clusters of restaurants based on their similarities.

For future enhancements, more advanced clustering algorithms and sentiment analysis techniques could be explored to further refine the results. Additionally, incorporating additional features such as images and menus of the restaurants might provide more comprehensive insights.

Overall, this project demonstrates the potential of leveraging data analytics to gain valuable insights into the restaurant industry, aiding both users in making informed choices and businesses in enhancing their services to meet customer expectations.