# 🍽️ Zomato Cuisine Clustering using VADER & K-Means

This notebook performs unsupervised clustering of Zomato restaurant reviews using:
- **TF-IDF** vectorization of review text
- **VADER sentiment analysis** for polarity scoring
- **Ratings** as numeric input
- **K-Means clustering** to uncover patterns in customer preferences

### 🔍 Goal:
To group reviews into meaningful clusters that can be used for **restaurant/cuisine recommendation systems**.

### 🛠 Tools Used:
- `pandas`, `scikit-learn`, `nltk` (VADER)
- `TfidfVectorizer` for text embedding
- `KMeans` for clustering

### ✅ Features Extracted:
- TF-IDF of `review` text  
- Sentiment scores (`compound`, `pos`, `neu`, `neg`) from VADER  
- Normalized `rating` score  

### 🎯 Output:
- Cluster-labeled reviews
- A function to recommend similar reviews based on new input

---


Preprocess Text and Rating

In [2]:
import pandas as pd

df = pd.read_csv('zomato_reviews.csv')
df.dropna(subset=['review', 'rating'], inplace=True)


Merge TF-IDF + Ratings + Sentiment for Clustering

In [20]:
!pip install vaderSentiment



Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m838.0 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [27]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

def vader_sentiment_scores(text):
    scores = analyzer.polarity_scores(text)
    return pd.Series([scores['compound'], scores['pos'], scores['neu'], scores['neg']])

df[['compound', 'pos', 'neu', 'neg']] = df['review'].apply(vader_sentiment_scores)


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500)
tfidf_features = tfidf.fit_transform(df['review'])


TF-IDF + Rating for Clustering

In [29]:
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import hstack, csr_matrix

# Select numeric features for normalization
numeric_data = df[['rating', 'compound', 'pos', 'neu', 'neg']]
scaler = MinMaxScaler()
normalized_numeric = scaler.fit_transform(numeric_data)

# Combine all features
combined_features = hstack([tfidf_features, csr_matrix(normalized_numeric)])


In [31]:
def recommend(review_text, rating):
    # Vectorize review
    tfidf_vec = tfidf.transform([review_text])

    # Get VADER sentiment scores
    scores = analyzer.polarity_scores(review_text)

    # Normalize numeric inputs
    numeric_input = scaler.transform([[rating, scores['compound'], scores['pos'], scores['neu'], scores['neg']]])

    # Combine text and numeric features
    input_vec = hstack([tfidf_vec, csr_matrix(numeric_input)])

    # Predict cluster
    cluster_id = kmeans.predict(input_vec)[0]

    # Return 5 similar reviews from same cluster
    return df[df['cluster'] == cluster_id].sample(5)[['review', 'rating']]


In [32]:
recommend("The biryani was flavorful and the service was excellent!", 4.5)




Unnamed: 0,review,rating
210,I don’t usually write reviews but I was compel...,5
4711,"Quality food, Best taste, attractive price, fa...",5
4089,I highly like all food ☺️ tasty or delicious.....,5
3616,I ordered a falooda but a got a soup.<br/>i.d ...,5
2956,if it was packed in square container the paple...,5
