# Restaurant Cuisine Classification Project

## Objective:
The goal of this project is to develop a machine learning model to classify restaurants based on their cuisines. The dataset contains various features such as restaurant location, average cost, ratings, and more. The target variable, "Cuisines", is multi-label, meaning a restaurant can belong to multiple cuisines.

## Steps:
1. **Preprocess the data**: Handle missing values, encode categorical variables, and process the multi-label "Cuisines".
2. **Split the data**: Divide the dataset into training and testing sets.
3. **Train a model**: Choose a classification algorithm (Random Forest, Logistic Regression, etc.) and train it on the data.
4. **Evaluate the model**: Use classification metrics like accuracy, precision, recall, and F1-score to evaluate the model's performance.
5. **Analyze performance**: Identify challenges or biases in predicting certain cuisines.

## Approach:
Since "Cuisines" is a multi-label target, the model will be trained using multi-label classification techniques such as Random Forest or Logistic Regression with OneVsRest or MultiOutputClassifier.

## Import Required Packages

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

## Load Dataset

In [44]:
data = pd.read_csv("../data/processed/data.csv")
data.head()

Unnamed: 0,Restaurant Name,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,Average Cost for two,Currency,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Votes,Country Name
0,Le Petit Souffle,73,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",1100,0,1,0,0,3,4.8,314,5
1,Izakaya Kikufuji,73,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,1200,0,1,0,0,3,4.5,591,5
2,Heat - Edsa Shangri-La,75,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",4000,0,1,0,0,4,4.4,270,5
3,Ooma,75,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",1500,0,0,0,0,4,4.9,365,5
4,Sambo Kojin,75,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",1500,0,1,0,0,4,4.8,229,5


## Extra Preprocessing

Drop irrelevant features

In [45]:
data = data.drop(columns=['Restaurant Name', 'Address', 'Locality', 'Locality Verbose'])
data.head()

Unnamed: 0,City,Longitude,Latitude,Cuisines,Average Cost for two,Currency,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Votes,Country Name
0,73,121.027535,14.565443,"French, Japanese, Desserts",1100,0,1,0,0,3,4.8,314,5
1,73,121.014101,14.553708,Japanese,1200,0,1,0,0,3,4.5,591,5
2,75,121.056831,14.581404,"Seafood, Asian, Filipino, Indian",4000,0,1,0,0,4,4.4,270,5
3,75,121.056475,14.585318,"Japanese, Sushi",1500,0,0,0,0,4,4.9,365,5
4,75,121.057508,14.58445,"Japanese, Korean",1500,0,1,0,0,4,4.8,229,5


Multi-label encoding

In [46]:
encoder = MultiLabelBinarizer()
y = encoder.fit_transform(data['Cuisines'])
y.shape

(9542, 52)

In [47]:
X = data.drop(columns=['Cuisines'])
X.shape

(9542, 12)

Data scaling

In [48]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X.shape

(9542, 12)

## Training Models

Logistic regression

In [49]:
# Initialize the Logistic Regression model
lr = LogisticRegression()

# Wrap it with OneVsRestClassifier for multi-label classification
lr_model = OneVsRestClassifier(lr)

# Perform cross-validation
# Cross-validation scores for multi-label classification
cv_scores = cross_val_score(lr_model, X, y, cv=10, scoring='f1_micro')

# Print the cross-validation scores
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean score: {cv_scores.mean():.4f}")

Cross-validation scores: [0.62912391 0.7334589  0.73204207 0.72778758 0.7415092  0.73495879
 0.74316439 0.72187447 0.72543826 0.69240648]
Mean score: 0.7182


XGBoost

In [50]:
# Initialize the XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

# Use MultiOutputClassifier to handle multi-label classification
xgb_multi_model = MultiOutputClassifier(xgb_model)

# Perform cross-validation
# Cross-validation scores for multi-label classification
cv_scores = cross_val_score(xgb_multi_model, X, y, cv=10, scoring='f1_micro')

# Print the cross-validation scores
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean score: {cv_scores.mean():.4f}")

Cross-validation scores: [0.59108328 0.71942682 0.72036297 0.71744285 0.73751698 0.72276622
 0.7412605  0.70491075 0.72108959 0.6740632 ]
Mean score: 0.7050


Random Forest

In [51]:
# Initialize the XGBoost model
rf_model = RandomForestClassifier(random_state=42)

# Use MultiOutputClassifier to handle multi-label classification
rf_multi_model = MultiOutputClassifier(rf_model)

# Perform cross-validation
# Cross-validation scores for multi-label classification
cv_scores = cross_val_score(rf_multi_model, X, y, cv=10, scoring='f1_micro')

# Print the cross-validation scores
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean score: {cv_scores.mean():.4f}")

Cross-validation scores: [0.61459462 0.71529392 0.7309059  0.71962856 0.7390057  0.7202714
 0.74175054 0.70602356 0.71402676 0.68501476]
Mean score: 0.7087
