# Predict Restaurant Ratings
In this notebook, we go through the process of **predicting restaurant ratings** based on their attributes.
* We will use a dataset of restaurants with their attributes and ratings after splitting it to `training`, `vaildation` & `testing` sets.
* We will use a simple `linear regression` model as a `baseline` to predict the ratings.
* We will also use a more complex model, such as `Random Forest`, `XGBoost` & `LightGBM` regressors to see if they perform better.
* We will also use `cross-validation` to evaluate the performance of our models based on some evaluation metrics such as `MSE` & `R-Squared`.
* We will also use `Grid-Search` to find the best parameters for our models.
* Finally, we will predict the ratings for the unrated restaurants in the dataset using the best model.

## Import Packages

In [125]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import warnings
warnings.filterwarnings("ignore")

## Load Processed Dataset

In [126]:
data = pd.read_csv('../data/processed/data.csv')
data.head()

Unnamed: 0,Restaurant Name,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,Average Cost for two,Currency,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Votes,Country Name
0,Le Petit Souffle,73,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",1100,0,1,0,0,3,4.8,314,5
1,Izakaya Kikufuji,73,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,1200,0,1,0,0,3,4.5,591,5
2,Heat - Edsa Shangri-La,75,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",4000,0,1,0,0,4,4.4,270,5
3,Ooma,75,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",1500,0,0,0,0,4,4.9,365,5
4,Sambo Kojin,75,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",1500,0,1,0,0,4,4.8,229,5


## Extra Data Preprocessing

Drop Unnecessary Categorial Variables

In [127]:
data = data.drop(columns=['Restaurant Name', 'Address', 'Locality', 'Locality Verbose', 'Cuisines'])
data.head()

Unnamed: 0,City,Longitude,Latitude,Average Cost for two,Currency,Has Table booking,Has Online delivery,Is delivering now,Price range,Aggregate rating,Votes,Country Name
0,73,121.027535,14.565443,1100,0,1,0,0,3,4.8,314,5
1,73,121.014101,14.553708,1200,0,1,0,0,3,4.5,591,5
2,75,121.056831,14.581404,4000,0,1,0,0,4,4.4,270,5
3,75,121.056475,14.585318,1500,0,0,0,0,4,4.9,365,5
4,75,121.057508,14.58445,1500,0,1,0,0,4,4.8,229,5


Use unrated restaurants as the testing data & the rated restaurants as the training & validation data

In [128]:
testing_data = data[data['Aggregate rating'] == 0]
testing_data.shape

(2148, 12)

In [129]:
training_data = data[data['Aggregate rating'] != 0]
training_data.shape

(7394, 12)

Split data to features & targets

In [130]:
X_train, y_train = training_data.drop(columns=['Aggregate rating']), training_data['Aggregate rating']
X_test_orig, y_test = testing_data.drop(columns=['Aggregate rating']), testing_data['Aggregate rating']

Use MinMaxScaler to scale the data

In [131]:
X_scaler = MinMaxScaler()
X_scaled = X_scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_scaled, columns=X_train.columns, index=X_train.index)
X_scaled = X_scaler.transform(X_test_orig)
X_test = pd.DataFrame(X_scaled, columns=X_test_orig.columns, index=X_test_orig.index)
print(X_train.shape, X_test.shape)

(7394, 11) (2148, 11)


In [132]:
y_scaler = MinMaxScaler()
y_scaled = y_scaler.fit_transform(y_train.values.reshape(-1, 1))
y_train = pd.DataFrame(y_scaled, columns=['Aggregate rating'], index=y_train.index)
y_scaled = y_scaler.transform(y_test.values.reshape(-1, 1))
y_test = pd.DataFrame(y_scaled, columns=['Aggregate rating'], index=y_test.index)
print(y_train.shape, y_test.shape)

(7394, 1) (2148, 1)


## Baseline Model

In [133]:
lr = LinearRegression()
scores = cross_val_score(lr, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
print("Average score:", -scores.mean())

Average score: 0.2650382914969433


## Complex Models

Random Forest Regressor

In [134]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', None]
}

rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=10, n_jobs=-1, scoring='neg_mean_squared_error',)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Results
print("Best parameters:", grid_search.best_params_)
print("Best MSE:", (-grid_search.best_score_))

KeyboardInterrupt: 

XGBoost Regressor

In [None]:
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid_xgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

grid_xgb = GridSearchCV(xgb, param_grid_xgb, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
grid_xgb.fit(X_train, y_train)

print("Best XGBoost Params:", grid_xgb.best_params_)
print("Best XGBoost Score:", -grid_xgb.best_score_)

LightGBM Regressor

In [None]:
lgb = LGBMRegressor(random_state=42)

param_grid_lgb = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [-1, 5, 10],
    'num_leaves': [31, 50],
    'subsample': [0.8, 1.0]
}

grid_lgb = GridSearchCV(lgb, param_grid_lgb, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
grid_lgb.fit(X_train, y_train)

print("Best LightGBM Params:", grid_lgb.best_params_)
print("Best LightGBM Score:", -grid_lgb.best_score_)

Random Forest Regressor is the best model, we will use it to make predictions on the test set.

## Rating Predictions

Training on the whole dataset

In [None]:
best_rf = RandomForestRegressor(**grid_search.best_params_, random_state=42)
best_rf.fit(X_train, y_train)

Predicting ratings for unrated restaurants

In [None]:
y_pred = best_rf.predict(X_test)

Inverse scale the ratings

In [None]:
y_pred = y_scaler.inverse_transform(y_pred.reshape(-1, 1))

Adding predictions to the testing data

In [None]:
X_test_orig['Aggregate rating'] = y_pred
X_test_orig.shape

In [None]:
X_test_orig.head()

Save predictions

In [None]:
X_test_orig.to_csv('../outputs/rating predictions/predictions.csv')