# Name : Remilekun Ajayi

## Prediction of Restaurant Ratings Using Regression-Based Approach to Build a Predictive Model

### Problem Statement
- Restaurants are very important in the world today as they help to fulfill a pressing need which is hunger.
- However, the business is largely competitive and when customers are not satisfied it could lead to the restaurant being out of business.
- Understanding the cruicial factors which influences restaurant ratings and building a model around this factors can help improve customer satisfaction.
- The aim of my project is to analyse Restaurant data while using the aggregate ratings as my target variable, Explore the key attributes involved in this evaluation and as a data scientist, buikd a predictive model to evaluate restaurant ratings in order to enhance customer satisfaction and improve sales.

### Objectives of the Project
- Perform an Exploratory Data Analysis to understand ratings and ratings distribution, check for skewness and generally to understand my dataset.
- Do some feature analysis, check for correlation and check for the best regression model to use in my prediction.
- Identify essential patterns in customer preferences, types of cuisines, price range and the features involved in service provision.
- Check the impact of geolocation on my restaurant ratings.
- Build a proper regression model to predict the restaurant ratings based on my exploratory data analysis.
- Communicate my findings and insights.

### Data Understanding
- The dataset contains multiple features which are necessary to understand data analytics
- Restaurant Details: Restaurant details contains, Name, Location and Cuisine Type
- Customer Engagement: Contains Votes, reviews and Aggregate ratings.
- Pricing and Availability: Contains the Average Cost for two, table booking option and online delivery.
- Geolocation: Contains the Longitude and Latitude coordinates
- The target variable is included which contains the Aggregate rating

look at each column independently
how they influence the whole outcome

### Week One - Data Exploration
- Explore Dataset Dimensions. Check for missing values. Perform datatype conversions as needed.
- Analyze 'Aggregate Rating' distribution. Address any class imbalances.
- Calculate the statistics for numerical columns. Explore categorical variables. Identify top 5 cuisines and cities.

### Importing My Libraries

- I'll be importing my libraries which include numpy and pandas to help me with understanding my dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
URL = ('https://raw.githubusercontent.com/Oyeniran20/axia_class_cohort_7/refs/heads/main/Dataset%20.csv')

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Oyeniran20/axia_class_cohort_7/refs/heads/main/Dataset%20.csv')

In [None]:
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


### Explaining My Dataset

 - There are 9551 rows and 21 columns.
 - The dataset shows that the data was collected in different countries. This shows that there are different currencies since the restaurants are located in different locations on earth.
 - The dataset contains the Longitude and Latitude of different locations, different cuisines, Votes and Ratings. The aggregate rating is the target variable. It's an integer. A numerical data.
 - My dataset contains a Yes or No column. This Yes or No Column gives a classification type of problem.

### Showing the Data Info and analyzing each column.

 - Here i use the df.info() function to check for my null columns in the dataset. This shows that there are three float types of data, 5 integer types and 13 objects.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

- I checked the value counts for my target variable which is the Aggregate rating. It shows that there are 2148 zero ratings. This is very significant because zero ratings are different from null values. This means that customers could eat at a restaurant without rating the restaurant. This may account for the values of the zero ratings.

In [None]:
df['Aggregate rating'].value_counts()

Unnamed: 0_level_0,count
Aggregate rating,Unnamed: 1_level_1
0.0,2148
3.2,522
3.1,519
3.4,498
3.3,483
3.5,480
3.0,468
3.6,458
3.7,427
3.8,400


In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Country Code,9551.0,18.36562,56.75055,1.0,1.0,1.0,1.0,216.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Average Cost for two,9551.0,1199.211,16121.18,0.0,250.0,400.0,700.0,800000.0
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0


In [None]:
df['City'].head()

Unnamed: 0,City
0,Makati City
1,Makati City
2,Mandaluyong City
3,Mandaluyong City
4,Mandaluyong City


#### Longitude and Latitude

The min value for the longitude is -157 and the maximum value is 172. This is fairly reasonable as this shows there are no outliers and the fact that there are different varying locations in the dataset. The normal range for longitude and latitude should be between -180 to 180.

The minimum value for the latitude should be -90 and the maximum value should be 90. The min latitude in the dataset is -43 and the maximum is 55.9. The values are still within the normal latitude range

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Country Code,9551.0,18.36562,56.75055,1.0,1.0,1.0,1.0,216.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Average Cost for two,9551.0,1199.211,16121.18,0.0,250.0,400.0,700.0,800000.0
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0


#### Checking for missing values



In [None]:
df.isna().sum().sort_values(ascending = False)

Unnamed: 0,0
Cuisines,9
Restaurant ID,0
Currency,0
Rating text,0
Rating color,0
Aggregate rating,0
Price range,0
Switch to order menu,0
Is delivering now,0
Has Online delivery,0


**Observations:**
- There are no missing values in my dataset.
- According to my value-counts, there are 2148 zero values in my target variable which is the aggregate rating.
- There are a lot of categorical data in my dataset which seem not quite important. I may really drop them along the way as they might not really serve me as it may cause problems for me with my model.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9551 entries, 0 to 9550
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Restaurant ID         9551 non-null   int64  
 1   Restaurant Name       9551 non-null   object 
 2   Country Code          9551 non-null   int64  
 3   City                  9551 non-null   object 
 4   Address               9551 non-null   object 
 5   Locality              9551 non-null   object 
 6   Locality Verbose      9551 non-null   object 
 7   Longitude             9551 non-null   float64
 8   Latitude              9551 non-null   float64
 9   Cuisines              9542 non-null   object 
 10  Average Cost for two  9551 non-null   int64  
 11  Currency              9551 non-null   object 
 12  Has Table booking     9551 non-null   object 
 13  Has Online delivery   9551 non-null   object 
 14  Is delivering now     9551 non-null   object 
 15  Switch to order menu 

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Country Code,9551.0,18.36562,56.75055,1.0,1.0,1.0,1.0,216.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Average Cost for two,9551.0,1199.211,16121.18,0.0,250.0,400.0,700.0,800000.0
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0


Working on other columns

In [None]:
df['Country Code'] = df['Country Code'].astype(str)

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Average Cost for two,9551.0,1199.211,16121.18,0.0,250.0,400.0,700.0,800000.0
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0


The Average cost for two is weird.
- This is weird because the max value is is 80,000. the 75% is 700. This discrepancy is too much as the gap between those two values are too much.
- The 50% is 400, the 25% is 250 and the min value is 7. This shows that due to the difference in locations, currency values can be largely varrying.
- This is true as Indian Rupees is present, India is a country in Asia. Botswanna Pula is also present. Botswanna is a country in Africa. Emirati Diram (AED) is also presnt. AED is an Arab country.
- This shows that the dataset is really diverse. So diverse in the sense that the countries and currencies are very far apart.
- To solve this problem, I will equate all the cuurencies to the US Dollar ($) in order to have a uniform currency rate and eliminate outliers.

In [None]:
if 'Currency' in df.columns:
    print(df['Currency'].unique())

['Botswana Pula(P)' 'Brazilian Real(R$)' 'Dollar($)' 'Emirati Diram(AED)'
 'Indian Rupees(Rs.)' 'Indonesian Rupiah(IDR)' 'NewZealand($)'
 'Pounds(��)' 'Qatari Rial(QR)' 'Rand(R)' 'Sri Lankan Rupee(LKR)'
 'Turkish Lira(TL)']


In [None]:
# Define approximate exchange rates (as of recent data)
currency_rates = {
    'Botswana Pula(P)': 0.073,    # 1 Pula ≈ 0.073 USD
    'Brazilian Real(R$)': 0.20,   # 1 BRL ≈ 0.20 USD
    'Dollar($)': 1.0,             # USD remains the same
    'Emirati Diram(AED)': 0.27,   # 1 AED ≈ 0.27 USD
    'Indian Rupees(Rs.)': 0.012,  # 1 INR ≈ 0.012 USD
    'Indonesian Rupiah(IDR)': 0.000065,  # 1 IDR ≈ 0.000065 USD
    'NewZealand($)': 0.61,        # 1 NZD ≈ 0.61 USD
    'Pounds(£)': 1.30,            # 1 GBP ≈ 1.30 USD
    'Qatari Rial(QR)': 0.27,      # 1 QAR ≈ 0.27 USD
    'Rand(R)': 0.053,             # 1 ZAR ≈ 0.053 USD
    'Sri Lankan Rupee(LKR)': 0.0031,  # 1 LKR ≈ 0.0031 USD
    'Turkish Lira(TL)': 0.032     # 1 TRY ≈ 0.032 USD
}

In [None]:
df['Average Cost for two(USD)'] = df['Average Cost for two'] * df['Currency'].map(currency_rates)

In [None]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Average Cost for two,9551.0,1199.211,16121.18,0.0,250.0,400.0,700.0,800000.0
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0
Average Cost for two(USD),9471.0,9.678978,15.74017,0.0,3.6,6.0,9.6,500.0


- Since I have unified my currencies, I'll be dropping the average cost for two.

In [None]:
df.drop(columns=['Average Cost for two']).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Restaurant ID,9551.0,9051128.0,8791521.0,53.0,301962.5,6004089.0,18352290.0,18500650.0
Longitude,9551.0,64.12657,41.46706,-157.948486,77.081343,77.19196,77.28201,174.8321
Latitude,9551.0,25.85438,11.00794,-41.330428,28.478713,28.57047,28.64276,55.97698
Price range,9551.0,1.804837,0.9056088,1.0,1.0,2.0,2.0,4.0
Aggregate rating,9551.0,2.66637,1.516378,0.0,2.5,3.2,3.7,4.9
Votes,9551.0,156.9097,430.1691,0.0,5.0,31.0,131.0,10934.0
Average Cost for two(USD),9471.0,9.678978,15.74017,0.0,3.6,6.0,9.6,500.0


### Checking for the top 5 cuisines and cities

I'm checking first for the top 5 cuisines. According to my code below, It shows that North Indian Cuisine, North Indian Chinese Cuisine, Chinese Cuisine, Fast Food and North Indian, Mughlai are the top 5 cuisines.

In [None]:
top_five_cuisines = df['Cuisines'].value_counts(ascending = False).head(5)

In [None]:
top_five_cuisines

Unnamed: 0_level_0,count
Cuisines,Unnamed: 1_level_1
North Indian,936
"North Indian, Chinese",511
Chinese,354
Fast Food,354
"North Indian, Mughlai",334


I'm checking again for the top five cities using the value counts method

Checking the top five cities below show that the top five cities include: New Delhi which has a total of 5473 restaurants, Gurgaon with a total of 1113 restaurants, Noida with a total of 1080 restaurants, Faridabad with 251 restaurants and Ghaziabad with 25 restaurants.

In [None]:
top_five_cities = df['City'].value_counts(ascending = False).head(5)

In [None]:
top_five_cities

Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
New Delhi,5473
Gurgaon,1118
Noida,1080
Faridabad,251
Ghaziabad,25


I'm going to be visualising my top 5 cuisines using seaborn and possibly matplotlib.

In [None]:
top_five_cities.index

Index(['New Delhi', 'Gurgaon', 'Noida', 'Faridabad', 'Ghaziabad'], dtype='object', name='City')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Checking for the top five cities
plt.figure(figsize=(8, 5))
sns.barplot(y=top_five_cities.values, x=top_five_cities.index, palette = 'viridis')
plt.ylabel = ('Number of Restaurants')
plt.xlabel = ('Top Five Cities')
plt.title=('Top Five Cities With the Highest Number of Restaurants')
plt.show()

In [None]:
# checking for the top five cuisines
plt.figure(figsize=(8, 5))
sns.barplot(x=top_five_cuisines.index, y=top_five_cuisines.values, color='red', palette='viridis')
plt.xlabel = ('Top Five Cuisines')
plt.ylabel = ('Frequency of Cuisines')
plt.title=('Top Five Cuisines')
plt.show()