### Task 1: Predictive Modelling

We'll build a regression model to predict the aggregate rating of a restaurant based on available features. We'll split the dataset into training and testing sets and evaluate the model's performance using appropriate metrics. We'll experiment with different machine learning algorithms such as linear regression, decision trees, and random forests.

#### Step-by-step approach:

1. **Preprocess the data**: Handle categorical variables and normalize the data.
2. **Split the data**: Split the dataset into training and testing sets.
3. **Build and evaluate models**: Train and evaluate different regression models.

In [1]:
# Importing the basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
# Disable all warnings
warnings.filterwarnings ('ignore')

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px
import plotly.graph_objs as go


In [3]:
# Load the dataset
df = pd.read_csv('./data/data.csv')


In [4]:

# Preprocess the data
# Convert categorical variables to numerical
le = LabelEncoder()
df['Cuisines'] = le.fit_transform(df['Cuisines'])
df['City'] = le.fit_transform(df['City'])
df['Country Code'] = le.fit_transform(df['Country Code'])
df['Rating color'] = le.fit_transform(df['Rating color'])
df['Has Table booking'] = df['Has Table booking'].apply(lambda x: 1 if x == 'Yes' else 0)
df['Has Online delivery'] = df['Has Online delivery'].apply(lambda x: 1 if x == 'Yes' else 0)


In [5]:

# Select features and target variable
features = ['Country Code', 'City', 'Cuisines', 'Price range', 'Has Table booking', 'Has Online delivery']
X = df[features]
y = df['Aggregate rating']


In [6]:

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    

In [7]:

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [8]:

# Train and evaluate different regression models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42)
}


In [9]:

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2}


In [10]:

# Display results
results_df = pd.DataFrame(results).T
print(results_df)


                        MSE        R2
Linear Regression  1.645230  0.277174
Decision Tree      1.759292  0.227062
Random Forest      1.555506  0.316594


### Task 2: Customer Preference Analysis

We'll analyze the relationship between the type of cuisine and the restaurant's rating. We'll identify the most popular cuisines based on the number of votes and determine if there are any specific cuisines that tend to receive higher ratings.


In [11]:
# Analyse the relationship between the type of cuisine and the restaurant's rating
cuisine_ratings = df.groupby('Cuisines')['Aggregate rating'].mean().sort_values(ascending=False)
print('Average rating by cuisine:\n', cuisine_ratings)


Average rating by cuisine:
 Cuisines
1062    4.9
41      4.9
13      4.9
169     4.9
1034    4.9
       ... 
1       0.0
75      0.0
1790    0.0
2       0.0
1808    0.0
Name: Aggregate rating, Length: 1826, dtype: float64


In [12]:
# Identify the most popular cuisines based on the number of votes
cuisine_votes = df.groupby('Cuisines')['Votes'].sum().sort_values(ascending=False)
print('Most popular cuisines based on votes:\n', cuisine_votes)


Most popular cuisines based on votes:
 Cuisines
1514    53747
1306    46241
1329    42012
331     30657
497     21925
        ...  
1398        0
1711        0
234         0
1299        0
1811        0
Name: Votes, Length: 1826, dtype: int64


In [13]:
# Determine if there are any specific cuisines that tend to receive higher ratings
top_cuisines = cuisine_ratings.head(10)
print('Top 10 cuisines with highest ratings:\n', top_cuisines)


Top 10 cuisines with highest ratings:
 Cuisines
1062    4.9
41      4.9
13      4.9
169     4.9
1034    4.9
33      4.9
949     4.9
1214    4.9
37      4.9
1286    4.9
Name: Aggregate rating, dtype: float64


### Task 3: Data Visualization

We'll create advanced visualizations using Plotly to represent the distribution of ratings and compare the average ratings of different cuisines or cities. We'll visualize the relationship between various features and the target variable.

In [14]:
import dash
from dash import dcc, html
from dash.dependencies import Input, Output

# Create Dash app
app = dash.Dash(__name__)

app.layout = html.Div([
    html.H1("Restaurant Ratings Analysis"),
    
    dcc.Graph(
        id='histogram',
        figure=px.histogram(df, x='Aggregate rating', nbins=20, title='Distribution of Aggregate Rating')
    ),
    
    dcc.Graph(
        id='bar-plot',
        figure=px.bar(df, x='Cuisines', y='Aggregate rating', title='Average Rating by Cuisine')
    ),
    
    dcc.Graph(
        id='violin-plot',
        figure=px.violin(df, y='Aggregate rating', x='City', box=True, points='all', title='Aggregate Rating by City')
    ),
    
    dcc.Graph(
        id='hexbin-plot',
        figure=px.density_heatmap(df, x='Longitude', y='Latitude', z='Aggregate rating', nbinsx=30, nbinsy=30, title='Hexbin Plot of Ratings by Location')
    ),
    
    dcc.Graph(
        id='bubble-plot',
        figure=px.scatter(df, x='Votes', y='Aggregate rating', size='Price range', color='City', hover_name='Restaurant Name', title='Bubble Plot of Ratings vs Votes')
    )
])


In [15]:

# Run the app
if __name__ == '__main__':
    app.run_server(debug=True)


### Summary

1. **Predictive Modelling**: Built and evaluated regression models using linear regression, decision trees, and random forests to predict restaurant ratings.
2. **Customer Preference Analysis**: Analyzed the relationship between cuisine types and ratings, identified popular cuisines, and determined cuisines with higher ratings.
3. **Data Visualization**: Created advanced visualizations using Plotly and Dash to represent data insights and facilitate data-driven decisions.

You can run the provided code to complete each task and gain insights from your dataset.