# Restaurant Revenue Model

### Dataset
It is a comma separated file and there are 8 columns in the dataset.

* Number_of_Customers - The count of customers visiting the restaurant.
* Menu_Price - Average menu prices at the restaurant.
* Marketing_Spend - Amount Spend in Marketing.
* Cuisine_Type - The type of cuisine offered (Italian, Mexican, Japanese,American).
* Average_Customer_Spending - Average spending per customer.
* Promotions - Binary indicator (0 or 1) denoting whether promotions were conducted.
* Reviews - Number of reviews received by the restaurant.
* Monthly_Revenue - Simulated monthly revenue, the target variable for prediction.


In [10]:
# making directory for the datasets
!mkdir datasets

mkdir: cannot create directory ‘datasets’: File exists


In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [12]:
# Defining the data type for the loading file
dtypes = {
    "Number_of_Customers": "int16",
    "Menu_Price": "float32",
    "Marketing_Spend": "float32",
    "Cuisine_Type": "str",
    "Average_Customer_Spending": "float32",
    "Promotions":"int16",
    "Reviews": "int16",
    "Monthly_Revenue": "float32",
}

In [13]:
# reading the orginal file from the git hub
restaurant_df = pd.read_csv( "https://raw.githubusercontent.com/ritikisb83/CTAssignmentgroup5/refs/heads/main/Datasets/Restaurant%20Revenue%20Model.csv" , dtype = dtypes)

In [14]:
# Saving the original file in parquet format
restaurant_df.to_parquet("./datasets/restaurant_df.parquet")

In [15]:
# reading the parquet file from github
url = "https://raw.githubusercontent.com/ritikisb83/CTAssignmentgroup5/refs/heads/main/Datasets/restaurant_df.parquet"
restaurant_df = pd.read_parquet(url, engine="pyarrow")

In [16]:
# column  and row size of restaurant_df

restaurant_df.shape

(1000, 8)

# Data Profiling

In [17]:
# installing the ydata
!pip install ydata_profiling



In [18]:
# importing the the ydata_profiling
from ydata_profiling import ProfileReport

## Creating Data Profile

In [19]:
#Generating the Pandas profiling report
profile = ProfileReport(restaurant_df, title="Pandas Profiling Report")


In [20]:
profile.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

In [21]:
## Exporting the report to a file
profile.to_file("Restaurant_revenue_profiling.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Train,Test and production split by 60-20-20

In [22]:
from sklearn.model_selection import train_test_split

# split into train and temporary sets
train_data, temp_data = train_test_split(restaurant_df, test_size=0.4,  random_state=30)

# split into test and production sets
test_data, prod_data = train_test_split(temp_data, test_size=0.5,  random_state=30)

In [23]:
# Checking the shape of the train, test and production data
train_data.shape, test_data.shape, prod_data.shape

((600, 8), (200, 8), (200, 8))

### Creating parquet file type for train,test and production data

In [24]:
# Saving the main file in parquet format

train_data.to_parquet("./datasets/restaurant_train_df.parquet")
test_data.to_parquet("./datasets/restaurant_test_df.parquet")
prod_data.to_parquet("./datasets/restaurant_prod_df.parquet")


In [25]:
# Reading the file from github and saving them into the dataframe

train_url = "https://raw.githubusercontent.com/ritikisb83/CTAssignmentgroup5/refs/heads/main/Datasets/restaurant_train_df.parquet"
train_df = pd.read_parquet(train_url, engine="pyarrow")
test_url = "https://raw.githubusercontent.com/ritikisb83/CTAssignmentgroup5/refs/heads/main/Datasets/restaurant_test_df.parquet"
test_df = pd.read_parquet(test_url, engine="pyarrow")
prod_url = "https://raw.githubusercontent.com/ritikisb83/CTAssignmentgroup5/refs/heads/main/Datasets/restaurant_prod_df.parquet"
prod_df = pd.read_parquet(prod_url, engine="pyarrow")

# ML Pipeline with Scikit-Learn

### Need for Data Transformation

1. Categorical Encoding for categorical columns
    - OHE Encoding
2. Data scaling
    - Standard scaling

##### Feature Set Selection & ML pipeline

In [26]:
# creating features set
features_set = ['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Cuisine_Type',
       'Average_Customer_Spending', 'Promotions', 'Reviews',
       'Monthly_Revenue']

In [27]:
# checking the columns name and null info

restaurant_df[features_set].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Number_of_Customers        1000 non-null   int16  
 1   Menu_Price                 1000 non-null   float32
 2   Marketing_Spend            1000 non-null   float32
 3   Cuisine_Type               1000 non-null   object 
 4   Average_Customer_Spending  1000 non-null   float32
 5   Promotions                 1000 non-null   int16  
 6   Reviews                    1000 non-null   int16  
 7   Monthly_Revenue            1000 non-null   float32
dtypes: float32(4), int16(3), object(1)
memory usage: 29.4+ KB


* There is no null values in the dataset
* One column is object type and others are quantitative, so we will use scaling and OHE to columns



In [28]:
# Creating X features
x_columns = ['Number_of_Customers', 'Menu_Price', 'Marketing_Spend', 'Cuisine_Type',
       'Average_Customer_Spending', 'Promotions', 'Reviews']

In [29]:
# creating categorical features

cat_vars = ['Cuisine_Type']

In [30]:
# Creating numerical features
num_vars = list(set(x_columns) - set(cat_vars))

In [31]:
# Splitting train dataset into X & y
x_train = train_df[x_columns]
y_train = train_df['Monthly_Revenue']
x_test = test_df[x_columns]
y_test = test_df['Monthly_Revenue']

In [32]:
# importing all the necessary library for the Mlpipeline

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import wandb
import os

In [33]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')
scaler = StandardScaler()

## Pipeline for the applying scaling
num_transformer = Pipeline(steps=[('scaler', scaler)])

## Pipeline for the applying OHE
cat_transformer = Pipeline(steps=[('ohe', ohe_encoder)])

## The complete pipeline for applying the required transformatinons to the respective columns
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num_vars),
                                               ('cat', cat_transformer, cat_vars)])

In [39]:
preprocessor

In [6]:
# importing WandB key
#os.environ["WANDB_API_KEY"] = <"Use your key">

# Model Experiements

In [35]:
# importing necessary file for model experiement

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import cross_validate

### Baseline Model: Linear Regression with Standard Scaling

In [36]:
# Define the Linear Regression model
linear_reg = LinearRegression()

# Create a pipeline that includes preprocessing and the Linear Regression model

linear_model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing (e.g., scaling, encoding)
    ('linear_model', linear_reg)     # Use Linear Regression as the model
])

# Train the pipeline on the training data

linear_model.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) to track the experiment
# Create a new project named 'mlops_restautant_revenue' with specific tags for the experiment
wandb.init(project='mlops_restautant_revenue', config=None, tags=['Linear Model', 'baseline', 'OHE Encoding'])

# Name the run for easier identification in W&B
wandb.run.name = "LinearModel"

# Evaluate the model on the test set
# Calculate the Root Mean Squared Error (RMSE) on the test data
rmse = np.sqrt(mean_squared_error(y_test, linear_model.predict(x_test)))

# Calculate the R-squared (R²) score on the test data
r2 = linear_model.score(x_test, y_test)

# Log performance metrics to W&B
wandb.log({
    "rmse": rmse,  # Log RMSE as a key metric
    "r2": r2       # Log R² as a key metric
})

#Save the pipeline as an artifact in W&B for versioning
wandb.Artifact("LinearModel",  # Name of the artifact
               type='model',   # Specify that this artifact is a model
               description=None)

# Save the artifact and finalize the W&B run
wandb.save()  # Save all files related to the artifact
wandb.finish()  # Mark the end of the W&B run


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mritikranjan[0m ([33mritikranjan-indian-school-of-business[0m). Use [1m`wandb login --relogin`[0m to force relogin




0,1
r2,▁
rmse,▁

0,1
r2,0.68213
rmse,57.74321


### Baseline Model: Linear Regression with MinMax Scaling

In [37]:
Min_Max_scaler = MinMaxScaler()

## Pipeline for the applying scaling
num_transformer_scale = Pipeline(steps=[('MinMax', Min_Max_scaler)])


## The complete pipeline for applying the required transformatinons to the respective columns
preprocessor_scale = ColumnTransformer(transformers=[('num', num_transformer_scale, num_vars),
                                               ('cat', cat_transformer, cat_vars)])

In [38]:
# Define the Linear Regression model
linear_reg = LinearRegression()

# Create a pipeline that includes preprocessing and the Linear Regression model

linear_model_Scaled = Pipeline(steps=[
    ('preprocessor', preprocessor_scale),  # Apply preprocessing (e.g., scaling, encoding)
    ('linear_model', linear_reg)     # Use Linear Regression as the model
])

# Train the pipeline on the training data

linear_model_Scaled.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) to track the experiment

wandb.init(project='mlops_restautant_revenue', config=None, tags=['Linear Model', 'MinMaxScaling', 'OHE Encoding'])

# Name the run for easier identification in W&B
wandb.run.name = "LinearModel_MinMaxScaling"

# Evaluate the model on the test set
# Calculate the Root Mean Squared Error (RMSE) on the test data
rmse = np.sqrt(mean_squared_error(y_test, linear_model_Scaled.predict(x_test)))

# Calculate the R-squared (R²) score on the test data
r2 = linear_model_Scaled.score(x_test, y_test)

# Log performance metrics to W&B
wandb.log({
    "rmse": rmse,  # Log RMSE as a key metric
    "r2": r2       # Log R² as a key metric
})

#Save the pipeline as an artifact in W&B for versioning
wandb.Artifact("LinearModel_MinMaxScaling",  # Name of the artifact
               type='model',   # Specify that this artifact is a model
               description=None)

# Save the artifact and finalize the W&B run
wandb.save()  # Save all files related to the artifact
wandb.finish()  # Mark the end of the W&B run

0,1
r2,▁
rmse,▁

0,1
r2,0.68405
rmse,57.56839


### Decision tree with depth of 5

In [None]:
# Define the hyperparameters for the Decision Tree model

params = {"max_depth": 5}

# Initialize the Decision Tree Regressor with specified parameters
dtree = DecisionTreeRegressor(**params)

# Create a pipeline that includes preprocessing and the Decision Tree model

dtree_model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing to the data
    ('dt_model', dtree)              # Use Decision Tree as the model
])

# Train the pipeline on the training data
dtree_model.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) for experiment tracking

wandb.init(project='mlops_restautant_revenue', config=params,
           tags=['Decision Tree', 'OHE Encoding'])  # Tags help organize experiments

# Name the run for easier identification in W&B
wandb.run.name = "DecisionTree_depth5"  # This run tracks the Decision Tree with max_depth=10

# Evaluate the model on the test data
# Calculate Root Mean Squared Error (RMSE) as a measure of prediction error
rmse = np.sqrt(mean_squared_error(y_test, dtree_model.predict(x_test)))

# Calculate R-squared (R²) score to measure the proportion of variance explained by the model
r2 = dtree_model.score(x_test, y_test)

# Log evaluation metrics to W&B
# Metrics like RMSE and R² are logged for comparison across experiments
wandb.log({
    "rmse": rmse,  # Log the RMSE value
    "r2": r2       # Log the R² score
})

# Save the pipeline as a W&B artifact for versioning
wandb.Artifact("DecisionTree",  # Name of the artifact
               type='model',    # Specify the artifact type as 'model'
               description=params)

# Step 10: Save all files related to the artifact and finalize the run
wandb.save()
wandb.finish()


0,1
r2,▁
rmse,▁

0,1
r2,0.56432
rmse,67.60233


### Decision tree with depth of 10

In [None]:
# Define the hyperparameters for the Decision Tree model

params = {"max_depth": 10}

# Initialize the Decision Tree Regressor with specified parameters
dtree = DecisionTreeRegressor(**params)

# Create a pipeline that includes preprocessing and the Decision Tree model

dtree_model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing to the data
    ('dt_model', dtree)              # Use Decision Tree as the model
])

# Train the pipeline on the training data
dtree_model.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) for experiment tracking

wandb.init(project='mlops_restautant_revenue', config=params,
           tags=['Decision Tree', 'OHE Encoding'])  # Tags help organize experiments

# Name the run for easier identification in W&B
wandb.run.name = "DecisionTree_depth10"  # This run tracks the Decision Tree with max_depth=10

# Evaluate the model on the test data
# Calculate Root Mean Squared Error (RMSE) as a measure of prediction error
rmse = np.sqrt(mean_squared_error(y_test, dtree_model.predict(x_test)))

# Calculate R-squared (R²) score to measure the proportion of variance explained by the model
r2 = dtree_model.score(x_test, y_test)

# Log evaluation metrics to W&B
# Metrics like RMSE and R² are logged for comparison across experiments
wandb.log({
    "rmse": rmse,  # Log the RMSE value
    "r2": r2       # Log the R² score
})

# Save the pipeline as a W&B artifact for versioning
wandb.Artifact("DecisionTree",  # Name of the artifact
               type='model',    # Specify the artifact type as 'model'
               description=params)

# Step 10: Save all files related to the artifact and finalize the run
wandb.save()
wandb.finish()


0,1
r2,▁
rmse,▁

0,1
r2,0.37837
rmse,80.74998


### Decision tree with depth of 15

In [None]:
# Define the hyperparameters for the Decision Tree model

params = {"max_depth": 15}

# Initialize the Decision Tree Regressor with specified parameters
dtree = DecisionTreeRegressor(**params)

# Create a pipeline that includes preprocessing and the Decision Tree model

dtree_model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing to the data
    ('dt_model', dtree)              # Use Decision Tree as the model
])

# Train the pipeline on the training data
dtree_model.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) for experiment tracking

wandb.init(project='mlops_restautant_revenue', config=params,
           tags=['Decision Tree', 'OHE Encoding'])  # Tags help organize experiments

# Name the run for easier identification in W&B
wandb.run.name = "DecisionTree_depth15"  # This run tracks the Decision Tree with max_depth=10

# Evaluate the model on the test data
# Calculate Root Mean Squared Error (RMSE) as a measure of prediction error
rmse = np.sqrt(mean_squared_error(y_test, dtree_model.predict(x_test)))

# Calculate R-squared (R²) score to measure the proportion of variance explained by the model
r2 = dtree_model.score(x_test, y_test)

# Log evaluation metrics to W&B
# Metrics like RMSE and R² are logged for comparison across experiments
wandb.log({
    "rmse": rmse,  # Log the RMSE value
    "r2": r2       # Log the R² score
})

# Save the pipeline as a W&B artifact for versioning
wandb.Artifact("DecisionTree",  # Name of the artifact
               type='model',    # Specify the artifact type as 'model'
               description=params)

# Step 10: Save all files related to the artifact and finalize the run
wandb.save()
wandb.finish()


0,1
r2,▁
rmse,▁

0,1
r2,0.34231
rmse,83.05905


### Decision tree with depth of 25

In [None]:
# Define the hyperparameters for the Decision Tree model

params = {"max_depth": 25}

# Initialize the Decision Tree Regressor with specified parameters
dtree = DecisionTreeRegressor(**params)

# Create a pipeline that includes preprocessing and the Decision Tree model

dtree_model = Pipeline(steps=[
    ('preprocessor', preprocessor),  # Apply preprocessing to the data
    ('dt_model', dtree)              # Use Decision Tree as the model
])

# Train the pipeline on the training data
dtree_model.fit(x_train, y_train)

# Initialize Weights & Biases (W&B) for experiment tracking

wandb.init(project='mlops_restautant_revenue', config=params,
           tags=['Decision Tree', 'OHE Encoding'])  # Tags help organize experiments

# Name the run for easier identification in W&B
wandb.run.name = "DecisionTree_depth25"  # This run tracks the Decision Tree with max_depth=10

# Evaluate the model on the test data
# Calculate Root Mean Squared Error (RMSE) as a measure of prediction error
rmse = np.sqrt(mean_squared_error(y_test, dtree_model.predict(x_test)))

# Calculate R-squared (R²) score to measure the proportion of variance explained by the model
r2 = dtree_model.score(x_test, y_test)

# Log evaluation metrics to W&B
# Metrics like RMSE and R² are logged for comparison across experiments
wandb.log({
    "rmse": rmse,  # Log the RMSE value
    "r2": r2       # Log the R² score
})

# Save the pipeline as a W&B artifact for versioning
wandb.Artifact("DecisionTree",  # Name of the artifact
               type='model',    # Specify the artifact type as 'model'
               description=params)

# Step 10: Save all files related to the artifact and finalize the run
wandb.save()
wandb.finish()

0,1
r2,▁
rmse,▁

0,1
r2,0.33978
rmse,83.21893


### Checking the K-fold cross validation for the Linear Model

* Out of all the experiements done, linear model is giving high r^2 and low rmse values
* Checking K-fold validation for the linear model with Standard Scaling

In [None]:
linear_reg = LinearRegression()

linear_model_cv = Pipeline(steps=[('preprocessor', preprocessor),
                               ('linear_model', linear_reg)])

# Step 2: Perform 5-fold cross-validation
cv5_results = cross_validate(
    linear_model_cv,
    x_train, y_train,
    scoring=['neg_mean_squared_error', 'r2'],
    cv=5,
    return_train_score=True
)

# Compute the r2 and rmse for cross validation 5
cv5_mean_rmse = np.sqrt(-np.mean(cv5_results['test_neg_mean_squared_error']))
cv5_mean_r2 = np.mean(cv5_results['test_r2'])

wandb.init(project='mlops_restautant_revenue', config=None, tags=['Linear Model', 'cross-validation', 'OHE Encoding'])
wandb.run.name = "Cross_Validation_5"

wandb.log( {
            "cv5_mean_rmse": cv5_mean_rmse,
            "cv5_mean_r2": cv5_mean_r2,
            } )

wandb.Artifact("Cross_Validation_5",
               type = 'model',
               description = None)

wandb.save()
wandb.finish()

0,1
cv5_mean_r2,▁
cv5_mean_rmse,▁

0,1
cv5_mean_r2,0.69037
cv5_mean_rmse,58.47569


* Similar performance across the folds, indicating the model generalizes well during cross-validation.

### Storing the model into a file

In [None]:
from joblib import dump

MODEL_DIR = "./restaurantmodela"

os.mkdir(MODEL_DIR)
dump(linear_model, MODEL_DIR + "/" + 'restaurantrevenue.pkl')

['./restaurantmodela/restaurantrevenue.pkl']

### Logging the model artifact in the tracking tools (weights and Biases)

In [None]:
# Initialize a new W&B run to track the experiment
wandb.init(project='mlops_restautant_revenue',
           config=None,
           tags = ['Final Model'])
wandb.run.name = "FinalModel"

In [None]:
# Create a new artifact in W&B to store the model
model_artifact = wandb.Artifact("Linear_Model_restaurantrevenue",
                                type = 'model',
                                description = 'Linear Model for used restaurant revenue prediction')
# Add the model directory to the artifact
model_artifact.add_dir(MODEL_DIR)

# Log the artifact to Weights & Biases
wandb.log_artifact(model_artifact)
wandb.save()
wandb.finish()

[34m[1mwandb[0m: Adding directory to artifact (./restaurantmodela)... Done. 0.0s
