# Coffee Shop Revenue Prediction with Random Forest Regressor

In this notebook, we'll build a machine learning model to predict coffee shop revenue using a Random Forest Regressor. We'll follow these steps:

1. Data Loading and EDA
2. Feature Engineering
3. Model Training with Hyperparameter Tuning
4. Model Evaluation
5. Saving the Model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Data Loading and EDA

In [2]:
df = pd.read_csv('coffee_shop_revenue.csv')

In [3]:
df.head()

Unnamed: 0,Number_of_Customers_Per_Day,Average_Order_Value,Operating_Hours_Per_Day,Number_of_Employees,Marketing_Spend_Per_Day,Location_Foot_Traffic,Daily_Revenue
0,152,6.74,14,4,106.62,97,1547.81
1,485,4.5,12,8,57.83,744,2084.68
2,398,9.09,6,6,91.76,636,3118.39
3,320,8.48,17,4,462.63,770,2912.2
4,156,7.44,17,2,412.52,232,1663.42


In [4]:
df.tail()

Unnamed: 0,Number_of_Customers_Per_Day,Average_Order_Value,Operating_Hours_Per_Day,Number_of_Employees,Marketing_Spend_Per_Day,Location_Foot_Traffic,Daily_Revenue
1995,372,6.41,11,4,466.11,913,2816.85
1996,105,3.01,11,7,12.62,235,337.97
1997,89,5.28,16,9,376.64,310,951.34
1998,403,9.41,7,12,452.49,577,4266.21
1999,89,6.88,13,14,78.46,322,914.24


In [5]:
df.isnull().sum()

Unnamed: 0,0
Number_of_Customers_Per_Day,0
Average_Order_Value,0
Operating_Hours_Per_Day,0
Number_of_Employees,0
Marketing_Spend_Per_Day,0
Location_Foot_Traffic,0
Daily_Revenue,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Number_of_Customers_Per_Day  2000 non-null   int64  
 1   Average_Order_Value          2000 non-null   float64
 2   Operating_Hours_Per_Day      2000 non-null   int64  
 3   Number_of_Employees          2000 non-null   int64  
 4   Marketing_Spend_Per_Day      2000 non-null   float64
 5   Location_Foot_Traffic        2000 non-null   int64  
 6   Daily_Revenue                2000 non-null   float64
dtypes: float64(3), int64(4)
memory usage: 109.5 KB


In [8]:
df.describe()

Unnamed: 0,Number_of_Customers_Per_Day,Average_Order_Value,Operating_Hours_Per_Day,Number_of_Employees,Marketing_Spend_Per_Day,Location_Foot_Traffic,Daily_Revenue
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,274.296,6.261215,11.667,7.947,252.61416,534.8935,1917.32594
std,129.441933,2.175832,3.438608,3.742218,141.136004,271.662295,976.202746
min,50.0,2.5,6.0,2.0,10.12,50.0,-58.95
25%,164.0,4.41,9.0,5.0,130.125,302.0,1140.085
50%,275.0,6.3,12.0,8.0,250.995,540.0,1770.775
75%,386.0,8.12,15.0,11.0,375.3525,767.0,2530.455
max,499.0,10.0,17.0,14.0,499.74,999.0,5114.6


In [9]:
df.shape

(2000, 7)

## 2. Feature Engineering

In [14]:
X = df.drop('Daily_Revenue', axis=1)
y = df['Daily_Revenue']

In [17]:
X.shape

(2000, 6)

In [18]:
y.shape

(2000,)

## 3. Model Training (Random Forest Regressor)

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

In [23]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'n_estimators' : [100, 200],
    'max_depth' : [ 10, 20, None],
    'min_samples_split' : [2, 5],
}

In [24]:
ge_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=3,
    n_jobs=-1,
    verbose=1
)


In [25]:
ge_search.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


In [26]:
y_pred = ge_search.predict(X_test)

## 4. Model Evaluation

In [28]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.9481536579516033

In [29]:
import pickle as pk

In [31]:
with open ('model.pkl','wb') as fs:
  pk.dump(ge_search,fs)

## 5. Save the Trained Model