<a href="https://colab.research.google.com/github/sesekheigbe/superkart-retail-forecasting/blob/main/SuperKart_Model_Deployment_Notebook_GIT_Esekheigbe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**

## Business Context

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

## Objective

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

## Data Description

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

- **Product_Id** - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
- **Product_Weight** - weight of each product
- **Product_Sugar_Content** - sugar content of each product like low sugar, regular and no sugar
- **Product_Allocated_Area** - ratio of the allocated display area of each product to the total display area of all the products in a store
- **Product_Type** - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
- **Product_MRP** - maximum retail price of each product
- **Store_Id** - unique identifier of each store
- **Store_Establishment_Year** - year in which the store was established
- **Store_Size** - size of the store depending on sq. feet like high, medium and low
- **Store_Location_City_Type** - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
- **Store_Type** - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
- **Product_Store_Sales_Total** - total revenue generated by the sale of that particular product in that particular store


# **Installing and Importing the necessary libraries**

In [None]:
#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.3 huggingface_hub==0.30.1 streamlit==1.45.0 -q

**Note:**

- After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.

- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For splitting the dataset
from sklearn.model_selection import train_test_split

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

# Libraries to get different metric scores
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# To create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# To tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# To serialize the model
import joblib

# os related functionalities
import os

# API request
import requests

# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# **Loading the dataset**

In [None]:
# Mount drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Read data
kart = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/SuperKart/SuperKart.csv")

In [None]:
# copying data to another variable to avoid compromising original data
data = kart.copy()

# **Data Overview**

View the first and last 5 rows of the dataset.

In [None]:
# First five rows of dataset
data.head()

In [None]:
# Last five rows of dataset
data.tail()

Shape of the dataset

In [None]:
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")

Check the data types of the columns for the dataset

In [None]:
data.info()

Statistical summary of the data

In [None]:
data.describe(include="all").T # see the difference in output compared to previous code without "include all"

Checking for duplicate values

In [None]:
data.duplicated().sum()

checking for missing values

In [None]:
data.isnull().sum()

# **Exploratory Data Analysis (EDA)**

## Univariate Analysis

In [None]:
# function to plot a boxplot and a histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

Product_Weight

In [None]:
histogram_boxplot(data, "Product_Weight")

Product_Allocated_Area

In [None]:
histogram_boxplot(data, "Product_Allocated_Area")

Product_MRP

In [None]:
histogram_boxplot(data, "Product_MRP")

Product_Store_Sales_Total

In [None]:
histogram_boxplot(data, "Product_Store_Sales_Total")

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

Product_Sugar_Content

In [None]:
labeled_barplot(data, "Product_Sugar_Content", perc=True)

Product_Type

In [None]:
labeled_barplot(data, "Product_Type", perc=True)

Store_Id

In [None]:
labeled_barplot(data, "Store_Id", perc=True)

Store_Size

In [None]:
labeled_barplot(data, "Store_Size", perc=True)

Store_Location_City_Type

In [None]:
labeled_barplot(data, "Store_Location_City_Type", perc=True)

Store_Type

In [None]:
labeled_barplot(data, "Store_Type", perc=True)

## Bivariate Analysis

Correlation Matrix

In [None]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(10, 5))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

Product weight and product MRP correlate well with revenue.

Check the distribution of our target variable

In [None]:
# plot a scatterplot of Product_Weight and Product_Store_Sales_Total
plt.figure(figsize=[8, 6])
sns.scatterplot(x=data.Product_Weight, y=data.Product_Store_Sales_Total)
plt.show()

In [None]:
# plot a scatterplot of Product_MRP and Product_Store_Sales_Total
plt.figure(figsize=[8, 6])
sns.scatterplot(x=data.Product_MRP, y=data.Product_Store_Sales_Total)
plt.show()

In [None]:
# plot a scatterplot of Product_Allocated_Area and Product_Store_Sales_Total
plt.figure(figsize=[8, 6])
sns.scatterplot(x=data.Product_Allocated_Area, y=data.Product_Store_Sales_Total)
plt.show()

Product type generating most revenue for the Company

In [None]:
df_revenue1 = data.groupby(["Product_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
a = sns.barplot(x=df_revenue1.Product_Type, y=df_revenue1.Product_Store_Sales_Total)
a.set_xlabel("Product_Types")
a.set_ylabel("Revenue")
plt.show()

In [None]:
# Perform a groupby on Product_Sugar_Content and select Product_Store_Sales_Total
df_revenue2 = data.groupby(["Product_Sugar_Content"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
b = sns.barplot(x=df_revenue2.Product_Sugar_Content, y=df_revenue2.Product_Store_Sales_Total)
b.set_xlabel("Product_Sugar_Content")
b.set_ylabel("Revenue")
plt.show()

Find Store types and location generating more revenue for company

In [None]:
df_store_revenue = data.groupby(["Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
r = sns.barplot(x=df_store_revenue.Store_Id, y=df_store_revenue.Product_Store_Sales_Total)
r.set_xlabel("Stores")
r.set_ylabel("Revenue")
plt.show()

In [None]:
# Perform a groupby on Store_Size and select Product_Store_Sales_Total
df_revenue3 = data.groupby(["Store_Size"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
c = sns.barplot(x=df_revenue3.Store_Size, y=df_revenue3.Product_Store_Sales_Total)
c.set_xlabel("Store_Size")
c.set_ylabel("Revenue")
plt.show()

In [None]:
# Perform a groupby on Store_Location_City and select Product_Store_Sales_Total
df_revenue4 = data.groupby(["Store_Location_City_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
d = sns.barplot(x=df_revenue4.Store_Location_City_Type, y=df_revenue4.Product_Store_Sales_Total)
d.set_xlabel("Store_Location_City_Type")
d.set_ylabel("Revenue")

In [None]:
# Perform a groupby on Store_type and select Product_Store_Sales_Total
df_revenue5 = data.groupby(["Store_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
e = sns.barplot(x=df_revenue5.Store_Type, y=df_revenue5.Product_Store_Sales_Total)
e.set_xlabel("Store_Type")
e.set_ylabel("Revenue")
plt.show()

Checking distribution of target variables

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Store_Id", y="Product_Store_Sales_Total", hue = "Store_Id")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_Store_Sales_Total")
plt.xlabel("Stores")
plt.ylabel("Product_Store_Sales_Total (of each product)")
plt.show()

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot(data=data, x="Store_Size", y="Product_Store_Sales_Total", hue = "Store_Size")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Size Vs Product_Store_Sales_Total")
plt.xlabel("Stores")
plt.ylabel("Product_Store_Sales_Total (of each product)")
plt.show()

Relationships between other columns

In [None]:
# Plot the boxplot with x as Product_Type , y as Product_Weight and hue as Product_Type
plt.figure(figsize=[14, 8])
sns.boxplot(data = data, x = "Product_Type", y = "Product_Weight", hue = "Product_Type")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_Weight")
plt.xlabel("Types of Products")
plt.ylabel("Product_Weight")
plt.show()

Check relationship between weight of product and its sugar content

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot(data = data, x = "Product_Sugar_Content", y = "Product_Weight", hue = "Product_Sugar_Content")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Sugar_Content Vs Product_Weight")
plt.xlabel("Product_Sugar_Content")
plt.ylabel("Product_Weight")
plt.show()

Analyzing sugar content of different product types

In [None]:
plt.figure(figsize=(14, 8))
sns.heatmap(
    pd.crosstab(data["Product_Sugar_Content"], data["Product_Type"]),
    annot=True,
    fmt="g",
    cmap="viridis",
)
plt.ylabel("Product_Sugar_Content")
plt.xlabel("Product_Type")
plt.show()

- Low Sugar is dominant across most product types. Fruits and vegetables have the highest number of low sugar items (864), followed by Snack foods (804).

- This may indicate a health-conscious product portfolio or consumer preference toward low sugar.

Number of items of each product type sold in each of the stores

In [None]:
# Perform a crosstab operation between Store_Id and Product_Type
plt.figure(figsize=(14, 8))
sns.heatmap(
    pd.crosstab(data["Store_Id"], data["Product_Type"]),
    annot=True,
    fmt="g",
    cmap="viridis",
)
plt.ylabel("Stores")
plt.xlabel("Product_Type")
plt.show()

- This heatmap shows store OUT004 dominating in all product categories. My inference from here is that this store is located in a high demand region with a much larger footprint, or could be a central warehouse.
- Stores OUT001, OUT002 and OUT003 have a more balanced spread of product types.

Analyse trend: Different product types have different prices

In [None]:
plt.figure(figsize=(14, 8))
sns.boxplot(data=data, x="Product_Type", y="Product_MRP", hue="Product_Type")
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.show()

Find out how the Product_MRP varies with the different stores

In [None]:
plt.figure(figsize=(14, 8))
sns.boxplot(data=data, x="Store_Id", y="Product_MRP", hue="Store_Id")
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_MRP")
plt.xlabel("Stores")
plt.ylabel("Product_MRP (of each product)")
plt.show()

Delve deeper and carry out a detailed analysis of each of the stores

In [None]:
# OUT001
data.loc[data["Store_Id"] == "OUT001"].describe(include="all").T

Observations
- OUT001 is a store of Supermarket Type 1 which is located in a Tier 2 city and has store size as high. It was established in 1987.
- OUT001 has sold products whose MRP range from 71 to 227.
- Snack Foods have been sold the highest number of times in OUT001.
- The revenue generated from each product at OUT001 ranges from 2300 to 5000.
- Low sugar contect products are mostly sold

In [None]:
data.loc[data["Store_Id"] == "OUT001", "Product_Store_Sales_Total"].sum()

Store_Id OUT001 generated a total revenue of 6,223,133 from the sale of goods

In [None]:
df_OUT001 = (
    data.loc[data["Store_Id"] == "OUT001"]
    .groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"]
    .sum()
)
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT001")
sns.barplot(x=df_OUT001.Product_Type, y=df_OUT001.Product_Store_Sales_Total)
plt.show()

- OUT001 has generated the highest revenue from the sale of fruits and vegetables and snack foods. Both the categories have contributed around 800,000 each.

In [None]:
# OUT002
data.loc[data["Store_Id"] == "OUT002"].describe(include="all").T

Observation
- OUT002 is a food mart which is located in a Tier 3 city and has store size as small. It was established in 1998.
- OUT002 has sold products whose MRP range from 31 to 225.
- Fruits and vegetables are the most sold products in OUT002.
- The revenue generated from each product at OUT002 ranges from 33 to 2300
- Low sugar contect products are mostly sold

In [None]:
data.loc[data["Store_Id"] == "OUT002", "Product_Store_Sales_Total"].sum()

Store_Id OUT002 generated a total revenue of 2,030,910 from the sale of goods

In [None]:
df_OUT002 = (
    data.loc[data["Store_Id"] == "OUT002"]
    .groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"]
    .sum()
)
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT002")
sns.barplot(x=df_OUT002.Product_Type, y=df_OUT002.Product_Store_Sales_Total)
plt.show()

- OUT002 has generated the highest revenue from the sale of fruits and vegetables (~ 300,000) followed by snack foods (~ 250,000).

In [None]:
# OUT003
data.loc[data["Store_Id"] == "OUT003"].describe(include="all").T


Observations
- OUT003 is a Departmental store which is located in a Tier 1 city and has store size as medium. It was established in 1999.
- OUT003 has sold products whose MRP range from 86 to 266.
- Snack Foods are the most sold products in OUT003.
- The revenue generated from each product at OUT003 ranges from 3070 to 8000
- Low sugar contect products are mostly sold

In [None]:
data.loc[data["Store_Id"] == "OUT003", "Product_Store_Sales_Total"].sum()

Store_Id OUT003 generated a total revenue of $6,673,457 from the sale of goods

In [None]:
df_OUT003 = (
    data.loc[data["Store_Id"] == "OUT003"]
    .groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"]
    .sum()
)
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT003")
sns.barplot(x=df_OUT003.Product_Type, y=df_OUT003.Product_Store_Sales_Total)
plt.show()

OUT003 has generated the highest revenue from the sale of snack foods followed by fruits and vegetables, both contributing over 800,000 each.

In [None]:
# OUT004
data.loc[data["Store_Id"] == "OUT004"].describe(include="all").T

- OOUT004 is a store of Supermarket Type2 which is located in a Tier 2 city and has store size as medium. It was established in 2009.
- OUT004 has sold products whose MRP range from 83 to 198.
- Fruits and vegetables have been sold the highest number of times in OUT004.
- The revenue generated from each product at OUT004 ranges from 1561 to 5463.
- Low sugar content products are mostly sold.

In [None]:
data.loc[data["Store_Id"] == "OUT004", "Product_Store_Sales_Total"].sum()

Store_Id OUT004 generated a total revenue of 15,427,583 from the sale of goods

In [None]:
df_OUT004 = (
    data.loc[data["Store_Id"] == "OUT004"]
    .groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"]
    .sum()
)
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT004")
sns.barplot(x=df_OUT004.Product_Type, y=df_OUT004.Product_Store_Sales_Total)
plt.show()

- OUT004 has generated the highest revenue from the sale of fruits and vegetables (~ 2,500,000) followed by snack foods (~ 2,000,000).

Revenue generated by the stores from each of the product types

In [None]:
df1 = data.groupby(["Product_Type", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df1

- This is in line with earlier findings with OUT004 genrating more revenue due to sales of more products (~53%).
- OUT002 generated the lowest revenue due to being a small store in a Tier 3 city.

Revenue generated by the stores from products having different levels of sugar content.




In [None]:
df2 = data.groupby(["Product_Sugar_Content", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df2

Thsi shows the same trend as before that low sugar content product generated more revenue.

# **Data Preprocessing**

Replacing the values in the Product_Sugar_Content column

In [None]:
# Replacing reg with Regular. Looking at the data, these obviously refers to the same category. Spelling was just shortened
data.Product_Sugar_Content.replace(to_replace=["reg"], value=["Regular"], inplace=True)

In [None]:
# Count the quantity of products in each sugar category
data.Product_Sugar_Content.value_counts()

Exploring Pattern in Product ID

In [None]:
## extracting the first two characters from the Product_Id column and storing it in another column
data["Product_Id_char"] = data["Product_Id"].str[:2]
data.head()

In [None]:
data["Product_Id_char"].unique()

In [None]:
data.loc[data["Product_Id_char"] == "FD"]

In [None]:
data.loc[data["Product_Id_char"] == "DR"]

In [None]:
data.loc[data["Product_Id_char"] == "NC"]

Store's Age

The store's age is important and needs to be incorporated into the model because;
- An older store is more trustworthy than newer ones
- Without proper attention, an older store may lack proper infrastructure, which impacts reveue gereration.

In [None]:
# Outlet Age
data["Store_Age_Years"] = 2025 - data.Store_Establishment_Year

Group Product Types (total of 16) into 2 broad categories - Perishables and Non- Perishables.

In [None]:
perishables = [
    "Dairy",
    "Meat",
    "Fruits and Vegetables",
    "Breakfast",
    "Breads",
    "Seafood",
]

In [None]:
def change(x):
    if x in perishables:
        return "Perishables"
    else:
        return "Non Perishables"

In [None]:
data['Product_Type_Category'] = data['Product_Type'].apply(change)

In [None]:
data.head()

In [None]:
df_revenue1 = data.groupby(["Product_Type_Category"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
a = sns.barplot(x=df_revenue1.Product_Type_Category, y=df_revenue1.Product_Store_Sales_Total)
a.set_xlabel("Product_Type_category")
a.set_ylabel("Revenue")
plt.show()

This shows that non perishables generated more revenue than the perishable product types. However the perishables are only 6 product types while the non perishables are 10 product types (resulting in more revenue)

Outlier Check

In [None]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove("Store_Establishment_Year")
numeric_columns.remove("Store_Age_Years")


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

- The boxplot shows outliers exists. we already observed that some product types contribute dispropotionately to generated revenue (notably fruits and vegetables, and snacks)

Data Preparation for modeling

In [None]:
data.head()

Remove columns that are not required

In [None]:
data.drop(["Product_Id", "Product_Type", "Store_Establishment_Year"], axis=1, inplace=True)

In [None]:
data.shape

In [None]:
data.head()

In [None]:
# Separating features and the target column
X = data.drop("Product_Store_Sales_Total", axis=1)
y = data["Product_Store_Sales_Total"]

In [None]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, shuffle=True
)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Data Preprocessing Pipeline

In [None]:
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features

In [None]:
# Create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
    (Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)

# **Model Building**

As per the project requirement, Only 2 models are to be built.

## Define functions for Model Evaluation

- We'll fit different models on the train data and observe their performance.
- We'll try to improve that performance by tuning some hyperparameters available for that algorithm.
- We'll use GridSearchCv for hyperparameter tuning and `r_2 score` to optimize the model.
- R-square - `Coefficient of determination` is used to evaluate the performance of a regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variables.
- Let's start by creating a function to get model scores, so that we don't have to use the same codes repeatedly.

In [None]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

The ML models to be built can be any two out of the following:
1. Decision Tree
2. Bagging
3. Random Forest
4. AdaBoost
5. Gradient Boosting
6. XGBoost

I chose Random forest and XGBoost models because both handle noise well and manage overfitting better



Random Forest Model

In [None]:
# Define random forest regressor

rf_estimator = RandomForestRegressor(random_state=1)
rf_pipeline = make_pipeline(preprocessor,rf_estimator)
rf_pipeline.fit(X_train, y_train)

Checking model performance on training set

In [None]:
rf_estimator_model_train_perf = model_performance_regression(rf_pipeline, X_train, y_train)
rf_estimator_model_train_perf

Checking model performance on test set

In [None]:
rf_estimator_model_test_perf = model_performance_regression(rf_pipeline, X_test, y_test)
rf_estimator_model_test_perf

XGBoost Regressor

In [None]:
# Define xgboost regressor

xgb_estimator = XGBRegressor(random_state=1)
xgb_pipeline = make_pipeline(preprocessor,xgb_estimator)
xgb_pipeline.fit(X_train, y_train)

checking model performance on training set

In [None]:
xgb_estimator_model_train_perf = model_performance_regression(xgb_pipeline, X_train, y_train)
xgb_estimator_model_train_perf

Checking model performance on test set

In [None]:
xgb_estimator_model_test_perf = model_performance_regression(xgb_pipeline, X_test, y_test)
xgb_estimator_model_test_perf

# **Model Performance Improvement - Hyperparameter Tuning**

Hyperparameter Tuning - Random Forest

In [None]:
# random forest regressor


# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)
rf_tuned_pipeline = make_pipeline(preprocessor,rf_tuned)

# Grid of parameters to choose from
parameters = {
"randomforestregressor__max_depth": [5, 10, 15, 20, None], #Complete the code to define the list of values to be tuned
"randomforestregressor__max_features": ["sqrt", "log2", None], #Complete the code to define the list of values to be tuned
"randomforestregressor__n_estimators": [100, 200, 300], #Complete the code to define the list of values to be tuned
}

# Run the grid search
grid_obj = GridSearchCV(rf_tuned_pipeline, parameters, scoring=r2_score, cv=3, n_jobs = -1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned_pipeline = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned_pipeline.fit(X_train, y_train)

Checking model performance on training set

In [None]:
rf_tuned_model_train_perf = model_performance_regression(rf_tuned_pipeline, X_train, y_train)
rf_tuned_model_train_perf

Checking model performance on test set

In [None]:
rf_tuned_model_test_perf = model_performance_regression(rf_tuned_pipeline, X_test, y_test)
rf_tuned_model_test_perf

Hyperparameter Tuning - XGBoost Regressor

In [None]:
# Create the XGBoost Regressor pipeline
xgb_tuned = XGBRegressor(random_state=1, objective='reg:squarederror')
xgb_tuned_pipeline = make_pipeline(preprocessor, xgb_tuned)

# Grid of parameters to choose from
parameters = {
    "xgbregressor__n_estimators": [100, 200, 300],
    "xgbregressor__subsample": [0.6, 0.8, 1.0],
    "xgbregressor__gamma": [0, 1, 5],
    "xgbregressor__colsample_bytree": [0.6, 0.8, 1.0],
    "xgbregressor__colsample_bylevel": [0.6, 0.8, 1.0]
}

# Run the grid search with r2 as scoring
grid_obj = GridSearchCV(xgb_tuned_pipeline, parameters, scoring='r2', cv=3, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the model to the best combination of parameters
xgb_tuned_pipeline = grid_obj.best_estimator_

# Fit the best model to the data
xgb_tuned_pipeline.fit(X_train, y_train)

Checking the model performance on training set

In [None]:
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned_pipeline, X_train, y_train)
xgb_tuned_model_train_perf

Checking the model performance on test set

In [None]:
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned_pipeline, X_test, y_test)
xgb_tuned_model_test_perf

# **Model Performance Comparison, Final Model Selection, and Serialization**

In [None]:
 # Training performance comparison

models_train_comp_df = pd.concat(
    [
        rf_estimator_model_train_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the first model you have choosen . Eg, rf_model_train_perf
        rf_tuned_model_train_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the first model (tuned) you have choosen
        xgb_estimator_model_train_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the second model you have choosen
        xgb_tuned_model_train_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the second model (tuned) you have choosen
    ],
    axis=1,
)

models_train_comp_df.columns = ["rf_estimator", "rf_tuned", "xgb_estimator", "xgb_tuned"] #Complete the code to define the names for the models

print("Training performance comparison:")
models_train_comp_df

In [None]:
 # Test performance comparison

models_test_comp_df = pd.concat(
    [
        rf_estimator_model_test_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the first model you have choosen . Eg, rf_model_train_perf
        rf_tuned_model_test_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the first model (tuned) you have choosen
        xgb_estimator_model_test_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the second model you have choosen
        xgb_tuned_model_test_perf.T, #Complete the code to define the variable name of the dataframe which stores the train performance metrics of the second model (tuned) you have choosen
    ],
    axis=1,
)

models_test_comp_df.columns = ["rf_estimator", "rf_tuned", "xgb_estimator", "xgb_tuned"] #Complete the code to define the names for the models

print("Test performance comparison:")
models_test_comp_df

In [None]:
# Create a folder for storing the files needed for web app deployment
os.makedirs("backend_files", exist_ok=True)

In [None]:
# Define the file path to save (serialize) the trained model along with the data preprocessing steps
saved_model_path = "backend_files/revenue_forecast_model.joblib" #Complete the code to define the name of the model

In [None]:
# Save the best trained model pipeline using joblib
joblib.dump(rf_tuned_pipeline, saved_model_path) #Complete the code to pass the variable name of the best model

print(f"Model saved successfully at {saved_model_path}")

- All the models are very close in performace metrics, but I chose rf_tuned model because it has a slightly better result and it is easier to deploy than the XGBoost.
- The XGBoost model typically demands more setuo and tuning than random forest, but in this case with no additonal benefit.  
- The tuned models did not show any significant improvement over the baseline models showing they were already well optimized. I would go with it, however as it might see future data that the tuned model might handle better.   

In [None]:
# Load the saved model pipeline from the file
saved_model = joblib.load("backend_files/revenue_forecast_model.joblib") #Complete the code to define the name of the saved model

# Confirm the model is loaded
print("Model loaded successfully.")

In [None]:
saved_model

In [None]:
# Make predictions on the test set using the deserialized model
saved_model.predict(X_test)

# **Deployment - Backend**

## Flask Web Framework


In [None]:
# Checking the features saved in model
import joblib
model = joblib.load("backend_files/revenue_forecast_model.joblib")
print(model.feature_names_in_)


Was initially getting a Store_Id not found error in my backend tradeback informaiton, so I included this line of code to confirm all the features were saved prior to deployment.

In [None]:
%%writefile backend_files/app.py
# Import necessary libraries
import numpy as np
import joblib  # For loading the serialized model
import pandas as pd  # For data manipulation
from flask import Flask, request, jsonify  # For creating the Flask API
import traceback  # For full exception tracebacks

# Initialize Flask app
superkart_api = Flask("SuperKart Sales Api")

# Load the trained sales revenue forecast model
model = joblib.load("backend_files/revenue_forecast_model.joblib")
print("Model loaded. Features expected by the model:", model.feature_names_in_)

# Home route
def home():
    return "Welcome to SuperKart Sales Forecasting Api"

@superkart_api.get('/')
def home_route():
    return home()



# Prediction endpoint
@superkart_api.post('/v1/predict')
def predict_sales():
    # 1) Parse JSON payload
    data = request.get_json()
    print("Raw JSON payload:", data)

    # 2) Build the sample dict (keys must match training features)
    sample = {
        'Product_Weight': data['Product_Weight'],
        'Product_Sugar_Content': data['Product_Sugar_Content'],
        'Product_Allocated_Area': data['Product_Allocated_Area'],
        'Product_MRP': data['Product_MRP'],
        'Store_Id': data['Store_Id'],
        'Store_Size': data['Store_Size'],
        'Store_Location_City_Type': data['Store_Location_City_Type'],
        'Store_Type': data['Store_Type'],
        'Product_Id_char': data['Product_Id_char'],
        'Store_Age_Years': data['Store_Age_Years'],
        'Product_Type_Category': data['Product_Type_Category'],
    }
    print("Sample dict keys:", list(sample.keys()))

    # 3) Convert to DataFrame
    input_data = pd.DataFrame([sample])
    print("DataFrame columns before padding:", input_data.columns.tolist())

    # 4) Determine expected columns from model
    try:
        expected = list(model.feature_names_in_)
        print("Model expects columns:", expected)
    except Exception:
        print("Model feature_names_in_ not available; using input columns as fallback")
        expected = input_data.columns.tolist()

    # 5) Pad missing columns with None
    for col in expected:
        if col not in input_data.columns:
            input_data[col] = None

    # 6) Reorder columns
    input_data = input_data[expected]
    print("DataFrame columns after padding:", input_data.columns.tolist())
    print("Input DataFrame preview:\n", input_data)

    # 7) Predict and handle errors
    try:
        print("About to predict. Input columns:", input_data.columns.tolist())
        prediction = model.predict(input_data).tolist()[0]
        print("Prediction result:", prediction)
        return jsonify({'Sales': prediction})
    except Exception as e:
        print("Exception during prediction:\n", traceback.format_exc())
        return jsonify({'error': str(e)}), 500

# Run the Flask app in debug mode
if __name__ == '__main__':
    superkart_api.run(debug=True)


## Dependencies File

In [None]:
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
seaborn==0.13.2
joblib==1.4.2
xgboost==2.1.4
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.32.3
uvicorn[standard]
streamlit==1.43.2

## Dockerfile

In [None]:
%%writefile backend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy all files from the current directory to the container's working directory
COPY . .

# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:superkart_api`: Runs the Flask app (assuming `app.py` contains the Flask instance named `superkart_api`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:superkart_api"]


## Setting up a Hugging Face Docker Space for the Backend

In [None]:
# Import the login function from the huggingface_hub library
from huggingface_hub import login

# Login to your Hugging Face account using your access token
# Replace "YOUR_HUGGINGFACE_TOKEN" with your actual token
#login(token="YOUR_HUGGINGFACE_TOKEN")  # You can get your token from https://huggingface.co/settings/tokens
login(token="hf_token") #Complete the code to define the access token

# Import the create_repo function from the huggingface_hub library
from huggingface_hub import create_repo

In [None]:
# Try to create the repository for the Hugging Face Space
try:
    create_repo("superkart-sales-forecast",  # Define the name of the repository
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker"
        private=False  # Set to True if you want the space to be private
    )
except Exception as e:
    # Handle potential errors during repository creation
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")

## Uploading Files to Hugging Face Space (Docker Space)

In [None]:
# for hugging face space authentication to upload files
from huggingface_hub import HfApi, upload_folder

access_key = "hf_token"  #Complete the code to define the access token
repo_id = "sesekheigbe/Superkart-Sales-Forecast"  #Complete the code to define the repo id.

# Login to Hugging Face platform with the access token
login(token=access_key)

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called backend_files
api.upload_folder(
    folder_path="backend_files",
    path_in_repo="",
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)

# **Deployment - Frontend**

## Points to note before executing the below cells
- Create a Streamlit space on Hugging Face by following the instructions provided on the content page titled **`Creating Spaces and Adding Secrets in Hugging Face`** from Week 1

## Streamlit for Interactive UI

In [None]:
# Create a folder for storing the files needed for frontend UI deployment
os.makedirs("frontend_files", exist_ok=True)

In [None]:
%%writefile frontend_files/app.py

import streamlit as st
import requests

# Title of the app
st.title("SuperKart Sales Forecasting Tool")

# Input fields for product and store data
Product_Weight = st.number_input("Product Weight (in kg)", min_value=0.0, value=12.66)

Product_Sugar_Content = st.selectbox(
    "Product Sugar Content",
    ["Low Sugar", "Regular", "No Sugar"]
)

Product_Allocated_Area = st.number_input(
    "Product Allocated Area (sq ft)", min_value=0.0, value=0.07
)

Product_MRP = st.number_input(
    "Product MRP (₹)", min_value=0.0, value=150.0
)

Store_Id = st.selectbox(
    "Store Id", ["OUT001", "OUT002", "OUT003", "OUT004"]
)

Store_Size = st.selectbox(
    "Store Size",
    ["Small", "Medium", "High"]
)

Store_Location_City_Type = st.selectbox(
    "Store Location City Type",
    ["Tier 1", "Tier 2", "Tier 3"]
)

Store_Type = st.selectbox(
    "Store Type",
    ["Department Store", "Food Mart", "Supermarket Type 1", "Supermarket Type 2"]
)

Product_Id_char = st.text_input(
    "Product ID Prefix (e.g., FDW, DRN, NC)",
    value="FDW"
)

Store_Age_Years = st.number_input(
    "Store Age (in years)", min_value=0, value=23
)

Product_Type_Category = st.selectbox(
    "Product Type Category",
     ["Perishables", "Non Perishables"]
)


# Prepare the data dictionary
product_data = {
    "Product_Weight": Product_Weight,
    "Product_Sugar_Content": Product_Sugar_Content,
    "Product_Allocated_Area": Product_Allocated_Area,
    "Product_MRP": Product_MRP,
    "Store_Id": Store_Id,
    "Store_Size": Store_Size,
    "Store_Location_City_Type": Store_Location_City_Type,
    "Store_Type": Store_Type,
    "Product_Id_char": Product_Id_char,
    "Store_Age_Years": Store_Age_Years,
    "Product_Type_Category": Product_Type_Category
}

# DEBUG: Show the request payload before submitting
st.subheader("🧾 Request JSON Preview")
st.json(product_data)

# Button to trigger prediction
if st.button("Predict", type='primary'):
    # Replace <user_name> and <space_name> with your actual Hugging Face values
    response = requests.post(
        "https://sesekheigbe-superkart-sales-forecast.hf.space/v1/predict",
        json=product_data
    )


    if response.status_code == 200:
        result = response.json()
        predicted_sales = result["Sales"]
        st.success(f"Predicted Product Store Sales Total: ₹{predicted_sales:.2f}")
    else:
        st.error("Error in API request")


## Dependencies File

In [None]:
%%writefile frontend_files/requirements.txt
requests==2.32.3
streamlit==1.45.0

## DockerFile

In [None]:
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Create Streamlit config with correct Hugging Face port (7860)
RUN mkdir -p /app/.streamlit && \
    echo "\
[server]\n\
port = 8501\n\
enableCORS = false\n\
enableXsrfProtection = false\n\
headless = true\n\
\n\
[browser]\n\
gatherUsageStats = false\n\
" > /app/.streamlit/config.toml


# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions

## Uploading Files to Hugging Face Space (Streamlit Space)

In [None]:
access_key = "hf_token"  #Complete the code to define the access token
repo_id = "sesekheigbe/superkartforecast-ui"  #Complete the code to define the repo id

# Login to Hugging Face platform with the access token
login(token=access_key)

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called frontend_files
api.upload_folder(
    folder_path="frontend_files",
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)

Backend link:   https://huggingface.co/spaces/sesekheigbe/superkart-sales-forecast?logs=container

Frontend link: https://huggingface.co/spaces/sesekheigbe/superkartforecast-ui

# **Actionable Insights and Business Recommendations**

Actionable Insights:

1. Non-Perishable items (e.g. Frozen Foods, Snacks, Canned Foods) collectively generate significantly more revenue than Perishables. However on an product basis, fruits and vegetables (perishable), and snacks (non-perishable) are the biggest contributors to revenue generation.

2. Larger store sizes (especially 'Medium'), Store type (Supermarket Type 2) and location city (Tier 2) seem to have positive effects on revenue. This also happens to fit the OUT004 description. OUT004 has 3 to 4 times more product counts than each of the individaul stores. This can skew the data in favor od these fetures. OUT004 seems to me like a distrubtion or wholesale center.

3. Product MRP and weight Influence Sales numbers. Products with higher MRP and larger weights correlate positively with revenue, highlighting the importance of premium pricing, which is also influenced by the product weight.

4. Low Sugar is dominant across most product types, contributing to more revenue. Fruits and vegetables have the highest number of low sugar items (864), followed by Snack foods (804). This may indicate a health-conscious product portfolio or consumer preference toward low sugar.

5. The RF and XGBoost models gave results with similar confidence level (r-squared @ 0.668 and adjusted r-squared @ 0.667. MAPE @ 0.187). However the RF was chosen because XGBoost typically demands more setup and tuning than random forest, but in this case with no additonal benefit.


Business Recommendations

1. Prioritize Expansion of high demand perishables like Fresh Fruits and vegetables,and Non-Perishables products like Snacks, Frozen Foods, and Canned Goods. Strengthen marketing, promotions, and shelf space allocation for such high-performing items.

2. Store OUT004, from the data provided outpaced all other stores in product sales. This store should be studied and some best prctices for increased revenue applied in the running of other stores.

3. Optimize Store Layout and Size.
Invest in expanding or redesigning smaller stores toward medium/large formats to maximize revenue potential.

4. Stores like OUT003 which is already rightly size and has high MAR can have their revenue optimized by redistributing the product selection to shelf more of the high performing products. This can help enhance overall revenue.

5. Implement an inventory control strategy that will ensure that fast moving products are restocked timely. Take advantage of the health conscious communities, ensuring low sugar items continue to dominate the shelfs.

6. Continue to gather data, comparing the model's forecast with the actual revenue. This will help in fine tuning the model. if required for better forecast reliabiltiy.