Module 01: Demand Forecasting and Inventory Optimization

This notebook contains the exploratory data analysis (EDA) for the first module of the **"Intelligent System for Supply Chain Management"** project. 

The main objective is to optimize inventory and purchasing management, with a target of **reducing overstocking by 20%** within 6 months.

- Target Variable for Inventory Optimization: **Stock_Quantity**
- Target Variable for Demand Forecasting: **Sales_Volume**

## DATA ACQUISITION

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
import plotly.io as pio

from sklearn.preprocessing import OrdinalEncoder

from smart_supply_chain_ai.data_processing import get_data

import warnings
warnings.filterwarnings('ignore')

# Set up display options and plotting template
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"
px.defaults.width = 800
px.defaults.height = 600

### Load Raw Data

In [2]:
# Define data paths
raw_data_path = os.path.join('../data', 'raw')

In [3]:
# Download Data from Kaggle
# link for data on web - [USER] / [DATASET_NAME]
module_one = "salahuddinahmedshuvo/grocery-inventory-and-sales-dataset"
# Download Data and Unzip 
get_data.download_kaggle_dataset(module_one, raw_data_path);

Starting the download of dataset 'salahuddinahmedshuvo/grocery-inventory-and-sales-dataset' from Kaggle...
Dataset URL: https://www.kaggle.com/datasets/salahuddinahmedshuvo/grocery-inventory-and-sales-dataset
Download, unzipping, and cleanup complete! The dataset was saved to: ../data/raw


In [4]:
# Load the raw dataset
df_raw = pd.read_csv(raw_data_path + '/Grocery_Inventory_and_Sales_Dataset.csv')

## DATA CLEANING & PREPROCESSING

In [5]:
# Dictionary columns names help
column_inventory = {
    'Product_ID': 'Unique identifier for each product.',
    'Product_Name': 'Name of the product.',
    'Category': 'The product category (e.g., Grains & Pulses, Beverages, Fruits & Vegetables).',
    'Supplier_ID': 'Unique identifier for the product supplier.',
    'Supplier_Name': 'Name of the supplier.',
    'Stock_Quantity': 'The current stock level of the product in the warehouse.',
    'Reorder_Level': 'The stock level at which new stock should be ordered.',
    'Reorder_Quantity': 'The quantity of product to order when the stock reaches the reorder level.',
    'Unit_Price': 'Price per unit of the product.',
    'Date_Received': 'The date the product was received into the warehouse.',
    'Last_Order_Date': 'The last date the product was ordered.',
    'Expiration_Date': 'The expiration date of the product, if applicable.',
    'Warehouse_Location': 'The warehouse address where the product is stored.',
    'Sales_Volume': 'The total number of units sold.',
    'Inventory_Turnover_Rate': 'The rate at which the product sells and is replenished.',
    'Status': 'Current status of the product (e.g., Active, Discontinued, Backordered).',
    'Stock_Value': 'The total monetary value of the current stock (Stock_Quantity * Unit_Price).',
    'Days_For_Expiration': 'The number of days until the product expires. Negative values indicate the product is already expired.',
    'Expiration_Status': 'Categorical status based on the expiration date (e.g., Expired, Nearing, Safe).',
    'Purchase_Order': 'The total monetary value of the new order (Reorder_Quantity * Unit_Price), used to analyze discrepancies.',
    'Stock_Coverage_Days': 'The number of days the current inventory can last, based on the annual sales rate.',
    'Delivery_Lag': 'The number of days between the last order date and the date the product was received.'
}

In [6]:
# Make a copy for manipulation
df = df_raw.copy()

In [7]:
### Standardize Column Names
# Rename 'Catagory' to 'Category' for consistency
df.rename(columns={"Catagory": "Category"}, inplace=True)

In [8]:
# Verifying variable with missing value
df[df['Category'].isna()]

Unnamed: 0,Product_ID,Product_Name,Category,Supplier_ID,Supplier_Name,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Date_Received,Last_Order_Date,Expiration_Date,Warehouse_Location,Sales_Volume,Inventory_Turnover_Rate,Status
685,10-378-9729,Cabbage,,83-941-9620,Rooxo,69,21,68,$66.55,12/23/2024,11/26/2024,9/21/2024,2 Butterfield Pass,36,35,Discontinued


In [9]:
# List unique categories
df.Category.unique()

array(['Grains & Pulses', 'Beverages', 'Fruits & Vegetables',
       'Oils & Fats', 'Dairy', 'Bakery', 'Seafood', nan], dtype=object)

### Handle Missing Values

In [10]:
# The analysis shows one missing value in the 'Category' column
# It corresponds to 'Cabbage', which is categorized as 'Fruits & Vegetables' 
df = df.fillna('Fruits & Vegetables')

### Convert Data Types

In [11]:
# Verificando informações gerais do dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Product_ID               990 non-null    object
 1   Product_Name             990 non-null    object
 2   Category                 990 non-null    object
 3   Supplier_ID              990 non-null    object
 4   Supplier_Name            990 non-null    object
 5   Stock_Quantity           990 non-null    int64 
 6   Reorder_Level            990 non-null    int64 
 7   Reorder_Quantity         990 non-null    int64 
 8   Unit_Price               990 non-null    object
 9   Date_Received            990 non-null    object
 10  Last_Order_Date          990 non-null    object
 11  Expiration_Date          990 non-null    object
 12  Warehouse_Location       990 non-null    object
 13  Sales_Volume             990 non-null    int64 
 14  Inventory_Turnover_Rate  990 non-null    i

In [12]:
# Convert date columns to datetime objects
date_columns = ['Date_Received', 'Last_Order_Date', 'Expiration_Date']
df[date_columns] = df[date_columns].apply(pd.to_datetime, errors='coerce')

In [13]:
# Convert 'Unit_Price' to float by removing the '$' sign
df['Unit_Price'] = df['Unit_Price'].str.replace('$', '').astype('float')

In [14]:
# Convert categorical columns to the 'category' type for memory efficiency
cat_columns = ['Category', 'Status']
df[cat_columns] = df[cat_columns].astype('category')

In [15]:
# Statistics for Numeric columns
df.describe(exclude=['datetime', 'object', 'category'])

Unnamed: 0,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Sales_Volume,Inventory_Turnover_Rate
count,990.0,990.0,990.0,990.0,990.0,990.0
mean,55.609091,51.215152,51.913131,5.924192,58.925253,50.150505
std,26.300775,29.095241,29.521059,6.49128,23.002318,28.798954
min,10.0,1.0,1.0,0.2,20.0,1.0
25%,33.0,25.25,25.0,2.5,39.0,25.0
50%,56.0,53.0,54.0,4.225,58.0,50.0
75%,79.0,77.0,77.0,7.0,78.0,74.75
max,100.0,100.0,100.0,98.43,100.0,100.0


In [16]:
# Statistics for Categorical columns
df.describe(include=['category'])

Unnamed: 0,Category,Status
count,990,990
unique,7,3
top,Fruits & Vegetables,Discontinued
freq,332,333


In [17]:
# Statistics for String columns
df.describe(include=['object'])

Unnamed: 0,Product_ID,Product_Name,Supplier_ID,Supplier_Name,Warehouse_Location
count,990,990,990,990,990
unique,990,121,990,350,990
top,29-205-1132,Bread Flour,38-037-1699,Katz,48 Del Sol Trail
freq,1,19,1,12,1


In [18]:
# Range of Dates
print(f'\tDate Min value\n\n{df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].min()}')
print(30 * '-')
print('')
print(f'\tDate Max value\n\n{df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].max()}')

	Date Min value

Date_Received     2024-02-25
Last_Order_Date   2024-02-25
Expiration_Date   2024-02-25
dtype: datetime64[ns]
------------------------------

	Date Max value

Date_Received     2025-02-24
Last_Order_Date   2025-02-24
Expiration_Date   2025-02-24
dtype: datetime64[ns]


In [19]:
# Confirm duplicate values exist
df['Product_ID'].duplicated().sum()

np.int64(0)

In [20]:
# Confirm duplicate values exist
df['Supplier_ID'].duplicated().sum()

np.int64(0)

# FEATURE ENGINEERING

### Create New Business Metrics

In [21]:
# 1. Calculate the 'Stock_Value'
df['Stock_Value'] = df['Stock_Quantity'] * df['Unit_Price']

In [22]:
# 2. Calculate the 'Days_For_Expiration'
df['Days_For_Expiration'] = (df['Expiration_Date'] - df['Date_Received']).dt.days.astype('Int64')

In [23]:
# 3. Create 'Expiration_Status' (Expired, Nearing, Safe)
df['Expiration_Status'] = np.where(df['Days_For_Expiration'] < 0, 'Expired', 
                                         np.where(df['Days_For_Expiration'] < 30, 'Nearing', 'Safe'))


In [24]:
# 4. Calculate 'Stock_Coverage' in days (based on 1-year sales window)
# The 'Inventory_Turnover_Rate' is a standard metric used to calculate this value
df['Stock_Coverage_Days'] = (365 / df['Inventory_Turnover_Rate']).apply(np.floor).astype('int')

In [25]:
# 5. Create 'Purchase_Order' from 'Reorder_Quantity' to check for discrepancies
df['Purchase_Order'] = df['Reorder_Quantity'] * df['Unit_Price']

In [26]:
# 6. Calculate Supplier Delivery Lag
# Assume 'Date_Received' refers to the most recent reception after the 'Last_Order_Date'
df['Delivery_Lag'] = (df['Date_Received'] - df['Last_Order_Date']).dt.days

# EXPLORATORY DATA ANALYSIS (EDA)

In [27]:
# Descriptive Statistics
df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Stock_Quantity,990.0,55.609091,10.0,33.0,56.0,79.0,100.0,26.300775
Reorder_Level,990.0,51.215152,1.0,25.25,53.0,77.0,100.0,29.095241
Reorder_Quantity,990.0,51.913131,1.0,25.0,54.0,77.0,100.0,29.521059
Unit_Price,990.0,5.924192,0.2,2.5,4.225,7.0,98.43,6.49128
Date_Received,990.0,2024-08-23 02:18:10.909090816,2024-02-25 00:00:00,2024-05-27 00:00:00,2024-08-19 00:00:00,2024-11-23 00:00:00,2025-02-24 00:00:00,
Last_Order_Date,990.0,2024-08-25 19:20:43.636363520,2024-02-25 00:00:00,2024-05-29 00:00:00,2024-08-20 12:00:00,2024-11-29 00:00:00,2025-02-24 00:00:00,
Expiration_Date,990.0,2024-08-23 06:45:49.090909184,2024-02-25 00:00:00,2024-05-23 00:00:00,2024-08-23 12:00:00,2024-11-23 00:00:00,2025-02-24 00:00:00,
Sales_Volume,990.0,58.925253,20.0,39.0,58.0,78.0,100.0,23.002318
Inventory_Turnover_Rate,990.0,50.150505,1.0,25.0,50.0,74.75,100.0,28.798954
Stock_Value,990.0,336.014859,2.0,108.0,209.0,413.75,5512.08,435.422617


### A. Analysis of Inventory and Product Status

In [28]:
# 1. Distribution of products by status (Active, Discontinued, Backordered)
df.Status.value_counts()
# Visual: Bar plot of product status

Status
Discontinued    333
Active          332
Backordered     325
Name: count, dtype: int64

In [29]:
fig = px.bar(df, 'Status', title='Product Status')
fig.show()

In [30]:
# 2. Identify and quantify inactive and expired stock
print(f'Total Products Expired: {df[df['Expiration_Status'] == 'Expired'].shape[0]}') # Count expired products
print(f'Total Products Discontinued: {df[df['Status'] == 'Discontinued'].shape[0]}') # Count discontinued products

Total Products Expired: 496
Total Products Discontinued: 333


In [31]:
# 3. Financial value of expired stock
df[df['Expiration_Status'] == 'Expired']['Stock_Value'].sum()

np.float64(172843.43)

### B. Analysis of Stock Value by Category

In [32]:
# 1. Top categories by total stock value
df.groupby('Category')['Stock_Value'].sum().sort_values(ascending=False)

Category
Fruits & Vegetables    90625.11
Beverages              62942.25
Seafood                62515.90
Dairy                  50601.95
Grains & Pulses        31969.20
Oils & Fats            17211.50
Bakery                 16788.80
Name: Stock_Value, dtype: float64

In [33]:
# Visual: Bar plot of stock value by category
fig = px.histogram(df, x='Category', y='Stock_Value', title='Stock Value by Category')
fig.show()

In [34]:
# 2. Analysis of supplier performance (delivery lag and stock value)
# Suppliers with high stock value
df_supplier_stock = df.groupby(by='Supplier_Name', as_index=False)['Stock_Value'].sum().sort_values(by='Stock_Value', ascending=False).head(5)
df_supplier_stock

Unnamed: 0,Supplier_Name,Stock_Value
332,Youfeed,6429.9
298,Vinder,6285.08
228,Rooxo,5076.95
166,Meeveo,4208.54
91,Feedfire,3552.5


In [35]:
# Visual: Scatter plot of stock quantity vs. stock value by supplier
fig = px.scatter(df, x='Stock_Quantity', y='Stock_Value', color='Supplier_Name',
                 title='Stock Quantity vs. Stock Value by Supplier')
fig.show()

### C. Analysis of Stock Coverage and Risk

In [36]:
# 1. Identify products with low stock coverage (less than 8 days)
df[df['Stock_Coverage_Days'] < 8].shape[0]

542

In [37]:
# Visual: Histogram of 'Stock_Coverage_Days' to show the distribution
fig = px.histogram(df.query('Stock_Coverage_Days < 8'), 'Stock_Coverage_Days', title='Distribution Stock Coverage in days')
fig.show()

In [38]:
# 2. Identify potential discrepancies
# Compare 'Stock_Quantity' with 'Reorder_Level' for active products
print('Active Products with Stock Quantities Inconsistent with Reorder Levels')
print(f'Total: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level'])].shape[0]}')
print(f'Products Expired: {df.query('Status == "Active" and Stock_Quantity < Reorder_Level and Expiration_Status == "Expired"').shape[0]}')
print(f'Products Nearing Expiration: {df.query('Status == "Active" and Stock_Quantity < Reorder_Level and Expiration_Status == "Nearing"').shape[0]}')
print(f'Products Safe: {df.query('Status == "Active" and Stock_Quantity < Reorder_Level and Expiration_Status == "Safe"').shape[0]}')


Active Products with Stock Quantities Inconsistent with Reorder Levels
Total: 136
Products Expired: 66
Products Nearing Expiration: 10
Products Safe: 60


In [39]:
# Visual: Scatter plot Stock Quantity vs. Reorder_Level with Expired Status
fig = px.scatter(df.query('Status == "Active" and Stock_Quantity < Reorder_Level'), x='Stock_Quantity', y='Reorder_Level',
           title='Active Products: Stock and Reorder Level Discrepancies', color='Expiration_Status', labels={'Expiration_Status': 'Status'}, )

fig.show()

# INSIGHTS & NEXT STEPS

### Summary of Findings

- A large number of products are inactive or expired, representing significant potential for capital recovery and waste reduction.
- There are notable discrepancies between current stock and reorder levels, indicating a potential mismatch between inventory and purchasing policies.
- Top stock categories by value are Fruits & Vegetables, Seafood, and Dairy.
- A number of products have a very low stock coverage (less than 8 days), presenting a high risk of stockouts.
- Supplier analysis reveals significant delivery delays, which directly impacts inventory planning.

### Next Steps


Based on this EDA, the next steps for the project include:
1. Developing a demand forecasting model to predict future sales volume.
2. Building an inventory optimization model that considers demand forecasts, supplier lead times, and product expiration dates.
3. Simulating different purchasing scenarios to find the optimal balance between cost, service level, and capital utilization.

In [40]:
# Define data paths
processed_data_path = os.path.join('../data', 'processed')

In [41]:
# Save Data
df.to_pickle(processed_data_path + '/grocery.pkl')