Module 01: Demand Forecasting and Inventory Optimization

This notebook contains the exploratory data analysis (EDA) for the first module of the **"Intelligent System for Supply Chain Management"** project. 

The main objective is to optimize inventory and purchasing management, with a target of **reducing overstocking by 20%** within 6 months.

- Target Variable for Inventory Optimization: **Stock_Quantity**
- Target Variable for Demand Forecasting: **Sales_Volume**

## DATA ACQUISITION

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
import plotly.io as pio

from sklearn.preprocessing import OrdinalEncoder

from smart_supply_chain_ai.data_processing import get_data

import warnings
warnings.filterwarnings('ignore')

# Set up display options and plotting template
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"
px.defaults.width = 800
px.defaults.height = 600

### Load Raw Data

In [None]:
# Define data paths
raw_data_path = os.path.join('../data', 'raw')

In [None]:
# Download Data from Kaggle
# link for data on web - [USER] / [DATASET_NAME]
module_one = "salahuddinahmedshuvo/grocery-inventory-and-sales-dataset"
# Download Data and Unzip 
get_data.download_kaggle_dataset(module_one, raw_data_path);

In [None]:
# Load the raw dataset
df_raw = pd.read_csv(raw_data_path + '/Grocery_Inventory_and_Sales_Dataset.csv')

## DATA CLEANING & PREPROCESSING

In [None]:
# Dictionary columns names help
column_inventory = {
    'Product_ID': 'Unique identifier for each product.',
    'Product_Name': 'Name of the product.',
    'Category': 'The product category (e.g., Grains & Pulses, Beverages, Fruits & Vegetables).',
    'Supplier_ID': 'Unique identifier for the product supplier.',
    'Supplier_Name': 'Name of the supplier.',
    'Stock_Quantity': 'The current stock level of the product in the warehouse.',
    'Reorder_Level': 'The stock level at which new stock should be ordered.',
    'Reorder_Quantity': 'The quantity of product to order when the stock reaches the reorder level.',
    'Unit_Price': 'Price per unit of the product.',
    'Date_Received': 'The date the product was received into the warehouse.',
    'Last_Order_Date': 'The last date the product was ordered.',
    'Expiration_Date': 'The expiration date of the product, if applicable.',
    'Warehouse_Location': 'The warehouse address where the product is stored.',
    'Sales_Volume': 'The total number of units sold.',
    'Inventory_Turnover_Rate': 'The rate at which the product sells and is replenished.',
    'Status': 'Current status of the product (e.g., Active, Discontinued, Backordered).',
    'Stock_Value': 'The total monetary value of the current stock (Stock_Quantity * Unit_Price).',
    'Days_For_Expiration': 'The number of days until the product expires. Negative values indicate the product is already expired.',
    'Expiration_Status': 'Categorical status based on the expiration date (e.g., Expired, Nearing, Safe).',
    'Purchase_Order': 'The total monetary value of the new order (Reorder_Quantity * Unit_Price), used to analyze discrepancies.',
    'Stock_Coverage_Days': 'The number of days the current inventory can last, based on the annual sales rate.',
    'Delivery_Lag': 'The number of days between the last order date and the date the product was received.'
}

In [None]:
# Make a copy for manipulation
df = df_raw.copy()

In [None]:
### Standardize Column Names
# Rename 'Catagory' to 'Category' for consistency
df.rename(columns={"Catagory": "Category"}, inplace=True)

In [None]:
# Verifying variable with missing value
df[df['Category'].isna()]

In [None]:
# List unique categories
df.Category.unique()

### Handle Missing Values

In [None]:
# The analysis shows one missing value in the 'Category' column
# It corresponds to 'Cabbage', which is categorized as 'Fruits & Vegetables' 
df = df.fillna('Fruits & Vegetables')

### Convert Data Types

In [None]:
# Verificando informações gerais do dataset
df.info()

In [None]:
# Convert date columns to datetime objects
date_columns = ['Date_Received', 'Last_Order_Date', 'Expiration_Date']
df[date_columns] = df[date_columns].apply(pd.to_datetime, errors='coerce')

In [None]:
# Convert 'Unit_Price' to float by removing the '$' sign
df['Unit_Price'] = df['Unit_Price'].str.replace('$', '').astype('float')

In [None]:
# Convert categorical columns to the 'category' type for memory efficiency
cat_columns = ['Category', 'Status']
df[cat_columns] = df[cat_columns].astype('category')

In [None]:
# Statistics for Numeric columns
df.describe(exclude=['datetime', 'object', 'category']).T

In [None]:
# Statistics for Categorical columns
df.describe(include=['category'])

In [None]:
# Statistics for String columns
df.describe(include=['object'])

In [None]:
df.head()

In [None]:
# Mi Range Date
df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].min()

In [None]:
# Max Range Date
df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].max()

In [None]:
# Confirm duplicate values exist
df['Product_ID'].duplicated().sum()

In [None]:
# Confirm duplicate values exist
df['Supplier_ID'].duplicated().sum()

# FEATURE ENGINEERING

### Create New Business Metrics

In [None]:
# 1. Calculate the 'Stock_Value'
df['Stock_Value'] = df['Stock_Quantity'] * df['Unit_Price']

In [None]:
# 2. Calculate the 'Days_For_Expiration'
df['Days_For_Expiration'] = (df['Expiration_Date'] - df['Date_Received']).dt.days.astype('Int64')

In [None]:
# 3. Create 'Expiration_Status' (Expired, Nearing, Safe)
df['Expiration_Status'] = np.where(df['Days_For_Expiration'] < 0, 'Expired', 
                                         np.where(df['Days_For_Expiration'] < 30, 'Nearing', 'Safe'))

In [None]:
# 4. Calculate 'Stock_Coverage' in days (based on 1-year sales window)
# The 'Inventory_Turnover_Rate' is a standard metric used to calculate this value
df['Stock_Coverage_Days'] = (365 / df['Inventory_Turnover_Rate']).apply(np.floor).astype('int')

In [None]:
# 5. Create 'Purchase_Order' from 'Reorder_Quantity' to check for discrepancies
df['Purchase_Order'] = df['Reorder_Quantity'] * df['Unit_Price']

In [None]:
# 6. Calculate Supplier Delivery Lag
# Assume 'Date_Received' refers to the most recent reception after the 'Last_Order_Date'
df['Delivery_Lag'] = (df['Date_Received'] - df['Last_Order_Date']).dt.days

# Delivery data must not be zero.
df['Delivery_Lag'] = df['Delivery_Lag'].where(df['Delivery_Lag'] > 0, 0)

# EXPLORATORY DATA ANALYSIS (EDA)

In [None]:
# Descriptive Statistics
df.describe().T

### A. Analysis of Inventory and Product Status

In [None]:
# 1. Distribution of products by status (Active, Discontinued, Backordered)
df.Status.value_counts()
# Visual: Bar plot of product status

In [None]:
fig = px.bar(df, 'Status', title='Product Status')
fig.show()

In [None]:
# 2. Identify and quantify inactive and expired stock
print(f"Total Products Expired: {df[df['Expiration_Status'] == 'Expired'].shape[0]}") # Count expired products
print(f"Total Products Discontinued: {df[df['Status'] == 'Discontinued'].shape[0]}") # Count discontinued products

In [None]:
# 3. Financial value of expired stock
df[df['Expiration_Status'] == 'Expired']['Stock_Value'].sum()

### B. Analysis of Stock Value by Category

In [None]:
# 1. Top categories by total stock value
df.groupby('Category')['Stock_Value'].sum().sort_values(ascending=False)

In [None]:
# Visual: Bar plot of stock value by category
fig = px.histogram(df, x='Category', y='Stock_Value', title='Stock Value by Category')
fig.show()

In [None]:
# 2. Analysis of supplier performance (delivery lag and stock value)
# Suppliers with high stock value
df_supplier_stock = df.groupby(by='Supplier_Name', as_index=False)['Stock_Value'].sum().sort_values(by='Stock_Value', ascending=False).head(5)
df_supplier_stock

In [None]:
# Visual: Scatter plot of stock quantity vs. stock value by supplier
fig = px.scatter(df, x='Stock_Quantity', y='Stock_Value', color='Supplier_Name',
                 title='Stock Quantity vs. Stock Value by Supplier')
fig.show()

### C. Analysis of Stock Coverage and Risk

In [None]:
# 1. Identify products with low stock coverage (less than 8 days)
df[df['Stock_Coverage_Days'] < 8].shape[0]

In [None]:
# Visual: Histogram of 'Stock_Coverage_Days' to show the distribution
fig = px.histogram(df.query('Stock_Coverage_Days < 8'), 'Stock_Coverage_Days', title='Distribution Stock Coverage in days')
fig.show()

In [None]:
# 2. Identify potential discrepancies
# Compare 'Stock_Quantity' with 'Reorder_Level' for active products
print('Active Products with Stock Quantities Inconsistent with Reorder Levels')
print(f"Total: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Stock_Quantity'] < df['Reorder_Level'])].shape[0]}")
print(f"Products Expired: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Expired')].shape[0]}")
print(f"Products Nearing: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Nearing')].shape[0]}")
print(f"Products Safe: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Safe')].shape[0]}")


In [None]:
# Visual: Scatter plot Stock Quantity vs. Reorder_Level with Expired Status
fig = px.scatter(df.query('Status == "Active" and Stock_Quantity < Reorder_Level'), x='Stock_Quantity', y='Reorder_Level',
           title='Active Products: Stock and Reorder Level Discrepancies', color='Expiration_Status', labels={'Expiration_Status': 'Status'}, )

fig.show()

# INSIGHTS & NEXT STEPS

### Summary of Findings

- A large number of products are inactive or expired, representing significant potential for capital recovery and waste reduction.
- There are notable discrepancies between current stock and reorder levels, indicating a potential mismatch between inventory and purchasing policies.
- Top stock categories by value are Fruits & Vegetables, Seafood, and Dairy.
- A number of products have a very low stock coverage (less than 8 days), presenting a high risk of stockouts.
- Supplier analysis reveals significant delivery delays, which directly impacts inventory planning.

### Next Steps


Based on this EDA, the next steps for the project include:
1. Developing a demand forecasting model to predict future sales volume.
2. Building an inventory optimization model that considers demand forecasts, supplier lead times, and product expiration dates.
3. Simulating different purchasing scenarios to find the optimal balance between cost, service level, and capital utilization.

In [None]:
# Define data paths
processed_data_path = os.path.join('../data', 'processed')

In [None]:
# Save Data
df.to_pickle(processed_data_path + '/grocery.pkl')