Module 01: Exploratory Data Analysis for Demand & Inventory

This notebook performs exploratory data analysis (EDA) for Module 01 of the **"Intelligent System for Supply Chain Management"** project.  

The primary goal is to optimize inventory and purchasing management, with a target of **reducing overstocking by 20%** within six months.

---

## Data Acquisition
### Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import json
import plotly.express as px
import plotly.io as pio

from plotly.subplots import make_subplots

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

from smart_supply_chain_ai.data_processing import get_data

import warnings
warnings.filterwarnings('ignore')

# Set up display options and plotting template
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"
px.defaults.width = 800
px.defaults.height = 600

### Load Dataset

In [2]:
# Define data paths
raw_data_path = os.path.join('../data', 'raw')

In [3]:
# Download Data from Kaggle
# link for data on web - [USER] / [DATASET_NAME]
module_one = "salahuddinahmedshuvo/grocery-inventory-and-sales-dataset"
# Download Data and Unzip 
get_data.download_kaggle_dataset(module_one, raw_data_path);

Starting the download of dataset 'salahuddinahmedshuvo/grocery-inventory-and-sales-dataset' from Kaggle...
Dataset URL: https://www.kaggle.com/datasets/salahuddinahmedshuvo/grocery-inventory-and-sales-dataset
Download, unzipping, and cleanup complete! The dataset was saved to: ../data/raw


In [4]:
# Load the raw dataset
df_raw = pd.read_csv(raw_data_path + '/Grocery_Inventory_and_Sales_Dataset.csv')

In [5]:
# Standardize Column Headers
# Rename 'Catagory' to 'Category' for consistency
df_raw.rename(columns={"Catagory": "Category"}, inplace=True)

## Data Cleaning and Preprocessing

In [6]:
# Description of dataset columns
column_inventory = {
    'Product_ID': 'Unique identifier for each product.',
    'Product_Name': 'Name of the product.',
    'Category': 'The product category (e.g., Grains & Pulses, Beverages, Fruits & Vegetables).',
    'Supplier_ID': 'Unique identifier for the product supplier.',
    'Supplier_Name': 'Name of the supplier.',
    'Stock_Quantity': 'The current stock level of the product in the warehouse.',
    'Reorder_Level': 'The stock level at which new stock should be ordered.',
    'Reorder_Quantity': 'The quantity of product to order when the stock reaches the reorder level.',
    'Unit_Price': 'Price per unit of the product.',
    'Date_Received': 'The date the product was received into the warehouse.',
    'Last_Order_Date': 'The last date the product was ordered.',
    'Expiration_Date': 'The expiration date of the product, if applicable.',
    'Warehouse_Location': 'The warehouse address where the product is stored.',
    'Sales_Volume': 'The total number of units sold.',
    'Inventory_Turnover_Rate': 'The rate at which the product sells and is replenished.',
    'Status': 'Current status of the product (e.g., Active, Discontinued, Backordered).',
    'Stock_Value': 'The total monetary value of the current stock (Stock_Quantity * Unit_Price).',
    'Days_For_Expiration': 'The number of days until the product expires. Negative values indicate the product is already expired.',
    'Expiration_Status': 'Categorical status based on the expiration date (e.g., Expired, Nearing, Safe).',
    'Purchase_Order': 'The total monetary value of the new order (Reorder_Quantity * Unit_Price), used to analyze discrepancies.',
    'DOI_Inventory_Turnover': 'Days of Inventory. The number of days the current inventory can last, based on the annual sales rate.',
    'Delivery_Lag': 'The number of days between the last order date and the date the product was received.',
    'Delivery_Lag_Issue': 'Flag records with inconsistent delivery lag (negative values).',
}

In [7]:
# Create a copy for cleaning and preprocessing
df = df_raw.copy()

In [8]:
df[:10]

Unnamed: 0,Product_ID,Product_Name,Category,Supplier_ID,Supplier_Name,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Date_Received,Last_Order_Date,Expiration_Date,Warehouse_Location,Sales_Volume,Inventory_Turnover_Rate,Status
0,29-205-1132,Sushi Rice,Grains & Pulses,38-037-1699,Jaxnation,22,72,70,$4.50,8/16/2024,6/29/2024,9/19/2024,48 Del Sol Trail,32,19,Discontinued
1,40-681-9981,Arabica Coffee,Beverages,54-470-2479,Feedmix,45,77,2,$20.00,11/1/2024,5/29/2024,5/8/2024,36 3rd Place,85,1,Discontinued
2,06-955-3428,Black Rice,Grains & Pulses,54-031-2945,Vinder,30,38,83,$6.00,8/3/2024,6/10/2024,9/22/2024,3296 Walton Court,31,34,Backordered
3,71-594-6552,Long Grain Rice,Grains & Pulses,63-492-7603,Brightbean,12,59,62,$1.50,12/8/2024,2/19/2025,4/17/2024,3 Westerfield Crossing,95,99,Active
4,57-437-1828,Plum,Fruits & Vegetables,54-226-4308,Topicstorm,37,30,74,$4.00,7/3/2024,10/11/2024,10/5/2024,15068 Scoville Court,62,25,Backordered
5,21-120-6238,All-Purpose Flour,Grains & Pulses,86-292-4587,Dabjam,55,33,14,$1.75,12/3/2024,5/26/2024,9/5/2024,050 Mcbride Avenue,34,62,Discontinued
6,71-516-1996,Corn Oil,Oils & Fats,04-391-7610,Tagfeed,96,52,16,$2.50,3/18/2024,5/7/2024,6/20/2024,12 Truax Court,67,13,Active
7,39-629-5554,Egg (Goose),Dairy,67-679-4930,Muxo,44,90,17,$2.50,2/3/2025,4/9/2024,2/5/2025,267 International Plaza,21,91,Discontinued
8,66-268-8345,Greek Yogurt,Dairy,32-182-1895,Thoughtstorm,91,84,11,$3.00,12/4/2024,6/2/2024,1/8/2025,550 Clemons Plaza,56,90,Active
9,46-452-9419,Egg (Duck),Dairy,67-137-4215,Wordify,43,10,15,$1.00,11/18/2024,11/14/2024,7/8/2024,55782 Welch Hill,27,69,Active


In [9]:
# Verify for missing values in the 'Category' column
df[df['Category'].isna()]

Unnamed: 0,Product_ID,Product_Name,Category,Supplier_ID,Supplier_Name,Stock_Quantity,Reorder_Level,Reorder_Quantity,Unit_Price,Date_Received,Last_Order_Date,Expiration_Date,Warehouse_Location,Sales_Volume,Inventory_Turnover_Rate,Status
685,10-378-9729,Cabbage,,83-941-9620,Rooxo,69,21,68,$66.55,12/23/2024,11/26/2024,9/21/2024,2 Butterfield Pass,36,35,Discontinued


In [10]:
# List unique categories
df.Category.unique()

array(['Grains & Pulses', 'Beverages', 'Fruits & Vegetables',
       'Oils & Fats', 'Dairy', 'Bakery', 'Seafood', nan], dtype=object)

### Handle Missing Data

In [11]:
# The 'Category' column has one missing value, corresponding to 'Cabbage'.  
# Based on domain knowledge, we'll fill this with 'Fruits & Vegetables'
df = df.fillna('Fruits & Vegetables')

### Convert Data Types for Analysis

In [12]:
# Check the data types and non-null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Product_ID               990 non-null    object
 1   Product_Name             990 non-null    object
 2   Category                 990 non-null    object
 3   Supplier_ID              990 non-null    object
 4   Supplier_Name            990 non-null    object
 5   Stock_Quantity           990 non-null    int64 
 6   Reorder_Level            990 non-null    int64 
 7   Reorder_Quantity         990 non-null    int64 
 8   Unit_Price               990 non-null    object
 9   Date_Received            990 non-null    object
 10  Last_Order_Date          990 non-null    object
 11  Expiration_Date          990 non-null    object
 12  Warehouse_Location       990 non-null    object
 13  Sales_Volume             990 non-null    int64 
 14  Inventory_Turnover_Rate  990 non-null    i

In [13]:
# Convert date columns to datetime objects
date_columns = ['Date_Received', 'Last_Order_Date', 'Expiration_Date']
df[date_columns] = df[date_columns].apply(pd.to_datetime, errors='coerce')

In [14]:
# Clean and convert 'Unit_Price' to a numeric format
df['Unit_Price'] = df['Unit_Price'].str.replace('$', '').astype('float')

In [15]:
# Convert categorical columns to the 'category' type for memory efficiency
cat_columns = ['Category', 'Status']
df[cat_columns] = df[cat_columns].astype('category')

In [16]:
# Statistics for Numeric columns
df.describe(exclude=['datetime', 'object', 'category']).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Stock_Quantity,990.0,55.609091,26.300775,10.0,33.0,56.0,79.0,100.0
Reorder_Level,990.0,51.215152,29.095241,1.0,25.25,53.0,77.0,100.0
Reorder_Quantity,990.0,51.913131,29.521059,1.0,25.0,54.0,77.0,100.0
Unit_Price,990.0,5.924192,6.49128,0.2,2.5,4.225,7.0,98.43
Sales_Volume,990.0,58.925253,23.002318,20.0,39.0,58.0,78.0,100.0
Inventory_Turnover_Rate,990.0,50.150505,28.798954,1.0,25.0,50.0,74.75,100.0


In [17]:
# Statistics for Categorical columns
df.describe(include=['category']).T

Unnamed: 0,count,unique,top,freq
Category,990,7,Fruits & Vegetables,332
Status,990,3,Discontinued,333


In [18]:
# Statistics for String columns
df.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
Product_ID,990,990,29-205-1132,1
Product_Name,990,121,Bread Flour,19
Supplier_ID,990,990,38-037-1699,1
Supplier_Name,990,350,Katz,12
Warehouse_Location,990,990,48 Del Sol Trail,1


In [19]:
# Display the minimum date for each column
df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].min()

Date_Received     2024-02-25
Last_Order_Date   2024-02-25
Expiration_Date   2024-02-25
dtype: datetime64[ns]

In [20]:
# Display the maximum date for each column
df[['Date_Received', 'Last_Order_Date', 'Expiration_Date']].max()

Date_Received     2025-02-24
Last_Order_Date   2025-02-24
Expiration_Date   2025-02-24
dtype: datetime64[ns]

In [21]:
# Check for duplicate product IDs
df['Product_ID'].duplicated().sum()

0

In [22]:
# Check for duplicate supplier IDs
df['Supplier_ID'].duplicated().sum()

0

# Feature Engineering: Create New Metrics

In [23]:
# Calculate the supplier delivery lag (number of days between order placement and receipt)
df['Delivery_Lag'] = (df['Date_Received'] - df['Last_Order_Date']).dt.days

### **Analysis of "Delivery Lag" Data Inconsistency and Proposed Action**

We have identified a significant inconsistency in the provided data regarding "Delivery Lag." The initial statistical analysis reveals a mean of -2.71 days and a standard deviation of approximately 150 days, resulting in the presence of negative values and an anomalous maximum value of 352 days. These metrics demonstrate a complete disconnect from market operational reality.

In an ideal scenario, the standard procedure would be to formally consult the responsible business area to elucidate and correct these discrepancies. However, as this is a publicly available dataset, this course of action is not viable.

Faced with this impediment and with the goal of not compromising the project's progress, we propose the following action: **the generation of a synthetic dataset for the "Delivery Lag" variable**. This series will be modeled based on realistic market parameters, using centralized statistics and dispersion consistent with practice, ensuring the necessary coherence for the prototyping phase.

It is important to emphasize that this solution is provisional. The prototype will be validated and undergo fine-tuning at a later stage when it is possible to integrate a set of real, audited data.

In [24]:
# Calculates the percentage of records in the dataset with a negative Delivery_Lag 
negative_count = df.query('Delivery_Lag < 0').shape[0]
total_count = df.shape[0]
percentage = (negative_count / total_count) * 100

print(f'Percentage of DataFrame with negative Delivery Lag: {percentage:.2f}%')

Percentage of DataFrame with negative Delivery Lag: 51.92%


In [25]:
IQR = df['Delivery_Lag'].quantile(.75) - df['Delivery_Lag'].quantile(.25)
outlier_sup = df['Delivery_Lag'].quantile(.75) + (1.5 * IQR)
outlier_sup

441.5

In [26]:
# Select only rows where Delivery_Lag is negative (less than 0)
df_lag_negative = df[['Delivery_Lag', 'Status']].query('Delivery_Lag < 0')
df_lag_negative

Unnamed: 0,Delivery_Lag,Status
3,-73,Active
4,-100,Backordered
6,-50,Active
11,-53,Backordered
12,-58,Discontinued
...,...,...
983,-55,Active
985,-113,Active
986,-1,Active
987,-21,Active


In [27]:
# Generate summary statistics for the inconsistent Delivery_Lag data
df_lag_negative['Delivery_Lag'].describe()

count    514.000000
mean    -121.373541
std       86.067183
min     -343.000000
25%     -182.750000
50%     -105.500000
75%      -49.250000
max       -1.000000
Name: Delivery_Lag, dtype: float64

In [28]:
# Perform initial exploratory analysis on Delivery_Lag to understand its distribution
df['Delivery_Lag'].describe()

count    990.000000
mean      -2.710101
std      150.485792
min     -343.000000
25%     -111.000000
50%       -6.000000
75%      110.000000
max      352.000000
Name: Delivery_Lag, dtype: float64

## Create Synthetic Dates

Features of the adjusted data:
Specific categories: Using only the categories present in your data

Realistic distribution: Based on observed frequency in the provided data

Realistic parameters per category:

- Grains & Pulses: Lead time 10–30 days, shelf life 6–12 months

- Beverages: Lead time 7–21 days, shelf life 6–18 months

- Fruits & Vegetables: Lead time 2–7 days, shelf life 5–21 days

- Oils & Fats: Lead time 14–45 days, shelf life 12–24 months

- Dairy: Lead time 3–10 days, shelf life 7–28 days

- Bakery: Lead time 1–3 days, shelf life 3–14 days

- Seafood: Lead time 1–5 days, shelf life 2–7 days

Seasonal patterns:

- Fruits/vegetables with reduced shelf life in summer

- Dairy with shorter lead time in winter

Realistic temporal distribution:

- 80% of deliveries on weekdays

Controlled outliers: Only 3% of data with unusual situations

These synthetic data preserve the specific characteristics of the categories in your original dataset, with realistic temporal relationships for supply chain analysis.

In [29]:
# Create a subset of the dataframe focusing on date columns and delivery lag for inconsistency analysis
df_date_inconsistent = df[['Date_Received', 'Last_Order_Date', 'Expiration_Date', 'Delivery_Lag']]

In [30]:
# Number of rows in the dataset (i.e., total number of records)
n_rows = df.shape[0]  

# Start date for the time range used in the analysis or simulation
start_date = '2024-02-25'  

# End date for the time range used in the analysis or simulation
end_date = '2025-02-24'    

In [31]:
# Convert start date to Unix timestamp (in seconds)
start_ts = pd.Timestamp(start_date).value // 10**9

# Convert end date to Unix timestamp (in seconds)
end_ts = pd.Timestamp(end_date).value // 10**9


In [32]:
# Extract unique product categories from the dataset
categories = df['Category'].unique().tolist()

# Define probability distribution for each category based on observed frequencies
category_probs = [0.35, 0.10, 0.25, 0.05, 0.15, 0.05, 0.05]

# Set realistic lead time and shelf life parameters for each category
category_params = {
    'Grains & Pulses': {'lead_min': 10, 'lead_max': 30, 'shelf_min': 180, 'shelf_max': 365},
    'Beverages': {'lead_min': 7, 'lead_max': 21, 'shelf_min': 180, 'shelf_max': 540},
    'Fruits & Vegetables': {'lead_min': 2, 'lead_max': 7, 'shelf_min': 5, 'shelf_max': 21},
    'Oils & Fats': {'lead_min': 14, 'lead_max': 45, 'shelf_min': 365, 'shelf_max': 730},
    'Dairy': {'lead_min': 3, 'lead_max': 10, 'shelf_min': 7, 'shelf_max': 28},
    'Bakery': {'lead_min': 1, 'lead_max': 3, 'shelf_min': 3, 'shelf_max': 14},
    'Seafood': {'lead_min': 1, 'lead_max': 5, 'shelf_min': 2, 'shelf_max': 7}
}


In [33]:
# Assign product categories from the dataset to a new variable
product_categories = df['Category']

In [34]:
# Generate received dates (with more realistic distribution – more deliveries on weekdays)
date_received_ts = np.zeros(n_rows, dtype=np.int64)

for i in range(n_rows):
    # 80% chance of being a weekday (Monday to Friday)
    if np.random.random() < 0.8:
        # Weekday: normal distribution centered around Wednesday
        day_offset = int(np.random.normal(2, 1.5))  # 0=Mon, 1=Tue, 2=Wed, 3=Thu, 4=Fri
        day_offset = max(0, min(4, day_offset))  # Clamp between 0 and 4
    else:
        # Weekend: Saturday or Sunday
        day_offset = np.random.choice([5, 6])
    
    # Select a random week in the year
    week_offset = np.random.randint(0, 52) * 7
    base_date_ts = start_ts + (week_offset + day_offset) * 86400
    
    # Add hour variation (deliveries usually in the morning)
    hour = int(np.random.normal(10, 2))  # Mean 10am, standard deviation 2h
    hour = max(6, min(18, hour))  # Clamp between 6am and 6pm
    
    date_received_ts[i] = base_date_ts + hour * 3600


In [35]:
# Initialize array to store last order timestamps for each product
last_order_ts = np.zeros(n_rows, dtype=np.int64)

# Initialize array to store expiration timestamps for each product
expiration_ts = np.zeros(n_rows, dtype=np.int64)


In [36]:
# Loop through each product and retrieve its category
for i, category in enumerate(product_categories):
    params = category_params[category]
    
    # Calculate lead time in seconds based on category parameters
    lead_time_days = np.random.uniform(params['lead_min'], params['lead_max'])
    lead_time_seconds = int(lead_time_days * 86400)
    
    # Calculate shelf life in seconds based on category parameters
    shelf_life_days = np.random.uniform(params['shelf_min'], params['shelf_max'])
    shelf_life_seconds = int(shelf_life_days * 86400)
    
    # Compute last order and expiration timestamps
    last_order_ts[i] = date_received_ts[i] - lead_time_seconds
    expiration_ts[i] = date_received_ts[i] + shelf_life_seconds


In [37]:
# Convert received timestamps to datetime format
date_received = pd.to_datetime(date_received_ts, unit='s')

# Convert last order timestamps to datetime format
last_order = pd.to_datetime(last_order_ts, unit='s')

# Convert expiration timestamps to datetime format
expiration = pd.to_datetime(expiration_ts, unit='s')


In [38]:
# Create synthetic DataFrame with category and date information
df_synthetic = pd.DataFrame({
    'Category': product_categories,
    'Date_Received': date_received,
    'Last_Order_Date': last_order,
    'Expiration_Date': expiration
})


In [39]:
# Adjust for seasonal patterns

# Fruits and vegetables have shorter shelf life during summer (due to heat)
summer_mask = (df_synthetic['Date_Received'].dt.month.isin([6, 7, 8])) & (df_synthetic['Category'] == 'Fruits & Vegetables')
df_synthetic.loc[summer_mask, 'Expiration_Date'] -= pd.to_timedelta(np.random.randint(2, 5), unit='d')

# Dairy products have shorter lead time during winter (lower spoilage risk)
winter_mask = (df_synthetic['Date_Received'].dt.month.isin([12, 1, 2])) & (df_synthetic['Category'] == 'Dairy')
df_synthetic.loc[winter_mask, 'Last_Order_Date'] += pd.to_timedelta(np.random.randint(1, 3), unit='d')


In [40]:
# Add some outliers (3% of the data) – unusual situations
outlier_mask = np.random.random(n_rows) < 0.03

# Apply early order dates for outlier records
df_synthetic.loc[outlier_mask, 'Last_Order_Date'] -= pd.to_timedelta(np.random.randint(15, 30), unit='d')

# Apply reduced shelf life for perishable outlier products
df_synthetic.loc[outlier_mask & (df_synthetic['Category'].isin(['Fruits & Vegetables', 'Seafood'])), 
       'Expiration_Date'] -= pd.to_timedelta(np.random.randint(3, 7), unit='d')

# Ensure Last_Order_Date is always earlier than Date_Received
date_inconsistency = df_synthetic['Last_Order_Date'] > df_synthetic['Date_Received']
df_synthetic.loc[date_inconsistency, 'Last_Order_Date'] = df_synthetic.loc[date_inconsistency, 'Date_Received'] - pd.to_timedelta(
    np.random.randint(1, 5), unit='d')

# Ensure Expiration_Date is always later than Date_Received
exp_inconsistency = df_synthetic['Expiration_Date'] <= df_synthetic['Date_Received']
df_synthetic.loc[exp_inconsistency, 'Expiration_Date'] = df_synthetic.loc[exp_inconsistency, 'Date_Received'] + pd.to_timedelta(
    np.random.randint(1, 10), unit='d')


In [41]:
# Format received date for display (YYYY-MM-DD)
df_synthetic['Date_Received'] = df_synthetic['Date_Received'].dt.strftime('%Y-%m-%d')

df_synthetic['Last_Order_Date'] = df_synthetic['Last_Order_Date'].dt.strftime('%Y-%m-%d')

df_synthetic['Expiration_Date'] = df_synthetic['Expiration_Date'].dt.strftime('%Y-%m-%d')


In [42]:
# Identify columns present in df but not in df_raw
columns_drop = list(set(df.columns.tolist()) - set(df_raw.columns.tolist()))

# Drop the extra columns from df to align with df_raw structure
df.drop(columns=columns_drop, inplace=True)


In [43]:
# Overwrite matching columns in df with values from the synthetic dataset
df[df_synthetic.columns] = df_synthetic


In [44]:
# Convert 'Date_Received' column to datetime format, assuming year comes first
df['Date_Received'] = pd.to_datetime(df['Date_Received'], yearfirst=True)

# Convert 'Last_Order_Date' column to datetime format, assuming year comes first
df['Last_Order_Date'] = pd.to_datetime(df['Last_Order_Date'], yearfirst=True)

# Convert 'Expiration_Date' column to datetime format, assuming year comes first
df['Expiration_Date'] = pd.to_datetime(df['Expiration_Date'], yearfirst=True)


In [45]:
# Show data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Product_ID               990 non-null    object        
 1   Product_Name             990 non-null    object        
 2   Category                 990 non-null    category      
 3   Supplier_ID              990 non-null    object        
 4   Supplier_Name            990 non-null    object        
 5   Stock_Quantity           990 non-null    int64         
 6   Reorder_Level            990 non-null    int64         
 7   Reorder_Quantity         990 non-null    int64         
 8   Unit_Price               990 non-null    float64       
 9   Date_Received            990 non-null    datetime64[ns]
 10  Last_Order_Date          990 non-null    datetime64[ns]
 11  Expiration_Date          990 non-null    datetime64[ns]
 12  Warehouse_Location       990 non-nul

## Create New Features

In [46]:
# Calculate the supplier delivery lag (number of days between order placement and receipt)
df['Delivery_Lag'] = (df['Date_Received'] - df['Last_Order_Date']).dt.days

In [47]:
# Perform initial exploratory analysis on Delivery_Lag to understand its distribution
df['Delivery_Lag'].describe()

count    990.000000
mean      10.396970
std       10.138724
min        1.000000
25%        3.000000
50%        6.000000
75%       14.750000
max       59.000000
Name: Delivery_Lag, dtype: float64

In [48]:
# Calculate the 'Stock_Value'
df['Stock_Value'] = df['Stock_Quantity'] * df['Unit_Price']

In [49]:
# Calculate the 'Days_For_Expiration'
df['Days_For_Expiration'] = (df['Expiration_Date'] - df['Date_Received']).dt.days.astype('Int64')

In [50]:
# Replace negative numbers for zero
df['Days_For_Expiration'].where(df['Days_For_Expiration'] >= 0, 0, inplace=True)

In [51]:
# Create 'Expiration_Status' (Expired, Nearing, Safe)
df['Expiration_Status'] = np.where(df['Days_For_Expiration'] == 0, 'Expired', 
                                         np.where(df['Days_For_Expiration'] < 30, 'Nearing', 'Safe'))
df['Expiration_Status'] = df['Expiration_Status'].astype('category')

In [52]:
# Calculate 'Stock_Coverage' in days (based on 1-year sales window)
# Use the 'Inventory_Turnover_Rate' to calculate Days of Inventory (DOI)
df['DOI_Inventory_Turnover'] = (365 / df['Inventory_Turnover_Rate']).apply(np.floor).astype('int')

In [53]:
# Create 'Purchase_Order' from 'Reorder_Quantity' to check for discrepancies
df['Purchase_Order'] = df['Reorder_Quantity'] * df['Unit_Price']

# Exploratory Data Analysis (EDA)

In [54]:
# Descriptive Statistics
df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Stock_Quantity,990.0,55.609091,10.0,33.0,56.0,79.0,100.0,26.300775
Reorder_Level,990.0,51.215152,1.0,25.25,53.0,77.0,100.0,29.095241
Reorder_Quantity,990.0,51.913131,1.0,25.0,54.0,77.0,100.0,29.521059
Unit_Price,990.0,5.924192,0.2,2.5,4.225,7.0,98.43,6.49128
Date_Received,990.0,2024-08-21 04:29:05.454545408,2024-02-25 00:00:00,2024-05-21 00:00:00,2024-08-20 00:00:00,2024-11-19 18:00:00,2025-02-22 00:00:00,
Last_Order_Date,990.0,2024-08-10 18:57:27.272727296,2024-01-12 00:00:00,2024-05-11 00:00:00,2024-08-09 00:00:00,2024-11-11 00:00:00,2025-02-19 00:00:00,
Expiration_Date,990.0,2024-12-22 14:16:43.636363520,2024-02-29 00:00:00,2024-07-03 06:00:00,2024-11-22 00:00:00,2025-03-22 12:00:00,2026-12-26 00:00:00,
Sales_Volume,990.0,58.925253,20.0,39.0,58.0,78.0,100.0,23.002318
Inventory_Turnover_Rate,990.0,50.150505,1.0,25.0,50.0,74.75,100.0,28.798954
Delivery_Lag,990.0,10.39697,1.0,3.0,6.0,14.75,59.0,10.138724


In [55]:
# Create a histogram to visualize the distribution of valid Delivery_Lag values
fig = px.histogram(
    df['Delivery_Lag'],
    title='Distribution of Valid Delivery Lag Values (in Days)'
)
fig.update_layout(
    xaxis_title='Delivery Lag (Days)',
    yaxis_title='Frequency',
    bargap=0.1,
    showlegend=False
)

# Display the histogram
fig.show()


### Inventory and Product Status Analysis

In [56]:
# Distribution of products by status (Active, Discontinued, Backordered)
df.Status.value_counts()
# Visual: Bar plot of product status

Status
Discontinued    333
Active          332
Backordered     325
Name: count, dtype: int64

In [57]:
fig = px.bar(df, 'Status', title='Product Status')
fig.show()

In [58]:
# Identify and quantify inactive and expired stock
print(f"Total Products Expired: {df[df['Expiration_Status'] == 'Expired'].shape[0]}") # Count expired products
print(f"Total Products Discontinued: {df[df['Status'] == 'Discontinued'].shape[0]}") # Count discontinued products

Total Products Expired: 0
Total Products Discontinued: 333


In [59]:
# Financial value of expired stock
total = df[df['Expiration_Status'] == 'Expired']['Stock_Value'].sum()
print(f'Total Value of expired stock products {total:,}')

Total Value of expired stock products 0.0


### Stock Value Analysis by Category and Supplier

In [60]:
# Top categories by total stock value
df.groupby('Category')['Stock_Value'].sum().sort_values(ascending=False)

Category
Fruits & Vegetables    90625.11
Beverages              62942.25
Seafood                62515.90
Dairy                  50601.95
Grains & Pulses        31969.20
Oils & Fats            17211.50
Bakery                 16788.80
Name: Stock_Value, dtype: float64

In [61]:
# Visual: Bar plot of stock value by category
fig = px.histogram(df, x='Category', y='Stock_Value', title='Stock Value by Category')
fig.show()

In [62]:
# Analysis of supplier performance (delivery lag and stock value)
# Suppliers with high stock value
df_supplier_stock = df.groupby(by=['Supplier_Name', 'Stock_Quantity'], as_index=False)['Stock_Value'].sum().sort_values(by='Stock_Value', ascending=False)

In [63]:
# Visual: Scatter plot of stock quantity vs. stock value by supplier
fig = px.scatter(df_supplier_stock, x='Stock_Quantity', y='Stock_Value', color='Supplier_Name',
                 title='Stock Quantity vs. Stock Value by Supplier')
fig.update_layout(height=600, width=900)
fig.show()

### Stock Coverage and Risk Analysis

In [64]:
# Identify products with low stock coverage (less than 8 days)
df[df['DOI_Inventory_Turnover'] < 8].shape[0]

542

In [65]:
# Visual: Histogram of 'DOI_Inventory_Turnover' to show the distribution
fig = px.histogram(df.query('DOI_Inventory_Turnover < 8'), 'DOI_Inventory_Turnover', title='Distribution Stock Coverage in days')
fig.update_layout(bargap=0.1)
fig.show()

In [66]:
# Identify potential discrepancies
# Compare 'Stock_Quantity' with 'Reorder_Level' for active products
print('Active Products with Stock Quantities Inconsistent with Reorder Levels')
print(f"Total: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Stock_Quantity'] < df['Reorder_Level'])].shape[0]}")
print(f"Products Expired: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Expired')].shape[0]}")
print(f"Products Nearing: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Nearing')].shape[0]}")
print(f"Products Safe: {df[(df['Status'] == 'Active') & (df['Stock_Quantity'] < df['Reorder_Level']) & (df['Expiration_Status'] == 'Safe')].shape[0]}")


Active Products with Stock Quantities Inconsistent with Reorder Levels
Total: 136
Products Expired: 0
Products Nearing: 91
Products Safe: 45


In [67]:
# Visual: Scatter plot Stock Quantity vs. Reorder_Level with Expired Status
fig = px.scatter(df.query('Status == "Active" and Stock_Quantity < Reorder_Level'), x='Stock_Quantity', y='Reorder_Level',
           title='Active Products: Stock and Reorder Level Discrepancies', color='Expiration_Status', labels={'Expiration_Status': 'Status'}, )

fig.show()


---

## **1. Overview of Exploratory Data Analysis (EDA)**

The exploratory data analysis was conducted to understand the main characteristics and trends of the inventory, sales, and logistics data. The provided charts and descriptive table offer insights into stock value by category, product distribution by status, discrepancies between stock and reorder levels, and the distribution of delivery lags and inventory coverage.

---

## **2. Inventory and Product Analysis**

### **Stock Value by Category**
The "Stock Value by Category" bar chart shows the distribution of the total stock value across different product categories. The **Fruits & Vegetables** category holds the highest stock value, exceeding $90,000. This suggests that the company invests more in this category, possibly due to high demand or a high unit cost of the items.



---

### **Product Status**
The "Product Status" chart indicates that the number of **active**, **discontinued**, and **backordered** products is quite similar, with a **count** of around 325 for each. This may suggest a considerable volume of products that are no longer in circulation or are temporarily unavailable, which warrants further investigation to understand the operational impact.

---

### **Stock and Reorder Level Discrepancies**
The "Active Products: Stock and Reorder Level Discrepancies" scatter plot illustrates the relationship between stock quantity and reorder level.

* Blue points ("Safe") represent products with sufficient stock.
* Orange points ("Nearing") indicate products that are approaching or are already below the reorder level.

The concentration of orange points at the lower end of the chart (low stock levels) is expected. However, the presence of orange points across all `Stock_Quantity` levels suggests that the reorder level may be inconsistent or inadequate for some products, requiring a review of the reordering policy.

---

### **Distribution of Days of Inventory (DOI)**
The "Distribution Stock Coverage in days" chart shows the distribution of inventory coverage in days. Most products have a stock coverage of **3 to 5 days**, with a significant peak at **4 days**. This indicates that the company maintains a lean inventory, which could be a strategy to reduce storage costs. The average `DOI_Inventory_Turnover` from the table is 18.49 days, but this value is likely skewed by some outliers with high coverage, as seen in the distribution.

---

## **3. Logistics and Purchasing Analysis**

### **Distribution of Delivery Lag**
The "Distribution of Valid Delivery Lag Values (in Days)" chart shows that most delivery times are concentrated between **1 and 5 days**. The main peak occurs at **2 days**, with a frequency of over 200. There is a second, smaller peak around 15 days. The descriptive table confirms that the average `Delivery_Lag` is 10.53 days, but the median is only 6 days, which again suggests that the mean is pulled up by a small number of deliveries with very long lead times.

---

### **Relationship Between Stock, Value, and Suppliers**
The "Stock Quantity vs. Stock Value by Supplier" scatter plot visualizes the relationship between stock quantity and stock value by supplier.

* Most points are concentrated at low `Stock_Value` and `Stock_Quantity` values, indicating that the bulk of the inventory consists of low-value items and/or small quantities.
* There are some notable **outliers**, such as an item with a `Stock_Value` exceeding $5,000 and a `Stock_Quantity` around 95, and another with a `Stock_Value` around $4,500 and a `Stock_Quantity` around 70. These high-value points deserve attention to understand their origin and impact on inventory.

---

## **4. Descriptive Table (Statistical Summary)**

The `describe()` table complements the visualizations with key statistical metrics.

* **`Stock_Quantity`**: The average stock quantity is 55.6, with a considerable standard deviation (26.3). Values range from 10 to 100.
* **`Reorder_Level`**: The average reorder level is 51.22, very close to the average stock.
* **`Unit_Price`**: The average unit price is $5.92, but with a high standard deviation ($6.49) and a maximum value of $98.43, indicating the presence of expensive items that skew the average.
* **`Stock_Value`**: The average stock value is $336, but the median is only $209, which reinforces the observation from the scatter plot about the existence of **high-value items that influence the mean**.
* **`Delivery_Lag`**: The median of 6 days is a more representative measure of the typical delivery time than the average of 10.53, as the distribution is skewed, as seen in the chart.

---

## **Conclusion**

The data reveals an operational landscape with several key characteristics:

* **Inventory Focus:** The company has a high concentration of value in `Fruits & Vegetables`, which may be a strategic area for monitoring.
* **Efficient Logistics:** Most deliveries are fast, with lead times of 1 to 5 days, but some outliers with long lead times need to be investigated.
* **Optimization Opportunity:** Discrepancies between stock and reorder levels for some products and the presence of high-value items suggest the need for a more detailed analysis to optimize inventory and reordering policies.
* **Discontinued Products:** The significant number of discontinued and backordered products requires an analysis to understand the reason and financial impact of this situation.

In [68]:
# Define data paths
processed_data_path = os.path.join('../data', 'processed')

utils_data_path = os.path.join('../docs/column_descriptions.json')

In [69]:
# Sort DataFrame by Date_Received in ascending order
df = df.sort_values(by='Date_Received').reset_index(drop=True)

In [70]:
# Save Data
df.to_pickle(processed_data_path + '/grocery.pkl')

# save Dictionary JSON archive
with open(utils_data_path, 'w') as f:
    json.dump(column_inventory, f, indent=4)