Module 01: Exploratory Data Analysis for Demand & Inventory

This notebook performs exploratory data analysis (EDA) for Module 01 of the **"Intelligent System for Supply Chain Management"** project.  

The primary goal is to optimize inventory and purchasing management, with a target of **reducing overstocking by 20%** within six months.

---

## Data Generation
### Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import json
import plotly.express as px
import plotly.io as pio
import ast

from plotly.subplots import make_subplots

from smart_supply_chain_ai.utils import create_data_functions

import warnings
warnings.filterwarnings('ignore')

# Set up display options and plotting template
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"
px.defaults.width = 800
px.defaults.height = 600

### Paths

In [2]:
# Define data paths
raw_data_path = os.path.join('../data', 'raw')

json_path = os.path.join('../src','smart_supply_chain_ai' , 'utils/')

## Create Synthetic Dates

Features of the adjusted data:
Specific categories: Using only the categories present in your data

Realistic distribution: Based on observed frequency in the provided data

Realistic parameters per category:

ü•¶ **Produce**
- **Lead Time:** 1‚Äì3 days (locally sourced), 5‚Äì10 days (imported)
- **Shelf Life:** 3‚Äì10 days (most fresh items), up to 2 weeks for hardy vegetables like carrots or potatoes

üåæ **Grains and Flours**
- **Lead Time:** 3‚Äì7 days (domestic), 10‚Äì15 days (imported specialty grains)
- **Shelf Life:** 6 months to 1 year (dry, sealed), up to 2 years for rice and flour stored properly

üßÄ **Dairy and Cold Cuts**
- **Lead Time:** 2‚Äì5 days (regional suppliers), 7‚Äì10 days (specialty cheeses)
- **Shelf Life:**
  - Milk & cream: 7‚Äì14 days refrigerated
  - Yogurt & soft cheeses: 2‚Äì3 weeks
  - Hard cheeses: 1‚Äì3 months
  - Cold cuts: 1‚Äì2 weeks sealed

‚òï **Beverages**
- **Lead Time:** 2‚Äì7 days (coffee/tea distributors)
- **Shelf Life:**
  - Tea: 1‚Äì2 years (dry)
  - Coffee beans: 6‚Äì12 months (sealed), 1‚Äì2 weeks after grinding
  - Brewed drinks: 1‚Äì3 days refrigerated

ü•ö **Eggs and Poultry**
- **Lead Time:** 1‚Äì3 days (local farms), 5‚Äì7 days (wholesale)
- **Shelf Life:**
  - Eggs: 3‚Äì5 weeks refrigerated
  - Fresh poultry: 1‚Äì2 days raw, 3‚Äì4 days cooked

üêü **Meats and Fish**
- **Lead Time:** 1‚Äì5 days (fresh), 7‚Äì10 days (frozen or imported)
- **Shelf Life:**
  - Fresh fish: 1‚Äì2 days
  - Frozen fish: 3‚Äì6 months
  - Cured fish (e.g., sardines): up to 1 year

üõ¢Ô∏è **Oils and Fats**
- **Lead Time:** 3‚Äì7 days (bulk suppliers)
- **Shelf Life:**
  - Vegetable oils: 6‚Äì12 months
  - Butter: 1 month refrigerated, 6 months frozen
  - Coconut oil: up to 2 years

üç¨ **Sugars and Sweets**
- **Lead Time:** 2‚Äì5 days
- **Shelf Life:**
  - Sugars: indefinite if dry and sealed
  - Dried fruits (e.g., plum): 6‚Äì12 months

üç™ **Miscellaneous and Biscuits**
- **Lead Time:** 2‚Äì6 days
- **Shelf Life:**
  - Biscuits: 3‚Äì6 months sealed


Seasonal patterns:

- Fruits/vegetables with reduced shelf life in summer

- Dairy with shorter lead time in winter

Realistic temporal distribution:

- 80% of deliveries on weekdays

Controlled outliers: Only 3% of data with unusual situations

These synthetic data preserve the specific characteristics of the categories in your original dataset, with realistic temporal relationships for supply chain analysis.

In [3]:
# List of JSON filenames (without extension) to be loaded
arch_json = ['products','products_categories', 'suppliers']

# Dictionary to store the loaded JSON content
store_catalog = {}

# Loop through each filename, build the full path, and load the JSON data
for name in arch_json:
    file_path = os.path.join(json_path, f"{name}.json")  # Construct full file path
    with open(file_path, "r", encoding="utf-8") as f:     # Open the JSON file
        store_catalog[name] = json.load(f)                        # Load and store the data under its name

# Product catalog information

In [4]:
# Create a DataFrame of products with product names as a column
products = pd.DataFrame.from_dict(store_catalog['products']).T.reset_index().rename(columns={'index': 'product'})


In [5]:
# Replace product with new IDs
products['product_id'] = create_data_functions.create_IDs(products.shape[0], suffix='P')

# Supplier catalog and distribution details

In [6]:
# Create a DataFrame of suppliers with supplier names as a column
suppliers = pd.DataFrame.from_dict(store_catalog['suppliers']).T.reset_index().rename(columns={'index': 'supplier'})

In [7]:
# Insert supplier IDs as the second column
suppliers.insert(1, 'supplier_id', create_data_functions.create_IDs(suppliers.shape[0], suffix='S'))

In [8]:
# Remove 'category' and 'subcategories' columns from the suppliers DataFrame
suppliers.drop(columns=['category', 'subcategories'], inplace=True)


In [9]:
# Split each supplier's product list into separate rows and reset the index
suppliers = suppliers.explode('products').reset_index(drop=True)


# Merge product and supplier tables to consolidate supply chain information

In [10]:
# Merge product and supplier data on matching product names, then drop duplicate 'products' column from suppliers
supply_df = pd.merge(products, suppliers, left_on='product', right_on='products').drop(columns='products')


# Create Date dataframe

In [11]:
"date_received"
"lpo (latest_purchase_order)"
"is_weekend"
"is_holiday"
"economic_index"
"weather_impact_score"
"promotion_active"

'promotion_active'

In [12]:
# import holidays

In [13]:
# Define the start date of the time range
start_date = pd.to_datetime('2023-03-01')

# Define the end date
end_date = pd.to_datetime('2025-03-31')

# Create a daily date range from start_date to end_date
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Specify the number of rows in the dataset (i.e., total number of records to generate)
n_rows = 1500

In [14]:
# Randomly sample dates from the date range
random_dates = np.random.choice(date_range, size=n_rows, replace=True)

# Create a DataFrame with the sampled dates
date_df = pd.DataFrame({
    'LPO': random_dates
})

In [15]:
date_df['day_classification'] = date_df['LPO'].apply(create_data_functions.day_classification)

In [16]:
date_df[['temperature_c', 'precipitation_mm', 'condition_weather']] = date_df['LPO'].apply(lambda x: pd.Series(create_data_functions.simulate_weather(x)))

In [17]:
date_df

Unnamed: 0,LPO,day_classification,temperature_c,precipitation_mm,condition_weather
0,2025-03-06,Weekdays,25,0,Sun and Heat
1,2023-05-01,Holiday,26,0,Pleasant
2,2024-02-29,Weekdays,30,5,Rain and Heat
3,2025-03-14,Weekdays,35,10,Rain and Heat
4,2024-06-11,Weekdays,27,0,Pleasant
...,...,...,...,...,...
1495,2025-01-21,Weekdays,33,0,Sun and Heat
1496,2023-07-02,Sunday,20,1,Cold and Rainy
1497,2023-08-09,Weekdays,18,0,Cold
1498,2025-02-27,Weekdays,30,0,Sun and Heat


# Criando dataframe randomicamente 

In [18]:
supply_df.columns

Index(['product', 'product_id', 'category', 'sub_category', 'shelf_life_days',
       'min_stock', 'max_stock', 'seasonality', 'storage_recommendation',
       'unit_of_measurement', 'barcode_ean', 'reorder_point', 'supplier',
       'supplier_id', 'distance_km'],
      dtype='object')

In [19]:
# 3. Fun√ß√£o para popular dates com supply de forma aleat√≥ria
def simular_stock_diario(date_df, supply_df, max_produtos_por_dia=6):
    """
    Simula stock di√°rio considerando fatores sazonais e meteorol√≥gicos
    """
    registros_stock = []
    
    for _, date_row in date_df.iterrows():
        data = date_row['LPO']
        temperatura = date_row['temperature_c']
        precipitacao = date_row['precipitation_mm']
        condicao_clima = date_row['condition_weather']
        dia_semana = date_row['day_classification']
        
        # N√∫mero de produtos para este dia (afetado pelo clima)
        if condicao_clima in ['Rainy', 'Stormy']:
            n_produtos = np.random.randint(1, max_produtos_por_dia - 1)
        else:
            n_produtos = np.random.randint(2, max_produtos_por_dia + 1)
        
        # Amostrar produtos aleatoriamente
        produtos_amostrados = supply_df.sample(n=n_produtos, replace=True, 
                                             random_state=int(data.day))
        
        for _, produto in produtos_amostrados.iterrows():
            # Stock base considerando max_stock
            stock_base = np.random.randint(produto['min_stock'], produto['max_stock'] + 1)
            
            # Ajustar stock baseado na temperatura (produtos sens√≠veis)
            if produto['category'] in ['Eletr√¥nico', 'Mobile']:
                if temperatura > 30:
                    stock_base = max(produto['min_stock'], int(stock_base * 0.8))
            
            # Ajustar stock baseado na precipita√ß√£o
            if precipitacao > 5 and produto['storage_recommendation'] == 'Seca':
                stock_base = max(produto['min_stock'], int(stock_base * 0.9))
            
            # Vendas di√°rias (afetadas pelo dia da semana e clima)
            if dia_semana == 'Weekend':
                vendas_base = min(stock_base, np.random.randint(0, int(stock_base * 0.7)))
            else:
                vendas_base = min(stock_base, np.random.randint(0, int(stock_base * 0.5)))
            
            if condicao_clima in ['Rainy', 'Stormy']:
                vendas_dia = max(0, int(vendas_base * 0.6))
            else:
                vendas_dia = vendas_base
            
            # Calcular stock final
            stock_final = stock_base - vendas_dia
            
            # Status do stock
            if stock_final <= produto['reorder_point']:
                status_stock = 'REORDER'
            elif stock_final <= produto['min_stock'] * 1.2:
                status_stock = 'LOW'
            else:
                status_stock = 'OK'
            
            registro = {
                'LPO': data,
                'day_classification': dia_semana,
                'temperature_c': temperatura,
                'precipitation_mm': precipitacao,
                'condition_weather': condicao_clima,
                'product': produto['product'],
                'product_id': produto['product_id'],
                'category': produto['category'],
                'sub_category': produto['sub_category'],
                'stock_inicial': stock_base,
                'vendas_dia': vendas_dia,
                'stock_final': stock_final,
                'min_stock': produto['min_stock'],
                'max_stock': produto['max_stock'],
                'reorder_point': produto['reorder_point'],
                'status_stock': status_stock,
                'supplier': produto['supplier'],
                'distance_km': produto['distance_km']
            }
            
            registros_stock.append(registro)
    
    return pd.DataFrame(registros_stock)



In [22]:
date_df.head(2)

Unnamed: 0,LPO,day_classification,temperature_c,precipitation_mm,condition_weather
0,2025-03-06,Weekdays,25,0,Sun and Heat
1,2023-05-01,Holiday,26,0,Pleasant


In [20]:
simular_stock_diario(date_df, supply_df)

Unnamed: 0,LPO,day_classification,temperature_c,precipitation_mm,condition_weather,product,product_id,category,sub_category,stock_inicial,vendas_dia,stock_final,min_stock,max_stock,reorder_point,status_stock,supplier,distance_km
0,2025-03-06,Weekdays,25,0,Sun and Heat,Peanut Oil,1984964|P,Oils & Condiments,Oils,70,4,66,50,100,50,OK,Santa Fe Distributor,1427
1,2025-03-06,Weekdays,25,0,Sun and Heat,Lime,1509912|P,Fresh Foods,Fruits,27,9,18,25,50,25,REORDER,BioSupply,421
2,2025-03-06,Weekdays,25,0,Sun and Heat,Parmesan Cheese,1258992|P,Dairy,Cheeses,49,14,35,25,50,20,OK,QualityFood Suppliers,210
3,2025-03-06,Weekdays,25,0,Sun and Heat,Cod,1195211|P,Meats & Fish,Fish,20,8,12,10,25,6,LOW,SupplyMaster Foods,1586
4,2025-03-06,Weekdays,25,0,Sun and Heat,Lime,1509912|P,Fresh Foods,Fruits,37,11,26,25,50,25,LOW,BioSupply,421
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5848,2023-03-18,Saturday,29,5,Rain and Heat,Kale,1477780|P,Fresh Foods,Leafy Greens,13,1,12,10,25,8,LOW,Metropolis Distributor,135
5849,2023-03-18,Saturday,29,5,Rain and Heat,Lettuce,1416652|P,Fresh Foods,Leafy Greens,25,1,24,10,25,9,OK,GreenFields Co.,127
5850,2023-03-18,Saturday,29,5,Rain and Heat,Egg (Quail),1053483|P,Fresh Foods,Eggs,26,2,24,25,50,20,LOW,MegaSupply Commerce,472
5851,2023-03-18,Saturday,29,5,Rain and Heat,Anchovies,1790934|P,Meats & Fish,Fish,10,1,9,10,25,5,LOW,QualityMax Supplier,1173


In [21]:
supply_df

Unnamed: 0,product,product_id,category,sub_category,shelf_life_days,min_stock,max_stock,seasonality,storage_recommendation,unit_of_measurement,barcode_ean,reorder_point,supplier,supplier_id,distance_km
0,Strawberries,1955116|P,Fresh Foods,Fruits,5,10,25,"[July, August, September, October, November]",Refrigerated,unit,8712345000018,10,FreshHarvest Ltd.,1311632|S,84
1,Strawberries,1955116|P,Fresh Foods,Fruits,5,10,25,"[July, August, September, October, November]",Refrigerated,unit,8712345000018,10,PrimeProduce,1280020|S,238
2,Strawberries,1955116|P,Fresh Foods,Fruits,5,10,25,"[July, August, September, October, November]",Refrigerated,unit,8712345000018,10,AgroPrime Foods,1476541|S,101
3,Spinach,1445175|P,Fresh Foods,Leafy Greens,5,10,25,[],Refrigerated,bunch,8712345000025,8,GreenFields Co.,1893979|S,127
4,Spinach,1445175|P,Fresh Foods,Leafy Greens,5,10,25,[],Refrigerated,bunch,8712345000025,8,UrbanFarmers,1769925|S,95
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,Coconut Sugar,1024883|P,Oils & Condiments,Condiments,9999,100,250,[],"Cool, dry place in an airtight container",kg,8712345001114,80,GlobalFoods,1182354|S,1450
179,Coconut Sugar,1024883|P,Oils & Condiments,Condiments,9999,100,250,[],"Cool, dry place in an airtight container",kg,8712345001114,80,North Brazil Distributor,1409654|S,1943
180,Oatmeal Biscuit,1852569|P,Breads & Biscuits,Biscuits,240,100,250,[],"Cool, dry place in an airtight container",box,8712345001121,100,Sunrise Traders,1101728|S,1890
181,Butter Biscuit,1221436|P,Breads & Biscuits,Biscuits,240,100,250,[],"Cool, dry place in an airtight container",box,8712345001138,100,Plain Distributor,1898182|S,1254


In [None]:
supply_df

In [None]:
supply_explode = supply_df.explode('seasonality')

In [None]:
current_month = date_df.month_name


In [None]:
date_df

In [None]:
supply_df['seasonality'][0][1]

In [None]:
# Generate received dates (with more realistic distribution ‚Äì more deliveries on weekdays)
date_received_ts = np.zeros(n_rows, dtype=np.int64)

for i in range(n_rows):
    # 80% chance of being a weekday (Monday to Friday)
    if np.random.random() < 0.8:
        # Weekday: normal distribution centered around Wednesday
        day_offset = int(np.random.normal(2, 1.5))  # 0=Mon, 1=Tue, 2=Wed, 3=Thu, 4=Fri
        day_offset = max(0, min(4, day_offset))  # Clamp between 0 and 4
    else:
        # Weekend: Saturday or Sunday
        day_offset = np.random.choice([5, 6])
    
    # Select a random week in the year
    week_offset = np.random.randint(0, 52) * 7
    base_date_ts = start_ts + (week_offset + day_offset) * 86400
    
    # Add hour variation (deliveries usually in the morning)
    hour = int(np.random.normal(10, 2))  # Mean 10am, standard deviation 2h
    hour = max(6, min(18, hour))  # Clamp between 6am and 6pm
    
    date_received_ts[i] = base_date_ts + hour * 3600


In [None]:
# Initialize array to store last order timestamps for each product
last_order_ts = np.zeros(n_rows, dtype=np.int64)

# Initialize array to store expiration timestamps for each product
expiration_ts = np.zeros(n_rows, dtype=np.int64)


In [None]:
products.columns

In [None]:
# Convert received timestamps to datetime format
date_received = pd.to_datetime(date_received_ts, unit='s')

# Convert last order timestamps to datetime format
last_order = pd.to_datetime(last_order_ts, unit='s')

# Convert expiration timestamps to datetime format
expiration = pd.to_datetime(expiration_ts, unit='s')


In [None]:
# Adjust for seasonal patterns

# Fruits and vegetables have shorter shelf life during summer (due to heat)
summer_mask = (df_synthetic['Date_Received'].dt.month.isin([6, 7, 8])) & (df_synthetic['Category'] == 'Fruits & Vegetables')
df_synthetic.loc[summer_mask, 'Expiration_Date'] -= pd.to_timedelta(np.random.randint(2, 5), unit='d')

# Dairy products have shorter lead time during winter (lower spoilage risk)
winter_mask = (df_synthetic['Date_Received'].dt.month.isin([12, 1, 2])) & (df_synthetic['Category'] == 'Dairy')
df_synthetic.loc[winter_mask, 'Last_Order_Date'] += pd.to_timedelta(np.random.randint(1, 3), unit='d')


In [None]:
# Add some outliers (3% of the data) ‚Äì unusual situations
outlier_mask = np.random.random(n_rows) < 0.03

# Apply early order dates for outlier records
df_synthetic.loc[outlier_mask, 'Last_Order_Date'] -= pd.to_timedelta(np.random.randint(15, 30), unit='d')

# Apply reduced shelf life for perishable outlier products
df_synthetic.loc[outlier_mask & (df_synthetic['Category'].isin(['Fruits & Vegetables', 'Seafood'])), 
       'Expiration_Date'] -= pd.to_timedelta(np.random.randint(3, 7), unit='d')

# Ensure Last_Order_Date is always earlier than Date_Received
date_inconsistency = df_synthetic['Last_Order_Date'] > df_synthetic['Date_Received']
df_synthetic.loc[date_inconsistency, 'Last_Order_Date'] = df_synthetic.loc[date_inconsistency, 'Date_Received'] - pd.to_timedelta(
    np.random.randint(1, 5), unit='d')

# Ensure Expiration_Date is always later than Date_Received
exp_inconsistency = df_synthetic['Expiration_Date'] <= df_synthetic['Date_Received']
df_synthetic.loc[exp_inconsistency, 'Expiration_Date'] = df_synthetic.loc[exp_inconsistency, 'Date_Received'] + pd.to_timedelta(
    np.random.randint(1, 10), unit='d')


# Create Table Complete Supply Chain

In [None]:
df = supply_df.copy()
df.head()

In [None]:
# df.insert(7,'stock_quantity', create_data_functions.create_stock_distribution_vectorized(stock_min=df['min_stock'], 
#                                                                               stock_max=df['max_stock'], 
#                                                                               seed=265, 
#                                                                               prob_stock = [0.12, 0.28, 0.60], 
#                                                                               prob_extreme = [0.68, 0.27, 0.05] ))

In [None]:
df.min_stock.min()

In [None]:
df.max_stock.max()

In [None]:
df[df.stock_quantity < 10].shape

In [None]:
df[df.stock_quantity < 10].shape[0]/ df.shape[0]

In [None]:
df[df.stock_quantity > 250].shape

In [None]:
df[df.stock_quantity > 250].shape[0] / df.shape[0]

In [None]:
# Function to calculate the suggested selling price
def calculate_selling_price(product):
    """
        Calculate the suggested selling price for a product based on its supply cost and category-specific rates.

        Parameters:
        ----------
        product : object
            An object representing a product, expected to have the attributes:
            - supply_pricece (float): The purchase cost of the product.
            - category (str): The product's category, used to look up rates.

        Returns:
        -------
        float
            The suggested selling price, calculated using logistics, loss, and markup rates
            specific to the product's category.

        Calculation:
        -----------
        - Actual Unit Cost = supply_pricece * (1 + logistics_rate) / (1 - loss_rate)
        - Suggested Price = Actual Unit Cost * markup

        Reference Tables:
        ----------------
        - logistics_table: % increase due to logistics per category.
        - loss_table: % expected loss per category.
        - markup_table: multiplier to determine final selling price per category.
    """
    
    # Retrieve purchase cost and category-specific rates from reference tables
    purchase_cost = product.supply_price
    category = product.category
    logistics_rate = logistics_table[category]
    loss_rate = loss_table[category]
    markup = markup_table[category]

    # Calculate the Actual Unit Cost
    actual_unit_cost = purchase_cost * (1 + logistics_rate) / (1 - loss_rate)

    # Calculate the Suggested Selling Price
    suggested_price = actual_unit_cost * markup

    return suggested_price



# Reference tables (implemented as Python dictionaries)
logistics_table = {
    'Produce': 0.07,
    'Meats and Fish': 0.06,
    'Dairy and Cold Cuts': 0.06,
    'Grains and Flours': 0.01,
    'Beverages': 0.01,
    'Oils and Fats': 0.01,
    'Eggs and Poultry': 0.05,
    'Sugars and Sweets': 0.01,
    'Miscellaneous and Biscuits': 0.01
}

loss_table = {
    'Produce': 0.0610,
    'Meats and Fish': 0.0375,
    'Grains and Flours': 0.0153,
    'Beverages': 0.0153,
    'Oils and Fats': 0.0153,
    'Dairy and Cold Cuts': 0.01,  # Assumed value for example
    'Eggs and Poultry': 0.01,     # Assumed value for example
    'Sugars and Sweets': 0.0153,
    'Miscellaneous and Biscuits': 0.0153
}

markup_table = {
    'Produce': 2.50,
    'Meats and Fish': 1.43,
    'Dairy and Cold Cuts': 1.39,
    'Grains and Flours': 1.25,
    'Beverages': 1.25,
    'Oils and Fats': 1.25,
    'Eggs and Poultry': 1.33,
    'Sugars and Sweets': 1.54,
    'Miscellaneous and Biscuits': 1.43
}


In [None]:
# Calculate and round the suggested sell price for each row using the calculate_selling_price function
base_df['sell_price'] = round(base_df.apply(calculate_selling_price, axis=1))

In [None]:
# Ponto de Reposi√ß√£o: Adicione um campo para reorder_point (geralmente 20-30% do stock m√°ximo)

In [None]:
len(products)

In [None]:
def calculate_reorder_quantity(product):
    # Baseado na demanda semanal, lead time e shelf life
    base_quantity = product['weekly_demand_avg'] * (product['lead_time_days'] / 7 + 1)
    
    # Ajuste para perecibilidade
    if product['shelf_life_days'] < 14:
        # Produtos muito perec√≠veis - pedidos menores e mais frequentes
        reorder_qty = base_quantity * 0.7
    elif product['shelf_life_days'] < 30:
        # Produtos moderadamente perec√≠veis
        reorder_qty = base_quantity * 1.0
    else:
        # Produtos n√£o perec√≠veis - pedidos maiores
        reorder_qty = base_quantity * 1.5
    
    return round(reorder_qty / 5) * 5  # Arredonda para m√∫ltiplos de 5

In [None]:
# Show data information
df.info()

In [None]:
# # Define data paths
# processed_data_path = os.path.join('../data', 'processed')

# utils_data_path = os.path.join('../docs/column_descriptions.json')

In [None]:
# Sort DataFrame by Date_Received in ascending order
# df = df.sort_values(by='Date_Received').reset_index(drop=True)

In [None]:
# # Save Data
# df.to_pickle(processed_data_path + '/grocery.pkl')

# # save Dictionary JSON archive
# with open(utils_data_path, 'w') as f:
#     json.dump(column_inventory, f, indent=4)