## Create Synthetic Data for Grocery Supply Chain

Features of the adjusted data:
Specific categories: Using only the categories present in your data

Realistic distribution: Based on observed frequency in the provided data

Realistic parameters per category:

ü•¶ **Produce**
- **Lead Time:** 1‚Äì3 days (locally sourced), 5‚Äì10 days (imported)
- **Shelf Life:** 3‚Äì10 days (most fresh items), up to 2 weeks for hardy vegetables like carrots or potatoes

üåæ **Grains and Flours**
- **Lead Time:** 3‚Äì7 days (domestic), 10‚Äì15 days (imported specialty grains)
- **Shelf Life:** 6 months to 1 year (dry, sealed), up to 2 years for rice and flour stored properly

üßÄ **Dairy and Cold Cuts**
- **Lead Time:** 2‚Äì5 days (regional suppliers), 7‚Äì10 days (specialty cheeses)
- **Shelf Life:**
  - Milk & cream: 7‚Äì14 days refrigerated
  - Yogurt & soft cheeses: 2‚Äì3 weeks
  - Hard cheeses: 1‚Äì3 months
  - Cold cuts: 1‚Äì2 weeks sealed

‚òï **Beverages**
- **Lead Time:** 2‚Äì7 days (coffee/tea distributors)
- **Shelf Life:**
  - Tea: 1‚Äì2 years (dry)
  - Coffee beans: 6‚Äì12 months (sealed), 1‚Äì2 weeks after grinding
  - Brewed drinks: 1‚Äì3 days refrigerated

ü•ö **Eggs and Poultry**
- **Lead Time:** 1‚Äì3 days (local farms), 5‚Äì7 days (wholesale)
- **Shelf Life:**
  - Eggs: 3‚Äì5 weeks refrigerated
  - Fresh poultry: 1‚Äì2 days raw, 3‚Äì4 days cooked

üêü **Meats and Fish**
- **Lead Time:** 1‚Äì5 days (fresh), 7‚Äì10 days (frozen or imported)
- **Shelf Life:**
  - Fresh fish: 1‚Äì2 days
  - Frozen fish: 3‚Äì6 months
  - Cured fish (e.g., sardines): up to 1 year

üõ¢Ô∏è **Oils and Fats**
- **Lead Time:** 3‚Äì7 days (bulk suppliers)
- **Shelf Life:**
  - Vegetable oils: 6‚Äì12 months
  - Butter: 1 month refrigerated, 6 months frozen
  - Coconut oil: up to 2 years

üç¨ **Sugars and Sweets**
- **Lead Time:** 2‚Äì5 days
- **Shelf Life:**
  - Sugars: indefinite if dry and sealed
  - Dried fruits (e.g., plum): 6‚Äì12 months

üç™ **Miscellaneous and Biscuits**
- **Lead Time:** 2‚Äì6 days
- **Shelf Life:**
  - Biscuits: 3‚Äì6 months sealed


Seasonal patterns:

- Fruits/vegetables with reduced shelf life in summer

- Dairy with shorter lead time in winter

Realistic temporal distribution:

- 80% of deliveries on weekdays

Controlled outliers: Only 3% of data with unusual situations

These synthetic data preserve the specific characteristics of the categories in your original dataset, with realistic temporal relationships for supply chain analysis.

## Data Generation
### Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import json

from smart_supply_chain_ai.utils import create_data_functions, combine_df_functions, weather_conditions

import warnings
warnings.filterwarnings('ignore')

### Paths

In [2]:
# Define data paths
raw_data_path = os.path.join('../data', 'raw/')

external_data_path = os.path.join('../data', 'external/')

json_path = os.path.join('../src','smart_supply_chain_ai' , 'utils/')

In [3]:
# List of JSON filenames (without extension) to be loaded
arch_json = ['products','products_categories', 'suppliers']

# Dictionary to store the loaded JSON content
store_catalog = {}

# Loop through each filename, build the full path, and load the JSON data
for name in arch_json:
    file_path = os.path.join(json_path, f"{name}.json")  # Construct full file path
    with open(file_path, "r", encoding="utf-8") as f:     # Open the JSON file
        store_catalog[name] = json.load(f)                        # Load and store the data under its name

In [4]:
store_catalog['products']

{'Strawberries': {'product_id': 'P-001',
  'category': 'Fresh Foods',
  'sub_category': 'Fruits',
  'shelf_life_days': 4,
  'maximum_days_on_sale': 2,
  'min_stock': 10,
  'max_stock': 25,
  'seasonality': ['July', 'August', 'September', 'October', 'November'],
  'storage_recommendation': 'Refrigerated',
  'unit_of_measurement': 'lb',
  'barcode_ean': '8712345000018',
  'reorder_point': 10},
 'Spinach': {'product_id': 'P-002',
  'category': 'Fresh Foods',
  'sub_category': 'Leafy Greens',
  'shelf_life_days': 7,
  'maximum_days_on_sale': 4,
  'min_stock': 10,
  'max_stock': 25,
  'seasonality': [],
  'storage_recommendation': 'Refrigerated',
  'unit_of_measurement': 'bunch',
  'barcode_ean': '8712345000025',
  'reorder_point': 8},
 'Mushrooms': {'product_id': 'P-003',
  'category': 'Fresh Foods',
  'sub_category': 'Vegetables',
  'shelf_life_days': 4,
  'maximum_days_on_sale': 2,
  'min_stock': 10,
  'max_stock': 25,
  'seasonality': [],
  'storage_recommendation': 'Refrigerated, in a 

# Product catalog information

In [5]:
# Create a DataFrame of products with product names as a column
products = pd.DataFrame.from_dict(store_catalog['products']).T.reset_index().rename(columns={'index': 'product'})


In [6]:
# Generate randomized inventory thresholds (min_stock, max_stock, reorder_point) for each product
# based on existing baseline values, and update the DataFrame with these new columns
products = create_data_functions.create_min_max_stock(
    df=products,
    base_min=products['min_stock'],
    base_max=products['max_stock'],
    base_reorder=products['reorder_point']
)


In [7]:
# Replace product with new IDs
products['product_id'] = create_data_functions.create_IDs(products.shape[0], suffix='P')

# Supplier catalog and distribution details

In [8]:
# Create a DataFrame of suppliers with supplier names as a column
suppliers = pd.DataFrame.from_dict(store_catalog['suppliers']).T.reset_index().rename(columns={'index': 'supplier'})

In [9]:
# Insert supplier IDs as the second column
suppliers.insert(1, 'supplier_id', create_data_functions.create_IDs(suppliers.shape[0], suffix='S'))

In [10]:
# Remove 'category' and 'subcategories' columns from the suppliers DataFrame
suppliers.drop(columns=['category', 'subcategories'], inplace=True)


In [11]:
# Split each supplier's product list into separate rows and reset the index
suppliers = suppliers.explode('products').reset_index(drop=True)


In [12]:
# Merge product and supplier data on matching product names, then drop duplicate 'products' column from suppliers
products_suppliers_df = pd.merge(products, suppliers, left_on='product', right_on='products').drop(columns='products')

In [13]:
# Initialize a random number generator with a fixed seed for reproducibility.
rng = np.random.default_rng(seed=43)

# Assign random supplier ratings between 1 and 4 to all suppliers.
products_suppliers_df['supplier_rating'] = rng.integers(1, 5, size=products_suppliers_df.shape[0])

# Randomly select 15 unique suppliers to be considered "top suppliers".
suppliers_top = rng.choice(products_suppliers_df['supplier'].unique(), 15, replace=False)

# Update ratings: if the supplier is in the top list, set rating to 5; otherwise keep the original rating.
products_suppliers_df['supplier_rating'] = np.where(products_suppliers_df['supplier'].isin(suppliers_top), 5, products_suppliers_df['supplier_rating'])


## Meteorological Data for Supply Chain Management

In [14]:
# Set the path to the weather CSV file
# archive_csv = external_data_path + 'dados_83967_D_2015-01-01_2025-09-18.csv'
archive_csv = external_data_path + 'dados_B807_D_2022-12-07_2025-09-22.csv'

# Read the CSV file into a DataFrame
weather_df = pd.read_csv(archive_csv, sep=";", decimal=",", skiprows=9, engine="python")

# Show the first rows of the DataFrame
weather_df.head()

Unnamed: 0,Data Medicao,"PRECIPITACAO TOTAL, DIARIO (AUT)(mm)","TEMPERATURA MAXIMA, DIARIA (AUT)(¬∞C)","TEMPERATURA MINIMA, DIARIA (AUT)(¬∞C)","VENTO, VELOCIDADE MEDIA DIARIA (AUT)(m/s)",Unnamed: 5
0,2022-12-07,,,,,
1,2022-12-08,,,,,
2,2022-12-09,,32.3,17.4,2.4,
3,2022-12-10,0.0,30.1,18.3,2.2,
4,2022-12-11,0.0,31.8,22.7,2.8,


In [15]:
# Remove columns that contain only missing values
weather_df.dropna(axis=1, how='all', inplace=True)


In [16]:
# Rename columns to clear and descriptive English names
weather_df.columns = [
    "measurement_date",
    "daily_total_precipitation_mm",
    "daily_maximum_temperature_c",
    "daily_minimum_temperature_c",
    "daily_average_wind_speed_mps"
]


In [17]:
# Set 'measurement_date' as index and remove rows with all missing values
weather_df = weather_df.set_index('measurement_date').dropna(how='all')

# Show the first rows of the DataFrame
weather_df.head()


Unnamed: 0_level_0,daily_total_precipitation_mm,daily_maximum_temperature_c,daily_minimum_temperature_c,daily_average_wind_speed_mps
measurement_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-12-09,,32.3,17.4,2.4
2022-12-10,0.0,30.1,18.3,2.2
2022-12-11,0.0,31.8,22.7,2.8
2022-12-12,0.0,37.3,19.9,2.9
2022-12-13,6.8,25.3,17.6,3.7


In [18]:
# This line calculates the total number of missing (NaN) values in each column of the weather_df DataFrame.
weather_df.isna().sum()

daily_total_precipitation_mm    5
daily_maximum_temperature_c     2
daily_minimum_temperature_c     0
daily_average_wind_speed_mps    6
dtype: int64

In [19]:
# Fill missing values in 'daily_total_precipitation_mm' using backward fill (next valid value).
weather_df['daily_total_precipitation_mm'].fillna(method='bfill', inplace=True)

# First, fill missing values in 'daily_maximum_temperature_c' using backward fill.
weather_df['daily_maximum_temperature_c'].fillna(method='bfill', inplace=True)

# Then, fill any remaining missing values in 'daily_maximum_temperature_c' using forward fill.
weather_df['daily_maximum_temperature_c'].fillna(method='ffill', inplace=True)

# Fill missing values in 'daily_average_wind_speed_mps' using forward fill (previous valid value).
weather_df['daily_average_wind_speed_mps'].fillna(method='ffill', inplace=True)


In [20]:
# Reset the DataFrame index to a default integer index and drop the old index column.
weather_df = weather_df.reset_index()

# Convert 'measurement_date' column to datetime format
weather_df['measurement_date'] = pd.to_datetime(weather_df['measurement_date'])

In [21]:
# Display summary information about the DataFrame
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 992 entries, 0 to 991
Data columns (total 5 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   measurement_date              992 non-null    datetime64[ns]
 1   daily_total_precipitation_mm  992 non-null    float64       
 2   daily_maximum_temperature_c   992 non-null    float64       
 3   daily_minimum_temperature_c   992 non-null    float64       
 4   daily_average_wind_speed_mps  992 non-null    float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 38.9 KB


In [22]:
# Dropped all missing values
weather_df.dropna(inplace=True)

In [23]:
weather_df['measurement_date'].min()

Timestamp('2022-12-09 00:00:00')

In [24]:
# Apply the weather classification function to the cleaned DataFrame to generate severity and category labels
weather_analyser = weather_conditions.WeatherConditions(weather_df)
weather_severity_df = weather_analyser.classify_weather()

In [25]:
# Show 10 samples rows of the DataFrame
weather_severity_df.sample(10)

Unnamed: 0,measurement_date,daily_total_precipitation_mm,daily_maximum_temperature_c,daily_minimum_temperature_c,daily_average_wind_speed_mps,daily_average_temperature_c,temperature_classification,precipitation_classification,wind_classification,weather_severity
28,2023-01-06,1.6,29.6,18.7,4.1,24.15,Warm,Light Rain,Gentle to Fresh Breeze,Moderate
780,2025-01-27,0.0,31.5,19.2,3.0,25.35,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
846,2025-04-03,8.4,26.5,20.6,1.7,23.55,Mild to Temperate,Moderate Rain,Gentle to Fresh Breeze,Moderate
692,2024-10-31,0.0,30.4,18.2,3.8,24.3,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
290,2023-09-25,13.4,18.0,15.6,3.4,16.8,Cool,Heavy Rain,Gentle to Fresh Breeze,Severe
217,2023-07-14,20.8,14.1,9.7,2.5,11.9,Cool,Heavy Rain,Gentle to Fresh Breeze,Severe
679,2024-10-18,0.2,22.0,12.8,1.6,17.4,Mild to Temperate,Light Rain,Gentle to Fresh Breeze,Moderate
974,2025-09-05,22.2,13.5,11.5,3.2,12.5,Cool,Heavy Rain,Gentle to Fresh Breeze,Severe
436,2024-02-18,0.0,30.6,20.3,3.0,25.45,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
971,2025-09-02,0.0,26.8,18.3,1.0,22.55,Mild to Temperate,No precipitation,Calm / Light Breeze,Normal


In [26]:
# Generate and transpose summary statistics for all numeric columns in the classified weather DataFrame
weather_severity_df.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
measurement_date,992.0,2024-04-20 02:35:19.354838528,2022-12-09 00:00:00,2023-08-13 18:00:00,2024-04-17 12:00:00,2024-12-21 06:00:00,2025-09-22 00:00:00,
daily_total_precipitation_mm,992.0,4.645565,0.0,0.0,0.0,1.65,132.0,12.602568
daily_maximum_temperature_c,992.0,25.034073,10.4,21.175,25.5,29.125,39.2,5.654455
daily_minimum_temperature_c,992.0,15.84496,1.7,12.3,16.6,19.6,25.2,4.928274
daily_average_wind_speed_mps,992.0,2.366331,0.7,1.6,2.3,2.9,7.7,0.96995
daily_average_temperature_c,992.0,20.439516,6.65,16.9375,20.85,24.25,30.9,4.911554


# Realistic supply chain modeling based on weather and product data

In [27]:
# Create a copy of the climate data DataFrame to work with weather-specific analysis
df_weather_conditions = weather_severity_df.copy()

# Create a copy of the products_suppliers data DataFrame to work with product-related operations
df_products = products_suppliers_df.copy()

In [28]:
# Display the first row of the weather DataFrame to preview its structure
df_weather_conditions.head(1)


Unnamed: 0,measurement_date,daily_total_precipitation_mm,daily_maximum_temperature_c,daily_minimum_temperature_c,daily_average_wind_speed_mps,daily_average_temperature_c,temperature_classification,precipitation_classification,wind_classification,weather_severity
0,2022-12-09,0.0,32.3,17.4,2.4,24.85,Warm,No precipitation,Gentle to Fresh Breeze,Moderate


In [29]:
# Preview the first row of the products DataFrame to check column names and initial data
df_products.head(1)

Unnamed: 0,product,product_id,category,sub_category,shelf_life_days,maximum_days_on_sale,min_stock,max_stock,seasonality,storage_recommendation,unit_of_measurement,barcode_ean,reorder_point,supplier,supplier_id,distance_km,supplier_rating
0,Strawberries,1812351|P,Fresh Foods,Fruits,4,2,8,25,"[July, August, September, October, November]",Refrigerated,lb,8712345000018,10,FreshHarvest Ltd.,1926670|S,84,5


In [30]:
# Rename the column 'measurement_date' to 'date' for easier reference.
df_weather_conditions.rename(columns={'measurement_date': 'received_date'}, inplace=True)


In [31]:
# Filters df_weather_conditions to retain only relevant weather-related columns for analysis or merging
df_weather_conditions = df_weather_conditions[['received_date', 'temperature_classification', 
                                               'precipitation_classification', 'wind_classification', 
                                               'weather_severity']]

In [32]:
# Display the first few rows of the weather DataFrame
df_weather_conditions.head()

Unnamed: 0,received_date,temperature_classification,precipitation_classification,wind_classification,weather_severity
0,2022-12-09,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
1,2022-12-10,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
2,2022-12-11,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
3,2022-12-12,Warm,No precipitation,Gentle to Fresh Breeze,Moderate
4,2022-12-13,Mild to Temperate,Moderate Rain,Gentle to Fresh Breeze,Moderate


In [33]:
# Display the first few rows of the products DataFrame
df_products.head()

Unnamed: 0,product,product_id,category,sub_category,shelf_life_days,maximum_days_on_sale,min_stock,max_stock,seasonality,storage_recommendation,unit_of_measurement,barcode_ean,reorder_point,supplier,supplier_id,distance_km,supplier_rating
0,Strawberries,1812351|P,Fresh Foods,Fruits,4,2,8,25,"[July, August, September, October, November]",Refrigerated,lb,8712345000018,10,FreshHarvest Ltd.,1926670|S,84,5
1,Strawberries,1812351|P,Fresh Foods,Fruits,4,2,8,25,"[July, August, September, October, November]",Refrigerated,lb,8712345000018,10,PrimeProduce,1219899|S,238,3
2,Strawberries,1812351|P,Fresh Foods,Fruits,4,2,8,25,"[July, August, September, October, November]",Refrigerated,lb,8712345000018,10,AgroPrime Foods,1656636|S,101,5
3,Spinach,1001437|P,Fresh Foods,Leafy Greens,7,4,11,25,[],Refrigerated,bunch,8712345000025,12,GreenFields Co.,1912796|S,127,1
4,Spinach,1001437|P,Fresh Foods,Leafy Greens,7,4,11,25,[],Refrigerated,bunch,8712345000025,12,UrbanFarmers,1603078|S,95,3


In [34]:
# Determine the number of samples based on the length of the supply DataFrame
n_samples = len(df_weather_conditions)

# Randomly select 'n_samples' rows from the df_products DataFrame
# Sampling is done with replacement (same row can be chosen more than once)
# The index is reset to avoid keeping the original row indices
random_samples1 = df_products.sample(n=n_samples, replace=True).reset_index(drop=True)
random_samples2 = df_products.sample(n=n_samples, replace=True).reset_index(drop=True)
random_samples3 = df_products.sample(n=n_samples, replace=True).reset_index(drop=True)
random_samples4 = df_products.sample(n=n_samples, replace=True).reset_index(drop=True)
random_samples5 = df_products.sample(n=n_samples, replace=True).reset_index(drop=True)


In [35]:
# Merge the supply_df and random_samples DataFrames using their index values
# This aligns rows from both DataFrames based on their position (index)
df_merged1 = df_weather_conditions.merge(random_samples1, left_index=True, right_index=True)
df_merged2 = df_weather_conditions.merge(random_samples2, left_index=True, right_index=True)
df_merged3 = df_weather_conditions.merge(random_samples3, left_index=True, right_index=True)
df_merged4 = df_weather_conditions.merge(random_samples4, left_index=True, right_index=True)
df_merged5 = df_weather_conditions.merge(random_samples5, left_index=True, right_index=True)


In [36]:
# Combine all merged DataFrames into one by stacking them row-wise and resetting the index
df_merged_full = pd.concat([df_merged1, df_merged2, df_merged3, df_merged4, df_merged5], axis=0, ignore_index=True)


In [37]:
# Display DataFrame information
df_merged_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4960 entries, 0 to 4959
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   received_date                 4960 non-null   datetime64[ns]
 1   temperature_classification    4960 non-null   object        
 2   precipitation_classification  4960 non-null   object        
 3   wind_classification           4960 non-null   object        
 4   weather_severity              4960 non-null   object        
 5   product                       4960 non-null   object        
 6   product_id                    4960 non-null   object        
 7   category                      4960 non-null   object        
 8   sub_category                  4960 non-null   object        
 9   shelf_life_days               4960 non-null   object        
 10  maximum_days_on_sale          4960 non-null   object        
 11  min_stock                     

In [38]:
df_merged_full['supplier_rating'].nunique()

5

In [39]:
# Randomly selects 3,000 rows from df_merged_full and resets the index to avoid retaining the original indices
df_raw = df_merged_full.sample(3000).reset_index(drop=True)

In [40]:
df_raw

Unnamed: 0,received_date,temperature_classification,precipitation_classification,wind_classification,weather_severity,product,product_id,category,sub_category,shelf_life_days,...,max_stock,seasonality,storage_recommendation,unit_of_measurement,barcode_ean,reorder_point,supplier,supplier_id,distance_km,supplier_rating
0,2023-04-12,Warm,No precipitation,Gentle to Fresh Breeze,Moderate,Milk,1246179|P,Dairy,Milk,7,...,25,[],Refrigerated,carton,8712345000483,14,SupplyTotal Logistics,1252625|S,1237,5
1,2025-02-26,Warm,Light Rain,Gentle to Fresh Breeze,Moderate,Banana,1532114|P,Fresh Foods,Fruits,3,...,24,[],"Room temperature, away from other fruits",lb,8712345000100,14,AgroExpress Supplies,1113380|S,276,5
2,2023-07-16,Cool,No precipitation,Calm / Light Breeze,Normal,Banana,1532114|P,Fresh Foods,Fruits,3,...,24,[],"Room temperature, away from other fruits",lb,8712345000100,14,BioSupply,1082312|S,421,2
3,2023-09-07,Mild to Temperate,Light Rain,Gentle to Fresh Breeze,Moderate,Milk,1246179|P,Dairy,Milk,7,...,25,[],Refrigerated,carton,8712345000483,14,SupplyQuality Foods,1235356|S,1210,1
4,2025-07-14,Cool,No precipitation,Gentle to Fresh Breeze,Moderate,Peas,1457605|P,Fresh Foods,Vegetables,4,...,25,"[September, October, November, December, Janua...",Refrigerated,pack,8712345000087,12,AgroPrime Foods,1656636|S,101,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,2025-03-26,Mild to Temperate,No precipitation,Gentle to Fresh Breeze,Moderate,Ricotta Cheese,1967642|P,Dairy,Cheeses,14,...,24,[],Refrigerated,tub,8712345000223,10,SupplyWorld Logistics,1530086|S,621,1
2996,2025-05-04,Mild to Temperate,No precipitation,Calm / Light Breeze,Normal,Bell Pepper,1026354|P,Fresh Foods,Vegetables,14,...,25,[],Refrigerated,unit,8712345000612,12,FreshHarvest Ltd.,1926670|S,84,5
2997,2023-09-17,Mild to Temperate,No precipitation,Calm / Light Breeze,Normal,Mushrooms,1747170|P,Fresh Foods,Vegetables,4,...,23,[],"Refrigerated, in a paper bag",pack,8712345000032,11,AgroNova,1258259|S,105,5
2998,2024-10-28,Mild to Temperate,No precipitation,Gentle to Fresh Breeze,Moderate,Cherry,1322487|P,Fresh Foods,Fruits,5,...,25,"[November, December, January]",Refrigerated,pack,8712345000179,11,AgroExpress Foods,1779611|S,142,5


In [41]:
# Defines the desired column order for organizing the dataset, prioritizing product, inventory, and weather-related attributes
reorder_columns = ['received_date', 'seasonality', 'product', 'product_id', 'category', 'sub_category', 
                   'shelf_life_days', 'maximum_days_on_sale', 'min_stock', 'max_stock', 'reorder_point', 
                   'unit_of_measurement', 'barcode_ean', 'supplier_rating', 'supplier', 'supplier_id', 'distance_km', 
                   'storage_recommendation', 'temperature_classification', 'precipitation_classification', 'wind_classification', 'weather_severity']

# Reorders columns according to reorder_columns
df_raw = df_raw[reorder_columns]

In [42]:
# Converts selected inventory-related columns to integer type for numerical operations and consistency
cols_int = ['shelf_life_days', 'min_stock', 'max_stock', 'reorder_point']
df_raw[cols_int] = df_raw[cols_int].astype(int)

# Converts selected categorical columns to 'category' dtype to optimize memory and improve model performance
cols_cat = ['category', 'sub_category', 'unit_of_measurement', 'supplier_rating', 
            'temperature_classification', 'precipitation_classification', 
            'wind_classification', 'weather_severity']
df_raw[cols_cat] = df_raw[cols_cat].astype('category')


In [43]:
# Summary 
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   received_date                 3000 non-null   datetime64[ns]
 1   seasonality                   3000 non-null   object        
 2   product                       3000 non-null   object        
 3   product_id                    3000 non-null   object        
 4   category                      3000 non-null   category      
 5   sub_category                  3000 non-null   category      
 6   shelf_life_days               3000 non-null   int64         
 7   maximum_days_on_sale          3000 non-null   object        
 8   min_stock                     3000 non-null   int64         
 9   max_stock                     3000 non-null   int64         
 10  reorder_point                 3000 non-null   int64         
 11  unit_of_measurement           

## Generate data about holidays, weekdays, and seasons of the year.

In [44]:
# Creates a full copy of df_raw and assigns it to df_date for independent manipulation
df_date = df_raw.copy()

In [45]:
# Apply the 'day_classification' function from 'create_data_functions' to each date in the 'date' column,
# assigning the result to a new column called 'day_classification'.
df_date['day_classification'] = create_data_functions.day_classification(dates = df_date['received_date'], country='BR')

# Create a boolean column indicating whether the day is classified as a holiday.
df_date['is_holiday'] = np.where(df_date['day_classification'] == 'Holiday', True, False)

# Create a boolean column indicating whether the day falls on a weekend (Saturday or Sunday).
df_date['is_weekend'] = np.where(df_date['received_date'].dt.dayofweek > 4, True, False)

In [46]:
def check_seasonality(row):
    """
    Checks whether the received month of a product aligns with its seasonal availability.
    """
    received_month = row['month_name']
    seasonality_list = row['seasonality']
    
    return received_month in seasonality_list


In [47]:
# Extracts the full month name from 'received_date' to support seasonality checks
df_date['month_name'] = df_date['received_date'].dt.month_name()

# Inserts a new column 'in_season' at position 2, indicating whether the product's received month aligns with its seasonal availability
df_date.insert(2, 'in_season', df_date.apply(check_seasonality, axis=1))

# Removes the temporary 'month_name' column after seasonality classification is complete
df_date.drop(columns=['month_name', 'seasonality'], inplace=True)

## Generate data for stock quantities and sales volumes.

In [48]:
# Creates a full copy
df_stock = df_date.copy()

In [49]:
# Adds a 'sales_demand' column to df_stock by classifying each date using Brazilian holiday calendar and demand heuristics
df_stock['sales_demand'] = create_data_functions.classify_grocery_demand(
    dates=df_stock['received_date'],
    country='BR'
)

In [50]:
df_stock['sales_demand'].unique()

array(['Normal', 'High', 'Very High'], dtype=object)

In [51]:
df_stock.head(1)

Unnamed: 0,received_date,in_season,product,product_id,category,sub_category,shelf_life_days,maximum_days_on_sale,min_stock,max_stock,...,distance_km,storage_recommendation,temperature_classification,precipitation_classification,wind_classification,weather_severity,day_classification,is_holiday,is_weekend,sales_demand
0,2023-04-12,False,Milk,1246179|P,Dairy,Milk,7,4,11,25,...,1237,Refrigerated,Warm,No precipitation,Gentle to Fresh Breeze,Moderate,Weekdays,False,False,Normal


In [52]:
# Inserts a new column 'stock_quantity' at position 7 with simulated stock quantities generated from min and max stock values
df_stock.insert(8, 'stock_quantity', create_data_functions.create_stock_distribution_vectorized(df_stock['min_stock'], df_stock['max_stock']))

In [53]:
# Insert a new column 'sales_volume' at position 8 with initial value 0
df_stock.insert(9, 'sales_volume', 0)

# Replace the 'sales_volume' column with simulated values using a custom function
df_stock['sales_volume'] = create_data_functions.simulate_sales_volume(df_stock, random_state=42)

In [54]:
# Insert a new column 'lpo' at position 1 with initial value 0
df_stock.insert(1, 'lpo', 0)

In [55]:
# Generate simulated purchase order dates based on product attributes and logistics
df_stock['lpo'] = create_data_functions.simulate_purchase_order_columns(df_stock)

In [56]:
# Save the updated DataFrame to CSV, excluding the index column
df_stock.to_csv(raw_data_path + 'synthetic_data_grocery_stock.csv', index=False)