# Forecasting Pipeline - Synthetic Data Example

This notebook presents a complete example of a **forecasting pipeline** using synthetic data.  
It covers the entire workflow, from data preparation and preprocessing to model training, prediction generation, and performance evaluation.


1.  **Libraries and Synthetic Data Generation:**

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt

# Example of a machine learning algorithm (clustering)
# Note: we will use it later in the pipeline
from sklearn.cluster import KMeans

# To suppress unnecessary warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# ---------------------------------------------------------
# Generate Synthetic Data
# ---------------------------------------------------------

# Create a range of monthly dates from January 2021 to December 2023
dates = pd.date_range(start='2021-01-01', end='2023-12-01', freq='MS')

# Define the total number of products to simulate
n_products = 10000

# Split products into two categories:
# - Regular demand (70% of products)
# - Intermittent demand (30% of products)
regular_products = [f'prod_reg_{i}' for i in range(int(n_products * 0.7))]
intermittent_products = [f'prod_int_{i}' for i in range(int(n_products * 0.3))]

# Initialize an empty DataFrame that will contain all synthetic records
df_synthetic = pd.DataFrame()

# ---------------------------------------------------------
# Generate Regular Demand Products
# ---------------------------------------------------------
# For regular products, demand follows a normal distribution
# with mean=100 and std=20. Negative values are clipped to 1.
for prod in regular_products:
    demand = np.random.normal(loc=100, scale=20, size=len(dates)).round(2).clip(min=1)
    df_temp = pd.DataFrame({
        'Date': dates,
        'Product_ID': prod,
        'Quantity_Sold': demand
    })
    # Concatenate with the main dataset
    df_synthetic = pd.concat([df_synthetic, df_temp])

# ---------------------------------------------------------
# Generate Intermittent Demand Products
# ---------------------------------------------------------
# For intermittent products, most months have zero demand.
# A random number of months (between 3 and 10) will contain sales,
# with quantities randomly chosen between 10 and 500 units.
for prod in intermittent_products:
    demand = np.zeros(len(dates))  # initialize with zero demand
    sale_indices = np.random.choice(len(dates), size=np.random.randint(3, 10), replace=False)
    demand[sale_indices] = np.random.randint(low=10, high=500, size=len(sale_indices))
    df_temp = pd.DataFrame({
        'Date': dates,
        'Product_ID': prod,
        'Quantity_Sold': demand
    })
    df_synthetic = pd.concat([df_synthetic, df_temp])

# ---------------------------------------------------------
# Preview of the generated synthetic dataset
# ---------------------------------------------------------
print("Synthetic data generated:")
print(df_synthetic.head())


Synthetic data generated:
        Date  Product_ID  Quantity_Sold
0 2021-01-01  prod_reg_0         105.75
1 2021-02-01  prod_reg_0          96.15
2 2021-03-01  prod_reg_0          98.60
3 2021-04-01  prod_reg_0          98.79
4 2021-05-01  prod_reg_0         110.82


2.  **Data Preparation and Characterization:**

In [2]:
# Filter out zero-demand records to focus only on periods with sales
sales_only_df = df_synthetic[df_synthetic['Quantity_Sold'] > 0].copy()

# ---------------------------------------------------------
# Aggregate features at the product level
# ---------------------------------------------------------
# - Total_Volume: total units sold across all periods
# - Num_Periods_Sold: number of periods in which the product had sales
features_df = sales_only_df.groupby('Product_ID').agg(
    Total_Volume=('Quantity_Sold', 'sum'),
    Num_Periods_Sold=('Date', 'count')
).reset_index()

# ---------------------------------------------------------
# Function to calculate intermittency metrics (ADI and CV²)
# ---------------------------------------------------------
def calculate_metrics(group):
    """
    Calculate intermittency metrics for a product's sales history.

    ADI (Average Demand Interval): average time gap between sales events.
    CV² (Squared Coefficient of Variation): measures demand variability.
    """
    if len(group) < 2:
        # If there are fewer than 2 sales events:
        # - ADI is estimated as the span of time / months + 1
        # - CV² is set to 1.0 (maximum variability by default)
        adi = (group['Date'].max() - group['Date'].min()).days / 30.44 + 1 if len(group) > 0 else len(dates) + 1
        cv2 = 1.0
    else:
        # Sort sales chronologically
        group = group.sort_values('Date')

        # Compute time gaps between sales in months
        demand_intervals = group['Date'].diff().dt.days / 30.44
        adi = demand_intervals.mean()

        # Compute squared coefficient of variation of demand
        mean_demand = group['Quantity_Sold'].mean()
        std_dev_demand = group['Quantity_Sold'].std()
        cv2 = (std_dev_demand / mean_demand)**2 if mean_demand > 0 and std_dev_demand > 0 else 0.0
    
    return pd.Series({'ADI': adi, 'CV2': cv2})

# ---------------------------------------------------------
# Apply intermittency metrics to each product
# ---------------------------------------------------------
intermittency_df = sales_only_df.groupby('Product_ID').apply(calculate_metrics).reset_index()

# ---------------------------------------------------------
# Merge all product-level features into a single dataset
# ---------------------------------------------------------
features_df = pd.merge(features_df, intermittency_df, on='Product_ID', how='left')

# Fill missing values with default assumptions:
# - Products with no sales get maximum ADI and CV² = 1.0
features_df.fillna({
    'Total_Volume': 0,
    'Num_Periods_Sold': 0,
    'ADI': len(dates) + 1,
    'CV2': 1.0
}, inplace=True)

# ---------------------------------------------------------
# Preview calculated features
# ---------------------------------------------------------
print("\nFeatures calculated for clustering:")
print(features_df.head())



Features calculated for clustering:
      Product_ID  Total_Volume  Num_Periods_Sold       ADI       CV2
0     prod_int_0        1189.0                 4  4.664915  0.190690
1     prod_int_1        1306.0                 5  3.244087  0.280302
2    prod_int_10        1591.0                 5  3.244087  0.269070
3   prod_int_100        1534.0                 7  5.502628  0.105215
4  prod_int_1000        1839.0                 8  4.003191  0.345730


3.  **Clustering for Demand Segmentation:**

In [3]:
# Prepare the feature set for clustering
# ---------------------------------------------------------
# We use ADI (Average Demand Interval) and CV² (Squared Coefficient of Variation)
# as inputs to cluster products based on their demand patterns.
X = features_df[['ADI', 'CV2']]

# ---------------------------------------------------------
# Train the K-Means clustering model
# ---------------------------------------------------------
# We set the number of clusters = 2 to separate "Regular" vs "Intermittent" products.
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
features_df['Segment_ID'] = kmeans.fit_predict(X)

# ---------------------------------------------------------
# Assign descriptive labels to each cluster
# ---------------------------------------------------------
# We compare average number of periods sold across clusters.
# - The cluster with higher frequency of sales is labeled "Regular"
# - The other cluster is labeled "Intermittent"
segments = features_df.groupby('Segment_ID')[['ADI', 'Num_Periods_Sold']].mean()
if segments.iloc[0]['Num_Periods_Sold'] > segments.iloc[1]['Num_Periods_Sold']:
    mapping = {segments.index[0]: 'Regular', segments.index[1]: 'Intermittent'}
else:
    mapping = {segments.index[0]: 'Intermittent', segments.index[1]: 'Regular'}

# Apply the mapping
features_df['Segment'] = features_df['Segment_ID'].map(mapping)

# ---------------------------------------------------------
# Display segmentation results
# ---------------------------------------------------------
print("\nSegmentation results:")
print(features_df['Segment'].value_counts())

# ---------------------------------------------------------
# Profile the segments (average characteristics)
# ---------------------------------------------------------
segments_profile = features_df.groupby('Segment')[['Total_Volume', 'Num_Periods_Sold', 'ADI', 'CV2']].mean().round(2)
print("\nAverage profile of each segment:")
print(segments_profile)

# ---------------------------------------------------------
# Merge product segmentation results back into the main dataset
# ---------------------------------------------------------
df_final = pd.merge(df_synthetic, features_df[['Product_ID', 'Segment']], on='Product_ID', how='left')



Segmentation results:
Segment
Regular         7437
Intermittent    2563
Name: count, dtype: int64

Average profile of each segment:
              Total_Volume  Num_Periods_Sold   ADI   CV2
Segment                                                 
Intermittent       1459.87              5.74  6.23  0.37
Regular            3498.72             34.31  1.13  0.06


4. **Forecasting Models for Each Segment:**

In [4]:
# Split the dataset by demand segment
df_regular = df_final[df_final['Segment'] == 'Regular']
df_intermittent = df_final[df_final['Segment'] == 'Intermittent']

# ---------------------------------------------------------
# Forecast for Regular Segment (Simulation of Exponential Smoothing)
# ---------------------------------------------------------
def forecast_regular(group):
    """
    Simple forecasting function for regular-demand products.
    For demonstration purposes, we use the historical average
    as a naive forecast (simulating Exponential Smoothing behavior).
    """
    forecast_value = group['Quantity_Sold'].mean()
    return pd.Series({'Regular_Forecast': forecast_value})

# Apply the forecasting function for each regular product
regular_forecasts = df_regular.groupby('Product_ID').apply(forecast_regular).reset_index()

print("\n--- 4. Forecast for Regular Products (ESM Simulation) ---\n")
print(regular_forecasts.head())

# ---------------------------------------------------------
# Forecast for Intermittent Segment (Croston's Method)
# ---------------------------------------------------------
def forecast_croston(group):
    """
    Implementation of Croston's method for intermittent demand.
    Steps:
    - Filter non-zero sales.
    - Calculate intervals between demand occurrences.
    - Apply exponential smoothing separately to demand size and intervals.
    - Final forecast = smoothed demand / smoothed interval.
    """
    # Ensure 'Date' is the index for interval calculations
    demand_series = group.set_index('Date')['Quantity_Sold']
    demand = demand_series[demand_series > 0]  # keep only non-zero demand
    
    # Handle edge cases with insufficient demand history
    if len(demand) < 2:
        return np.mean(demand) if len(demand) > 0 else 0
    
    # Calculate intervals (in months) between demand occurrences
    intervals = demand.index.to_series().diff().dt.days / 30.44
    
    # Initialize smoothing parameters
    alpha = 0.2
    demand_smooth = [demand.iloc[0]]
    interval_smooth = [intervals.iloc[1]]
    
    # Recursive exponential smoothing for demand and intervals
    for i in range(1, len(demand)):
        demand_smooth.append(alpha * demand.iloc[i] + (1 - alpha) * demand_smooth[-1])
        interval_smooth.append(alpha * intervals.iloc[i] + (1 - alpha) * interval_smooth[-1])
        
    # Final smoothed values
    final_demand = demand_smooth[-1]
    final_interval = interval_smooth[-1]
    
    # Croston forecast
    return final_demand / final_interval if final_interval > 0 else 0

# Apply Croston's method to intermittent products
intermittent_forecasts = df_intermittent.groupby('Product_ID').apply(forecast_croston).reset_index(name='Intermittent_Forecast')

print("\n--- 5. Forecast for Intermittent Products (Croston) ---\n")
print(intermittent_forecasts.head())



--- 4. Forecast for Regular Products (ESM Simulation) ---

      Product_ID  Regular_Forecast
0     prod_int_1         36.277778
1    prod_int_10         44.194444
2  prod_int_1014         60.972222
3  prod_int_1024         57.694444
4  prod_int_1035         37.361111

--- 5. Forecast for Intermittent Products (Croston) ---

      Product_ID  Intermittent_Forecast
0     prod_int_0              49.827391
1   prod_int_100              50.165191
2  prod_int_1000              52.692604
3  prod_int_1001              58.904294
4  prod_int_1002              29.529824
