# **Phase 1: Study of Classical Statistical Models for Demand Forecasting**

In this initial project phase, we focus on **evaluating forecasting models based on classical statistical methods**, aiming to establish a baseline for comparison with more complex models (e.g., machine learning, deep learning) to be developed later.

## Main Objective:
Optimize inventory and purchasing management, targeting a **20% reduction in overstocking within 6 months**.

## Target Variables:
- **Demand**: `Sales_Volume`
- **Inventory**: `Stock_Quantity`

## Implemented Statistical Models:
- `HoltWinters`
- `SeasonalNaive`
- `DynamicOptimizedTheta`
- `HistoricAverage` (fallback model)
- `AutoARIMA`
- `AutoETS`
- `AutoCES`
- `AutoTheta`

## Evaluation Metrics:
- **RMSE** (Root Mean Squared Error)
- **MAE** (Mean Absolute Error)

## Approach:
- Utilization of the `StatsForecast` library for model training and evaluation.
- Logarithmic transformation of target variables to stabilize variance.
- Forecasts generated for a **28-day horizon**.
- Cross-validation and performance comparison among models.

This phase helps identify which statistical methods are most suitable for the given time series, serving as a reference for future iterations with more advanced techniques.

## Import Libraries

In [1]:
# Standard Libraries 
import pandas as pd
import numpy as np
import os
import plotly.express as px
import joblib
import json

# Specialized Libraries
from statsforecast import StatsForecast
from statsforecast.models import (
    HoltWinters, HistoricAverage, DynamicOptimizedTheta as DOT,
    SeasonalNaive, AutoARIMA, AutoETS, AutoCES, AutoTheta
)
from utilsforecast.losses import mae, rmse

# Import personalized classes 
from smart_supply_chain_ai.utils.preprocess_gapFreq import TimeSeriesIntegrityTransformer
from smart_supply_chain_ai.utils.metrics import evaluate_cv, get_best_model_forecast

# Notebook mlflow Loggings
import warnings
warnings.filterwarnings("ignore", category=UserWarning, message="pkg_resources is deprecated")
warnings.filterwarnings("ignore", category=FutureWarning)
pd.set_option('display.max_columns', None)

  __import__("pkg_resources").declare_namespace(__name__)  # type: ignore


## Load Data

In [2]:
# Define data paths
data_path = os.path.join('../data', 'processed')
docs_path = os.path.join('../docs/')
path_models = os.path.join('../models/')

saved_models = os.listdir(path_models)

# Load column descriptions from JSON file into a dictionary for reference or documentation
with open(docs_path + 'column_descriptions.json') as f:
    column_descriptions = json.load(f)

# Load data_path
read_data = pd.read_parquet(data_path + '/grocery.parquet')

# Prepare data for time series analysis

In [3]:
# Transform sales data into a clean time series
timeTransformer = TimeSeriesIntegrityTransformer(
    date_col='received_date',
    id_col='product_id',
    target_col='sales_volume'
)
df_transformed = timeTransformer.fit_transform(read_data)


# Split the data into two parts:
# - Data on/after May 10, 2025, for comparison/validation
# - Data before May 10, 2025, for training and analysis
compare_data = df_transformed[df_transformed['received_date'] >= '2025-05-10'].copy()
predict_data = df_transformed[df_transformed['received_date'] < '2025-05-10'].copy()

In [4]:
# Log transform
predict_data['stock_log'] = np.log1p(predict_data['stock_quantity'])
predict_data['sales_log'] = np.log1p(predict_data['sales_volume'])

compare_data['stock_log'] = np.log1p(compare_data['stock_quantity'])
compare_data['sales_log'] = np.log1p(compare_data['sales_volume'])

# Model's

In [5]:
# Create a list of models and instantiation parameters
models = [
    HoltWinters(season_length=7),
    SeasonalNaive(season_length=7),
    DOT(season_length=7)
]

In [6]:
# Instantiate StatsForecast class as sf
sf = StatsForecast( 
    models=models,
    freq='D', 
    fallback_model = HistoricAverage(),
    n_jobs=-1,
)

# Stock Quantity

In [7]:
# Rename columns to fit the StatsForecast requirements
df_stock = predict_data.rename(columns={'received_date': 'ds', 'product_id': 'unique_id', 'stock_log': 'y'})

# Simplify the DataFrame to include only necessary columns
df_stock_simple = df_stock[['unique_id', 'ds', 'y']].copy()

In [8]:
df_stock_simple.head()

Unnamed: 0,unique_id,ds,y
0,1010497|P,2022-12-09,6.285998
1,1010497|P,2022-12-10,5.762051
2,1010497|P,2022-12-11,5.680173
3,1010497|P,2022-12-12,6.006353
4,1010497|P,2022-12-13,5.908083


## Train multiple models for many series

In [9]:
# Check if the multiple models StatsForecast for stock is already saved
if 'stock_multiple_models.joblib' in saved_models:
    # Load the saved multiple models StatsForecast for stock
    forecasts_df = joblib.load(path_models + 'stock_multiple_models.joblib')
else:
    # Generate forecasts for the next 28 days with 95% prediction intervals
    forecasts_df = sf.forecast(df=df_stock_simple, h=28, level=[90])

    # Save the multiple models StatsForecast for stock
    StatsForecast.save(forecasts_df, path_models + 'stock_multiple_models.joblib')


In [10]:
# Inverse transformation of log1p to original scale using expm1
forecasts_df[forecasts_df.select_dtypes(include=[np.number]).columns.to_list()] = np.expm1(forecasts_df.select_dtypes(include=[np.number]))

In [11]:
# Display the first 14 rows of the forecasts DataFrame
forecasts_df.head(14)

Unnamed: 0,unique_id,ds,HoltWinters,HoltWinters-lo-90,HoltWinters-hi-90,SeasonalNaive,SeasonalNaive-lo-90,SeasonalNaive-hi-90,DynamicOptimizedTheta,DynamicOptimizedTheta-lo-90,DynamicOptimizedTheta-hi-90
0,1010497|P,2025-05-10,163.960419,45.681397,581.928995,408.0,69.034335,2387.556984,166.805124,52.926207,701.492488
1,1010497|P,2025-05-11,170.368535,45.382571,632.151072,424.0,71.774064,2480.996866,169.569645,43.663322,480.967031
2,1010497|P,2025-05-12,177.401044,45.249146,687.162605,424.0,71.774064,2480.996866,175.457347,33.382855,628.169269
3,1010497|P,2025-05-13,154.593574,37.68527,624.803056,320.0,53.965823,1873.637633,159.330374,38.931488,514.808124
4,1010497|P,2025-05-14,158.211711,37.009506,665.895501,320.0,53.965823,1873.637633,161.007921,44.160192,699.993512
5,1010497|P,2025-05-15,175.428669,39.487647,767.804261,77.0,12.356181,454.519425,178.747485,45.412844,784.906864
6,1010497|P,2025-05-16,185.948355,40.280691,845.635226,77.0,12.356181,454.519425,188.270271,37.648897,791.037116
7,1010497|P,2025-05-17,162.925339,33.861542,769.806897,408.0,32.717288,4960.282711,166.805124,34.869728,962.862683
8,1010497|P,2025-05-18,169.293245,33.909823,829.705701,424.0,34.036302,5154.36712,169.569645,34.414605,836.725017
9,1010497|P,2025-05-19,176.281628,34.060069,895.426507,424.0,34.036302,5154.36712,175.457347,37.475104,760.744823


In [12]:
# Reverter a transformação log1p para os valores originais
revert_log = np.expm1(df_stock_simple['y'])
df_stock_simple.loc[: , 'y_origin'] = revert_log

In [13]:
# Filter the DataFrame for dates after June 1, 2024
plot_df = df_stock_simple.loc[df_stock_simple['ds'] > '2024-06-01'].copy()
plot_df.drop(columns=['y'], inplace=True)
plot_df.rename(columns={'y_origin': 'y'}, inplace=True)

In [14]:
# Plot forecasts for the specified date range
sf.plot(plot_df, forecasts_df, engine='plotly')

## Evaluate the model’s performance

In [15]:
df_stock_simple.head()

Unnamed: 0,unique_id,ds,y,y_origin
0,1010497|P,2022-12-09,6.285998,536.0
1,1010497|P,2022-12-10,5.762051,317.0
2,1010497|P,2022-12-11,5.680173,292.0
3,1010497|P,2022-12-12,6.006353,405.0
4,1010497|P,2022-12-13,5.908083,367.0


In [16]:
if 'stock_cv.joblib' in saved_models:
    # Load the saved cross-validation results for stock
    stock_cv_df = joblib.load(path_models + 'stock_cv.joblib')
else:
    # Perform cross-validation to evaluate model performance
    stock_cv_df = sf.cross_validation(
        df=df_stock_simple.drop(columns=['y_origin']),
        h=28,
        step_size=28,
        n_windows=2
    )

    # Save the cross-validation results for stock
    StatsForecast.save(stock_cv_df, path_models + 'stock_cv.joblib')

In [17]:
# Display the first rows of the cross-validation DataFrame
stock_cv_df.head()

Unnamed: 0,unique_id,ds,cutoff,y,HoltWinters,SeasonalNaive,DynamicOptimizedTheta
0,1010497|P,2025-03-15,2025-03-14,6.025866,5.990256,6.357842,5.998172
1,1010497|P,2025-03-16,2025-03-14,6.025866,6.042397,6.013715,6.051562
2,1010497|P,2025-03-17,2025-03-14,6.150603,6.093292,6.013715,6.09022
3,1010497|P,2025-03-18,2025-03-14,6.035481,5.956725,6.169611,5.971669
4,1010497|P,2025-03-19,2025-03-14,5.998937,5.965525,6.056784,5.960212


In [18]:
# Evaluate model performance using Mean Absolute Error (MAE)
evaluation_mae_stock = evaluate_cv(stock_cv_df, mae)
evaluation_mae_stock.head()

Unnamed: 0,unique_id,HoltWinters,SeasonalNaive,DynamicOptimizedTheta,best_model
0,1010497|P,0.424553,0.466206,0.425137,HoltWinters
1,1013121|P,0.358669,0.938277,0.355351,DynamicOptimizedTheta
2,1019979|P,0.654597,0.790079,0.636688,DynamicOptimizedTheta
3,1021354|P,0.315177,1.271394,0.290029,DynamicOptimizedTheta
4,1025598|P,1.796513,1.065059,1.777348,SeasonalNaive


In [19]:
# Evaluate model performance using Root Mean Square Error (RMSE)
evaluation_rmse_stock = evaluate_cv(stock_cv_df, rmse)
evaluation_rmse_stock.head()

Unnamed: 0,unique_id,HoltWinters,SeasonalNaive,DynamicOptimizedTheta,best_model
0,1010497|P,0.832424,0.842323,0.836305,HoltWinters
1,1013121|P,0.454789,1.532629,0.44461,DynamicOptimizedTheta
2,1019979|P,0.788981,0.949767,0.773359,DynamicOptimizedTheta
3,1021354|P,0.667287,1.767825,0.67173,HoltWinters
4,1025598|P,2.280886,1.695456,2.24844,SeasonalNaive


In [20]:
# Display the best model counts based on MAE evaluation
evaluation_mae_stock['best_model'].value_counts().to_frame().reset_index()

Unnamed: 0,best_model,count
0,DynamicOptimizedTheta,71
1,SeasonalNaive,58
2,HoltWinters,41


In [21]:
# Display the best model counts based on RMSE evaluation
evaluation_rmse_stock['best_model'].value_counts().to_frame().reset_index()

Unnamed: 0,best_model,count
0,DynamicOptimizedTheta,64
1,HoltWinters,55
2,SeasonalNaive,51


In [22]:
seasonal_ids = evaluation_rmse_stock.query('best_model == "HoltWinters"')['unique_id']
sf.plot(plot_df, forecasts_df, unique_ids=seasonal_ids, models=["HoltWinters","DynamicOptimizedTheta"], engine='plotly')

## Select the Best Model

In [23]:
# Prepare data for plotting
prod_forecasts_df = get_best_model_forecast(forecasts_df, evaluation_mae_stock)
prod_forecasts_df.head()

Unnamed: 0,unique_id,ds,best_model,best_model-lo-90,best_model-hi-90
0,1010497|P,2025-05-10,163.960419,45.681397,581.928995
1,1010497|P,2025-05-11,170.368535,45.382571,632.151072
2,1010497|P,2025-05-12,177.401044,45.249146,687.162605
3,1010497|P,2025-05-13,154.593574,37.68527,624.803056
4,1010497|P,2025-05-14,158.211711,37.009506,665.895501


In [24]:
# Plot to unique_ids and some selected models
sf.plot(plot_df, prod_forecasts_df, level=[90], engine='plotly')

## Technical Summary of Time Series Model Analysis

**1. Time Series Profile (Exploratory Analysis)**

The charts reveal a time series with challenging characteristics for conventional modeling:

* **Intermittent Volatility:** There is a clear alternation between periods of stability (plateaus) and sharp drops, suggesting a batch consumption behavior or scheduled replenishment.
* **Noise and Outliers:** Isolated demand spikes deviate from the historical average, which can impact sensitive metrics such as RMSE.
* **Trend and Seasonality:** No evident long-term growth or decline trend (series is mostly stationary in mean), but there is repetitive micro-seasonality in short windows.

**2. Model Evaluation (Figures 1 and 2)**

The comparison between **Holt-Winters**, **Seasonal Naive**, and **Dynamic Optimized Theta** shows distinct behaviors:

| Model | Observed Behavior |
| --- | --- |
| **Seasonal Naive** | Tends to rigidly replicate the last seasonal pattern. Useful as a baseline, but fails to capture variance changes. |
| **Holt-Winters** | Smooths the series to capture trend and seasonality. In the images, it presents a more conservative (stable) projection, ideal when no immediate spikes are expected. |
| **Dynamic Optimized Theta** | Demonstrated greater flexibility in adjusting to the series curvature, often showing superior fit in series with dynamic level changes. |

Validation by **RMSE (Figure 2)** indicates that the selection process aimed to minimize mean squared error, penalizing models that miss drastically on stock peaks.

**3. Best Model Analysis (Figure 3)**

The final chart shows the application of the selected model with **confidence intervals (90% level):**

* **Historical Fit:** The model (pink line) follows the baseline (blue), staying faithful to the historical mean without being “lost” in excessive noise.
* **Projection (Forecast):** The future forecast maintains stock stability, but the **confidence interval (pink shaded area)** expands significantly as the time horizon increases.
* **Risk of Shortage or Excess:** The expansion of the shaded area in Figure 3 suggests growing uncertainty. For inventory management, this implies that while the *expectation* is stability, the model warns of the statistical possibility of demand spikes (upper bound) or the need for safety against stockouts (lower bound).

### Technical Conclusion

The selected model is robust for predicting the **average stock level**, but the high volatility of the original data requires close attention to confidence intervals. Using RMSE as the validation metric ensured the model was not overly influenced by erratic variations, favoring a projection that minimizes large deviations.

---

##  Automatic Models for Stock Quantity

In [25]:
# Define parameters
season_length=7
horizon=28

In [26]:
# Automatic Models
auto_models = [
    AutoARIMA(season_length=season_length),
    AutoETS(season_length=season_length),
    AutoCES(season_length=season_length),
    AutoTheta(season_length=season_length)
    ]

In [27]:
# Instantiate StatsForecast class as auto_sf
auto_sf = StatsForecast(models=auto_models, freq='D', n_jobs=-1)

In [28]:
df_stock_simple.head()

Unnamed: 0,unique_id,ds,y,y_origin
0,1010497|P,2022-12-09,6.285998,536.0
1,1010497|P,2022-12-10,5.762051,317.0
2,1010497|P,2022-12-11,5.680173,292.0
3,1010497|P,2022-12-12,6.006353,405.0
4,1010497|P,2022-12-13,5.908083,367.0


In [29]:
# Check if the automatic StatsForecast model for sales is already saved
if 'stock_automatic_statsforecast_model.joblib' in saved_models:
    # Load the saved automatic StatsForecast model for sales
    auto_sf_stock = joblib.load(path_models + 'stock_automatic_statsforecast_model.joblib')
else:
    # Generate forecasts for the next horizon days
    auto_sf_stock = auto_sf.fit(df=df_stock_simple.drop(columns=['y_origin']))

    # Save the automatic StatsForecast model
    StatsForecast.save(auto_sf_stock, path_models + 'stock_automatic_statsforecast_model.joblib')

In [30]:
if 'stock_pred.joblib' in saved_models:
    # Load the saved stock predictions
    stock_pred_auto = joblib.load(path_models + 'stock_pred.joblib')
else:
    # Predict using automatic models
    stock_pred_auto = auto_sf_stock.predict(h=horizon)
    # Save the stock predictions
    StatsForecast.save(stock_pred_auto, path_models + 'stock_pred.joblib')

In [31]:
stock_pred_auto.head()

Unnamed: 0,unique_id,ds,AutoARIMA,AutoETS,CES,AutoTheta
0,1010497|P,2025-05-10,5.225988,5.175079,5.29663,5.124397
1,1010497|P,2025-05-11,5.500495,5.175079,5.561518,5.140686
2,1010497|P,2025-05-12,5.542457,5.175079,5.557883,5.174575
3,1010497|P,2025-05-13,5.576133,5.175079,5.475121,5.078649
4,1010497|P,2025-05-14,5.60316,5.175079,5.436939,5.089004


In [32]:
stock_pred_auto[ stock_pred_auto.select_dtypes(include=[np.number]).columns.to_list() ] = np.expm1( stock_pred_auto.select_dtypes(include=[np.number]) )
stock_pred_auto.head()

Unnamed: 0,unique_id,ds,AutoARIMA,AutoETS,CES,AutoTheta
0,1010497|P,2025-05-10,185.044823,175.81063,198.6629,167.072738
1,1010497|P,2025-05-11,243.813163,175.81063,259.217674,169.832893
2,1010497|P,2025-05-12,254.304397,175.81063,258.273352,175.72144
3,1010497|P,2025-05-13,263.048535,175.81063,237.679393,159.556929
4,1010497|P,2025-05-14,270.28237,175.81063,228.737945,161.228251


In [33]:
# StatsForecast plot
auto_sf.plot(plot_df, stock_pred_auto, engine='plotly')

#### Summary: Advanced Automated Forecasting Models

**1. Historical Data Profile**

* **Volatility and Noise:** All series show high volatility, with sharp drops often reaching zero or very low levels. This suggests data with intermittency or seasonality strongly influenced by specific events (such as production stoppages or sales cycles).
* **Stationarity:** Most series seem to fluctuate around a constant mean (mean-stationary), but with significant variance spikes.

**2. Forecast Model Performance**

Forecasts begin after the gray vertical line (May 2025). A clear divergence between models can be observed:

| Model | Observed Behavior |
| --- | --- |
| **AutoARIMA / AutoETS** | Tend to follow the most recent trend or the historical average. In several charts, they converge to a stable line, capturing less of the short-term oscillation. |
| **CES (Complex Exponential Smoothing)** | In some cases, CES projects a straight line well below the other models, which may indicate excessive sensitivity to recent drops in the data. |
| **AutoTheta** | Often shows slight oscillation or conservatively follows the central trend. |


**Conclusion**

The automatic models are struggling to capture the **uncertainty** of these series.  
Since the historical data is very “nervous” (with many spikes and drops), a single-point forecast (straight line) ends up being the statistically “safest” bet, but may be of limited use for business decisions that depend on peaks.

---

# Sales Volume

In [34]:
predict_data.head(3)

Unnamed: 0,product_id,received_date,lpo,in_season,product,category,sub_category,shelf_life_days,maximum_days_on_sale,unit_of_measurement,supplier_rating,supplier,supplier_id,distance_km,moq,storage_recommendation,temperature_classification,precipitation_classification,wind_classification,weather_severity,day_classification,is_holiday,is_weekend,sales_demand,sales_volume,lead_time,min_stock,max_stock,stock_quantity,delivery_lag,expiration_status,inventory_turnover_rate,doi_inventory_turnover,stock_log,sales_log
0,1010497|P,2022-12-09,0,missing,missing,missing,missing,1095.0,90.0,missing,2.0,missing,missing,25.0,300.0,missing,missing,missing,missing,missing,missing,missing,missing,missing,0.0,4.0,285.0,380.0,536.0,4.0,missing,58.659824,17.0,6.285998,0.0
1,1010497|P,2022-12-10,2022-12-02 00:00:00,False,Canned Tomatoes,Pantry,Canned Goods,1095.0,90.0,unit,2.0,Wholesale Warehouse,1141069|S,25.0,300.0,Room Temperature,Warm,No precipitation,Gentle to Fresh Breeze,Moderate,Saturday,False,True,High,131.0,4.0,285.0,380.0,317.0,8.0,Safe,58.659824,17.0,5.762051,4.882802
2,1010497|P,2022-12-11,2022-12-01 00:00:00,False,Canned Tomatoes,Pantry,Canned Goods,1095.0,90.0,unit,2.0,Wholesale Warehouse,1141069|S,25.0,300.0,Room Temperature,Warm,No precipitation,Gentle to Fresh Breeze,Moderate,Sunday,False,True,High,274.0,4.0,285.0,380.0,292.0,10.0,Safe,58.659824,17.0,5.680173,5.616771


In [35]:
df_sales = predict_data.rename(columns={'received_date': 'ds', 'product_id': 'unique_id', 'sales_log': 'y'}).copy()

df_sales_simple = df_sales[['unique_id', 'ds', 'y', 'sales_volume']].copy()
df_sales_simple.head()

Unnamed: 0,unique_id,ds,y,sales_volume
0,1010497|P,2022-12-09,0.0,0.0
1,1010497|P,2022-12-10,4.882802,131.0
2,1010497|P,2022-12-11,5.616771,274.0
3,1010497|P,2022-12-12,4.204693,66.0
4,1010497|P,2022-12-13,4.682131,107.0


In [36]:
if 'sales_multiple_models.joblib' in saved_models:
    # Load the saved multiple models StatsForecast for sales
    sales_forecasts_df = joblib.load(path_models + 'sales_multiple_models.joblib')
else:
    # Generate forecasts for the next 28 days with 95% prediction intervals
    sales_forecasts_df = sf.forecast(df=df_sales_simple.drop(columns=['sales_volume']), h=28, level=[90])

    # Save the multiple models StatsForecast for sales
    StatsForecast.save(sales_forecasts_df, path_models + 'sales_multiple_models.joblib')

In [37]:
# Inverse transformation of log1p to original scale using expm1
sales_forecasts_df[sales_forecasts_df.select_dtypes(include=[np.number]).columns.to_list()] = np.expm1(sales_forecasts_df.select_dtypes(include=[np.number]))

In [38]:
# Display the first 14 rows of the forecasts DataFrame
sales_forecasts_df.head(14)

Unnamed: 0,unique_id,ds,HoltWinters,HoltWinters-lo-90,HoltWinters-hi-90,SeasonalNaive,SeasonalNaive-lo-90,SeasonalNaive-hi-90,DynamicOptimizedTheta,DynamicOptimizedTheta-lo-90,DynamicOptimizedTheta-hi-90
0,1010497|P,2025-05-10,34.964482,-0.296663,1838.011461,271.0,0.103544,67041.19756,13.206489,-0.641372,1470.887389
1,1010497|P,2025-05-11,24.489036,-0.501526,1302.35908,370.0,0.505201,91442.585643,13.206223,-0.794909,460.650797
2,1010497|P,2025-05-12,16.689511,-0.654057,903.537566,0.0,-0.995943,245.478668,13.205948,-0.91109,688.925232
3,1010497|P,2025-05-13,13.214346,-0.722018,725.838325,73.0,-0.699771,18238.421395,13.205662,-0.789739,456.447691
4,1010497|P,2025-05-14,16.872773,-0.650473,912.909336,0.0,-0.995943,245.478668,13.205367,-0.589267,1072.995892
5,1010497|P,2025-05-15,15.500752,-0.677305,842.752836,110.0,-0.549657,27358.132093,13.205062,-0.7043,797.314879
6,1010497|P,2025-05-16,18.12343,-0.626015,976.862294,0.0,-0.995943,245.478668,13.204747,-0.810237,983.959607
7,1010497|P,2025-05-17,34.899242,-0.297942,1834.682915,271.0,-0.887259,656228.396645,13.204423,-0.780466,1033.514573
8,1010497|P,2025-05-18,24.446043,-0.502369,1300.167573,370.0,-0.846225,895076.59616,13.204089,-0.815796,988.21497
9,1010497|P,2025-05-19,16.661768,-0.654601,902.124916,0.0,-0.999586,2411.608076,13.203746,-0.729291,581.622244


In [39]:
# Filter the DataFrame for dates after June 1, 2024
plot_df_sales = df_sales_simple.loc[df_sales_simple['ds'] > '2024-06-01'].copy()
plot_df_sales.drop(columns=['y'], inplace=True)
plot_df_sales.rename(columns={'sales_volume': 'y'}, inplace=True)

In [40]:
# Plot forecasts for the specified date range
sf.plot(plot_df_sales, sales_forecasts_df, engine='plotly')

## Evaluate the model’s performance

In [41]:
df_sales_simple.head()

Unnamed: 0,unique_id,ds,y,sales_volume
0,1010497|P,2022-12-09,0.0,0.0
1,1010497|P,2022-12-10,4.882802,131.0
2,1010497|P,2022-12-11,5.616771,274.0
3,1010497|P,2022-12-12,4.204693,66.0
4,1010497|P,2022-12-13,4.682131,107.0


In [42]:
if 'sales_cv.joblib' in saved_models:
    # Load the saved cross-validation DataFrame for sales
    sales_cv_df = joblib.load(path_models + 'sales_cv.joblib')
else:
    # Perform cross-validation to evaluate model performance
    sales_cv_df = sf.cross_validation(
        df=df_sales_simple.drop(columns=['sales_volume']),
        h=28,
        step_size=28,
        n_windows=2
    )
    # Save the cross-validation DataFrame for sales
    StatsForecast.save(sales_cv_df, path_models + 'sales_cv.joblib')

In [43]:
# Display the first rows of the cross-validation DataFrame
sales_cv_df.head()

Unnamed: 0,unique_id,ds,cutoff,y,HoltWinters,SeasonalNaive,DynamicOptimizedTheta
0,1010497|P,2025-03-15,2025-03-14,0.0,3.689912,5.533389,3.074399
1,1010497|P,2025-03-16,2025-03-14,0.0,3.311911,4.644391,3.074396
2,1010497|P,2025-03-17,2025-03-14,4.691348,2.867888,0.0,3.074393
3,1010497|P,2025-03-18,2025-03-14,5.624018,2.649525,5.170484,3.07439
4,1010497|P,2025-03-19,2025-03-14,4.290459,2.791525,3.401197,3.074388


In [44]:
# Evaluate model performance using Mean Absolute Error (MAE)
evaluation_mae_sales = evaluate_cv(sales_cv_df, mae)
evaluation_mae_sales.head()

Unnamed: 0,unique_id,HoltWinters,SeasonalNaive,DynamicOptimizedTheta,best_model
0,1010497|P,2.456038,2.753987,2.383593,DynamicOptimizedTheta
1,1013121|P,2.701211,3.03451,2.804681,HoltWinters
2,1019979|P,3.136028,3.696831,3.116535,DynamicOptimizedTheta
3,1021354|P,1.876006,1.832042,1.883688,SeasonalNaive
4,1025598|P,3.652752,4.240872,3.63815,DynamicOptimizedTheta


In [45]:
# Evaluate model performance using Root Mean Square Error (RMSE)
evaluation_rmse_sales = evaluate_cv(sales_cv_df, rmse)
evaluation_rmse_sales.head()

Unnamed: 0,unique_id,HoltWinters,SeasonalNaive,DynamicOptimizedTheta,best_model
0,1010497|P,2.555649,3.537163,2.517539,DynamicOptimizedTheta
1,1013121|P,2.801041,4.156522,2.855581,HoltWinters
2,1019979|P,3.24082,4.900633,3.247759,HoltWinters
3,1021354|P,1.979218,2.631018,1.992842,HoltWinters
4,1025598|P,3.781163,5.562703,3.896219,HoltWinters


In [46]:
# Display the best model counts based on MAE evaluation
evaluation_mae_sales['best_model'].value_counts().to_frame().reset_index()

Unnamed: 0,best_model,count
0,SeasonalNaive,63
1,HoltWinters,57
2,DynamicOptimizedTheta,50


In [47]:
# Display the best model counts based on RMSE evaluation
evaluation_rmse_sales['best_model'].value_counts().to_frame().reset_index()

Unnamed: 0,best_model,count
0,HoltWinters,120
1,DynamicOptimizedTheta,50


In [48]:
seasonal_ids = evaluation_rmse_sales.query('best_model == "DynamicOptimizedTheta"')['unique_id']
sf.plot(plot_df_sales, sales_forecasts_df, unique_ids=seasonal_ids, models=["HoltWinters","DynamicOptimizedTheta"], engine='plotly')

## Select the Best Model

In [49]:
# Prepare data for plotting
prod_sales_fc_df = get_best_model_forecast(sales_forecasts_df, evaluation_mae_sales)
prod_sales_fc_df.head()

Unnamed: 0,unique_id,ds,best_model,best_model-lo-90,best_model-hi-90
0,1010497|P,2025-05-10,13.206489,-0.641372,1470.887389
1,1010497|P,2025-05-11,13.206223,-0.794909,460.650797
2,1010497|P,2025-05-12,13.205948,-0.91109,688.925232
3,1010497|P,2025-05-13,13.205662,-0.789739,456.447691
4,1010497|P,2025-05-14,13.205367,-0.589267,1072.995892


In [50]:
# Plot to unique_ids and some selected models
sf.plot(plot_df_sales, prod_sales_fc_df, level=[90], engine='plotly')


## Detailed Graphical Analysis of Sales Volume


**1. Analysis of the First Image**

This image compares different statistical models applied to the 8 time series:

* **Data Behavior (y):** Sales are marked by sudden spikes followed by sharp drops. IDs such as `1786407|P` and `1598886|P` show more frequent sales, while `1045624|P` has longer windows of inactivity.
* **Seasonality (Seasonal Naive):** The red model attempts to replicate the pattern of the last seasonal period. In such noisy series, it ends up generating a “sawtooth” forecast that may not be realistic if there is no clear seasonality.
* **Smoothing (DynamicOptimizedTheta):** This model seems to try to find an average trend, resulting in more stable forecasts, though potentially conservative in the face of demand spikes.

**2. Analysis of the Second Image**

This image shows what was selected as the “best model,” including a 90% confidence interval (shaded pink area).

* **Explosion of Uncertainty (Confidence Intervals):**
  * The most critical point is the **Y-axis scale**. Note that for ID `1316510|P`, the scale reaches **2B (billions)**, and for ID `1802113|P`, it reaches **600k**.
  * This indicates that the model is generating extremely wide confidence intervals. In practice, a forecast that says you may sell between 0 and 2 billion units has no operational value for inventory.

* **Distortion from Outliers:** The presence of very high historical spikes (such as in ID `1862558|P`) is causing the model to “protect itself” by excessively expanding the uncertainty area in future projections.
* **Model Fit:** The “best_model” (continuous pink line) seems to follow a moving average or central trend, but it is visually “flattened” at the bottom of the chart due to the magnitude of the confidence intervals.

**3. General Observations and Diagnosis**

1. **Intermittent Demand:** The data contains many zeros. Traditional time series models (such as Holt-Winters or Naive) usually perform poorly here. It would be advisable to test models specific to intermittency, such as the **Croston Method** or count-based *machine learning* models.
2. **Scale Problem:** The fact that some charts have scales in billions (2B) while the historical data seems to be in the thousands suggests a possible processing error or the presence of an extreme *outlier* that the model is trying to compensate for.
3. **Reliability:** The current forecast is risky for procurement management. The 90% confidence interval is too “wide,” which could lead to overstock if the company follows the upper bound of the interval.

---

##  Automatic Models for Sales Volume

In [51]:
df_sales_simple.head()

Unnamed: 0,unique_id,ds,y,sales_volume
0,1010497|P,2022-12-09,0.0,0.0
1,1010497|P,2022-12-10,4.882802,131.0
2,1010497|P,2022-12-11,5.616771,274.0
3,1010497|P,2022-12-12,4.204693,66.0
4,1010497|P,2022-12-13,4.682131,107.0


In [52]:
# Check if the automatic StatsForecast model for sales is already saved
if 'sales_automatic_statsforecast_model.joblib' in saved_models:
    # Load the saved automatic StatsForecast model for sales
    auto_sf_sales = joblib.load(path_models + 'sales_automatic_statsforecast_model.joblib')
else:
    # Generate forecasts for the next horizon days
    auto_sf_sales = auto_sf.fit(df=df_sales_simple.drop(columns=['sales_volume']))

    # Save the automatic StatsForecast model
    StatsForecast.save(auto_sf_sales, path_models + 'sales_automatic_statsforecast_model.joblib')

In [53]:
# Check if the sales predictions are already saved
if 'sales_predictions.joblib' in saved_models:
    # Load the saved sales predictions
    sales_pred_auto = joblib.load(path_models + 'sales_predictions.joblib')
else:
    # Predict using automatic modelspred_auto
    sales_pred_auto = auto_sf_sales.predict(h=horizon)

    # Save the sales predictions
    StatsForecast.save(sales_pred_auto, path_models + 'sales_predictions.joblib')

In [54]:
sales_pred_auto.head()

Unnamed: 0,unique_id,ds,AutoARIMA,AutoETS,CES,AutoTheta
0,1010497|P,2025-05-10,3.002632,3.002975,3.3025,2.653428
1,1010497|P,2025-05-11,3.002632,3.002975,3.17657,2.653389
2,1010497|P,2025-05-12,3.002632,3.002975,2.841586,2.653351
3,1010497|P,2025-05-13,3.002632,3.002975,2.538214,2.653312
4,1010497|P,2025-05-14,3.002632,3.002975,2.791247,2.653274


In [55]:
sales_pred_auto[ sales_pred_auto.select_dtypes(include=[np.number]).columns.to_list() ] = np.expm1( sales_pred_auto.select_dtypes(include=[np.number]) )
sales_pred_auto.head()

Unnamed: 0,unique_id,ds,AutoARIMA,AutoETS,CES,AutoTheta
0,1010497|P,2025-05-10,19.138482,19.145371,26.180502,13.202636
1,1010497|P,2025-05-11,19.138482,19.145371,22.964407,13.202091
2,1010497|P,2025-05-12,19.138482,19.145371,16.142934,13.201546
3,1010497|P,2025-05-13,19.138482,19.145371,11.657047,13.201001
4,1010497|P,2025-05-14,19.138482,19.145371,15.301341,13.200456


In [56]:
# StatsForecast plot
sf.plot(plot_df_sales, sales_pred_auto, engine='plotly')


## Analysis of Auto Models Sales Volume Forecasting

**1. Data Profile: Intermittent Series**

The most striking feature across all charts is **intermittency**.

* **Sparsity:** There are long periods with values at zero or very close to zero, interrupted by sudden spikes of activity.
* **Volatility:** The spikes do not appear to follow an obvious cycle (such as clear weekly or monthly seasonality), suggesting “on-demand” demand or random events.
* **Variable Scale:** Note that the Y-axis scales vary drastically between IDs. While `unique_id=1788407JP` reaches peaks of 60, `unique_id=1310758JP` exceeds 4000.

**2. Forecast Model Analysis**

On the right side of each chart (after May 2025), we see projections from the models: **AutoARIMA, AutoETS, CES, and AutoTheta**.

* **Tendency to “Flatline”:** In most cases, forecasts result in an almost horizontal, low line. This happens because, in very noisy and intermittent series, classical statistical models tend to predict the **mean** or expected value, since they cannot detect a deterministic pattern for the spikes.
* **Difficulty Capturing Peaks:** None of the models seem to project future spikes with the same magnitude as historical ones. This indicates that the models are treating past spikes as “noise” or *outliers* rather than seasonal components.

**3. Specific Observations by Chart**

* **High-Magnitude IDs (e.g., 1310758JP and 1862553JP):** These show very high peaks (above 4000 and 6000). Forecasting here is particularly difficult, as the squared error would be enormous if the model attempted to predict a spike and missed the exact timing.
* **Low-Magnitude IDs (e.g., 1788407JP and 1898838JP):** The frequency of events seems slightly higher, but the scale is small. In these cases, the AutoTheta model tends to perform slightly better at capturing the mild trend.

### Conclusions and Recommendations

This is a classic dataset of **Intermittent Demand**. To improve these results, I would suggest:

1. **Specialized Models:** Test models such as **Croston** or **SBA (Syntetos-Boylan Approximation)**, which are specifically designed for data with long periods of zero.
2. **Temporal Aggregation:** Try analyzing the data at a higher granularity (e.g., converting daily data into weekly) to reduce noise and uncover hidden seasonality.
3. **Exogenous Variables:** If these spikes are caused by promotions, weather events, or holidays, adding this information as external variables would help the models “understand” why the spikes occur.

---