This notebook contains the exploratory data analysis (EDA) for the first module of the **"Intelligent System for Supply Chain Management"** project. 

The main objective is to optimize inventory and purchasing management, with a target of **reducing overstocking by 20%** within 6 months.

- Target Variable for Inventory Optimization: **Stock_Quantity**
- Target Variable for Demand Forecasting: **Sales_Volume**

Import Libraries

In [1]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.io as pio
import pandas as pd
import numpy as np
import os
from statsmodels.tsa.stattools import adfuller, grangercausalitytests, acf, pacf

# Configure display and graphs
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"

import warnings
warnings.filterwarnings('ignore')

# Load Data

In [2]:
# Define data paths
data_path = os.path.join('../data', 'processed')

# Load Pickle file
df = pd.read_pickle(data_path + '/grocery.pkl')

In [3]:
# Create date Index
data = df.set_index('Date_Received')

# Ordenate new index
data.sort_index(inplace=True)

# Initial Setup and Analysis of Main Series

In [4]:
# Chart of main target variables
fig = make_subplots(rows=2, cols=1, subplot_titles=['Sales Volume Over Time', 'Stock Quantity Over Time'])

fig.add_trace(go.Scatter(x=data.index, y=data['Sales_Volume'], 
                         name='Sales Volume', line=dict(color='blue')), row=1, col=1)
fig.add_trace(go.Scatter(x=data.index, y=data['Stock_Quantity'], 
                         name='Stock Quantity', line=dict(color='green')), row=2, col=1)

fig.update_layout(height=600, title_text="Main Target Variables", showlegend=True)
fig.show()

# Stationarity Analysis for All Numerical Variables

In [5]:
# Select only numerical variables for analysis
numeric_columns = data.select_dtypes(include=[np.number]).columns.tolist()

# Function for ADF test
def create_stationarity_table(data, columns):
    results = []
    for column in columns:
        try:
            result = adfuller(data[column].dropna())
            stationary = result[1] <= 0.05
            results.append({
                'Variable': column,
                'ADF Statistic': f"{result[0]:.3f}",
                'p-value': f"{result[1]:.4f}",
                'Stationary': 'Yes' if stationary else 'No'
            })
        except:
            continue
    
    return pd.DataFrame(results)

# Create interactive table
stationarity_df = create_stationarity_table(data, numeric_columns)
fig = go.Figure(data=[go.Table(
    header=dict(values=list(stationarity_df.columns),
                fill_color='lightblue',
                align='left'),
    cells=dict(values=[stationarity_df[col] for col in stationarity_df.columns],
               fill_color='lavender',
               align='left'))
])
fig.update_layout(title='Stationarity Test for Numerical Variables')
fig.show()

In [6]:
# Applying differentiation
data['Delivery_Lag_diff'] = data['Delivery_Lag'].diff()
result = adfuller(data['Delivery_Lag_diff'].dropna())
print(f"Delivery_Lag after differentiation: ADF={result[0]:.3f}, p-value={result[1]:.4f}")

Delivery_Lag after differentiation: ADF=-10.268, p-value=0.0000


In [7]:
# Applying differentiation
data['Days_For_Expiration_diff'] = data['Days_For_Expiration'].diff()
result = adfuller(data['Days_For_Expiration_diff'].dropna())
print(f"Days_For_Expiration after differentiation: ADF={result[0]:.3f}, p-value={result[1]:.4f}")

Days_For_Expiration after differentiation: ADF=-12.214, p-value=0.0000


**Obs.**:   
**Delivery_Lag** and **Days_For_Expiration** should be removed before training because they are not stationary.

# Correlation Matrix Focusing on Relevant Variables

In [8]:
# Select most relevant variables for correlation analysis
correlation_variables = [
    'Sales_Volume', 'Stock_Quantity', 'Unit_Price', 'Inventory_Turnover_Rate',
    'Stock_Value', 'Days_For_Expiration', 'Stock_Coverage_Days', 'Delivery_Lag'
]

# Filter only variables that exist in the DataFrame
existing_variables = [v for v in correlation_variables if v in data.columns]
correlation = data[existing_variables].corr()

fig = go.Figure(data=go.Heatmap(
    z=correlation.values,
    x=correlation.columns,
    y=correlation.index,
    colorscale='RdBu',
    zmid=0,
    text=correlation.values.round(2),
    texttemplate="%{text}",
    textfont={"size": 10}
))

fig.update_layout(
    title='Correlation Matrix Between Key Variables',
    xaxis_title='Variables',
    yaxis_title='Variables',
    width=700,
    height=600
)

fig.show()

# Relationship Between Sales Volume and Stock Quantity

In [9]:
# Interactive scatter plot with additional information
fig = px.scatter(data, x='Stock_Quantity', y='Sales_Volume',
                 color='Category',  # color by product category
                 size='Unit_Price',  # size by unit price
                 hover_data=['Product_Name', 'Inventory_Turnover_Rate', 'Stock_Value'],
                 title='Relationship Between Stock Quantity and Sales Volume',
                 labels={'Stock_Quantity': 'Stock Quantity', 
                         'Sales_Volume': 'Sales Volume'})

fig.show()

# Analysis by Product Category

In [10]:
# Time analysis by category
fig = px.line(data, x=data.index, y='Sales_Volume', color='Category',
              title='Sales Volume by Category Over Time',
              labels={'Sales_Volume': 'Sales Volume', 'Date': 'Date'})

fig.show()

# Stock quantity by category
fig = px.box(data, x='Category', y='Stock_Quantity',
             title='Stock Quantity Distribution by Category',
             labels={'Stock_Quantity': 'Stock Quantity', 'Category': 'Category'})
fig.show()

# Seasonality and Decomposition Analysis

In [11]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Decomposition for Sales Volume
sales_decomposition = seasonal_decompose(data['Sales_Volume'].dropna(), period=30)  # adjust period according to your data

fig = make_subplots(rows=4, cols=1, 
                    subplot_titles=['Original Series - Sales Volume', 'Trend', 'Seasonality', 'Residuals'])

fig.add_trace(go.Scatter(x=data.index, y=data['Sales_Volume'], name='Original'), row=1, col=1)
fig.add_trace(go.Scatter(x=data.index, y=sales_decomposition.trend, name='Trend'), row=2, col=1)
fig.add_trace(go.Scatter(x=data.index, y=sales_decomposition.seasonal, name='Seasonality'), row=3, col=1)
fig.add_trace(go.Scatter(x=data.index, y=sales_decomposition.resid, name='Residuals'), row=4, col=1)

fig.update_layout(height=800, title_text="Sales Volume Series Decomposition")
fig.show()

# Autocorrelation Analysis

In [12]:
# Function for interactive ACF/PACF charts
def plot_acf_pacf(series, title, max_lag=40):
    acf_values = acf(series.dropna(), nlags=max_lag)
    pacf_values = pacf(series.dropna(), nlags=max_lag)
    
    fig = make_subplots(rows=1, cols=2, subplot_titles=[f'ACF - {title}', f'PACF - {title}'])
    
    # ACF
    fig.add_trace(go.Bar(x=list(range(max_lag+1)), y=acf_values, name='ACF'), row=1, col=1)
    fig.add_hline(y=1.96/np.sqrt(len(series)), line_dash="dash", row=1, col=1)
    fig.add_hline(y=-1.96/np.sqrt(len(series)), line_dash="dash", row=1, col=1)
    
    # PACF
    fig.add_trace(go.Bar(x=list(range(max_lag+1)), y=pacf_values, name='PACF'), row=1, col=2)
    fig.add_hline(y=1.96/np.sqrt(len(series)), line_dash="dash", row=1, col=2)
    fig.add_hline(y=-1.96/np.sqrt(len(series)), line_dash="dash", row=1, col=2)
    
    fig.update_layout(height=400, showlegend=False, title_text=f"Autocorrelation Analysis - {title}")
    fig.show()

# Plot for Sales Volume
plot_acf_pacf(data['Sales_Volume'], 'Sales Volume')

# Plot for Stock Quantity
plot_acf_pacf(data['Stock_Quantity'], 'Stock Quantity')

# Granger Causality Analysis

In [13]:
# Test causality relationships between relevant variables
variables_to_test = ['Unit_Price', 'Inventory_Turnover_Rate', 'Stock_Value', 'Delivery_Lag_diff', 'Days_For_Expiration_diff']
granger_results = []

for variable in variables_to_test:
    if variable in data.columns:
        for target in ['Sales_Volume', 'Stock_Quantity']:
            try:
                # Prepare data without NaN values
                temp_data = data[[target, variable]].dropna()
                if len(temp_data) > 50:  # minimum observations
                    result = grangercausalitytests(temp_data, maxlag=3, verbose=False)
                    p_values = [result[i+1][0]['ssr_chi2test'][1] for i in range(3)]
                    
                    granger_results.append({
                        'Explanatory Variable': variable,
                        'Target Variable': target,
                        'Lag 1 p-value': f"{p_values[0]:.4f}",
                        'Lag 2 p-value': f"{p_values[1]:.4f}", 
                        'Lag 3 p-value': f"{p_values[2]:.4f}",
                        'Causality': 'Yes' if any(p < 0.05 for p in p_values) else 'No'
                    })
            except:
                continue

if granger_results:
    granger_df = pd.DataFrame(granger_results)
    fig = go.Figure(data=[go.Table(
        header=dict(values=list(granger_df.columns),
                    fill_color='lightgreen',
                    align='left'),
        cells=dict(values=[granger_df[col] for col in granger_df.columns],
                   fill_color='lavender',
                   align='left'))
    ])
    fig.update_layout(title='Granger Causality Test - Relationships Between Variables')
    fig.show()
else:
    print("Could not perform Granger test with available data.")

**Obs:**  
A Granger causality analysis revealed that the **Days_To_Expiration** variable is the only one with statistical significance in predicting **Sales_Volume**. 

Although other variables such as **Unit_Price** and **Stock_Quantity** have a theoretical relationship with sales, the statistical test did not confirm their direct predictive power. Therefore, we prioritized the inclusion of **Days_To_Expiration** in the sales forecasting model.

# Stock Status and Expiration Analysis

In [14]:
# Stock Status Analysis
if 'Status' in data.columns:
    fig = px.pie(data, names='Status', title='Stock Status Distribution')
    fig.show()

# Expiration Status Analysis
if 'Expiration_Status' in data.columns:
    fig = px.pie(data, names='Expiration_Status', title='Expiration Status Distribution')
    fig.show()

# Days to expiration vs Sales Volume
if 'Days_For_Expiration' in data.columns:
    fig = px.scatter(data, x='Days_For_Expiration', y='Sales_Volume',
                     color='Category', hover_data=['Product_Name'],
                     title='Relationship Between Days to Expiration and Sales Volume')
    fig.show()

# Complete Interactive Dashboard

In [15]:
# Create an interactive dashboard with main analyses
fig = make_subplots(
    rows=3, cols=2,
    specs=[[{"type": "scatter"}, {"type": "scatter"}],
           [{"type": "heatmap"}, {"type": "scatter"}],
           [{"type": "bar"}, {"type": "pie"}]],
    subplot_titles=("Sales Volume Over Time", "Stock Quantity Over Time",
                    "Correlation Matrix", "Stock vs Sales Relationship",
                    "ACF Sales Volume", "Distribution by Category")
)

# Sales Volume over time
fig.add_trace(go.Scatter(x=data.index, y=data['Sales_Volume'], name='Sales Volume'), row=1, col=1)

# Stock Quantity over time
fig.add_trace(go.Scatter(x=data.index, y=data['Stock_Quantity'], name='Stock Quantity'), row=1, col=2)

# Correlation matrix
corr_vars = ['Sales_Volume', 'Stock_Quantity', 'Unit_Price', 'Inventory_Turnover_Rate']
corr_vars = [v for v in corr_vars if v in data.columns]
corr = data[corr_vars].corr()
fig.add_trace(go.Heatmap(z=corr.values, x=corr.columns, y=corr.index, 
                         colorscale='RdBu', zmid=0, showscale=True), row=2, col=1)

# Relationship between Stock and Sales
fig.add_trace(go.Scatter(x=data['Stock_Quantity'], y=data['Sales_Volume'],
                         mode='markers', name='Stock vs Sales',
                         marker=dict(size=8, opacity=0.6)), row=2, col=2)

# ACF Sales Volume
acf_values = acf(data['Sales_Volume'].dropna(), nlags=20)
fig.add_trace(go.Bar(x=list(range(21)), y=acf_values, name='ACF'), row=3, col=1)

# Distribution by category (if available)
if 'Category' in data.columns:
    category_count = data['Category'].value_counts()
    fig.add_trace(go.Pie(labels=category_count.index, 
                         values=category_count.values, name='Categories'), row=3, col=2)

fig.update_layout(height=1000, title_text="Inventory and Sales Analysis Dashboard")
fig.show()

# Turnover and Stock Value Analysis

In [16]:
# Inventory Turnover Rate Analysis
if 'Inventory_Turnover_Rate' in data.columns:
    fig = make_subplots(rows=2, cols=1, subplot_titles=['Inventory Turnover Rate', 'Relationship with Sales Volume'])
    
    fig.add_trace(go.Scatter(x=data.index, y=data['Inventory_Turnover_Rate'], 
                             name='Turnover Rate'), row=1, col=1)
    
    fig.add_trace(go.Scatter(x=data['Inventory_Turnover_Rate'], y=data['Sales_Volume'],
                             mode='markers', name='Turnover vs Sales',
                             marker=dict(color=data['Unit_Price'] if 'Unit_Price' in data.columns else 'blue',
                                         size=8, opacity=0.6, showscale=True)),
                 row=2, col=1)
    
    fig.update_layout(height=600, title_text="Inventory Turnover Analysis")
    fig.show()

# Stock Value Analysis
if 'Stock_Value' in data.columns:
    fig = px.scatter(data, x='Stock_Value', y='Sales_Volume',
                     color='Category' if 'Category' in data.columns else None,
                     size='Stock_Quantity',
                     hover_data=['Product_Name', 'Unit_Price'],
                     title='Relationship Between Stock Value and Sales Volume')
    fig.show()

# Analysis of Stable Sales Trend and Strong Weekly Seasonality

## Final Report: Time Series Analysis of Sales Volume

### Executive Summary

The analysis of the Sales Volume time series data indicates that the series is unpredictable and lacks significant historical patterns. While the data shows a stable, non-trending mean, statistical tests confirm an absence of meaningful autocorrelation or strong seasonality. This suggests the series behaves like a **white noise** process, making it unsuitable for traditional time series forecasting models.

---

#### 1. Trend Analysis

The data shows no significant long-term **trend**. Sales volume fluctuates around a relatively **consistent mean**, with no clear long-term growth or decline. This observation is consistent across the time series plot and the statistical tests. Any minor variations or waves observed are temporary and do not constitute a sustained trend.

---

#### 2. Seasonality and Periodicity

While a preliminary review might suggest a weekly pattern, a deeper **statistical analysis** does not support this conclusion.

* The **Autocorrelation Function (ACF)** plot shows no significant repeating peaks at any lag, which is the primary indicator of seasonality.

The absence of strong, recurring peaks indicates that the series does not exhibit strong **seasonality**.

---

#### 3. Autocorrelation and Predictability

The core finding of this analysis is that the series has **no significant autocorrelation**.

* The **ACF** chart confirms this, as all bars fall within the confidence interval (except for lag 0), indicating that sales at a given time are not significantly correlated with sales from previous periods.
* The **Partial Autocorrelation Function (PACF)** chart reinforces this finding. The absence of significant bars suggests that there is no direct correlation or "memory" in the series.

This lack of autocorrelation means that the sales volume for a given period is not dependent on its own past values, making it fundamentally difficult to forecast using historical data.

---

### Conclusion and Recommendations

Based on the statistical analysis of the **ACF**, **PACF**, and **Periodogram** plots, the Sales Volume series is considered a **random** process. Its behavior is dominated by unpredictable fluctuations rather than by discernible trends or repeating patterns.

Therefore, building a sophisticated forecasting model (such as ARIMA or Prophet) would not be effective, as there are no patterns for the model to learn. The most statistically sound "forecast" for this series would simply be its **historical average**, as this is the best estimate for a random process.