---

This notebook contains the Feature Engineering for the first module of the **"Intelligent System for Supply Chain Management"** project. 

The EDA revealed the need for feature engineering to address the volatility and unpredictable nature of sales volume. The focus is to extract new features to improve forecasting and optimize inventory, addressing the critical problem of expired products.

---

Import Libraries

In [26]:
import pandas as pd
import numpy as np
import os
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

# Configure display and graphs
pd.set_option('display.max_columns', None)
pio.templates.default = "plotly_white"

import warnings
warnings.filterwarnings('ignore')

Load Data

In [27]:
# Define data paths
data_path = os.path.join('../data', 'processed')

# Load Pickle file
read_data = pd.read_pickle(data_path + '/grocery.pkl', )

# Sort DataFrame by Date_Received in ascending order
df = read_data.sort_values(by='Date_Received').reset_index(drop=True)

In [28]:
# Create Dataframe which string information, Suppliers and Products
df_supp_prod_str = df.select_dtypes(exclude=[np.number , 'datetime'])

In [29]:
# Transform Objects and Category columns
df['Product_ID_Code'] = df['Product_ID'].astype('category').cat.codes
df['Supplier_ID_Code'] = df['Supplier_ID'].astype('category').cat.codes
df['Category_Code'] = df['Category'].cat.codes
df['Status_Code'] = df['Status'].cat.codes
df['Expiration_Status_Code'] = df['Expiration_Status'].cat.codes

In [30]:
# Categorical, String and Datetime columns to drop
columns_drop = list(set(df_supp_prod_str.columns.tolist()))  + ['Last_Order_Date', 'Expiration_Date', 'Stock_Value', 'Purchase_Order']

# Drop Columns and make a dataframe copy
df = df.drop(columns=columns_drop).copy()

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Stock_Quantity           990 non-null    int64         
 1   Reorder_Level            990 non-null    int64         
 2   Reorder_Quantity         990 non-null    int64         
 3   Unit_Price               990 non-null    float64       
 4   Date_Received            990 non-null    datetime64[ns]
 5   Sales_Volume             990 non-null    int64         
 6   Inventory_Turnover_Rate  990 non-null    int64         
 7   Days_For_Expiration      990 non-null    Int64         
 8   Stock_Coverage_Days      990 non-null    int64         
 9   Delivery_Lag             990 non-null    int64         
 10  Product_ID_Code          990 non-null    int16         
 11  Supplier_ID_Code         990 non-null    int16         
 12  Category_Code            990 non-nul

In [32]:
# Reindex Dataframe by Dairy Date index and Product ID Code
df = df.groupby(['Date_Received', 'Product_ID_Code']).agg(
    Stock_Quantity=('Stock_Quantity', 'sum'),
    Sales_Volume=('Sales_Volume', 'sum'),
    Reorder_Level=('Reorder_Level', 'sum'),
    Reorder_Quantity=('Reorder_Quantity', 'sum'),
    Unit_Price=('Unit_Price', 'mean'),
    Inventory_Turnover_Rate=('Inventory_Turnover_Rate', 'sum'),
    Days_For_Expiration=('Days_For_Expiration', 'sum'),
    Stock_Coverage_Days=('Stock_Coverage_Days', 'sum'),
    Delivery_Lag=('Delivery_Lag', 'sum'),
    Supplier_ID_Code=('Supplier_ID_Code', 'sum'),
    Category_Code=('Category_Code', 'sum'),
    Status_Code=('Status_Code', 'sum'),    
    Expiration_Status_Code=('Expiration_Status_Code', 'sum'),
)


In [33]:
# Take information of Date
df['Year'] = df.index.get_level_values('Date_Received').year
df['Month'] = df.index.get_level_values('Date_Received').month
df['Day'] = df.index.get_level_values('Date_Received').day
df['DayOfYear'] = df.index.get_level_values('Date_Received').dayofyear
df['Weekday'] = df.index.get_level_values('Date_Received').weekday
df['QuarterOfYear'] = df.index.get_level_values('Date_Received').quarter
df['WeekOfYear'] = df.index.get_level_values('Date_Received').isocalendar().week.values

In [34]:
df[['Reorder_Level', 'Stock_Quantity']].head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Reorder_Level,Stock_Quantity
Date_Received,Product_ID_Code,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-02-25,845,89,22
2024-02-26,217,96,70
2024-02-26,735,90,73


In [35]:
# Create Variable Stock Quantity / Sales Volume
df['Stock_Sales'] = df['Stock_Quantity'] / df['Sales_Volume']

In [36]:
# Create Variable Reorder Level / Sales Volume
df['Reorder_Sales'] = df['Reorder_Level'] / df['Sales_Volume']

In [37]:
# Create Variable Sales Volume / Inventory Turnover Rate
df['Sales_Inv_Turnover'] = df['Sales_Volume'] / df['Inventory_Turnover_Rate']

In [38]:
# Create Variable Stock Quantity / Inventory Turnover Rate
df['Stock_Inv_Turnover'] = df['Stock_Quantity'] / df['Inventory_Turnover_Rate']

In [39]:
# Create Variable Reorder Level, Reorder Quantity and Inventory Turnover Rate
df['Reorder_Level_Quantity_Turnover'] = (df['Reorder_Level'] - df['Reorder_Quantity']) / df['Inventory_Turnover_Rate']

In [40]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Stock_Quantity,Sales_Volume,Reorder_Level,Reorder_Quantity,Unit_Price,Inventory_Turnover_Rate,Days_For_Expiration,Stock_Coverage_Days,Delivery_Lag,Supplier_ID_Code,Category_Code,Status_Code,Expiration_Status_Code,Year,Month,Day,DayOfYear,Weekday,QuarterOfYear,WeekOfYear,Stock_Sales,Reorder_Sales,Sales_Inv_Turnover,Stock_Inv_Turnover,Reorder_Level_Quantity_Turnover
Date_Received,Product_ID_Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2024-02-25,845,22,30,89,25,21.00,62,146,5,0,495,1,2,2,2024,2,25,56,6,1,8,0.733333,2.966667,0.483871,0.354839,1.032258
2024-02-26,217,70,34,96,85,2.50,3,244,121,0,616,5,0,2,2024,2,26,57,0,1,9,2.058824,2.823529,11.333333,23.333333,3.666667
2024-02-26,735,73,75,90,77,4.00,19,9,19,0,505,3,1,1,2024,2,26,57,0,1,9,0.973333,1.200000,3.947368,3.842105,0.684211
2024-02-26,768,41,59,29,76,2.40,4,218,91,0,114,3,2,2,2024,2,26,57,0,1,9,0.694915,0.491525,14.750000,10.250000,-11.750000
2024-02-27,391,94,20,13,15,7.00,24,221,15,0,837,2,1,2,2024,2,27,58,1,1,9,4.700000,0.650000,0.833333,3.916667,-0.083333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-02-23,432,75,79,61,19,2.00,69,0,5,25,180,5,1,0,2025,2,23,54,6,1,8,0.949367,0.772152,1.144928,1.086957,0.608696
2025-02-23,452,60,51,69,64,2.75,6,0,60,351,476,4,0,0,2025,2,23,54,6,1,8,1.176471,1.352941,8.500000,10.000000,0.833333
2025-02-24,514,88,96,40,29,15.00,87,0,4,245,395,6,2,0,2025,2,24,55,0,1,9,0.916667,0.416667,1.103448,1.011494,0.126437
2025-02-24,628,53,41,98,23,5.00,84,0,4,235,934,3,1,0,2025,2,24,55,0,1,9,1.292683,2.390244,0.488095,0.630952,0.892857


In [41]:
# Create Correlation
correlation = df.corr()

corr_79 = correlation[(correlation >= 0.80) & (correlation != 1)].fillna(0)


In [42]:
# Plot Matrix Correlation
fig_corr = px.imshow(correlation, title='Correlation Feature Engineering', contrast_rescaling=False, text_auto=False)

fig_corr.update_layout(height=900, width=900)
fig_corr.show()

**Removed Columns with correlation upper than 0.80**

- Stock_Coverage_Days
- DayOfYear
- QuarterOfYear
- WeekOfYear
- Sales_Inv_Turnover
- Month

In [43]:
# Columns to remove
drop_correlated = ['Stock_Coverage_Days', 'DayOfYear', 'QuarterOfYear', 'WeekOfYear', 'Sales_Inv_Turnover', 'Month']

correlation.drop(columns=drop_correlated, inplace=True)

In [47]:
# Plot Matrix Correlation
fig_corr2 = px.imshow(correlation, title='Correlation Feature Engineering', contrast_rescaling=False, text_auto=False)

fig_corr2.update_layout(height=900, width=1200)
fig_corr2.show()


In [48]:
# Remove Correlated Columns Main DataFrame
df.drop(columns=drop_correlated, inplace=True)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Stock_Quantity,Sales_Volume,Reorder_Level,Reorder_Quantity,Unit_Price,Inventory_Turnover_Rate,Days_For_Expiration,Delivery_Lag,Supplier_ID_Code,Category_Code,Status_Code,Expiration_Status_Code,Year,Day,Weekday,Stock_Sales,Reorder_Sales,Stock_Inv_Turnover,Reorder_Level_Quantity_Turnover
Date_Received,Product_ID_Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2024-02-25,845,22,30,89,25,21.00,62,146,0,495,1,2,2,2024,25,6,0.733333,2.966667,0.354839,1.032258
2024-02-26,217,70,34,96,85,2.50,3,244,0,616,5,0,2,2024,26,0,2.058824,2.823529,23.333333,3.666667
2024-02-26,735,73,75,90,77,4.00,19,9,0,505,3,1,1,2024,26,0,0.973333,1.200000,3.842105,0.684211
2024-02-26,768,41,59,29,76,2.40,4,218,0,114,3,2,2,2024,26,0,0.694915,0.491525,10.250000,-11.750000
2024-02-27,391,94,20,13,15,7.00,24,221,0,837,2,1,2,2024,27,1,4.700000,0.650000,3.916667,-0.083333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2025-02-23,432,75,79,61,19,2.00,69,0,25,180,5,1,0,2025,23,6,0.949367,0.772152,1.086957,0.608696
2025-02-23,452,60,51,69,64,2.75,6,0,351,476,4,0,0,2025,23,6,1.176471,1.352941,10.000000,0.833333
2025-02-24,514,88,96,40,29,15.00,87,0,245,395,6,2,0,2025,24,0,0.916667,0.416667,1.011494,0.126437
2025-02-24,628,53,41,98,23,5.00,84,0,235,934,3,1,0,2025,24,0,1.292683,2.390244,0.630952,0.892857


# parei aqui

In [None]:
# Aggregate target columns by date, summing values for each day
data = read_data.groupby('Last_Order_Date').sum().reset_index()

# Set the date column as the DataFrame index for time series operations
df_target = data.set_index('Last_Order_Date')

# Resample to daily frequency, ensuring all dates are represented and filling missing days with 0
df_target = df_target.resample('D').sum().fillna(0)

In [None]:
date_range = pd.date_range(
    start=df_ts['Date_Received'].min(),
    end=df_ts['Date_Received'].max(),
    freq='D'
              )
len(date_range)

In [None]:
df_ts = df_ts.groupby(['Date_Received', 'Product_Name', 'Supplier_Name']).agg(
    Stock_Quantity=('Stock_Quantity', 'sum'),
    Sales_Volume=('Sales_Volume', 'sum'),
    Reorder_Level=('Reorder_Level', 'sum'),
    Reorder_Quantity=('Reorder_Quantity', 'mean'),
    Unit_Price=('Unit_Price', 'mean'),
    Inventory_Turnover_Rate=('Inventory_Turnover_Rate', 'sum'),
    Days_For_Expiration=('Days_For_Expiration', 'sum'),
    Stock_Coverage_Days=('Stock_Coverage_Days', 'sum'),
    Delivery_Lag=('Delivery_Lag', 'sum'),
    Status_Code=('Status_Code', 'sum'),
    Expiration_Status_Code=('Expiration_Status_Code', 'sum'),
    Category_Code=('Category_Code', 'sum'),
    Product_Name_Code=('Product_Name_Code', 'sum'),
    Supplier_Name_Code=('Supplier_Name_Code', 'sum')

)

## OpÃ§Ã£o A: AgregaÃ§Ã£o por data

In [None]:
_cols = df_ts.select_dtypes(np.number).columns.to_list()
categorical_cols = ['Status_Code', 'Expiration_Status_Code', 'Category_Code', 'Product_Name_Code', 'Supplier_Name_Code']

numeric_cols = list(set(_cols) - set(categorical_cols))

In [None]:
# Agregar por data (soma/mÃ©dia de todos os produtos)
# daily_aggregated = 
df_ts.groupby('Date_Received')[categorical_cols].sum()

# Agregar por categoria e data
# category_daily = df_ts.groupby(['Date_Received', 'Product_Cat'])[numeric_cols].mean()


## OpÃ§Ã£o B: SÃ©rie para Produtos EspecÃ­ficos

In [None]:
df_ts

In [None]:
# Selecionar um produto especÃ­fico para anÃ¡lise
# specific_product = df_ts.xs(('Arabica Coffee', 'Chatterpoint'), level=['Product_Name', 'Supplier_Name'])

# Ou para uma categoria inteira
# coffee_products = 
df_ts.xs('Spinach', level='Product_Name')

In [None]:
df_ts.reset_index()

## OpÃ§Ã£o C: Wide Format

In [None]:
# Pivot table para ter cada produto como coluna
# df_ts_wide = df_ts[['Sales_Volume', 'Stock_Quantity']].unstack(['Product_Name','Supplier_Name'], fill_value=0).reindex(index=date_range)

# Temporal Feture Engineering

In [None]:
# Adicionar features temporais
df_complete = df_complete.reset_index()
df_complete['day_of_week'] = df_complete['Date_Received'].dt.dayofweek
df_complete['month'] = df_complete['Date_Received'].dt.month
df_complete['quarter'] = df_complete['Date_Received'].dt.quarter
df_complete['year'] = df_complete['Date_Received'].dt.year

# Voltar ao multi-Ã­ndice
df_complete = df_complete.set_index(['Date_Received', 'Product_Cat', 'Product_Supplier'])

In [None]:
# Transform Categorical Columns
df_ts['Status_Code'] = df_ts['Status'].cat.codes
df_ts['Expiration_Status_Code'] = df_ts['Expiration_Status'].cat.codes
df_ts['Category_Code'] = df_ts['Category'].cat.codes
df_ts['Product_Name_Code'] = df_ts['Product_Name'].astype('category').cat.codes
df_ts['Supplier_Name_Code'] = df_ts['Supplier_Name'].astype('category').cat.codes

df_ts = df_ts.drop(columns=['Status', 'Expiration_Status', 'Category'])

In [None]:
corr_ = df_stat_exog.select_dtypes(np.number).corr()
corr_ = corr_[(corr_ < -0.2) | (corr_ > 0.2)]
corr_

In [None]:
fig = px.imshow(corr_, text_auto=True, aspect='auto')
fig.show()

Analysis of Exogenous Variables in the Time Series

1. Correlation with the target

Numerical exogenous variables show low correlation with the target and among themselves.

Categorical variables, after one-hot encoding, display weak internal correlations and very low correlation with the target.

2. Expected impact on forecasting

Given the low level of association, these variables are unlikely to provide immediate performance improvements.

Including them at this stage could increase the risk of overfitting, especially considering the limited data available.

3. Strategic approach

Exogenous variables will be retained but not actively used in the current modeling process.

This approach ensures we preserve potential value, as longer historical series may reveal patterns not visible today.

4. Next steps

Reassess the contribution of exogenous variables as more data becomes available.

Evaluate selection techniques tailored to time series (e.g., Granger causality, feature importance from tree-based models, multivariate approaches such as VAR/VARMAX).

Compare model performance with and without exogenous variables to support future decisions.

In [None]:
df_exo = df[variables_exog].sort_values(by='Date_Received').reset_index(drop=True)

In [None]:
df_exo