<h1 style='color:Green'>Feature Engineering ‚Äì Time-Based Signals</h1>
<h3>üìå Objectives </h3>
<pre>
    Transform cleaned raw data into high-signal features that capture:
    - Temporal patterns
    - Promotion effects
    - Holiday impact
    - Store & product behavior
    - Lagged demand dynamics
</pre>

<h2 style='color:purple'>Import Essentials</h2>

In [1]:
# import Libraries
import numpy as np 
import pandas as pd  

from pathlib import Path

<h2 style='color:purple'>Load Cleaned Data</h2>

In [2]:
# project path
PROJECT_ROOT = Path.cwd().parent
PROCESSED_DATA = PROJECT_ROOT / 'data' / "processed_data"

In [3]:
# Load data 
train = pd.read_parquet(PROCESSED_DATA / 'train_cleaned.parquet')
test = pd.read_parquet(PROCESSED_DATA / "test_cleaned.parquet")

train['date'] = pd.to_datetime(train['date'])
test['date'] = pd.to_datetime(test['date'])

train = train.sort_values(['store_nbr', 'family', 'date'])
test = test.sort_values(['store_nbr', 'family', 'date'])

In [4]:
train.head(3)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,store_type,cluster,dcoilwtico,holiday_type,locale,locale_name,description,is_holiday,is_workday,earthquake,is_payday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,93.14,Holiday,National,Ecuador,Primer dia del ano,1.0,0.0,0,0
1782,1782,2013-01-02,1,AUTOMOTIVE,2.0,0,Quito,Pichincha,D,13,93.14,,,,,0.0,0.0,0,0
3564,3564,2013-01-03,1,AUTOMOTIVE,3.0,0,Quito,Pichincha,D,13,92.97,,,,,0.0,0.0,0,0


In [5]:
train = train.sort_values(['store_nbr', 'family', 'date'])
test = test.sort_values(['store_nbr', 'family', 'date'])

In [6]:
train

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,city,state,store_type,cluster,dcoilwtico,holiday_type,locale,locale_name,description,is_holiday,is_workday,earthquake,is_payday
0,0,2013-01-01,1,AUTOMOTIVE,0.0,0,Quito,Pichincha,D,13,93.140000,Holiday,National,Ecuador,Primer dia del ano,1.0,0.0,0,0
1782,1782,2013-01-02,1,AUTOMOTIVE,2.0,0,Quito,Pichincha,D,13,93.140000,,,,,0.0,0.0,0,0
3564,3564,2013-01-03,1,AUTOMOTIVE,3.0,0,Quito,Pichincha,D,13,92.970000,,,,,0.0,0.0,0,0
5346,5346,2013-01-04,1,AUTOMOTIVE,3.0,0,Quito,Pichincha,D,13,93.120000,,,,,0.0,0.0,0,0
7128,7128,2013-01-05,1,AUTOMOTIVE,5.0,0,Quito,Pichincha,D,13,93.146667,Work Day,National,Ecuador,Recupero puente Navidad,0.0,1.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3047087,2993627,2017-08-11,54,SEAFOOD,0.0,0,El Carmen,Manabi,C,3,48.810000,Transfer,National,Ecuador,Traslado Primer Grito de Independencia,0.0,0.0,1,0
3048869,2995409,2017-08-12,54,SEAFOOD,1.0,1,El Carmen,Manabi,C,3,48.403333,,,,,0.0,0.0,1,0
3050651,2997191,2017-08-13,54,SEAFOOD,2.0,0,El Carmen,Manabi,C,3,47.996667,,,,,0.0,0.0,1,0
3052433,2998973,2017-08-14,54,SEAFOOD,0.0,0,El Carmen,Manabi,C,3,47.590000,,,,,0.0,0.0,1,0


<h2 style='color:Green'>Time Based Feature Engineering</h2>
<h3>üìå Objectives </h3>
<pre>
   Capture calendar-driven demand patterns:
    - Weekly cycles
    - Monthly seasonality
    - Weekend behavior
    - Salary payment effects (domain knowledge)
</pre>

<h3 style='color:purple'>Core Time-Based Features (Day, Week, Month, Year)</h3>
<pre>
    These are foundational features.
    - Sales differ by weekday
    - Monthly seasonality is strong in retail
    - Long-term trends captured by year
</pre>

In [7]:
# Add Features: Day, Week, Month, Year
def add_time_features(df):
    df['year'] = df['date'].dt.year
    df['monty'] = df['date'].dt.month
    df['day'] = df['date'].dt.day

    df['day_of_week'] = df['date'].dt.dayofweek # monday=0, sunday=6
    df['week_of_year'] = df['date'].dt.isocalendar().week.astype(int)
    return df 

train = add_time_features(train)
test = add_time_features(test)

<h3 style='color:purple'>Weekend Flag</h3>
<pre>
    Business Logic:
    - Saturday & Sunday ‚Üí higher footfall (Saturday & Sunday: 1, Weekend: 0)
    - Different buying behavior
</pre>

In [8]:
# Weekend Flag 
def add_weekend_flag(df): 
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    return df 

train = add_weekend_flag(train)
test = add_weekend_flag(test)

<h3 style='color:purple'>Payday Feature (CRITICAL Domain Feature)</h3>
<pre>
  Ecuador Salary Rule:
    Public sector salaries paid on (1: Salary Day, 0: Normal Day):
    - 15th of the month
    - Last day of the month
    - This creates demand spikes.
</pre>

In [9]:
# Month_end Flag 
def add_month_end_flag(df): 
    df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
    return df 

train = add_month_end_flag(train)
test = add_month_end_flag(test)

In [10]:
# pay-day Flag
def payday_flag(df):
    df['is_payday'] == (
        (df['day'] == 15) | df['is_month_end'] == 1
    ).astype(int)
    return df 

train = payday_flag(train)
test = payday_flag(test)

<h3 style='color:purple'>Quick Sanity Check</h3>

In [11]:
# quick sanity check 
train[
    ["date", "day_of_week", "is_weekend", "is_month_end", "is_payday"]
].head(10)

Unnamed: 0,date,day_of_week,is_weekend,is_month_end,is_payday
0,2013-01-01,1,0,0,0
1782,2013-01-02,2,0,0,0
3564,2013-01-03,3,0,0,0
5346,2013-01-04,4,0,0,0
7128,2013-01-05,5,1,0,0
8910,2013-01-06,6,1,0,0
10692,2013-01-07,0,0,0,0
12474,2013-01-08,1,0,0,0
14256,2013-01-09,2,0,0,0
16038,2013-01-10,3,0,0,0


<h2 style='color:Green'>Lag Features (Sales, Promotion, Oil)</h2>
<h3>üìå Objectives </h3>
<pre>
  Capture temporal dependency:
    ‚ÄúToday‚Äôs sales depend heavily on what happened yesterday, last week, and last month.‚Äù
    - Sales lags ‚Üí core signal
    - Promotion lags ‚Üí delayed promo effect
    - Oil price lags ‚Üí macro impact
</pre>

<h3 style='color:purple'>Sales Lag Features </h3>
<pre>
- Captures store-family specific behavior
- Works extremely well with tree models
    
 Lag	Meaning
  1     Yesterday‚Äôs demand
  7   	Same day last week
  14	Bi-weekly pattern
  28	Monthly pattern
</pre>

In [12]:
# Sales Lag Features
SALES_LAGS = [1, 7, 14, 28]

for lag in SALES_LAGS: 
    train[f"sales_lag_{lag}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['sales'].shift(lag)
    )

<h3 style='color:purple'>Promotion Lag Features </h3>
<pre>
Promotions don‚Äôt just affect today ‚Äî they have carry-over effects.

  Business logic:
    - Customers stock up
    - Awareness spreads over days
</pre>

In [13]:
# Promotion Lag Features 
PROMO_LAGS = [1, 7]

for lag in PROMO_LAGS:
    train[f"promo_lag_{lag}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['onpromotion'].shift(lag)
    )

<h3 style='color:purple'>OIL PRICE LAG FEATURES (MACRO SIGNAL) </h3>
<pre>
 Oil prices affect:
  - Inflation
  - Purchasing power
  - Transportation cost
These effects are not instant, so lags matter.
</pre>

In [14]:
# Oil Lage 
OIL_LAGS = [7, 14, 28]

for lag in OIL_LAGS: 
    train[f"oil_lag_{lag}"] = train['dcoilwtico'].shift(lag)

<h3 style='color:purple'>What About TEST DATA? </h3>
<pre>
 Do not create sales lags for test yet.
  Why?
  - Test sales are unknown
  - Lags must be generated using last known train data
 üìå This will be handled during:
  üëâ model inference / recursive prediction

‚úÖ For now: lags only on train
</pre>

<h3 style='color:purple'>Quick Sanity Check</h3>

In [15]:
# Sanity Check
train[
    [
        "date", "sales",
        "sales_lag_1", "sales_lag_7",
        "promo_lag_1", "oil_lag_7"
    ]
].head(15)

Unnamed: 0,date,sales,sales_lag_1,sales_lag_7,promo_lag_1,oil_lag_7
0,2013-01-01,0.0,,,,
1782,2013-01-02,2.0,0.0,,0.0,
3564,2013-01-03,3.0,2.0,,0.0,
5346,2013-01-04,3.0,3.0,,0.0,
7128,2013-01-05,5.0,3.0,,0.0,
8910,2013-01-06,2.0,5.0,,0.0,
10692,2013-01-07,0.0,2.0,,0.0,
12474,2013-01-08,2.0,0.0,0.0,0.0,93.14
14256,2013-01-09,2.0,2.0,2.0,0.0,93.14
16038,2013-01-10,2.0,2.0,3.0,0.0,92.97


<h2 style='color:Green'>Rolling Statistics Features</h2>
<h3>üìå Objectives </h3>
<pre>
 Rolling features help models answer:
  - ‚ÄúIs demand increasing or decreasing?‚Äù
  - ‚ÄúHow volatile is this product‚Äôs sales?‚Äù
  - ‚ÄúIs promotion pressure building up?‚Äù
 These features are extremely effective for tree models and DL.<br>
 Rolling windows must:
  - Use past values only
  - Always apply .shift(1) before rolling
</pre>

<h3 style='color:purple'>Rolling Mean (Trend) </h3>
<pre>
Interpretation:
 - High mean ‚Üí strong demand
 - Low mean ‚Üí weak demand
</pre>

In [16]:
# Rolling Mean
ROLL_WINDOWS = [7, 14, 28]
for window in ROLL_WINDOWS: 
    train[f"sales_roll_mean_{window}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['sales']
        .shift(1).rolling(window).mean()
    )

<h3 style='color:purple'>Rolling Standard Deviation (Volatility) </h3>
<pre>
Interpretation:
 - High std ‚Üí unstable demand
 - Low std ‚Üí High demand
</pre>

In [17]:
# Rolling Standard Deviation 
ROLL_WINDOWS = [7, 14, 28]
for window in ROLL_WINDOWS: 
    train[f"sales_roll_std_{window}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['sales']
        .shift(1).rolling(window).std()
    )

<h3 style='color:purple'>PROMOTION ROLLING INTENSITY </h3>
<pre>
 Promotions work cumulatively over time.
 Business meaning:
  ‚ÄúHow intense have promotions been recently?‚Äù<br> 
 üìå Value range:
   0.0 ‚Üí no promotions
   1.0 ‚Üí promotion every day in window
</pre>

In [18]:
# Rolling Promotion Count
for window in ROLL_WINDOWS: 
    train[f"promo_roll_sum_{window}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['onpromotion']
        .shift(1).rolling(window).sum()
    )

In [19]:
# Rolling Promotion Frequency 
train['promo_flag'] = (train['onpromotion'] > 0).astype(int)

for window in ROLL_WINDOWS:
    train[f"promo_freq_{window}"] = (
        train.groupby(['store_nbr', 'family'], observed=True)['promo_flag']
        .shift(1).rolling(window).mean()
    )

<h3 style='color:purple'>Quick Sanity Check</h3>

In [20]:
# Sanity Check
train[
    [
        "date", "sales",
        "sales_roll_mean_7",
        "sales_roll_std_7",
        "promo_roll_sum_7",
        "promo_freq_7"
    ]
].head(20)

Unnamed: 0,date,sales,sales_roll_mean_7,sales_roll_std_7,promo_roll_sum_7,promo_freq_7
0,2013-01-01,0.0,,,,
1782,2013-01-02,2.0,,,,
3564,2013-01-03,3.0,,,,
5346,2013-01-04,3.0,,,,
7128,2013-01-05,5.0,,,,
8910,2013-01-06,2.0,,,,
10692,2013-01-07,0.0,,,,
12474,2013-01-08,2.0,2.142857,1.772811,0.0,0.0
14256,2013-01-09,2.0,2.428571,1.511858,0.0,0.0
16038,2013-01-10,2.0,2.428571,1.511858,0.0,0.0


<h1 style='color:Green'>Holiday Feature Engineering</h1>
<h3>üìå Objectives </h3>
<pre>
   Encode holiday effects correctly, capturing:
    - National vs regional vs local impact
    - Extended holidays (bridges)
    - Compensatory workdays
    - Non-holiday days explicitly
  Retail demand is highly sensitive to these patterns.
</pre>

<h3 style='color:purple'>Holiday Scope Encoding </h3>
<pre>
<b> National / Regional / Local</b>
 Why:
   - National holidays affect all stores
   - Regional holidays affect state
   - Local holidays affect city<br>
  üìå Interpretation:
   - Only one of these can be 1 on a given day
   - All zero ‚Üí normal day
</pre>

In [21]:
# National Holiday Flag
train['is_national_holiday'] = (train['locale'] == 'National').astype(int)
test['is_national_holiday'] = (test['locale'] == 'National').astype(int)

In [22]:
# Regional holiday Flag
train['is_regional_holiday'] = (train['locale'] == 'Regional').astype(int)
test['is_regional_holiday'] = (test['locale'] == 'Regional').astype(int)

In [23]:
# Local Holiday Flag
train['is_local_holiday'] = (train['locale'] == 'Local').astype(int)
test['is_local_holiday'] = (test['locale'] == 'Local').astype(int)

<h3 style='color:purple'> Bridge Holiday Flag</h3>
<pre>
   What is a Bridge?
    - Extra days added to extend holidays (long weekends).
    - Why important?
    - Demand often behaves like a holiday
   Sometimes even stronger than actual holiday
</pre>

In [24]:
# Brideg Holiday Flag
train['is_bridge'] = (train['holiday_type'] == 'Bridge').astype(int)
test['is_bridge'] = (test['holiday_type'] == 'Bridge').astype(int)

<h3 style='color:purple'>Compensatory Workday Flag </h3>
<pre>
   What is a Work Day?
    - A normally non-working day (e.g., Saturday)
    - People work to compensate for a bridge<br>
   Demand behavior:
    - Often lower supermarket sales
</pre>

In [25]:
# Compensatory Workday Flag
train["is_comp_workday"] = (train["holiday_type"] == "Work Day").astype(int)
test["is_comp_workday"] = (test["holiday_type"] == "Work Day").astype(int)

<h3 style='color:purple'>Holiday Proximity Features</h3>
<pre>Demand often spikes before holidays.</pre>

In [26]:
# Day Before Holiday
train["is_pre_holiday"] = (
    train.groupby(["store_nbr", "family"], observed=True)["is_holiday"]
    .shift(-1)
    .fillna(0)
    .astype(int)
)

test["is_pre_holiday"] = (
    test.groupby(["store_nbr", "family"], observed=True)["is_holiday"]
    .shift(-1)
    .fillna(0)
    .astype(int)
)


In [27]:
# Day After Holiday
train["is_post_holiday"] = (
    train.groupby(["store_nbr", "family"], observed=True)["is_holiday"]
    .shift(1)
    .fillna(0)
    .astype(int)
)

test["is_post_holiday"] = (
    test.groupby(["store_nbr", "family"], observed=True)["is_holiday"]
    .shift(1)
    .fillna(0)
    .astype(int)
)


<h3 style='color:purple'>Quick Sanity Check</h3>

In [28]:
# Sanity Check  
train[
    [
        "date", "is_holiday",
        "is_national_holiday",
        "is_bridge",
        "is_comp_workday",
    ]
].sample(15)

Unnamed: 0,date,is_holiday,is_national_holiday,is_bridge,is_comp_workday
447369,2013-09-05,0.0,0,0,0
2466413,2016-09-25,0.0,0,0,0
2417915,2016-08-28,0.0,0,0,0
2149124,2016-04-09,0.0,0,0,0
220063,2013-05-04,0.0,0,0,0
1973480,2016-01-01,1.0,1,0,0
1188446,2014-10-21,0.0,0,0,0
700054,2014-01-24,0.0,0,0,0
20736,2013-01-12,0.0,1,0,1
1491329,2015-04-08,0.0,0,0,0


<h2 style='color:Green'>Store & Product Encoding: 
Target Encoding & Frequency Encodings</h2>
<h3>üìå Objectives </h3>
<pre>
  Convert high-cardinality categorical variables into numerical signals that ML models can learn from, without losing meaning.<br>
   We will encode:
    - store_nbr
    - family
    - store metadata like city, state, store_type
</pre>

<h3 style='color:purple'> FREQUENCY ENCODING</h3>
<pre>
   What is Frequency Encoding?
   Replace a category with how often it appears in data.<br>
   Why it works:
    - Popular products/stores behave differently
    - Zero leakage
    - Very stable<br>
  üìå Interpretation:
   - Higher value ‚Üí more common product family</pre>

In [29]:
# Frequecy Encoding - Product Family 
family_freq = train['family'].value_counts(normalize=True)

train['family_freq'] = train['family'].map(family_freq)
test['family_freq'] = test['family'].map(family_freq).fillna(0)

In [30]:
# Frequency store - store
store_freq = train['store_nbr'].value_counts(normalize=True)

train['store_freq'] = train['store_nbr'].map(store_freq)
test['store_freq'] = test['store_nbr'].map(store_freq).fillna(0)

In [31]:
# Frequency Encoding - City / State
city_freq = train['city'].value_counts(normalize=True)
state_freq = train['state'].value_counts(normalize=True) 

train['city_freq'] = train['city'].map(city_freq)
train['state_freq'] = train['state'].map(state_freq)

test['city_freq'] = test['city'].map(city_freq).fillna(0)
test['state_freq'] = test['state'].map(state_freq).fillna(0)

<h3 style='color:purple'>TARGET ENCODING </h3>
<pre>What is Target Encoding?
  - Replace category with average sales behavior
  - This is extremely powerful ‚Äî and dangerous if done wrong.<br>
 We encode using log-sales for stability.</pre>

In [32]:
# Targe Variable preparation 
train['sales_log'] = np.log1p(train['sales'])

In [33]:
# Target Encoding - Product Family
family_target_mean = (
    train.groupby("family", observed=True)["sales_log"]
    .mean()
)

global_mean = train["sales_log"].mean()

train["family_te"] = train["family"].map(family_target_mean)
test["family_te"] = test["family"].map(family_target_mean)


In [34]:
# Target Encoding - Store
store_target_mean = (
    train.groupby("store_nbr", observed=True)["sales_log"]
    .mean()
)

train["store_te"] = train["store_nbr"].map(store_target_mean)
test["store_te"] = test["store_nbr"].map(store_target_mean).fillna(global_mean)


<h3 style='color:purple'>Quick Sanity Check</h3>

In [35]:
# Sanity Check 
train[
    [
        "store_nbr", "family",
        "family_freq", "store_freq", "city_freq", "state_freq",
        "family_te", "store_te"
    ]
].sample(10)


Unnamed: 0,store_nbr,family,family_freq,store_freq,city_freq,state_freq,family_te,store_te
1725141,14,AUTOMOTIVE,0.030303,0.018519,0.018519,0.018519,1.612195,2.62369
2030551,32,POULTRY,0.030303,0.018519,0.148148,0.203704,5.064399,2.201618
2671769,24,MAGAZINES,0.030303,0.018519,0.148148,0.203704,0.740832,3.325509
1872199,4,EGGS,0.030303,0.018519,0.333333,0.351852,4.534353,3.289774
389160,28,MEATS,0.030303,0.018519,0.148148,0.203704,5.113098,3.047216
50911,37,PERSONAL CARE,0.030303,0.018519,0.055556,0.055556,5.043507,3.388362
1788498,40,PRODUCE,0.030303,0.018519,0.037037,0.037037,4.495347,2.882559
920353,32,HOME AND KITCHEN II,0.030303,0.018519,0.148148,0.203704,1.886819,2.201618
471060,26,HOME CARE,0.030303,0.018519,0.148148,0.203704,3.279253,2.510817
1858606,9,GROCERY II,0.030303,0.018519,0.333333,0.351852,2.407541,3.409338


<h3 style='color:purple'>Save Final Code </h3>

In [36]:
from pathlib import Path

FEATURE_DATA = Path("../data/features")
FEATURE_DATA.mkdir(parents=True, exist_ok=True)

train.to_parquet(FEATURE_DATA / "train_features.parquet")
test.to_parquet(FEATURE_DATA / "test_features.parquet")
