# PREPARATION OF PRODUCTION SCRIPT

## APPROACH FOR RETRAINING AND EXECUTION SCRIPTS

### Goal

In this notebook, we prepare 

- **Retraining script**: models lose predictive power over time (data drifts, market changes). There is no fixed schedule (insurance may retrain every few years, digital advertising every minutes). Typically retrain when performance drops 5–10%. This script keeps the model updated.
- **Execution (scoring) script**: runs in production (batch, API, app...). This is the one **actually used** to make new predictions. 

... which will be the last 2 notebooks in this project

### Why creating pipelines (retraining, execution)

We will create pipelines to:

- ensure that all data transformations and the model are applied in the exact same order during both training and prediction, preventing data-leakage and schema mismatches.
- encapsulate feature engineering, encoding, scaling, and the model into a single reusable artifact, making deployment simple, reliable, and reproducible. 
- help automate retraining and make production scoring consistent and error-proof.

### What's outside and inside the pipeline

- If the transformation must run again during prediction: put it inside the pipeline  
- If it is a one-time structural data fix: keep it outside the pipeline  

**Outside the pipeline (Pandas – one-time structural fixes)**  
- Correct column names  
- Convert data types (`date` to datetime, `holiday_promo` to category)  
- Remove duplicates and fully empty rows (not needed in this project) 
- Create the target variable (`stockout_14d`) - *The target is always created outside the pipeline, because in production scoring the target does not exist and the pipeline must run without it, meaning the target is only generated during training when historical outcomes are known.*
- X/y split

These steps do not need to be repeated during prediction  

**Inside the pipeline (Sklearn – transformations needed for prediction)**  
- Feature engineering: eeach step is encapsulated as a transformer and inserted into the pipeline, and it is automatically applied during both training and prediction.
    - Creating new date variables (month, day_of_week, is_weekend...)  
    - Encodings (One-Hot Encoding, Target Encoding)  
    - Scaling (MinMax)  
- Restrict to selected final model features  
- XGBoost model  

These steps must also run on future data when predicting

### Datasets used in retraining and execution scripts

**Training code uses:**  
- Historical dataset with features + known target  
- Target created outside pipeline  
- Train/validation split  
- Fit full pipeline and model  

**Execution (production) code uses:**  
- New incoming operational dataset  
- NO Target column (unknown future)  
- Apply saved pipeline to transform data  
- Model predicts `stockout_14d`

## IMPORT LIBRARIES

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from xgboost import XGBClassifier

import cloudpickle
import os

# Disable warnings for cleaner logs
import warnings
warnings.filterwarnings("ignore")

# Display formatting
pd.options.display.float_format = '{:.2f}'.format

# Autocomplete
%config IPCompleter.greedy=True

## IMPORT DATA

### Import the raw dataset

In [2]:
project_path = '/Users/rober/retail-stockout-risk-scoring/'
file_name_data = 'retail_store_inventory.csv'
path = project_path + '/02_Data/01_Raw/' + file_name_data 
df = pd.read_csv(path)

## OUTSIDE THE PIPELINE (pandas)

### Correct column names

In [3]:
df = df.rename(columns={
    "Date": "date",
    "Store ID": "store_id",
    "Product ID": "product_id",
    "Category": "category",
    "Region": "region",
    "Inventory Level": "inventory_level",
    "Units Sold": "units_sold",
    "Units Ordered": "units_ordered",
    "Demand Forecast": "demand_forecast",
    "Price": "price",
    "Discount": "discount",
    "Weather Condition": "weather",
    "Holiday/Promotion": "holiday_promo",
    "Competitor Pricing": "competitor_pricing",
    "Seasonality": "seasonality",
})

### Convert data types

In [4]:
df['holiday_promo'] = df['holiday_promo'].astype('category')

df['date'] = pd.to_datetime(df['date'])

###  Create the target variable: stockout_14d

In [5]:
df['stockout_14d'] = (df['inventory_level'] <= df['demand_forecast'] * 14).astype(int)

### X/y split

In [6]:
y = df['stockout_14d']
X = df.drop(columns=['stockout_14d'])

## INSIDE THE PIPELINE (+ CREATE THE PIPELINE)

Each step is encapsulated as a transformer and inserted into the pipeline.

- Creating new date variables (month, day_of_week, is_weekend...): DateFeatureExtractor  
- Encodings (One-Hot Encoding, Target Encoding): 
- Scaling (MinMax)  

### New dates variables 

In [7]:
class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self, date_col='date'):
        self.date_col = date_col

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()

        X['day_of_week']  = X[self.date_col].dt.dayofweek
        X['week_of_year'] = X[self.date_col].dt.isocalendar().week.astype(int)
        X['month']        = X[self.date_col].dt.month
        X['year']         = X[self.date_col].dt.year
        X['day_of_month'] = X[self.date_col].dt.day
        X['is_weekend']   = (X['day_of_week'] >= 5).astype(int)
        
        X = X.drop(columns=[self.date_col])

        return X

### Transformations: OHE, TE, Min-Max

In [8]:
# -- Transformations (OHE, TE, MMS) --

# Encodings: OHE, TE

# OHE
var_ohe = ['store_id', 'category', 'region', 'weather', 'seasonality', 'holiday_promo']
ohe = OneHotEncoder(sparse_output = False, handle_unknown='ignore')

# TE
var_te = ['product_id']
te = TargetEncoder(min_samples_leaf=100, return_df = False)

# Rescalling: MMS
var_mms = ['inventory_level','units_sold','units_ordered','demand_forecast','price','discount','competitor_pricing']
mms = MinMaxScaler()

# --  Create the transformers --
preprocessor = ColumnTransformer(
    transformers=[
        ('ohe', ohe, var_ohe),
        ('te', te, var_te),
        ('mms', mms, var_mms)
    ],
    remainder='passthrough'
)

### Instantiate the model

In [9]:
model = XGBClassifier(
    n_estimators=1000,
    max_depth=10,
    learning_rate=0.05,
    n_jobs=-1,
    random_state=42
)

## CREATE THE PIPELINES: RETRAINING AND EXECUTION

In [11]:
models_path = project_path + '/04_Models/'
pipe_retraining_path = os.path.join(models_path, "pipe_retraining.pkl")
pipe_execution_path = os.path.join(models_path, "pipe_execution.pkl")

# Pipeline with date features + OHE + TE + MMS + Model
pipeline = Pipeline(steps=[
    ('date_features', DateFeatureExtractor()),
    ('preprocess', preprocessor),
    ('model', model)
])

# Save UNTRAINED pipeline (RETRAINING)
with open(pipe_retraining_path, 'wb') as f:
    cloudpickle.dump(pipeline, f)
print("✔ Untrained pipeline saved:", pipe_retraining_path)

# Train
pipeline.fit(X, y)

# Save TRAINED pipeline (EXECUTION)
with open(pipe_execution_path, 'wb') as f:
    cloudpickle.dump(pipeline, f)
print("✔ Trained pipeline saved:", pipe_execution_path)

✔ Untrained pipeline saved: /Users/rober/retail-stockout-risk-scoring//04_Models/pipe_retraining.pkl
✔ Trained pipeline saved: /Users/rober/retail-stockout-risk-scoring//04_Models/pipe_execution.pkl


## Recap

We have saved:

- **pipe_retraining**: the untrained pipeline, in case we want to retrain it in the future.
- **pipe_execution**: the trained pipeline (already fitted), which we will later use to make predictions.

From this point on, we will generate two scripts:

- **Retraining**: 

    - Models gradually lose predictive power over time (data drifts, the market evolves...)
    - There is no fixed rule for how often to retrain (e.g., insurance or energy companies may retrain every 5 years, while in digital advertising it can be every few seconds). 
    - Typically, retraining is triggered when predictive performance drops by 5–10%. 
    - We’ll keep this code ready, although the one we’ll actually use is the execution script. 
    

- **Execution**: An engineer will deploy this script in a production environment (to run in batch mode, or via API, or as part of an app...).