# Santander Product Recommendation System

## Table of Contents

### Part 1: Data Preparation
* [1. Data Loading](#section-1) - Load and combine training/test datasets with memory optimization
* [2. Data Cleaning](#section-2) - Handle missing values and data quality issues
* [3. Feature Engineering](#section-3) - Create 193 engineered features for better predictions

### Part 2: Model Development
* [4. Train/Validation Split](#section-4) - Time-based split strategy for recommendation systems
* [5. LightGBM Model Training](#section-5) - Train 24 binary classifiers (one per product)
* [6. Model Evaluation](#section-6) - Compute MAP@7 and AUC metrics

### Part 3: Model Analysis
* [7. Fairness Testing](#section-7) - Evaluate recommendation bias across demographic groups
  * 7.1 Fairness by Age
  * 7.2 Fairness by Gender
  * 7.3 Fairness by Income
  * 7.4 Cross-Group Analysis

### Part 4: Deployment
* [8. Model Export](#section-8) - Save models and artifacts for production deployment

---

## 🎯 Project Goals

Build a recommendation system that predicts which financial products customers are likely to purchase next, helping Santander Bank:
- **Personalize** product offerings to individual customers
- **Increase** cross-sell conversion rates
- **Ensure** fair treatment across demographic groups

---

## 📊 Key Results

| Metric | Value |
|--------|-------|
| **Dataset Size** | 10M+ customer records |
| **Time Period** | Oct 2015 - Jun 2016 (9 months) |
| **Products** | 24 financial products |
| **Models Trained** | 21 LightGBM classifiers |
| **MAP@7** (Model only) | 0.0141 |
| **MAP@7** (Hybrid) | 0.0147 |
| **Average AUC** | ~0.85 |
| **Features** | 193 engineered features |

---

## 🛠️ Tech Stack

- **Data Processing**: Pandas, NumPy
- **Modeling**: LightGBM, Scikit-learn
- **Visualization**: Matplotlib, Seaborn
- **Deployment**: FastAPI, Docker

---

**GitHub Repository**: https://github.com/yiruchen03/Santander-product-recommendation-system

---




In [1]:
import numpy as np
import pandas as pd
import os
import random
from timeit import default_timer as timer
import seaborn as sns
import matplotlib.pyplot as plt
import zipfile

from sklearn.preprocessing import LabelEncoder
import gc  

import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score


<a id="1"></a>
## Get Data
The dataset contains purchase data about the bank's customers and the bank's products.
The data starts at 2015-01-28 and has monthly records of products a customer has, such as "credit
card", "savings account", etc. To save memory, I only train on data from 2015-06 to 2016-05.

In [None]:
import pandas as pd
import zipfile
train_zip = os.path.expanduser("~/Downloads/santander-product-recommendation/train_ver2.csv.zip")
test_zip  = os.path.expanduser("~/Downloads/santander-product-recommendation/test_ver2.csv.zip")

train_csv_name = "train_ver2.csv"
test_csv_name  = "test_ver2.csv"

need_months = pd.period_range("2015-10", "2016-06", freq="M")

def read_zip_csv_filtered(zip_path, csv_name, chunksize=1_000_000, usecols=None, dtypes=None):
    """read from zip file and filter by month"""
    out_chunks = []
    with zipfile.ZipFile(zip_path) as zf:
        with zf.open(csv_name) as f:
            for chunk in pd.read_csv(
                f,
                chunksize=chunksize,
                dtype=dtypes,
                usecols=usecols,
                low_memory=False
            ):
                # parse date (only parse the date we need)
                if "fecha_dato" in chunk.columns:
                    chunk["fecha_dato"] = pd.to_datetime(chunk["fecha_dato"], errors="coerce")
                    per = chunk["fecha_dato"].dt.to_period("M")
                    chunk = chunk[per.isin(need_months)]
                # append
                out_chunks.append(chunk)
    if not out_chunks:
        return pd.DataFrame()
    return pd.concat(out_chunks, axis=0, ignore_index=True)

    # claim some light dtypes to save memory
dtypes_hint = {
    "ncodpers": "int64",
    "sexo": "object",
    "age": "object",
    "renta": "object",
   
}

df_train = read_zip_csv_filtered(train_zip, train_csv_name, dtypes=dtypes_hint)
df_test  = read_zip_csv_filtered(test_zip,  test_csv_name,  dtypes=dtypes_hint)
df_all = pd.concat([df_train, df_test], ignore_index=True)

# map sexo to 0 and 1
df_all['sexo'] = df_all['sexo'].map({'H': 0, 'V': 1})

# Check the shape of the concatenated DataFrame
df_all.shape

(8261972, 48)

In [3]:
df_all["fecha_dato"] = pd.to_datetime(df_all["fecha_dato"],format="%Y-%m-%d")
df_all["fecha_alta"] = pd.to_datetime(df_all["fecha_alta"],format="%Y-%m-%d")
df_all["fecha_dato"].unique()

<DatetimeArray>
['2015-10-28 00:00:00', '2015-11-28 00:00:00', '2015-12-28 00:00:00',
 '2016-01-28 00:00:00', '2016-02-28 00:00:00', '2016-03-28 00:00:00',
 '2016-04-28 00:00:00', '2016-05-28 00:00:00', '2016-06-28 00:00:00']
Length: 9, dtype: datetime64[ns]

## data cleaning

I tried some data cleaning methods like filling na with mode/mean but the impact turned out to be very little. So I did not use much of the cleaning methods in this version.

In [4]:
# already has nomprov as province name, so drop cod_prov and tipodom
df_all.drop(["tipodom","conyuemp",'cod_prov'],axis=1,inplace=True)

In [None]:
# 1. get all product columns and convert to numeric
product_cols = [col for col in df_all.columns if col.startswith('ind_') and 'ult1' in col]
for col in product_cols:
    df_all[col] = pd.to_numeric(df_all[col], errors='coerce')
    df_all[col].fillna(0, inplace=True)
    df_all[col] = df_all[col].astype(int)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_all[col].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_all[col].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a c

In [6]:
df_all.isnull().any()
# now we have cleaned all missing values

fecha_dato               False
ncodpers                 False
ind_empleado             False
pais_residencia          False
sexo                      True
age                      False
fecha_alta               False
ind_nuevo                False
antiguedad               False
indrel                   False
ult_fec_cli_1t            True
indrel_1mes               True
tiprel_1mes               True
indresi                  False
indext                   False
canal_entrada             True
indfall                  False
nomprov                   True
ind_actividad_cliente    False
renta                     True
segmento                  True
ind_ahor_fin_ult1        False
ind_aval_fin_ult1        False
ind_cco_fin_ult1         False
ind_cder_fin_ult1        False
ind_cno_fin_ult1         False
ind_ctju_fin_ult1        False
ind_ctma_fin_ult1        False
ind_ctop_fin_ult1        False
ind_ctpp_fin_ult1        False
ind_deco_fin_ult1        False
ind_deme_fin_ult1        False
ind_dela

In [7]:
df_all.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1,ind_plan_fin_ult1,ind_pres_fin_ult1,ind_reca_fin_ult1,ind_tjcr_fin_ult1,ind_valo_fin_ult1,ind_viv_fin_ult1,ind_nomina_ult1,ind_nom_pens_ult1,ind_recibo_ult1
0,2015-10-28,1217174,N,ES,1.0,22,2013-11-08,0.0,23,1.0,...,0,0,0,0,0,0,0,0,0,0
1,2015-10-28,1217176,N,ES,1.0,32,2013-11-08,0.0,23,1.0,...,0,0,0,0,0,0,0,1,1,0
2,2015-10-28,1217173,N,ES,0.0,23,2013-11-08,0.0,23,1.0,...,0,0,0,0,0,0,0,0,0,0
3,2015-10-28,1217172,N,ES,0.0,32,2013-11-08,0.0,23,1.0,...,0,0,0,1,0,0,0,1,1,1
4,2015-10-28,1217171,N,ES,1.0,25,2013-11-08,0.0,23,1.0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
def optimize_dtypes(df):
    """
    Optimize DataFrame dtypes to save memory.
    Note: This function modifies df inplace, does not create new columns.
    """
    print(f"Memory before optimization: {df.memory_usage(deep=True).sum() / 1024**3:.2f} GB")
    print(f"Number of columns before optimization: {len(df.columns)}")
    initial_memory = df.memory_usage(deep=True).sum()

    # 1. Optimize integer columns
    int_cols = df.select_dtypes(include=['int64', 'int32']).columns
    print(f"\nOptimizing {len(int_cols)} integer columns...")
    for col in int_cols:
        col_min = df[col].min()
        col_max = df[col].max()
        if col_min >= 0:  # unsigned integer
            if col_max < 255:
                df[col] = df[col].astype('uint8')
            elif col_max < 65535:
                df[col] = df[col].astype('uint16')
            else:
                df[col] = df[col].astype('uint32')
        else:  # signed integer
            if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype('int8')
            elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype('int16')
            else:
                df[col] = df[col].astype('int32')

    # 2. Optimize float columns
    float_cols = df.select_dtypes(include=['float64']).columns
    print(f"Optimizing {len(float_cols)} float columns...")
    for col in float_cols:
        df[col] = df[col].astype('float32')

    # 3. Convert repeated string columns to category type
    obj_cols = df.select_dtypes(include=['object']).columns
    print(f"Checking {len(obj_cols)} object columns...")
    converted = 0
    for col in obj_cols:
        num_unique = df[col].nunique()
        num_total = len(df[col])
        if num_unique / num_total < 0.5:  # if unique values < 50%
            df[col] = df[col].astype('category')
            converted += 1
    print(f"  → Converted {converted} columns to category type")

    # 4. Check that no columns were added
    final_memory = df.memory_usage(deep=True).sum()
    print(f"\nMemory after optimization: {final_memory / 1024**3:.2f} GB")
    print(f"Number of columns after optimization: {len(df.columns)}")
    print(f"Memory saved: {(1 - final_memory / initial_memory) * 100:.1f}%")
    return df

df_all = optimize_dtypes(df_all)

Memory before optimization: 7.99 GB
Number of columns before optimization: 45

Optimizing 25 integer columns...
Optimizing 4 float columns...
Checking 14 object columns...
  → Converted 14 columns to category type

Memory after optimization: 0.71 GB
Number of columns after optimization: 45
Memory saved: 91.1%


## feature engineering

In [9]:
print(f"total dataset: {df_all.shape}")
print(f"date: {df_all['fecha_dato'].min()} to {df_all['fecha_dato'].max()}")
product_cols = [col for col in df_all.columns if col.startswith('ind_') and 'ult1' in col]
print(f"product_cols: {len(product_cols)}")

total dataset: (8261972, 45)
date: 2015-10-28 00:00:00 to 2016-06-28 00:00:00
product_cols: 24


In [10]:
# 1. 基础排序和时间特征（保持不变，这部分已经很快）
df_all = df_all.sort_values(['ncodpers','fecha_dato']).reset_index(drop=True)
df_all['month'] = df_all['fecha_dato'].dt.month.astype('int8')

# 2. 一次性获取所有产品的数据矩阵（避免循环）
products_array = df_all[product_cols].values.astype('float32')
customer_idx = df_all['ncodpers'].values
unique_customers = np.unique(customer_idx)
customer_map = {cid: i for i, cid in enumerate(unique_customers)}
mapped_idx = np.array([customer_map[cid] for cid in customer_idx])

# 3. 批量计算 EWM 特征
def batch_ewm(array, alpha):
    """批量计算多列的 EWM"""
    w = 1
    n = len(array)
    weights = np.zeros(n)
    weights[0] = w
    for i in range(1, n):
        w *= (1 - alpha)
        weights[i] = w
    weights = weights[::-1]  # 反转权重
    weights /= weights.sum()
    # 使用卷积计算移动加权平均
    from numpy.lib.stride_tricks import as_strided
    s = array.strides[0]
    strided = as_strided(array, shape=(n-n+1, n), strides=(s, s))
    return (strided * weights).sum(axis=1)

# 4. 年龄和入行时间特征（保持不变）
y = df_all['fecha_dato'].dt.year
m = df_all['fecha_dato'].dt.month
y0 = y.iloc[0]; m0 = m.iloc[0]
df_all['rel_month'] = ((y - y0) * 12 + (m - m0)).astype('int16')

df_all['fecha_alta'] = pd.to_datetime(df_all['fecha_alta'], errors='coerce')
df_all['tenure_m'] = (
    (df_all['fecha_dato'].dt.year - df_all['fecha_alta'].dt.year) * 12 +
    (df_all['fecha_dato'].dt.month - df_all['fecha_alta'].dt.month)
).clip(lower=0).fillna(0).astype('int16')

# 5. 收入和年龄特征（向量化处理）
df_all['age'] = pd.to_numeric(df_all['age'], errors='coerce')
df_all['renta'] = pd.to_numeric(df_all['renta'], errors='coerce')

age_bins = [0,25,35,45,55,65,200]
age_labels = ['<=25','26-35','36-45','46-55','56-65','66+']
df_all['age_bucket'] = pd.cut(df_all['age'], bins=age_bins, labels=age_labels)

# 6. 收入比率特征（使用 numpy 运算加速）
renta_arr = df_all['renta'].values  # 提前获取renta数组

if {'pais_residencia','provincia'}.issubset(df_all.columns):
    loc_mean = df_all.groupby(['pais_residencia','provincia'])['renta'].transform('mean')
    loc_mean_arr = loc_mean.values
    ratio = np.divide(renta_arr, loc_mean_arr, out=np.ones_like(renta_arr), where=loc_mean_arr!=0)
    df_all['renta_to_loc_mean'] = ratio.clip(0, 10).astype('float32')

if 'age_bucket' in df_all.columns:
    age_mean = df_all.groupby('age_bucket')['renta'].transform('mean')
    age_mean_arr = age_mean.values
    ratio = np.divide(renta_arr, age_mean_arr, out=np.ones_like(renta_arr)*-1, where=age_mean_arr!=0)
    df_all['renta_to_age_mean'] = ratio.clip(-1, 10).astype('float32')

if 'canal_entrada' in df_all.columns:
    channel_mean = df_all.groupby('canal_entrada')['renta'].transform('mean')
    channel_mean_arr = channel_mean.values
    ratio = np.divide(renta_arr, channel_mean_arr, out=np.ones_like(renta_arr)*-1, where=channel_mean_arr!=0)
    df_all['renta_to_channel_mean'] = ratio.clip(-1, 10).astype('float32')

# 7. 历史状态特征（批量计算）
for p in product_cols:
    # 计算前1/2/3月状态
    series = df_all[p]
    for k in range(1, 4):
        shifted = series.shift(k).fillna(0)
        df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
    
    # 计算变化和持续时长
    prev = df_all[f'{p}_prev_1']
    df_all[f'{p}_delta'] = (series - prev).astype('int8')
    
    # 使用 rolling 计算近5月统计
    roll = prev.groupby(df_all['ncodpers']).rolling(5, min_periods=1).sum()
    df_all[f'{p}_ones_5m'] = roll.reset_index(level=0, drop=True).astype('int8')
    df_all[f'{p}_zeros_5m'] = (5 - roll).reset_index(level=0, drop=True).astype('int8')

# 8. 转换分类变量（保持不变）
for col in ['segmento','canal_entrada','pais_residencia','provincia','age_bucket']:
    if col in df_all.columns:
        df_all[col] = df_all[col].astype('category')

  age_mean = df_all.groupby('age_bucket')['renta'].transform('mean')
  channel_mean = df_all.groupby('canal_entrada')['renta'].transform('mean')
  df_all[f'{p}_ones_5m'] = roll.reset_index(level=0, drop=True).astype('int8')
  df_all[f'{p}_zeros_5m'] = (5 - roll).reset_index(level=0, drop=True).astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_delta'] = (series - prev).astype('int8')
  df_all[f'{p}_ones_5m'] = roll.reset_index(level=0, drop=True).astype('int8')
  df_all[f'{p}_zeros_5m'] = (5 - roll).reset_index(level=0, drop=True).astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_prev_{k}'] = shifted.astype('int8')
  df_all[f'{p}_delta'] = (series - prev).astype('int8')
  df_all[f'{p}_ones_5m'] = roll.reset_index(level=0, drop=True).astype('int8')
  df_all[f'{p}_zeros_5m'] = 

In [11]:
# 计算每个月每个产品的总体先验概率
per_m = df_all['fecha_dato'].dt.to_period('M')
monthly_pop = {}

for p in product_cols:
    # 计算上月是否持有（如果没有，复用前面已经算好的）
    if f'{p}_prev' not in df_all.columns:
        prev = df_all.groupby('ncodpers')[p].shift(1).fillna(0).astype('int8')
        df_all[f'{p}_prev'] = prev
    else:
        prev = df_all[f'{p}_prev']

    # 当月"新增"标记（用于计算月度先验）
    added = ((df_all[p] == 1) & (prev == 0)).astype('int8')

    # 计算月度新增率，向后移1月作为先验
    rate_by_m = added.groupby(per_m).mean().astype('float32')
    monthly_pop[p] = rate_by_m.shift(1).fillna(rate_by_m.mean())
    df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')

# 上月24个产品的0/1组合压缩编码
prev_matrix = df_all[[f'{p}_prev' for p in product_cols]].astype('int8')
weights = np.array([1 << i for i in range(len(product_cols))], dtype=np.int64)
df_all['prev_products_code'] = (prev_matrix.values @ weights).astype('int64')
df_all['prev_products_code_bucket'] = (df_all['prev_products_code'] % 1000).astype('category')

# 可选：每个产品组合的转化率特征
# grouped = df_all.groupby('prev_products_code')
# for p in product_cols:
#     tcol = f'{p}_target'
#     if tcol in df_all.columns:
#         rate = grouped[tcol].transform('mean').fillna(0)
#         df_all[f'{p}_combo_rate'] = rate.astype('float32')

  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]).astype('float32')
  df_all[f'{p}_prev'] = prev
  df_all[f'pop_prior_{p}'] = per_m.map(monthly_pop[p]

In [12]:
df_all.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_viv_fin_ult1_prev,pop_prior_ind_viv_fin_ult1,ind_nomina_ult1_prev,pop_prior_ind_nomina_ult1,ind_nom_pens_ult1_prev,pop_prior_ind_nom_pens_ult1,ind_recibo_ult1_prev,pop_prior_ind_recibo_ult1,prev_products_code,prev_products_code_bucket
0,2015-10-28,15889,F,ES,1.0,56,1995-01-16,0.0,248,1.0,...,0,0.000386,0,0.009905,0,0.010971,0,0.021723,0,0
1,2015-11-28,15889,F,ES,1.0,56,1995-01-16,0.0,249,1.0,...,0,0.003437,0,0.05005,0,0.054801,0,0.117038,524548,548
2,2015-12-28,15889,F,ES,1.0,56,1995-01-16,0.0,250,1.0,...,0,6e-06,0,0.004922,0,0.004939,0,0.011237,524548,548
3,2016-01-28,15889,F,ES,1.0,56,1995-01-16,0.0,251,1.0,...,0,9e-06,0,0.005628,0,0.005714,0,0.010923,786692,692
4,2016-02-28,15889,F,ES,1.0,56,1995-01-16,0.0,252,1.0,...,0,2e-06,0,0.002714,0,0.003211,0,0.011081,786692,692


In [13]:
# product columns
if 'product_cols' not in globals():
    product_cols = [c for c in df_all.columns if c.startswith('ind_') and c.endswith('_ult1')]

# previous month product state
for p in product_cols:
    prev_col = f'{p}_prev'
    if prev_col not in df_all.columns:
        df_all[prev_col] = df_all.groupby('ncodpers')[p].shift(1).fillna(0).astype('int8')

# target column is whether the product is newly added this month
target_cols = []
for p in product_cols:
    tcol = f'{p}_target'
    df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
    target_cols.append(tcol)

print(f"targets built: {len(target_cols)}")

  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df

targets built: 24


  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')
  df_all[tcol] = ((df_all[p] == 1) & (df_all[f'{p}_prev'] == 0)).astype('int8')


## train test split

In [14]:
test_date = '2016-06-28'
val_date = '2016-05-28'

train_df = df_all[df_all['fecha_dato'] < val_date].copy()
val_df = df_all[df_all['fecha_dato'] == val_date].copy()
test_df = df_all[df_all['fecha_dato'] == test_date].copy()

print(f"training set: {train_df.shape} (date: {train_df['fecha_dato'].min()} to {train_df['fecha_dato'].max()})")
print(f"validation set: {val_df.shape} (date: {val_date})")
print(f"test set: {test_df.shape} (date: {test_date})")

training set: (6400904, 269) (date: 2015-10-28 00:00:00 to 2016-04-28 00:00:00)
validation set: (931453, 269) (date: 2016-05-28)
test set: (929615, 269) (date: 2016-06-28)


In [15]:
train_df.head()

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1_target,ind_plan_fin_ult1_target,ind_pres_fin_ult1_target,ind_reca_fin_ult1_target,ind_tjcr_fin_ult1_target,ind_valo_fin_ult1_target,ind_viv_fin_ult1_target,ind_nomina_ult1_target,ind_nom_pens_ult1_target,ind_recibo_ult1_target
0,2015-10-28,15889,F,ES,1.0,56,1995-01-16,0.0,248,1.0,...,0,0,0,0,0,1,0,0,0,0
1,2015-11-28,15889,F,ES,1.0,56,1995-01-16,0.0,249,1.0,...,0,0,0,0,0,0,0,0,0,0
2,2015-12-28,15889,F,ES,1.0,56,1995-01-16,0.0,250,1.0,...,0,0,0,0,1,0,0,0,0,0
3,2016-01-28,15889,F,ES,1.0,56,1995-01-16,0.0,251,1.0,...,0,0,0,0,0,0,0,0,0,0
4,2016-02-28,15889,F,ES,1.0,56,1995-01-16,0.0,252,1.0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# select feature columns (exclude target and product cols)
# ncodpers is user id, fecha_dato is date, fecha_alta is join date, ult_fec_cli_1t is last date of first product (dropped)
leak_cols = product_cols + target_cols + ['ncodpers','fecha_dato','fecha_alta','ult_fec_cli_1t']
feature_cols = [c for c in df_all.columns if c not in leak_cols]
feature_cols = [c for c in feature_cols if not c.endswith('_delta')]

print('Feature count:', len(feature_cols))

X_train = train_df[feature_cols].copy()
X_val   = val_df[feature_cols].copy()
X_test  = test_df[feature_cols].copy()

y_train = train_df[target_cols].copy()
y_val   = val_df[target_cols].copy()




Feature count: 193


In [17]:
import lightgbm as lgb
import numpy as np
from sklearn.metrics import roc_auc_score

# convert feature columns to 'category' dtype for LightGBM
categorical_cols = [c for c in feature_cols
                    if (X_train[c].dtype == 'object') or (str(X_train[c].dtype) == 'category')]

for c in categorical_cols:
    X_train[c] = X_train[c].astype('category')
    X_val[c]   = X_val[c].astype('category')
    X_test[c]  = X_test[c].astype('category')

# 3) get all product names (without _target suffix)
all_products = [c.replace('_target','') for c in target_cols]

# 4) store model and validation AUC for each product
models  = {}  
metrics = {}  

# 5) loop over each product to train a separate model for 24 products
for product in all_products:
    target_col = f'{product}_target'         # e.g. 'ind_ahor_fin_ult1_target'
    y_tr = y_train[target_col]              
    y_va = y_val[target_col]                 

    # skip products that are too rare in training or validation
    pos = int(y_tr.sum())                    # positive samples in training
    if pos < 10 or y_va.sum() == 0:           # threshold can be adjusted
        # record as skipped for later inspection (optional)
        metrics[product] = {'val_auc': None, 'skipped': True}
        continue

    # 6) construct LightGBM dataset objects and declare categorical_feature
    dtrain = lgb.Dataset(
        X_train,                              # training features
        label=y_tr,                           # training labels (0/1)
        categorical_feature=categorical_cols, # tell LGB which columns are categorical
        free_raw_data=False                   # keep raw data for later use
    )
    dvalid = lgb.Dataset(
        X_val,                                # validation features
        label=y_va,                           # validation labels
        categorical_feature=categorical_cols, # same as above
        reference=dtrain,                     # share dictionary info to save memory
        free_raw_data=False
    )

    # 7) LightGBM parameters with typical values for binary classification
    params = dict(
        objective='binary',
        metric='auc',
        learning_rate=0.01,      # 从 0.03 改为 0.01
        num_leaves=127,          # 从 31 改为 127
        max_depth=8,             # 从 6 改为 8
        feature_fraction=0.7,    # 从 0.8 改为 0.7
        bagging_fraction=0.7,    # 从 0.8 改为 0.7
        bagging_freq=1,          # 从 5 改为 1
        min_child_samples=50,    # 从 20 改为 50
        reg_alpha=0.1,           # 新增
        reg_lambda=0.1,          # 新增
        is_unbalance=True,
        num_threads=-1,
        seed=2027,
        verbose=-1
)
    # 8) train the model with early stopping (stop if no improvement on validation set for 50 rounds)
    model = lgb.train(
        params=params,                    # parameters
        train_set=dtrain,                 # training data
        num_boost_round=500,              # maximum rounds (can be truncated with early stopping)
        valid_sets=[dvalid],              # validation set
        valid_names=['valid'],            # validation set name
        callbacks=[
            lgb.early_stopping(100),      # stop if no improvement on validation set for 50 rounds
            lgb.log_evaluation(50)       # print AUC every 50 rounds
        ]
    )

    # 9) save model
    models[product] = model

    # 10) make a prediction on the validation set and calculate AUC (for monitoring)
    y_pred_val = model.predict(
        X_val,                             # validation features
        num_iteration=model.best_iteration # use best iteration
    )
    val_auc = roc_auc_score(y_va, y_pred_val)  # calculate AUC
    metrics[product] = {'val_auc': float(val_auc)}  # record validation AUC for this product

# 11) (optional) view overall AUC summary
#     filter out skipped products and calculate average AUC as reference
valid_aucs = [m['val_auc'] for m in metrics.values() if m.get('val_auc') is not None]
if valid_aucs:
    print(f"Avg valid AUC across trained products: {np.mean(valid_aucs):.4f} "
          f"(trained {len(valid_aucs)}/{len(all_products)})")
else:
    print("No products were trained (likely all too rare in training/validation).")


Training until validation scores don't improve for 100 rounds
[50]	valid's auc: 0.253626
[100]	valid's auc: 0.511127
[150]	valid's auc: 0.274814
Early stopping, best iteration is:
[73]	valid's auc: 0.599855
Training until validation scores don't improve for 100 rounds
[50]	valid's auc: 0.991399
[100]	valid's auc: 0.991696
[150]	valid's auc: 0.991843
[200]	valid's auc: 0.991893
[250]	valid's auc: 0.991963
[300]	valid's auc: 0.992022
[350]	valid's auc: 0.992064
[400]	valid's auc: 0.992076
[450]	valid's auc: 0.992072
Early stopping, best iteration is:
[385]	valid's auc: 0.992084
Training until validation scores don't improve for 100 rounds
[50]	valid's auc: 0.651063
[100]	valid's auc: 0.651352
Early stopping, best iteration is:
[42]	valid's auc: 0.70499
Training until validation scores don't improve for 100 rounds
[50]	valid's auc: 0.961882
[100]	valid's auc: 0.962888
[150]	valid's auc: 0.963159
[200]	valid's auc: 0.963274
[250]	valid's auc: 0.963374
[300]	valid's auc: 0.963509
[350]	vali

## evaluation metrics: map@7

In [18]:
# referring to https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py

#average precision at k
def apk(actual, predicted, k=7):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=7):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

# evaluation

In [None]:
import numpy as np
import pandas as pd

# choose validation and test date
val_date   = '2016-04-28'
eval_date  = '2016-05-28'   

val_df_m   = df_all[df_all['fecha_dato'] == val_date].copy()
eval_df_m  = df_all[df_all['fecha_dato'] == eval_date].copy()

# only keep users that appear in both months and sort by ncodpers
common_ids = np.intersect1d(val_df_m['ncodpers'].values, eval_df_m['ncodpers'].values)
val_df_m   = val_df_m.loc[val_df_m['ncodpers'].isin(common_ids)].sort_values('ncodpers').reset_index(drop=True)
eval_df_m  = eval_df_m.loc[eval_df_m['ncodpers'].isin(common_ids)].sort_values('ncodpers').reset_index(drop=True)

eval_df_m

Unnamed: 0,fecha_dato,ncodpers,ind_empleado,pais_residencia,sexo,age,fecha_alta,ind_nuevo,antiguedad,indrel,...,ind_hip_fin_ult1_target,ind_plan_fin_ult1_target,ind_pres_fin_ult1_target,ind_reca_fin_ult1_target,ind_tjcr_fin_ult1_target,ind_valo_fin_ult1_target,ind_viv_fin_ult1_target,ind_nomina_ult1_target,ind_nom_pens_ult1_target,ind_recibo_ult1_target
0,2016-05-28,15889,F,ES,1.0,56,1995-01-16,0.0,255,1.0,...,0,0,0,0,1,0,0,0,0,0
1,2016-05-28,15890,A,ES,1.0,63,1995-01-16,0.0,256,1.0,...,0,0,0,0,0,0,0,0,0,0
2,2016-05-28,15892,F,ES,0.0,62,1995-01-16,0.0,256,1.0,...,0,0,0,0,0,0,0,0,0,0
3,2016-05-28,15893,N,ES,1.0,63,1997-10-03,0.0,256,1.0,...,0,0,0,0,0,0,0,0,0,0
4,2016-05-28,15894,A,ES,1.0,60,1995-01-16,0.0,256,1.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926658,2016-05-28,1548202,N,ES,1.0,22,2016-04-29,1.0,1,1.0,...,0,0,0,0,0,0,0,0,0,0
926659,2016-05-28,1548203,N,ES,1.0,51,2016-04-29,1.0,1,1.0,...,0,0,0,0,0,0,0,0,0,0
926660,2016-05-28,1548204,N,ES,1.0,54,2016-04-29,1.0,1,1.0,...,0,0,0,0,0,0,0,0,0,0
926661,2016-05-28,1548206,N,ES,0.0,40,2016-04-29,1.0,1,1.0,...,0,0,0,0,0,0,0,0,0,0


In [20]:

# feature matrix: in april
X_val_m = val_df_m[feature_cols].copy()

# prediction using april features
all_products = [c.replace('_target','') for c in target_cols]
P = len(all_products)
N = len(X_val_m)

val_scores = np.zeros((N, P), dtype=np.float32)
for j, p in enumerate(all_products):
    if p in models:
        m = models[p]
        val_scores[:, j] = m.predict(X_val_m, num_iteration=getattr(m, "best_iteration", None)).astype(np.float32)
    else:
        prior_col = f'pop_prior_{p}'
        if prior_col in val_df_m.columns:
            val_scores[:, j] = val_df_m[prior_col].to_numpy(dtype=np.float32)
        else:
            val_scores[:, j] = float(y_train[f'{p}_target'].mean()) 




In [21]:


# moddel-based recommendation evaluation (MAP@7) ---
# 1. Build a mask for products already owned in April (to avoid recommending them again)
owned_mask = val_df_m[all_products].to_numpy(dtype=np.int8)  # 1 if owned in April, else 0

prod_arr = np.array(all_products)
y_eval_mat = eval_df_m[[f"{p}_target" for p in all_products]].to_numpy(dtype=np.int8)  #(926663, 24)
actual_products = [prod_arr[np.where(r == 1)[0]].tolist() for r in y_eval_mat] #list of 926663

# 4. Mask out products already owned in April by setting their score to a very low value
val_scores_masked = val_scores.copy()
val_scores_masked[owned_mask == 1] = -1e9  # ensure these will not be recommended

# 5. For each user, select the top 7 products with the highest predicted scores
topk_idx = np.argpartition(val_scores_masked, -7, axis=1)[:, -7:] 
row = np.arange(N)[:, None]
order = np.argsort(val_scores_masked[row, topk_idx], axis=1)[:, ::-1]  # sort top 7 by score descending
predicted_products = prod_arr[topk_idx[row, order]].tolist()  
# 6. Compute MAP@7: mean average precision at 7 over all users
val_map7 = mapk(actual_products, predicted_products, k=7)
print(f"Validation MAP@7 (model only): {val_map7:.4f}")


Validation MAP@7 (model only): 0.0149


In [22]:
# --- 先准备先验矩阵、屏蔽矩阵、真实新增（都基于 4→5 的评估） ---
prior_cols = [f'pop_prior_{p}' for p in all_products]
assert set(prior_cols).issubset(val_df_m.columns), "val_df_m缺少pop_prior_*列"
prior_mat = val_df_m[prior_cols].to_numpy(dtype=np.float32)      # 4月的先验

owned_mask = val_df_m[all_products].to_numpy(dtype=np.int8)      # 4月已持有（预测起点）

prod_arr = np.array(all_products)
y_eval_mat = eval_df_m[[f"{p}_target" for p in all_products]].to_numpy(dtype=np.int8)  # 5月的新增
actual_products = [prod_arr[np.where(r == 1)[0]].tolist() for r in y_eval_mat]

def topk_from_scores(scores, k=7, product_names=None):
    N, P = scores.shape
    topk_idx = np.argpartition(scores, -k, axis=1)[:, -k:]
    row = np.arange(N)[:, None]
    order = np.argsort(scores[row, topk_idx], axis=1)[:, ::-1]
    idx_sorted = topk_idx[row, order]
    return np.array(product_names)[idx_sorted].tolist() if product_names is not None else idx_sorted

# --- α 网格搜索 ---
alphas = [0.5, 0.6, 0.7, 0.8, 0.9]
best_map, best_alpha = -1.0, None
for a in alphas:
    s = a * val_scores + (1.0 - a) * prior_mat      # 融合
    s[owned_mask == 1] = -1e9                       # 屏蔽4月已持有
    pred = topk_from_scores(s, 7, product_names=all_products)
    map7 = mapk(actual_products, pred, 7)
    print(f'alpha={a:.2f}, MAP@7={map7:.4f}')
    if map7 > best_map:
        best_map, best_alpha = map7, a

print('Best alpha:', best_alpha, 'MAP@7:', f'{best_map:.4f}')

# --- 用最优 α 生成最终的验证预测（便于后续对比/存档） ---
alpha = best_alpha
val_scores_blend = alpha * val_scores + (1.0 - alpha) * prior_mat
val_scores_blend[owned_mask == 1] = -1e9
topk_idx = np.argpartition(val_scores_blend, -7, axis=1)[:, -7:]
row = np.arange(N)[:, None]
order = np.argsort(val_scores_blend[row, topk_idx], axis=1)[:, ::-1]
predicted_products = prod_arr[topk_idx[row, order]].tolist()

val_map7 = mapk(actual_products, predicted_products, k=7)
print(f"Validation MAP@7 (best alpha): {val_map7:.4f}")


alpha=0.50, MAP@7=0.0153
alpha=0.60, MAP@7=0.0152
alpha=0.70, MAP@7=0.0151
alpha=0.80, MAP@7=0.0150
alpha=0.90, MAP@7=0.0149
Best alpha: 0.5 MAP@7: 0.0153
Validation MAP@7 (best alpha): 0.0153


# Fairness Test

## Fairness by Age

In [23]:
# import pandas as pd
# import matplotlib.pyplot as plt

# # Drop rows with missing or invalid age
# merged_df['age'] = pd.to_numeric(merged_df['age'], errors='coerce')
# age_df = merged_df.dropna(subset=['age'])
# age_df = age_df[(age_df['age'] >= 18) & (age_df['age'] <= 100)]  # optional: realistic range

# # Create age groups
# age_df['age_group'] = pd.cut(age_df['age'], bins=[18, 30, 45, 60, 75, 100],
#                              labels=['18-30', '31-45', '46-60', '61-75', '76+'])

# # Group by age group and calculate mean AP@7 and number of users
# age_summary = age_df.groupby('age_group')['apk'].agg(['mean', 'count']).reset_index()
# age_summary.columns = ['age_group', 'mean_apk', 'n_users']

# # Print summary table
# print(age_summary)

# # Plot
# plt.figure(figsize=(8, 5))
# plt.bar(age_summary['age_group'].astype(str), age_summary['mean_apk'])
# plt.xlabel('Age Group')
# plt.ylabel('Mean AP@7')
# plt.title('Recommendation Fairness by Age Group')
# plt.tight_layout()
# plt.show()


## Fairness by Gender

In [24]:
# # delete missing values
# gender_df = merged_df.dropna(subset=['sexo'])

# # Group by sex and calculate mean AP@7 and number of users
# gender_summary = gender_df.groupby('sexo')['apk'].agg(['mean', 'count']).reset_index()
# gender_summary.columns = ['sexo', 'mean_apk', 'n_users']

# print(gender_summary)

# # data visualization
# import matplotlib.pyplot as plt

# plt.bar(gender_summary['sexo'].astype(str), gender_summary['mean_apk'])
# plt.xlabel('Gender (0=Male, 1=Female)')
# plt.ylabel('Mean AP@7')
# plt.title('Recommendation Fairness by Gender')
# plt.show()


## Fairness by Income

In [25]:
# # Drop rows with missing income
# income_df = merged_df.dropna(subset=['renta'])

# # Divide users into income groups based on tertiles
# income_df['income_group'] = pd.qcut(income_df['renta'], 3, labels=['low', 'medium', 'high'])

# # Group by income and calculate mean AP@7 and number of users
# income_summary = income_df.groupby('income_group')['apk'].agg(['mean', 'count']).reset_index()
# income_summary.columns = ['income_group', 'mean_apk', 'n_users']

# print(income_summary)

# # Visualization
# plt.bar(income_summary['income_group'], income_summary['mean_apk'])
# plt.xlabel('Income Group')
# plt.ylabel('Mean AP@7')
# plt.title('Recommendation Fairness by Income')
# plt.show()


We evaluated fairness across gender and income groups based on AP@7. The average precision scores were [slightly higher / lower] for female users compared to male users, and [users in high-income groups received more accurate recommendations than those in low-income groups]. These differences may reflect behavioral patterns but also point to potential fairness concerns.



## cross-group fairness analysis




In [26]:
# # Drop rows with missing renta or age
# cross_df = merged_df.dropna(subset=['renta', 'age'])

# # Categorize income
# cross_df['income_group'] = pd.qcut(cross_df['renta'], 3, labels=['low', 'medium', 'high'])

# # Categorize age
# cross_df['age'] = cross_df['age'].astype(float)
# cross_df['age_group'] = pd.cut(cross_df['age'],
#                                bins=[0, 30, 45, 60, 75, 200],
#                                labels=['18-30', '31-45', '46-60', '61-75', '76+'])

# # Map gender
# cross_df['sexo_label'] = cross_df['sexo'].map({0: 'man', 1: 'woman'})

# # Combine all into a cross group
# cross_df['group'] = cross_df['sexo_label'] + "_" + cross_df['income_group'].astype(str) + "_" + cross_df['age_group'].astype(str)


In [27]:
# cross_summary = cross_df.groupby('group')['apk'].agg(['mean', 'count']).reset_index()
# cross_summary.columns = ['group', 'mean_apk', 'n_users']
# cross_summary = cross_summary.sort_values(by='mean_apk', ascending=False)
# print(cross_summary)
# import matplotlib.pyplot as plt
# import seaborn as sns

# plt.figure(figsize=(12, 6))
# sns.barplot(data=cross_summary, x='group', y='mean_apk')
# plt.xticks(rotation=45, ha='right')
# plt.xlabel('Gender + Income + Age Group')
# plt.ylabel('Mean AP@7')
# plt.title('Cross-Group Fairness: Mean AP@7 by Subgroup')
# plt.tight_layout()
# plt.show()


## Model Deployment - Save Trained Models

This section saves all trained models and necessary artifacts for production deployment:
- 24 LightGBM models (one per product)
- Feature columns list (for consistent inference)
- Product names list
- Configuration and hyperparameters
- Model performance metrics

These files will be used by the FastAPI service to serve predictions.


In [28]:
import os
import json

# ========================================
# STEP 1: Create Models Directory
# ========================================
# Create a directory to store all model artifacts
# This directory will contain: model files, feature lists, configs, and metrics
models_dir = 'models'
os.makedirs(models_dir, exist_ok=True)

print("=" * 80)
print("STARTING MODEL EXPORT FOR DEPLOYMENT")
print("=" * 80)

# ========================================
# STEP 2: Save All Product Models
# ========================================
# Save each of the 24 LightGBM models (one per financial product)
# Models are saved in LightGBM's native text format (.txt)
# This format is efficient and compatible with the LightGBM API
print("\n[1/5] Saving LightGBM models...")
for product, model in models.items():
    model_path = os.path.join(models_dir, f'{product}_model.txt')
    model.save_model(model_path)
    print(f"  ✓ Saved: {model_path}")

print(f"\n  → Total models saved: {len(models)} products")

# ========================================
# STEP 3: Save Feature Columns List
# ========================================
# Save the list of feature column names used during training
# This ensures the API uses the same feature order during inference
# Critical for model correctness!
print("\n[2/5] Saving feature columns list...")
feature_cols_path = os.path.join(models_dir, 'feature_cols.json')
with open(feature_cols_path, 'w') as f:
    json.dump(feature_cols, f, indent=2)
print(f"  ✓ Saved: {feature_cols_path}")
print(f"  → Feature count: {len(feature_cols)}")

# ========================================
# STEP 4: Save Product Names
# ========================================
# Save the list of all 24 product names
# Used by the API to iterate over all products for recommendations
print("\n[3/5] Saving product list...")
products_path = os.path.join(models_dir, 'products.json')
with open(products_path, 'w') as f:
    json.dump(all_products, f, indent=2)
print(f"  ✓ Saved: {products_path}")
print(f"  → Product count: {len(all_products)}")

# ========================================
# STEP 5: Save Configuration
# ========================================
# Save model configuration and optimal hyperparameters
# Includes: best alpha (for hybrid recommendation), MAP@7 score, metadata
print("\n[4/5] Saving configuration...")
config_path = os.path.join(models_dir, 'config.json')
config = {
    'best_alpha': best_alpha,          # Optimal blending weight for hybrid system
    'best_map7': float(best_map),      # Best validation MAP@7 score
    'num_products': len(all_products),  # Total number of products
    'num_features': len(feature_cols),  # Total number of features
    'recommendation_k': 7,              # Number of recommendations to return (top-7)
    'model_type': 'LightGBM',          # Model algorithm
    'training_date': '2016-04',        # Training data month
    'validation_date': '2016-05',      # Validation data month
    'description': 'Santander Product Recommendation System - Hybrid Model (LightGBM + Popularity Prior)'
}
with open(config_path, 'w') as f:
    json.dump(config, f, indent=2)
print(f"  ✓ Saved: {config_path}")
print(f"  → Best alpha: {best_alpha}")
print(f"  → Validation MAP@7: {best_map:.4f}")

# ========================================
# STEP 6: Save Model Performance Metrics
# ========================================
# Save individual model performance metrics (AUC for each product)
# Useful for monitoring and debugging specific product predictions
print("\n[5/5] Saving model metrics...")
metrics_path = os.path.join(models_dir, 'metrics.json')
with open(metrics_path, 'w') as f:
    json.dump(metrics, f, indent=2)
print(f"  ✓ Saved: {metrics_path}")

# ========================================
# COMPLETION MESSAGE
# ========================================
print("\n" + "=" * 80)
print("✅ MODEL EXPORT COMPLETE!")
print("=" * 80)
print("\nAll artifacts saved to 'models/' directory:")
print("  - 24 model files (*.txt)")
print("  - feature_cols.json (feature list)")
print("  - products.json (product list)")
print("  - config.json (configuration)")
print("  - metrics.json (performance metrics)")

print("\n" + "=" * 80)
print("NEXT STEPS FOR DEPLOYMENT:")
print("=" * 80)
print("1. Verify models directory contains all files:")
print("   $ ls -lh models/")
print("")
print("2. Test API locally (development mode):")
print("   $ ./deploy.sh --local")
print("   or")
print("   $ uvicorn app:app --host 0.0.0.0 --port 8000 --reload")
print("")
print("3. Test API endpoints:")
print("   $ python predict_client.py")
print("")
print("4. Build Docker image (production deployment):")
print("   $ docker build -t santander-api .")
print("")
print("5. Run Docker container:")
print("   $ docker run -p 8000:8000 santander-api")
print("   or use automated script:")
print("   $ ./deploy.sh")
print("")
print("6. Access API documentation:")
print("   http://localhost:8000/docs")
print("=" * 80)


STARTING MODEL EXPORT FOR DEPLOYMENT

[1/5] Saving LightGBM models...
  ✓ Saved: models/ind_ahor_fin_ult1_model.txt
  ✓ Saved: models/ind_cco_fin_ult1_model.txt
  ✓ Saved: models/ind_cder_fin_ult1_model.txt
  ✓ Saved: models/ind_cno_fin_ult1_model.txt
  ✓ Saved: models/ind_ctju_fin_ult1_model.txt
  ✓ Saved: models/ind_ctma_fin_ult1_model.txt
  ✓ Saved: models/ind_ctop_fin_ult1_model.txt
  ✓ Saved: models/ind_ctpp_fin_ult1_model.txt
  ✓ Saved: models/ind_dela_fin_ult1_model.txt
  ✓ Saved: models/ind_ecue_fin_ult1_model.txt
  ✓ Saved: models/ind_fond_fin_ult1_model.txt
  ✓ Saved: models/ind_hip_fin_ult1_model.txt
  ✓ Saved: models/ind_plan_fin_ult1_model.txt
  ✓ Saved: models/ind_pres_fin_ult1_model.txt
  ✓ Saved: models/ind_reca_fin_ult1_model.txt
  ✓ Saved: models/ind_tjcr_fin_ult1_model.txt
  ✓ Saved: models/ind_valo_fin_ult1_model.txt
  ✓ Saved: models/ind_viv_fin_ult1_model.txt
  ✓ Saved: models/ind_nomina_ult1_model.txt
  ✓ Saved: models/ind_nom_pens_ult1_model.txt
  ✓ Saved: model