# Session-Based Recommendation with Transformers4Rec

This notebook implements a session-based recommender system using NVIDIA's [Transformers4Rec](https://github.com/NVIDIA-Merlin/Transformers4Rec) library.

We will:
1.  **Setup**: Install necessary libraries.
2.  **Preprocess**: Use NVTabular to create session sequences from our rental data.
3.  **Model**: Define a Transformer-based model (e.g., XLNet).
4.  **Train**: Train the model to predict the next item in a session.
5.  **Evaluate**: Check performance metrics.

In [1]:
import os
import glob
import pandas as pd
import numpy as np
import torch

import nvtabular as nvt
from nvtabular.ops import *
from merlin.schema.tags import Tags

import transformers4rec.torch as tr

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

  warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
  warn(f"Triton dtype mappings did not load successfully due to an error: {exc.msg}")


Using device: cpu


## 2. Preprocessing with NVTabular

We need to transform our raw interaction data into a format suitable for sequential models.
This involves:
1.  Loading the raw data (Hits and Visits).
2.  Merging them to associate products with sessions.
3.  Using NVTabular to group interactions by session (`visit_id`) and create sequences.

In [90]:
# 1. Load Enriched Data
# We use the feature engineering notebook to prepare the data
# This file contains: visit_id, item_id, context features, item metadata, and counter features.

print("Loading enriched interactions...")
interactions = pd.read_parquet('data/enriched_interactions.parquet')

print(f"Loaded {len(interactions)} interactions.")
print("Columns:", interactions.columns.tolist())
interactions.head()

Loading enriched interactions...
Loaded 331689 interactions.
Columns: ['visit_id', 'item_id', 'date_time', 'traffic_source', 'region_city', 'brand', 'main_category', 'price_bucket', 'hour', 'day_of_week', 'is_weekend', 'device_category', 'mobile_phone', 'item_popularity', 'category_popularity', 'conversion_rate', 'user_session_rank', 'days_since_last_session', 'is_new_user']


Unnamed: 0,visit_id,item_id,date_time,traffic_source,region_city,brand,main_category,price_bucket,hour,day_of_week,is_weekend,device_category,mobile_phone,item_popularity,category_popularity,conversion_rate,user_session_rank,days_since_last_session,is_new_user
0,7155825714110136555,495491082,2023-12-02 01:35:08,direct,Ivanteevka,Weina,Ходунки,2,1,5,1,1,,261,2302,0.0,1,-1.0,1
1,2091271487556157575,463480491,2022-04-22 10:59:27,social,Moscow,4moms,Электрокачели,6,10,4,0,2,apple,7113,21831,0.000141,1,-1.0,1
2,573835779580362831,495257463,2022-02-14 11:03:29,organic,Moscow,Chicсo,Автокресла,2,11,0,0,2,xiaomi,1133,63506,0.0,1,-1.0,1
3,573835779580362831,495520383,2022-02-14 11:03:29,organic,Moscow,Vtech,Музыкальные инструменты,0,11,0,0,2,xiaomi,158,2779,0.0,1,-1.0,1
4,573835779580362831,495257463,2022-02-14 11:03:29,organic,Moscow,Chicсo,Автокресла,2,11,0,0,2,xiaomi,1133,63506,0.0,1,-1.0,1


In [91]:
# 2. Define NVTabular Workflow
# We will create a workflow to:
# - Categorify categorical features
# - Normalize continuous features
# - Group by visit_id to create sequences

# Define Feature Columns
# Categorical Features
item_id = ['item_id'] >> Categorify(dtype="int64") >> TagAsItemID()
traffic_source = ['traffic_source'] >> Categorify(dtype="int64")
region_city = ['region_city'] >> Categorify(dtype="int64")
brand = ['brand'] >> Categorify(dtype="int64")
main_category = ['main_category'] >> Categorify(dtype="int64")
price_bucket = ['price_bucket'] >> Categorify(dtype="int64")
hour = ['hour'] >> Categorify(dtype="int64")
day_of_week = ['day_of_week'] >> Categorify(dtype="int64")
is_weekend = ['is_weekend'] >> Categorify(dtype="int64")

# Device Features
device_category = ['device_category'] >> Categorify(dtype="int64")
mobile_phone = ['mobile_phone'] >> Categorify(dtype="int64")

# User History (Categorical)
is_new_user = ['is_new_user'] >> Categorify(dtype="int64")

# Continuous Features (Counters)
# We LogOp then Normalize to handle skewed distributions typical of popularity
item_popularity = ['item_popularity'] >> LogOp() >> Normalize()
category_popularity = ['category_popularity'] >> LogOp() >> Normalize()

# Conversion Rate (already 0-1 range, just normalize)
conversion_rate = ['conversion_rate'] >> Normalize()

# User History (Continuous)
user_session_rank = ['user_session_rank'] >> LogOp() >> Normalize()
days_since_last_session = ['days_since_last_session'] >> FillMissing() >> Normalize()

session_id = ['visit_id'] >> Categorify(dtype="int64") >> TagAsUserID()
time_col = ['date_time']

# Grouping to create sequences
# We group by 'visit_id' and aggregate other columns into lists
groupby_features = (
    session_id + item_id + traffic_source + region_city + 
    brand + main_category + price_bucket + 
    hour + day_of_week + is_weekend +
    device_category + mobile_phone +  # Device
    is_new_user +  # User History
    item_popularity + category_popularity +
    conversion_rate +  # Item Rate
    user_session_rank + days_since_last_session +  # User History
    time_col
) >> Groupby(
    groupby_cols=['visit_id'],
    sort_cols=['date_time'],
    aggs={
        'item_id': 'list',
        'traffic_source': 'list',
        'region_city': 'list',
        'brand': 'list',
        'main_category': 'list',
        'price_bucket': 'list',
        'hour': 'list',
        'day_of_week': 'list',
        'is_weekend': 'list',
        'device_category': 'list',  # Device
        'mobile_phone': 'list',  # Device
        'is_new_user': 'list',  # User History
        'item_popularity': 'list',
        'category_popularity': 'list',
        'conversion_rate': 'list',  # Item Rate
        'user_session_rank': 'list',  # User History
        'days_since_last_session': 'list',  # User History
        'date_time': 'first'
    },
    name_sep="-"
)

workflow = nvt.Workflow(groupby_features)

# Create a dataset from the pandas dataframe
interactions = interactions.reset_index(drop=True)
dataset = nvt.Dataset(interactions)

# Fit and Transform
print("Fitting and transforming with NVTabular...")
workflow.fit(dataset)
workflow.transform(dataset).to_parquet("data/processed_sessions")

print("NVTabular processing complete.")



Fitting and transforming with NVTabular...




NVTabular processing complete.


## 3. Dataset Creation

We load the processed Parquet files into a Merlin Dataset, which T4Rec uses.
We also define the schema, which tells the model which features are categorical, which is the item ID, etc.

In [92]:
# Load the processed data
# In a real scenario, we would split by time before this step.
# For this example, we'll just load the single file and split it manually or use a subset.

import os
import glob

processed_path = "data/processed_sessions"
schema = workflow.output_schema

# Check the schema
print("Schema:", schema)

# Create a Dataset
# We can use the Merlin Dataset API
import merlin.io
ds = merlin.io.Dataset(processed_path, engine="parquet")

# Simple Time-based Split (approximate for this example)
# We'll just take the last 20% of rows as validation since we sorted by time implicitly? 
# Actually, we should sort the parquet file or split it properly.
# Let's assume random split for the mechanics of this demo if time split is hard to do on the fly.
# But for session rec, time split is crucial.

# Let's reload as DataFrame to split, then save back to parquet for T4Rec (easiest for small data)
# We use glob to explicitly select .parquet files and avoid reading metadata files (like schema.pbtxt)
parquet_files = glob.glob(os.path.join(processed_path, "*.parquet"))
df = pd.read_parquet(parquet_files)
df = df.sort_values('date_time-first')

split_index = int(len(df) * 0.8)
train_df = df.iloc[:split_index]
valid_df = df.iloc[split_index:]

train_df.to_parquet("data/train.parquet")
valid_df.to_parquet("data/valid.parquet")

print(f"Train sessions: {len(train_df)}")
print(f"Valid sessions: {len(valid_df)}")

Schema: [{'name': 'visit_id', 'tags': {<Tags.CATEGORICAL: 'categorical'>, <Tags.USER: 'user'>, <Tags.ID: 'id'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0, 'max_size': 0, 'cat_path': './/categories/unique.visit_id.parquet', 'domain': {'min': 0, 'max': 119419, 'name': 'visit_id'}, 'embedding_sizes': {'cardinality': 119420, 'dimension': 512}}, 'dtype': DType(name='uint64', element_type=<ElementType.UInt: 'uint'>, element_size=64, element_unit=None, signed=None, shape=Shape(dims=(Dimension(min=0, max=None),))), 'is_list': False, 'is_ragged': False}, {'name': 'item_id-list', 'tags': {<Tags.CATEGORICAL: 'categorical'>, <Tags.ITEM: 'item'>, <Tags.ID: 'id'>}, 'properties': {'num_buckets': None, 'freq_threshold': 0, 'max_size': 0, 'cat_path': './/categories/unique.item_id.parquet', 'domain': {'min': 0, 'max': 594, 'name': 'item_id'}, 'embedding_sizes': {'cardinality': 595, 'dimension': 57}, 'value_count': {'min': 0, 'max': None}}, 'dtype': DType(name='int64', element_type=<Elemen



Train sessions: 95533
Valid sessions: 23884


## 4. Model Configuration

We define the Transformer model.
We use `SequentialBlock` to combine:
1.  **Embeddings**: For items and side info (city, source).
2.  **Transformer Body**: XLNet or similar.
3.  **Prediction Head**: To predict the next item.

In [93]:
# ============================================
# HYPERPARAMETERS - Change values here!
# ============================================
hparams = {
    # Model Architecture
    "d_model": 64,
    "n_head": 4,
    "n_layer": 1,
    "max_seq_length": 128,
    # Training
    "learning_rate": 0.0001,
    "batch_size": 512,
    "max_steps": -1,
    "dropout": 0.5,
    "summary_dropout": 0.5,
    # Data (auto-calculated below)
    "num_features": 17,
    "num_categorical": 12,
    "num_continuous": 5,
    "weight_decay": 0.01,
}

print("=" * 50)
print("HYPERPARAMETERS")
print("=" * 50)
for k, v in hparams.items():
    print(f"  {k}: {v}")

HYPERPARAMETERS
  d_model: 64
  n_head: 4
  n_layer: 1
  max_seq_length: 128
  learning_rate: 0.0001
  batch_size: 512
  max_steps: -1
  dropout: 0.5
  summary_dropout: 0.5
  num_features: 17
  num_categorical: 12
  num_continuous: 5
  weight_decay: 0.01


In [94]:
# Define the Schema for the model
from merlin.schema import Schema, Tags
import merlin.io

# Load schema from processed data
train_schema = merlin.io.Dataset("data/processed_sessions", engine="parquet").schema

# Select features - ALL 19 engineered features (minus visit_id, item_id, date_time which are IDs/meta)
# 17 model input features total
selected_features = [
    'item_id-list', 
    # Session Context (2)
    'traffic_source-list', 'region_city-list',
    # Item Metadata (3)
    'brand-list', 'main_category-list', 'price_bucket-list',
    # Temporal (3)
    'hour-list', 'day_of_week-list', 'is_weekend-list',
    # Device (2)
    'device_category-list', 'mobile_phone-list',
    # Continuous Counters (2)
    'item_popularity-list', 'category_popularity-list',
    # Item Rate (1)
    'conversion_rate-list',
    # User History (3)
    'user_session_rank-list', 'days_since_last_session-list', 'is_new_user-list'
]

input_schema = train_schema.select_by_name(selected_features)

# WORKAROUND: Fix schema value_counts
new_cols = []
for col in input_schema:
    props = col.properties.copy()
    if 'value_count' not in props:
        props['value_count'] = {}
    props['value_count']['max'] = hparams["max_seq_length"]
    new_col = col.with_properties(props)
    new_cols.append(new_col)

input_schema = Schema(new_cols)
print("Input Schema (Fixed):", input_schema)
print(f"\nTotal input features: {len(selected_features)}")

# Define the Input Block (uses hparams)
input_module = tr.TabularSequenceFeatures.from_schema(
    input_schema,
    max_sequence_length=hparams["max_seq_length"],
    aggregation="concat",
    d_output=hparams["d_model"],
    masking="causal",
)

# Define the Transformer Body - XLNet (uses hparams)
transformer_config = tr.GPT2Config.build(
    d_model=hparams["d_model"],
    n_head=hparams["n_head"],
    n_layer=hparams["n_layer"],
    total_seq_length=1024,
    dropout=hparams["dropout"],
    summary_dropout=hparams["summary_dropout"],
)

# Define body
body = tr.SequentialBlock(
    input_module,
    tr.TransformerBlock(transformer_config, masking=input_module.masking)
)

# Define ranking metrics
from transformers4rec.torch.ranking_metric import NDCGAt, RecallAt
metrics = [
    RecallAt(top_ks=[6, 10], labels_onehot=True),
    NDCGAt(top_ks=[6, 10], labels_onehot=True)
]

# Define the Head
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True, metrics=metrics),
    inputs=input_module,
)

# Get the end-to-end Model
model = tr.Model(head)

print(f"Model Switched: GPT-2 Architecture | Inputs: {len(selected_features)} features")
print(model)



Input Schema (Fixed): [{'name': 'item_id-list', 'tags': {<Tags.CATEGORICAL: 'categorical'>, <Tags.ITEM: 'item'>, <Tags.ID: 'id'>}, 'properties': {'freq_threshold': 0.0, 'num_buckets': None, 'cat_path': './/categories/unique.item_id.parquet', 'max_size': 0.0, 'embedding_sizes': {'dimension': 57.0, 'cardinality': 595.0}, 'domain': {'min': 0, 'max': 594, 'name': 'item_id'}, 'value_count': {'min': 0, 'max': 128}}, 'dtype': DType(name='int64', element_type=<ElementType.Int: 'int'>, element_size=64, element_unit=None, signed=True, shape=Shape(dims=(Dimension(min=0, max=None), Dimension(min=0, max=128)))), 'is_list': True, 'is_ragged': True}, {'name': 'traffic_source-list', 'tags': {<Tags.CATEGORICAL: 'categorical'>}, 'properties': {'freq_threshold': 0.0, 'num_buckets': None, 'cat_path': './/categories/unique.traffic_source.parquet', 'max_size': 0.0, 'embedding_sizes': {'dimension': 16.0, 'cardinality': 14.0}, 'domain': {'min': 0, 'max': 13, 'name': 'traffic_source'}, 'value_count': {'min': 0

## 5. Training

We use the `Trainer` class (based on HuggingFace Trainer) to train the model.

In [95]:
from transformers4rec.torch.trainer import Trainer
from transformers4rec.torch.utils.data_utils import MerlinDataLoader
from torch.utils.tensorboard import SummaryWriter

# Training Arguments with TensorBoard (uses hparams from above)
training_args = tr.T4RecTrainingArguments(
    output_dir="./t4r_output",
    max_steps=-1,
    num_train_epochs=10,
    learning_rate=hparams["learning_rate"],
    per_device_train_batch_size=hparams["batch_size"],
    per_device_eval_batch_size=hparams["batch_size"],
    logging_steps=100,
    eval_steps=100,
    save_steps=100,
    evaluation_strategy="steps",
    # TensorBoard
    report_to=["tensorboard"],
    logging_dir="./tb_logs",
    # Best model tracking workaround: use eval_loss for checkpointing
    load_best_model_at_end=True,
    metric_for_best_model="eval_/loss",
    greater_is_better=False,
    # Other
    dataloader_drop_last=False,
    compute_metrics_each_n_steps=1,
    use_mps_device=False,
    no_cuda=True,
    weight_decay=hparams["weight_decay"],
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    schema=train_schema,
    train_dataset_or_path="data/train.parquet",
    eval_dataset_or_path="data/valid.parquet",
)

# Log hyperparameters to TensorBoard
tb_writer = SummaryWriter(log_dir="./tb_logs")
tb_writer.add_hparams(hparams, {"placeholder": 0})
tb_writer.close()

# Train
print("Starting training...")
print("TensorBoard: Run 'tensorboard --logdir ./tb_logs' to monitor")
trainer.train()

Starting training...
TensorBoard: Run 'tensorboard --logdir ./tb_logs' to monitor




  0%|          | 0/1870 [00:00<?, ?it/s]

{'loss': 6.1454, 'learning_rate': 9.46524064171123e-05, 'epoch': 0.53}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.9507317543029785, 'eval_runtime': 2.705, 'eval_samples_per_second': 8896.173, 'eval_steps_per_second': 17.375, 'epoch': 0.53}
{'loss': 5.655, 'learning_rate': 8.930481283422461e-05, 'epoch': 1.07}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.671504974365234, 'eval_runtime': 2.6341, 'eval_samples_per_second': 9135.572, 'eval_steps_per_second': 17.843, 'epoch': 1.07}
{'loss': 5.3488, 'learning_rate': 8.39572192513369e-05, 'epoch': 1.6}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.475963115692139, 'eval_runtime': 2.7012, 'eval_samples_per_second': 8908.488, 'eval_steps_per_second': 17.399, 'epoch': 1.6}
{'loss': 5.1086, 'learning_rate': 7.86096256684492e-05, 'epoch': 2.14}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.314854621887207, 'eval_runtime': 2.7689, 'eval_samples_per_second': 8690.85, 'eval_steps_per_second': 16.974, 'epoch': 2.14}
{'loss': 4.9481, 'learning_rate': 7.326203208556151e-05, 'epoch': 2.67}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.192490577697754, 'eval_runtime': 2.6821, 'eval_samples_per_second': 8971.964, 'eval_steps_per_second': 17.523, 'epoch': 2.67}
{'loss': 4.8125, 'learning_rate': 6.79144385026738e-05, 'epoch': 3.21}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.090116024017334, 'eval_runtime': 2.7169, 'eval_samples_per_second': 8857.287, 'eval_steps_per_second': 17.299, 'epoch': 3.21}
{'loss': 4.6954, 'learning_rate': 6.25668449197861e-05, 'epoch': 3.74}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 5.0207648277282715, 'eval_runtime': 2.7308, 'eval_samples_per_second': 8811.957, 'eval_steps_per_second': 17.211, 'epoch': 3.74}
{'loss': 4.6114, 'learning_rate': 5.721925133689839e-05, 'epoch': 4.28}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.965872764587402, 'eval_runtime': 2.7459, 'eval_samples_per_second': 8763.598, 'eval_steps_per_second': 17.116, 'epoch': 4.28}
{'loss': 4.5323, 'learning_rate': 5.1871657754010694e-05, 'epoch': 4.81}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.907406806945801, 'eval_runtime': 2.6877, 'eval_samples_per_second': 8953.256, 'eval_steps_per_second': 17.487, 'epoch': 4.81}
{'loss': 4.4747, 'learning_rate': 4.6524064171123e-05, 'epoch': 5.35}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.858163833618164, 'eval_runtime': 2.8206, 'eval_samples_per_second': 8531.449, 'eval_steps_per_second': 16.663, 'epoch': 5.35}
{'loss': 4.4265, 'learning_rate': 4.11764705882353e-05, 'epoch': 5.88}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.835385799407959, 'eval_runtime': 2.7655, 'eval_samples_per_second': 8701.651, 'eval_steps_per_second': 16.995, 'epoch': 5.88}
{'loss': 4.3826, 'learning_rate': 3.582887700534759e-05, 'epoch': 6.42}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.808504104614258, 'eval_runtime': 2.8031, 'eval_samples_per_second': 8584.872, 'eval_steps_per_second': 16.767, 'epoch': 6.42}
{'loss': 4.3497, 'learning_rate': 3.0481283422459894e-05, 'epoch': 6.95}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.768310070037842, 'eval_runtime': 2.779, 'eval_samples_per_second': 8659.136, 'eval_steps_per_second': 16.912, 'epoch': 6.95}
{'loss': 4.3044, 'learning_rate': 2.5133689839572196e-05, 'epoch': 7.49}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.738343715667725, 'eval_runtime': 2.6854, 'eval_samples_per_second': 8960.932, 'eval_steps_per_second': 17.502, 'epoch': 7.49}
{'loss': 4.3018, 'learning_rate': 1.9786096256684494e-05, 'epoch': 8.02}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.733992099761963, 'eval_runtime': 2.8636, 'eval_samples_per_second': 8403.335, 'eval_steps_per_second': 16.413, 'epoch': 8.02}
{'loss': 4.2748, 'learning_rate': 1.4438502673796791e-05, 'epoch': 8.56}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.714770317077637, 'eval_runtime': 2.9005, 'eval_samples_per_second': 8296.526, 'eval_steps_per_second': 16.204, 'epoch': 8.56}
{'loss': 4.2651, 'learning_rate': 9.090909090909091e-06, 'epoch': 9.09}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.711856842041016, 'eval_runtime': 2.7196, 'eval_samples_per_second': 8848.373, 'eval_steps_per_second': 17.282, 'epoch': 9.09}
{'loss': 4.2473, 'learning_rate': 3.7433155080213903e-06, 'epoch': 9.63}




  0%|          | 0/47 [00:00<?, ?it/s]

{'eval_/loss': 4.702125549316406, 'eval_runtime': 2.9268, 'eval_samples_per_second': 8221.901, 'eval_steps_per_second': 16.058, 'epoch': 9.63}
{'train_runtime': 319.8618, 'train_samples_per_second': 2993.293, 'train_steps_per_second': 5.846, 'train_loss': 4.698595359108665, 'epoch': 10.0}


TrainOutput(global_step=1870, training_loss=4.698595359108665, metrics={'train_runtime': 319.8618, 'train_samples_per_second': 2993.293, 'train_steps_per_second': 5.846, 'total_flos': 0.0, 'train_loss': 4.698595359108665})

## 6. Evaluation

We evaluate the model using ranking metrics.

In [80]:
# Evaluate
# Enable metric computation (it defaults to None in Trainer)
trainer.compute_metrics = True

# Note: The trainer already has the eval_dataset_or_path from initialization
eval_metrics = trainer.evaluate()

print("\nAll Evaluation Metrics:")
for key, value in eval_metrics.items():
    print(f"{key}: {value}")

print("\nFiltered Results:")
for key, value in eval_metrics.items():
    if "recall" in key or "ndcg" in key:
        print(f"{key}: {value:.4f}")



  0%|          | 0/55 [00:00<?, ?it/s]


All Evaluation Metrics:
eval_/next-item/recall_at_6: 0.3113028407096863
eval_/next-item/recall_at_10: 0.3574920892715454
eval_/next-item/ndcg_at_6: 0.23176944255828857
eval_/next-item/ndcg_at_10: 0.24618665874004364
eval_/loss: 6.106310844421387
eval_runtime: 3.5025
eval_samples_per_second: 8039.977
eval_steps_per_second: 15.703

Filtered Results:
eval_/next-item/recall_at_6: 0.3113
eval_/next-item/recall_at_10: 0.3575
eval_/next-item/ndcg_at_6: 0.2318
eval_/next-item/ndcg_at_10: 0.2462


In [96]:
# 1. Load the Enriched Test Data you just created
print("Loading enriched test data...")
raw_test_dataset = nvt.Dataset("data/enriched_interactions_test.parquet")

# 2. Transform using the TRAINING workflow
# CRITICAL: We use 'workflow.transform', NOT 'workflow.fit'
# This ensures "Nike" is mapped to ID 105, just like in training.
print("Mapping Test Data to Integer IDs...")
processed_test = workflow.transform(raw_test_dataset)

# 3. Save to Processed Parquet
processed_test.to_parquet("data/processed_test")
print("Test Data Aligned and Saved to 'data/processed_test'")

# 4. Generate Submission
# Now you can run the prediction code from my previous response!
test_dataset_final = merlin.io.Dataset("data/processed_test", engine="parquet")
# trainer.predict(test_dataset_final) ...

Loading enriched test data...
Mapping Test Data to Integer IDs...




Test Data Aligned and Saved to 'data/processed_test'




# Generate Predictions

In [97]:
import torch
import numpy as np
import pandas as pd
import merlin.io
import os
import glob

# 1. Load the Processed Test Data
print("Loading aligned test data for inference...")
test_path = "data/processed_test"
test_dataset = merlin.io.Dataset(test_path, engine="parquet")

# 2. Generate Predictions
print("Running model inference (this may take a minute)...")
# Assuming 'trainer' is already defined in previous cells
output = trainer.predict(test_dataset) 

# === FIX 1: Handle Tuple Output & Dimensions ===
# Check if predictions is a tuple (common in HF/T4Rec)
if isinstance(output.predictions, tuple):
    logits = output.predictions[0]
else:
    logits = output.predictions

# Handle shapes: [Batch, Seq, Vocab] -> [Batch, Vocab]
# We take the prediction corresponding to the LAST item in the session
if len(logits.shape) == 3:
    final_step_logits = logits[:, -1, :]
else:
    final_step_logits = logits

print(f"Prediction shape: {final_step_logits.shape}")

# 3. Get Top-100 Candidates (The Buffer Strategy)
# === CHANGED FROM 20 TO 100 ===
K_CANDIDATES = 100 
logits_tensor = torch.from_numpy(final_step_logits).float()
top_scores, top_indices = torch.topk(logits_tensor, k=K_CANDIDATES, dim=1)
top_indices = top_indices.numpy()

# 4. Load the "Decoder Ring" (Integer -> String ID)
print("Loading Item ID mapping...")
categories_path = "categories/unique.item_id.parquet"

if os.path.exists(categories_path):
    categories_df = pd.read_parquet(categories_path)
    # Create array where index = integer ID, value = string Product ID
    item_map = categories_df['item_id'].values
    print(f"Loaded {len(item_map)} item categories.")
else:
    raise FileNotFoundError(f"Could not find categories file at {categories_path}")

# 5. Map Integers to Product IDs and Filter
print("Mapping IDs and formatting submission...")

def is_valid_id(s_id):
    # Filter out Padding (None/0), Unknown (<NA>), or empty strings
    return s_id and str(s_id) not in ['0', '1', '<NA>', 'None', 'nan']

final_predictions = []

for row_idx, indices in enumerate(top_indices):
    valid_items = []
    
    # We now have 100 candidates to look through, so we are much more likely
    # to find 6 valid ones before running out.
    for token_id in indices:
        # Safety bound check
        if token_id >= len(item_map):
            continue
            
        real_id = item_map[token_id]
        
        if is_valid_id(real_id):
            s_real_id = str(real_id)
            if s_real_id not in valid_items:
                valid_items.append(s_real_id)
        
        # Stop once we have 6 valid items
        if len(valid_items) == 6:
            break
    
    # FILLER STRATEGY: If < 6 items found even after checking top 100
    while len(valid_items) < 6:
        if len(valid_items) > 0:
            valid_items.append(valid_items[0]) # Duplicate best item
        else:
            valid_items.append("0") # Emergency fallback
            
    final_predictions.append(" ".join(valid_items))

# 6. Create Submission DataFrame
print("Reading visit IDs from test file...")

# === FIX 2: Use Glob to avoid schema.pbtxt crash ===
# Find only actual .parquet files
parquet_files = glob.glob(os.path.join(test_path, "*.parquet"))

# Read the specific files to get the visit_ids
test_id_df = pd.read_parquet(parquet_files, columns=['visit_id'])

# Safety Check for length mismatch
min_len = min(len(test_id_df), len(final_predictions))
if len(test_id_df) != len(final_predictions):
    print(f"WARNING: Length mismatch! IDs: {len(test_id_df)}, Preds: {len(final_predictions)}")
    print("Truncating to match the shorter length.")
    test_id_df = test_id_df.iloc[:min_len]
    final_predictions = final_predictions[:min_len]

# Create the dataframe
submission = pd.DataFrame({
    'visit_id': test_id_df['visit_id'],
    'product_ids': final_predictions
})

# 7. Save Intermediate Result
submission.to_csv("submission.csv", index=False)
print("\nSUCCESS! Saved intermediate 'submission.csv'")
print(submission.head())

Loading aligned test data for inference...
Running model inference (this may take a minute)...




  0%|          | 0/5 [00:00<?, ?it/s]

Prediction shape: (2062, 100)
Loading Item ID mapping...
Loaded 592 item categories.
Mapping IDs and formatting submission...
Reading visit IDs from test file...

SUCCESS! Saved intermediate 'submission.csv'
              visit_id                                        product_ids
0  3705189088312688779  495255340 495399681 463480423 495370340 495401...
1  3705579516103753793  495402312 495400605 495402322 495254783 495254...
2  3706260855703732415  495403143 495403163 495400474 495256989 463480...
3  3706380385759789156  463480699 495401895 495277293 495370340 495493...
4  3706985600653983919  495403163 463480699 463480417 495255645 495370...


In [98]:
import pandas as pd
import numpy as np

# ==========================================
# 1. LOAD DATA & MAPPER
# ==========================================
print("Loading Product Catalog for ID Mapping...")
products_df = pd.read_csv('data/new_site_products.csv')

# === FIX 3: Map Slugs to IDs ===
# Ensure IDs are strings
products_df['id'] = products_df['id'].astype(str)
# Create a dictionary to map 'slug' -> 'numerical_id'
slug_to_id_map = dict(zip(products_df['slug'], products_df['id']))

# Calculate Top 6 Popular IDs for "Cold Start" fallback
# We take the first 6 IDs from the catalog as a baseline
top_6_fallback = " ".join(products_df['id'].head(6).tolist())
print(f"Fallback prediction for empty sessions: {top_6_fallback}")

# ==========================================
# 2. LOAD YOUR CURRENT PREDICTIONS
# ==========================================
current_submission = pd.read_csv("submission.csv")
print(f"Current Predictions: {len(current_submission)} rows")

# ==========================================
# 3. CONVERT SLUGS TO NUMERICAL IDs
# ==========================================
def convert_slugs_to_ids(slug_string):
    slugs = slug_string.split(" ")
    ids = []
    for slug in slugs:
        # Map slug to ID
        if slug in slug_to_id_map:
            ids.append(slug_to_id_map[slug])
        else:
            # If mapping fails, check if it's already a number
            if str(slug).isdigit():
                 ids.append(str(slug))
            
    # If we have < 6 items after mapping, fill with fallback
    fallback_list = top_6_fallback.split(" ")
    
    while len(ids) < 6:
        for backup_id in fallback_list:
            if backup_id not in ids:
                ids.append(backup_id)
            if len(ids) == 6: break
            
    return " ".join(ids[:6])

print("Converting Slugs to Numerical IDs...")
current_submission['product_ids'] = current_submission['product_ids'].apply(convert_slugs_to_ids)

# ==========================================
# 4. FIX MISSING ROWS (Align with Kaggle Test List)
# ==========================================
print("Aligning with full Test Set...")

# Load the RAW test visits to get the complete list of required visit_ids
full_test_visits = pd.read_csv('data/metrika_visits_test.csv')
required_ids = full_test_visits[['visit_id']].drop_duplicates()

print(f"Required Rows: {len(required_ids)}")

# Merge: Left join keeps all required IDs, adds NaN where we have no prediction
final_df = required_ids.merge(current_submission, on='visit_id', how='left')

# Fill NaN (sessions that were dropped during preprocessing) with Top 6
missing_count = final_df['product_ids'].isna().sum()
print(f"Sessions with no prediction (filled with popular items): {missing_count}")

final_df['product_ids'] = final_df['product_ids'].fillna(top_6_fallback)

# ==========================================
# 5. FINAL SAVE
# ==========================================
# Ensure output format is exact
final_df['visit_id'] = final_df['visit_id'].astype(str)

print(f"Final Submission Rows: {len(final_df)}")
final_df.to_csv("submission_final.csv", index=False)
print("\nSUCCESS! Saved 'submission_final.csv'")
print(final_df.head())

Loading Product Catalog for ID Mapping...
Fallback prediction for empty sessions: 463480210 463480211 463480212 463480213 463480214 463480215
Current Predictions: 2062 rows
Converting Slugs to Numerical IDs...
Aligning with full Test Set...
Required Rows: 3891
Sessions with no prediction (filled with popular items): 1829
Final Submission Rows: 3891

SUCCESS! Saved 'submission_final.csv'
              visit_id                                        product_ids
0  3705073560174199024  463480210 463480211 463480212 463480213 463480...
1  3705189088312688779  495255340 495399681 463480423 495370340 495401...
2  3705549051029618879  463480210 463480211 463480212 463480213 463480...
3  3705579516103753793  495402312 495400605 495402322 495254783 495254...
4  3705717843210797336  463480210 463480211 463480212 463480213 463480...
