# Training the Sequential Transformer (Retrieval & Ranking)

This notebook creates a  Sequential Recommender System using the MovieLens-20M dataset. It covers the full training pipeline:

1.  **Data Processing:** Converting raw MovieLens data into user sequence arrays.
2.  **Retriever Training:** Training a SASRec-style Transformer with RoPE, SwiGLU, and InfoNCE loss.
3.  **Ranking:** Mining hard negatives and training a secondary MLP Ranker.

### Protocol
*   **Data Split:** Leave-One-Out (LOO).
*   **Positive Samples:** All interactions treated as implicit feedback.
*   **Metrics:** HR@10, NDCG@10 (no sampling, full ranking, and exclude items the user has already interacted with).

### Setup (Colab)
Run the following cell to install the package and dependencies if running in Google Colab.
If running locally, ensure you have installed the package via `pip install -e .`

In [None]:
!git clone https://github.com/zheliu17/nanoRecSys.git
%pip install -q -e ./nanoRecSys

import psutil  # noqa: F401

# In fact, we don't need psutil. force-reinstall to trigger colab restart
%pip install --force-reinstall psutil=={psutil.__version__}
print("Installation complete. Please restart runtime...")

### 1. Data Preparation
We use the MovieLens-20M dataset:
- Chronologically sorting user interactions.
- Splitting data: The last item of each user sequence is reserved for the `test` set. The second to last is for `val` set.


In [None]:
import wandb
import nanoRecSys.data.build_dataset
import nanoRecSys.data.splits

wandb.login()

nanoRecSys.data.build_dataset.process_data()
nanoRecSys.data.splits.create_user_time_split()
nanoRecSys.data.build_dataset.prebuild_sequential_files()

### 2. Training the Retriever (Sequential Transformer)

We train a **Transformer-based tower** using **InfoNCE loss**.

Training takes ~10 hours on A100 GPU for 300 epochs (100 epochs is sufficient for good performance).

> **Note:** You may also download the pre-trained model from [huggingface](https://huggingface.co/zheliu97/nanoRecSys) and place it in the `artifacts` directory to skip training.


In [None]:
import nanoRecSys.train


class Args:
    mode = "retriever"
    user_tower_type = "transformer"

    epochs = 300
    batch_size = 128
    lr = 1e-3
    num_workers = 4
    warmup_steps = 2000
    ckpt_path = "last"


nanoRecSys.train.main(Args)

### 3. Generate Embeddings for Indexing

Once trained, we generate static embeddings for all items. For users, we can pre-compute embeddings for the static validation/test set.
Make sure `item_tower.pth` and `user_tower.pth` are in the `artifacts` directory (`nanoRecSys.config.settings.artifacts_dir`).


In [None]:
import nanoRecSys.indexing.build_embeddings

nanoRecSys.indexing.build_embeddings.build_item_embeddings(batch_size=128)
nanoRecSys.indexing.build_embeddings.build_user_embeddings(batch_size=128)

### 4. Retriever evaluation:

In [None]:
from nanoRecSys.eval.offline_eval import OfflineEvaluator

evaluator = OfflineEvaluator(1024)
results = evaluator.eval_retrieval()

df = evaluator.formatted_results(results)
df

### 5. Negative Mining & Ranker Training

The retriever is good at finding relevant items from millions, but a heavier Ranker can re-order them for better precision.

Takes ~10 minutes in total.

In [None]:
import nanoRecSys.training.mine_negatives_sasrec
import nanoRecSys.train

# Generate hard negatives per interaction. Full set takes 10GB+ disk space.
# Here we only keep 0.2 ratio of them.
nanoRecSys.training.mine_negatives_sasrec.run_pipeline(
    batch_size=128, top_k=100, skip_top=10, sampling_ratio=0.2
)


class Args:
    mode = "ranker"
    user_tower_type = "transformer"
    epochs = 5
    batch_size = 2048
    id_dropout = 0.5
    random_neg_ratio = 0.01

    lr = 1e-3
    item_lr = 0
    num_workers = 2
    warmup_steps = 500
    check_val_every_n_epoch = 1


nanoRecSys.train.main(Args)

### 6. Ranker Evaluation

In [None]:
from nanoRecSys.eval.offline_eval import OfflineEvaluator

evaluator = OfflineEvaluator(1024)
results = evaluator.eval_popularity()

df = evaluator.formatted_results(results)
df

In [None]:
results = evaluator.eval_ranker()
# results = evaluator.eval_ranker_new_items()

df = evaluator.formatted_results(results)
df