## 🧠 04e · Recommendation System Pattern with Ray Train 
In this notebook you build a **scalable matrix factorization recommendation system** using the **MovieLens 100K** dataset, fully distributed on an Anyscale cluster with **Ray Train V2** and **Ray Data**.

### What you’ll learn & take away  
* How to use **Ray Data** to load, encode, and shard tabular datasets across many workers  
* How to **stream training data** directly into PyTorch using `iter_torch_batches()`  
* How to build a **custom training loop with validation and checkpointing** using `ray.train.report()`  
* How to use **Ray Train V2's fault-tolerant trainer** to resume training from the latest checkpoint with no extra logic  
* How to separate **training, evaluation, and inference** while keeping all code modular and distributed-ready  
* How to run real-world recommendation workloads with **no changes to your model code**, thanks to Ray’s orchestration  


### 🔢 What problem are you solving? (Matrix factorization for recommendations)

You’re building a **collaborative filtering recommendation system** that predicts how much a user likes an item  
based on **historical interaction data** — in this case, user ratings from the MovieLens 100K dataset.

Use **matrix factorization**, a classic yet scalable approach where you embed each user and item in a latent space.  
The model learns to represent users and items as vectors and predicts ratings by computing their dot product.

---

### Input: User–Item–Rating triples

Each row in the dataset represents a user’s explicit rating of a movie:

$$
(u, {i}, r) \in \{\text{users}\} \times \{\text{items}\} \times \{1, 2, 3, 4, 5\}
$$

Encode these using contiguous integer indices (`user_idx`, `item_idx`)  
and normalize them for efficient embedding lookup and training.

---

### Model: Embedding-based matrix factorization

Learn an embedding vector for each user and each item:

$$
U_{u} \in \mathbb{R}^d, \quad V_{i} \in \mathbb{R}^d
$$

The predicted rating is the dot product of these vectors:

$$
\hat{r}_{u,{i}} = U_{u}^\top V_{i}
$$

The embedding dimension $d$ controls model capacity.

---

### Training objective

Minimize **Mean Squared Error (MSE)** between predicted and actual ratings:

$$
\mathcal{L} = \mathbb{E}_{(u, {i}, r)}\ \big(\hat{r}_{u,{i}} - r\big)^2
$$

This encourages the model to assign higher scores to user–item pairs that historically received high ratings.

---

### Inference: ranking items per user

Once trained, you can recommend items by computing predicted scores for a target user  
against **all items in the catalog**:

$$
\hat{r}_{u, *} = U_{u}^\top V^\top
$$

Sort these scores and return the top-N items as personalized recommendations.

---

### 🧭 How you’ll migrate this recommendation system workload to a distributed setup using Ray on Anyscale

This tutorial walks through how to **migrate a local matrix factorization pipeline for recommendation into a distributed, fault-tolerant training loop using Ray Train and Ray Data on Anyscale**.

Here’s how you approach the transition:

1. **Pandas DataFrame → Sharded Ray Dataset**  
   Load MovieLens 100K as a pandas DataFrame, encode the IDs, and use `ray.data.from_pandas_refs()` to create a **multi-block Ray Dataset**. Each block is a training shard that Ray can distribute across workers.

2. **Manual Batching → Streaming Torch Data loaders**  
   Instead of manually writing PyTorch `Dataset` logic, use `iter_torch_batches()` from **Ray Data** to stream batches directly into each worker. Ray handles all the parallelism and sharding behind the scenes.

3. **Single-node PyTorch → Multi-GPU Distributed Training**  
   Write a minimal `train_loop_per_worker` that runs on each Ray worker. Using `TorchTrainer` and `prepare_model()`, scale this loop across 8 GPU workers automatically, where each working on its own data shard.

4. **Ad-hoc Logging → Structured Epoch Logging and Checkpoints**  
   Each epoch logs `train_loss` and `val_loss` to a shared JSON file, and report checkpoints with `ray.train.report(checkpoint=...)`. This enables **automatic recovery and metric tracking** without any additional code.

5. **Resume and Scaling → Declarative Configuration**  
   Configure fault tolerance, checkpointing, and scaling using `ScalingConfig`, `CheckpointConfig`, and `FailureConfig`. This lets Ray + Anyscale handle retries, recovery, and GPU orchestration.

6. **Post-training Inference → Lightweight Python Function**  
   After training, load the latest checkpoint and generate top-N recommendations for any user with a simple forward pass. No retraining, no re-initialization, just pure PyTorch inference.

With just a few changes to your core code, scale a traditional recommendation pipeline across a Ray cluster with **distributed data loading, checkpointing, fault tolerance, and parallel training**, all fully managed by Anyscale.