## 01 · Imports  
Start by importing all the libraries you need for the rest of the notebook. These include standard utilities like `os`, `json`, and `pandas`, as well as deep learning libraries like PyTorch and visualization tools like `matplotlib`.

Also, import everything needed for **distributed training and data processing with Ray**:
- `ray` and `ray.data` provide the high-level distributed data API.
- `ray.train` gives you `TorchTrainer`, `ScalingConfig`, checkpointing, and metrics reporting.
- `prepare_model` wraps your PyTorch model for multi-worker training with Distributed Data Parallel (DDP).

A few extra helpers like `tqdm` and `train_test_split` round out the list for progress bars and quick offline preprocessing.

This notebook assumes Ray is already running (For example, with Anyscale), so you don’t call `ray.init()` manually.

In [None]:
# 01. Imports

# Standard libraries
import os
import uuid
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import zipfile
import shutil

# PyTorch
import torch
from torch import nn
import torch.nn.functional as F

# Ray
import ray
import ray.data
from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig, Checkpoint, get_checkpoint, get_context,  get_dataset_shard, report
from ray.train.torch import TorchTrainer, prepare_model

# Other
from tqdm import tqdm

### 02 · Load MovieLens 100K Dataset  
Download and extract the [MovieLens 100K](https://grouplens.org/datasets/movielens/100k/) dataset and persist a cleaned version to cluster storage under `/mnt/cluster_storage/rec_sys_tutorial/raw/ratings.csv`.

The MovieLens 100K dataset contains 100,000 ratings across 943 users and 1,682 movies. It’s small enough to train quickly, but realistic enough to demonstrate scaling and checkpointing with Ray Train.

If you already downloaded and extracted the dataset, skip both steps to save time. The output is a CSV with four columns: `user_id`, `item_id`, `rating`, and `timestamp`.

In [None]:
# 02. Load MovieLens 100K Dataset and store in /mnt/cluster_storage/

# Define clean working paths
DATA_URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
LOCAL_ZIP = "/mnt/cluster_storage/rec_sys_tutorial/ml-100k.zip"
EXTRACT_DIR = "/mnt/cluster_storage/rec_sys_tutorial/ml-100k"
OUTPUT_CSV = "/mnt/cluster_storage/rec_sys_tutorial/raw/ratings.csv"

# Ensure target directories exist
os.makedirs("/mnt/cluster_storage/rec_sys_tutorial/raw", exist_ok=True)

# Download only if not already done
if not os.path.exists(LOCAL_ZIP):
    !wget -q $DATA_URL -O $LOCAL_ZIP

# Extract cleanly
if not os.path.exists(EXTRACT_DIR):
    with zipfile.ZipFile(LOCAL_ZIP, 'r') as zip_ref:
        zip_ref.extractall("/mnt/cluster_storage/rec_sys_tutorial")

# Load raw file
raw_path = os.path.join(EXTRACT_DIR, "u.data")
df = pd.read_csv(raw_path, sep="\t", names=["user_id", "item_id", "rating", "timestamp"])

# Save cleaned version
df.to_csv(OUTPUT_CSV, index=False)

print(f"✅ Loaded {len(df):,} ratings → {OUTPUT_CSV}")
df.head()

### 03 · Preprocess IDs and Create Ray Dataset  
Begin preprocessing by encoding `user_id` and `item_id` into contiguous integer indices required for embedding layers. These encoded columns—`user_idx` and `item_idx`—are what your model uses during training.

After encoding, drop the original IDs and split the dataset into 64 chunks. Serialize each chunk and push to Ray’s object store using `ray.put(...)`. This allows Ray Data to construct a distributed dataset in the next step without creating a bottleneck on a single worker or process.

In [None]:
# 03. Preprocess IDs and create Ray Dataset in parallel

# Load CSV
df = pd.read_csv("/mnt/cluster_storage/rec_sys_tutorial/raw/ratings.csv")

# Encode user_id and item_id
user2idx = {uid: j for j, uid in enumerate(sorted(df["user_id"].unique()))}
item2idx = {iid: j for j, iid in enumerate(sorted(df["item_id"].unique()))}

df["user_idx"] = df["user_id"].map(user2idx)
df["item_idx"] = df["item_id"].map(item2idx)
df = df[["user_idx", "item_idx", "rating", "timestamp"]]

# Split into multiple chunks for parallel ingestion
NUM_SPLITS = 64  # adjust based on cluster size
dfs = np.array_split(df, NUM_SPLITS)
object_refs = [ray.put(split) for split in dfs]

### 04 · Visualize Dataset: Ratings, Users, and Items  
Before training, visualize the distribution of ratings, user activity, and item popularity. These quick checks help you verify that the dataset parses correctly and reveal useful patterns:

- The first plot shows the overall rating distribution (1–5 stars). As expected, you see a skew toward 4 and 5.
- The second plot shows how many ratings each user has submitted. There’s a long tail: a few power users, but many light users.
- The third plot shows how often users rated each item. Again, you see a long-tail distribution common in recommendation settings.

These histograms give you a sense of sparsity and coverage, both of which influence model performance.

In [None]:
# 04. Visualize Dataset: Ratings, User & Item Activity

# Plot rating distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
df["rating"].hist(bins=[0.5,1.5,2.5,3.5,4.5,5.5], edgecolor='black')
plt.title("Rating Distribution")
plt.xlabel("Rating"); plt.ylabel("Frequency")

# Plot number of ratings per user
plt.subplot(1, 3, 2)
df["user_idx"].value_counts().hist(bins=30, edgecolor='black')
plt.title("Ratings per User")
plt.xlabel("# Ratings"); plt.ylabel("Users")

# Plot number of ratings per item
plt.subplot(1, 3, 3)
df["item_idx"].value_counts().hist(bins=30, edgecolor='black')
plt.title("Ratings per Item")
plt.xlabel("# Ratings"); plt.ylabel("Items")

plt.tight_layout()
plt.show()

### 05 · Create Ray Dataset from Encoded Chunks  
Now, convert your list of encoded pandas chunks into a Ray Dataset using `from_pandas_refs(...)`. This method ensures that each chunk becomes its own block, enabling parallel data processing across the cluster.

The result is a distributed Ray Dataset with one block per chunk, which is ideal for streaming batches during training. Confirm the number of blocks and show a few rows to verify the format.

In [None]:
# 05. Create Ray Dataset from refs (uses multiple blocks/workers)

ratings_ds = ray.data.from_pandas_refs(object_refs)
print("✅ Ray Dataset created with", ratings_ds.num_blocks(), "blocks")
ratings_ds.show(3)

### 06 · Train/Validation Split using Ray Data  
Next, split the dataset into training and validation sets. First, shuffle the entire Ray Dataset to ensure randomization, then split by row index, using 80% for training and 20% for validation.

This approach is simple and scalable: Ray handles the shuffling and slicing in parallel across blocks. Also, set a fixed seed to ensure the split is reproducible. After you split it, each dataset remains a fully distributed Ray Dataset, ready to stream into workers.

In [None]:
# 06. Train/Val Split using Ray Data

# Parameters
TRAIN_FRAC = 0.8
SEED = 42  # for reproducibility

# Shuffle + split by index
total_rows = ratings_ds.count()
train_size = int(total_rows * TRAIN_FRAC)

ratings_ds = ratings_ds.random_shuffle(seed=SEED)
train_ds, val_ds = ratings_ds.split_at_indices([train_size])

print(f"✅ Train/Val Split:")
print(f"  Train → {train_ds.count():,} rows")
print(f"  Val   → {val_ds.count():,} rows")