## 1 · Imports  
This cell loads all required libraries for the tutorial: PyData tools for data processing, PyTorch for model building and training, and Ray Train for distributed orchestration. `TorchTrainer` is the main training engine, while `prepare_model` and `prepare_data_loader` help convert vanilla PyTorch code into Ray-aware components that scale seamlessly across nodes.

In [None]:
# 01. Imports
import os, io, math, uuid, shutil, random
import requests, sys
from pathlib import Path
from datetime import datetime, timedelta
from datasets import load_dataset   

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim

import ray
import ray.train as train
from ray.train import (
    ScalingConfig, RunConfig, FailureConfig,
    CheckpointConfig, Checkpoint, get_checkpoint, get_context
)
from ray.train.torch import prepare_model, prepare_data_loader, TorchTrainer

### 2 · Load NYC-Taxi passenger counts (30-min)  
Download and cache a lightweight NYC taxi demand dataset from GitHub. store the file under the shared `/mnt/cluster_storage` directory so that all Ray workers can read it without duplication. Parse the timestamps and used as the DataFrame index, making the data time-series ready.

In [None]:
# 02. Load NYC-Taxi passenger counts (30-min) from GitHub raw – no auth, ~1 MB

DATA_DIR = "/mnt/cluster_storage/nyc_taxi_ts"
os.makedirs(DATA_DIR, exist_ok=True)

url = "https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv"
csv_path = os.path.join(DATA_DIR, "nyc_taxi.csv")

if not os.path.exists(csv_path):
    print("Downloading nyc_taxi.csv …")
    df = pd.read_csv(url)
    df.to_csv(csv_path, index=False)
else:
    print("File already present.")
    df = pd.read_csv(csv_path)

# Parse timestamp + tidy
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.set_index("timestamp").rename(columns={"value": "passengers"})

print("Rows:", len(df), "| Time span:", df.index.min(), "→", df.index.max())
df.head()

### 3 · Resample to hourly, then normalise  
Resample the dataset to 30-minute intervals (if it wasn’t already), then z-score the `passengers` column to get a standardised signal. This helps with training stability, gradient scale, and ensures the model doesn’t learn absolute magnitudes too early. You reverse the normalisation after inference.

In [None]:
# 03. Resample to hourly, then normalise
hourly = df.resample("30min").mean()

mean, std = hourly["passengers"].mean(), hourly["passengers"].std()
hourly["norm"] = (hourly["passengers"] - mean) / std

print(f"Half-Hourly rows: {len(hourly)}  |  mean={mean:.1f}, std={std:.1f}")
hourly.head()

### 4 · Quick visual sanity-check  
Before moving to training, it’s good practice to visualise the raw data. Plot the first two weeks of half-hourly taxi demand. This helps confirm that the series exhibits strong seasonality and contains no unexpected gaps or noise.

In [None]:
# 04. Quick visual sanity-check — first two weeks
plt.figure(figsize=(10, 4))
hourly["passengers"].iloc[:24*14].plot()
plt.title("NYC-Taxi passengers - first 2 weeks of 2014")
plt.ylabel("# trips in hour")
plt.grid(True)
plt.tight_layout()
plt.show()

### 5 · Sliding-window dataset → Parquet  
Convert the time-series into a supervised learning format using sliding windows. Each sample consists of a fixed-length input sequence (1 week of past data) and a prediction target (next 24 hours). You write these to columnar Parquet files on shared storage to enable efficient streaming in distributed training.

In [None]:
# 05. Build sliding-window dataset and write to Parquet
# ----------------------------------------------------
INPUT_WINDOW = 24 * 7   # 1/2 week history (in 30-min steps = 168)
HORIZON      = 48       # predict next 24 h
STRIDE       = 12       # slide 6 hours at a time

values = hourly["norm"].to_numpy(dtype="float32")  # already normalised

# ---- Time-aware split to avoid leakage between train and val ----
cut = int(0.9 * len(values))  # split by time index on the original series
train_records, val_records = [], []

for s in range(0, len(values) - INPUT_WINDOW - HORIZON + 1, STRIDE):
    past   = values[s : s + INPUT_WINDOW]
    future = values[s + INPUT_WINDOW : s + INPUT_WINDOW + HORIZON]
    end    = s + INPUT_WINDOW + HORIZON  # last index consumed by this window

    rec = {
        "series_id": 0,
        "past":  past.tolist(),
        "future": future.tolist(),
    }

    if end <= cut:         # entire window ends before the cut → train
        train_records.append(rec)
    elif s >= cut:         # window starts after the cut → val
        val_records.append(rec)
    # else: window crosses the cut → drop to prevent leakage

print(f"Windows → train: {len(train_records)}, val: {len(val_records)}")

# Write to Parquet
DATA_DIR     = "/mnt/cluster_storage/nyc_taxi_ts"
PARQUET_DIR  = os.path.join(DATA_DIR, "parquet")
os.makedirs(PARQUET_DIR, exist_ok=True)

schema = pa.schema([
    ("series_id", pa.int32()),
    ("past",  pa.list_(pa.float32())),
    ("future", pa.list_(pa.float32()))
])

def write_parquet(records, fname):
    pq.write_table(pa.Table.from_pylist(records, schema=schema), fname, version="2.6")

write_parquet(train_records, os.path.join(PARQUET_DIR, "train.parquet"))
write_parquet(val_records,   os.path.join(PARQUET_DIR, "val.parquet"))
print("Parquet shards written →", PARQUET_DIR)


### 6 · PyTorch Dataset over Parquet  
Define a lightweight PyTorch `Dataset` class that reads each window from the Parquet shard. This makes the model training logic agnostic to how you store the data. Your DataLoader receives standard PyTorch tensors.

In [None]:
# 06. PyTorch Dataset that reads the Parquet shards

class TaxiWindowDataset(Dataset):
    def __init__(self, parquet_path):
        self.table  = pq.read_table(parquet_path)
        self.past   = self.table.column("past").to_pylist()
        self.future = self.table.column("future").to_pylist()

    def __len__(self):
        return len(self.past)

    def __getitem__(self, idx):
        past   = torch.tensor(self.past[idx],   dtype=torch.float32).unsqueeze(-1)   # (T, 1)
        future = torch.tensor(self.future[idx], dtype=torch.float32)                 # (H,)
        return past, future

### 7 · Inspect one random batch  
Always verify shapes before diving into training. This cell uses a basic `DataLoader` to fetch one random batch and prints the dimensions of the input and target tensors. This ensures the encoder and decoder receive tensors of the correct size and shape.

In [None]:
# 07. Inspect one random batch
loader = DataLoader(TaxiWindowDataset(os.path.join(PARQUET_DIR, "train.parquet")),
                    batch_size=4, shuffle=True)
xb, yb = next(iter(loader))
print("Past:", xb.shape, "Future:", yb.shape)

### 8 · Ray-prepared DataLoader  
Ray Train provides a helper to wrap your `DataLoader` so that it integrates seamlessly with distributed training. `prepare_data_loader` takes care of sharding and worker setup, ensuring each process only loads a subset of the data and communicates correctly.

In [None]:
# 08. Helper to build Ray-prepared DataLoader
from ray.train.torch import prepare_data_loader

def build_dataloader(parquet_path, batch_size, shuffle=True):
    ds = TaxiWindowDataset(parquet_path)
    loader = DataLoader(
        ds, batch_size=batch_size, shuffle=shuffle, num_workers=2, drop_last=False,
    )
    return prepare_data_loader(loader)