## 15 · Run inference and visualise prediction  
Grab the most recent week of taxi data, run it through your trained model, and plot the predicted future demand against the actual values. This gives you a visual check of model quality and allows you to verify whether the model has learned temporal patterns like daily or weekly cycles.

Due to the small size of the dataset, the model in this tutorial learns the mean of the data (a constant solution). To improve these results requires more training data.

In [None]:
# 15. Run inference on the latest window & plot

# Get last week of data
past_norm = hourly["norm"].iloc[-INPUT_WINDOW:].to_numpy()
future_true = hourly["passengers"].iloc[-HORIZON:].to_numpy()

with best_ckpt.as_directory() as p:
    pred_norm = ray.get(forecast_from_checkpoint.remote(p, past_norm))

# de-normalise
mean, std = hourly["passengers"].mean(), hourly["passengers"].std()
pred = pred_norm * std + mean
past = past_norm * std + mean

# Plot
import matplotlib.pyplot as plt
t_past   = np.arange(-INPUT_WINDOW, 0)
STEP_SIZE_HOURS = 0.5  # because you're now using 30min data
t_future = np.arange(0, HORIZON) * STEP_SIZE_HOURS

plt.figure(figsize=(10,4))
plt.plot(t_past, past, label="History")
plt.plot(t_future, future_true, "--", label="Ground Truth")
plt.plot(t_future, pred,  "-.", label="Forecast")
plt.axvline(0, color="black"); plt.xlabel("Hours relative"); plt.ylabel("# trips")
plt.title("NYC-Taxi 24 h Forecast"); plt.legend(); plt.grid(); plt.tight_layout(); plt.show()

### 16 · Cleanup: remove all training artifacts  
Finally, tidy up by deleting temporary checkpoint folders, the metrics CSV, and any intermediate result directories. Clearing out old artefacts frees disk space and leaves your workspace clean for whatever comes next.

In [None]:
# 16. Cleanup – optionally remove all artefacts to free space
if os.path.exists(DATA_DIR):
    shutil.rmtree(DATA_DIR)
    print(f"Deleted {DATA_DIR}")

### 🎉 Wrapping Up & Next Steps

Nice work. You've built a robust, distributed forecasting workflow using **Ray Train on Anyscale** that:

* Trains a Transformer model across **multiple GPUs** using **Ray Train with Distributed Data Parallel (DDP)**, abstracting away low-level orchestration.
* Recovers automatically from failures with **built-in checkpointing and resume**, even across re-launches or node churn.
* Logs and reports per-epoch metrics using **Ray Train’s reporting APIs**, enabling real-time monitoring and seamless plotting.
* Performs inference using **Ray remote tasks**, allowing you to scale forecasting across GPUs or nodes without changing model code.

---

### 🚀 Where can you take this next?

1. **Hyperparameter Sweeps**  
   * Wrap the `TorchTrainer` with **Ray Tune** to search over `d_model`, `nhead`, learning rate, and window sizes.  

2. **Probabilistic Forecasting**  
   * Output percentiles or fit a distribution head (For example, Gaussian) to capture prediction uncertainty.  

3. **Multivariate & Exogenous Features**  
   * Add weather, holidays, or ride-sharing surge multipliers as extra input channels.  

4. **Early-Stopping & LR Scheduling**  
   * Monitor val-loss and reduce LR on plateau, or stop when improvement < 1 %.  

5. **Model Compression**  
   * Distil the large Transformer into a lightweight LSTM or Tiny-Transformer for edge deployment.  

6. **Streaming / Online Learning**  
   * Use **Ray Serve** to deploy the model and update weights periodically with the latest data.  

7. **Interpretability**  
   * Visualise attention maps to see which time lags the model focuses on—great for stakeholder trust.  

8. **End-to-End MLOps**  
   * Schedule nightly retraining with **Ray Jobs**, log artifacts to MLflow or Weights & Biases, and automate model promotion.  
