## 🏔️ 04b · Tabular Workload Pattern with Ray Train  
In this tutorial you take the classic **Cover type forest-cover dataset** (580 k rows, 54 tabular features) and scale an **XGBoost** model across an Anyscale cluster using **Ray Train V2**.

### What you’ll learn & take away

- Ingest tabular data at scale using **Ray Data** and persist it to Parquet for reproducibility  
- Launch a fault-tolerant, checkpoint enabled **XGBoost training loop** on multiple CPUs using **Ray Train**  
- Resume training from checkpoints across job restarts and hardware failures  
- Evaluate model accuracy, visualize feature importance, and scale batch inference using **Ray remote tasks**  
- Understand how to port classic gradient boosting workflows into a **fully distributed, multi-node training setup on Anyscale**

### 🔢 What problem are you solving? (Forest Cover Classification with XGBoost)

You're predicting which **type of forest vegetation** (For example, Lodge-pole Pine, Spruce/Fir, Aspen) is present at a given land location, using only numeric and binary cartographic features such as elevation, slope, soil type, and proximity to roads or hydrology.

---

### What's XGBoost?

**XGBoost** (Extreme Gradient Boosting) is a fast, scalable machine learning algorithm based on **gradient-boosted decision trees**. It builds a sequence of shallow decision trees, where each new tree tries to correct the errors of the previous ensemble by minimizing a differentiable loss (like log-loss).

In your case, minimize the **multi-class Softmax log-loss**, learning a function:

$$
f_\theta: \mathbb{R}^{54} \rightarrow \{0, 1, \dots, 6\}
$$

that maps a 54-dimensional tabular input (raw geo-spatial features) to a forest cover type. Each boosting round fits a new tree on the gradient of the loss, gradually improving accuracy over hundreds of rounds.

---

### 🧭 How you’ll migrate this tabular workload to a distributed setup using Ray on Anyscale

This tutorial walks through the end-to-end process of **migrating a local XGBoost training pipeline to a distributed Ray cluster running on Anyscale**.

Here’s how you make that transition:

1. **Local → Remote Data**  
   Store the raw data as Parquet in a shared cloud directory and load it using **Ray Data**, which streams and shards the dataset across workers automatically.

2. **Single-process → Multi-worker Training**  
   Define a custom `train_func`, then let **Ray Train** spin up 16 distributed training workers (1 per CPU) and run `xgb.train` in parallel, each with its own data shard.

3. **Manual Checkpointing → Automated Fault Tolerance**  
   With `RayTrainReportCallback` and `CheckpointConfig`, Ray saves checkpoints every 10 boosting rounds and can resume mid-training if any worker crashes or a job is re-launched.

4. **Manual Loops → Cluster-scale Abstractions**  
   Skip the boilerplate of manually slicing datasets, coordinating workers, or building launch scripts. Instead, declare intent (with `ScalingConfig`, `RunConfig`, and `FailureConfig`) and let **Ray + Anyscale** manage the execution.

5. **Offline Inference → Remote Tasks**  
   Batch inference can launch as **Ray remote tasks** on CPU workers, which is useful for validation, drift detection, or live scoring inside a service.

This pattern turns a traditional single-node workflow into a scalable, resilient training pipeline with minimal code changes, and it works seamlessly on any cluster you provision through Anyscale.

### 1 · Imports  
Before you touch any data, import every tool you need.  
Alongside the standard scientific-Python stack, bring in **XGBoost** for gradient-boosted decision trees and **Ray** for distributed data loading and training. Ray Train’s helper classes (RunConfig, ScalingConfig, CheckpointConfig, FailureConfig) give you fault-tolerant, CPU training with almost no extra code.


In [None]:
# 00. Runtime setup — install same deps and set env vars
import os, sys, subprocess

# Non-secret env var 
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"

# Install Python dependencies 
subprocess.check_call([
    sys.executable, "-m", "pip", "install", "--no-cache-dir",
    "matplotlib==3.10.6",
    "scikit-learn==1.7.2",
    "pyarrow==14.0.2",    
    "xgboost==3.0.5",
    "seaborn==0.13.2",
])

In [None]:
# 01. Imports
import os, shutil, json, uuid, tempfile, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_covtype
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import xgboost as xgb

import ray
import ray.data as rd
from ray.train import RunConfig, ScalingConfig, CheckpointConfig, FailureConfig, get_dataset_shard, get_checkpoint, get_context
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback

### 2 · Load the University of California, Irvine (UCI) Cover type dataset  
The Cover type dataset contains ~580 000 forest-cover observations with 54 tabular features and a 7-class label. Fetch it from `sklearn.datasets`, rename the target column to `label` (Ray’s default), and shift the classes from **1-7** to **0-6** so they're zero-indexed as XGBoost expects. A quick `value_counts` sanity-check confirms the mapping worked.


In [None]:
# 02. Load the UCI Cover type dataset (~580k rows, 54 features)
data = fetch_covtype(as_frame=True)
df = data.frame
df.rename(columns={"Cover_Type": "label"}, inplace=True)   # Ray expects "label"
df["label"] = df["label"] - 1          # 1-7  →  0-6
assert df["label"].between(0, 6).all()
print(df.shape, df.label.value_counts(normalize=True).head())

### 3 · Visualise class balance  
Highly imbalanced targets can bias tree-based models, so plot the raw label counts. The cover type distribution shows skew, but not much—the bar chart lets you judge whether extra re-scaling or class-weighting is necessary (You rely on XGBoost’s built-in handling for now).

In [None]:
# 03. Visualize class distribution
df.label.value_counts().plot(kind="bar", figsize=(6,3), title="Cover Type distribution")
plt.ylabel("Frequency"); plt.show()

### 4 · Persist the dataset to Parquet  
Storing the data-frame once on the cluster’s shared file-system keeps later steps fast and reproducible. Parquet is columnar, compressed, and lazily readable by Ray Data, which means you can stream partitions to workers without loading everything into RAM.


In [None]:
# 04. Write to /mnt/cluster_storage/covtype/
PARQUET_DIR = "/mnt/cluster_storage/covtype/parquet"
os.makedirs(PARQUET_DIR, exist_ok=True)
file_path = os.path.join(PARQUET_DIR, "covtype.parquet")
df.to_parquet(file_path)
print(f"Wrote Parquet -> {file_path}")

### 5 · Read the data as a Ray Dataset  
`ray.data.read_parquet` gives you a **lazy, columnar dataset** and shuffles it on-the-fly. From this point on, every split, batch, or transformation executes in parallel across the cluster, so you avoid a single-node bottleneck.

In [None]:
# 05. Load dataset into a Ray Dataset (lazy, columnar)
ds_full = rd.read_parquet(file_path).random_shuffle()      
print(ds_full)

### 6 · Create train / validation splits  
Perform an 80 / 20 split directly on the Ray Dataset, preserving the lazy execution plan. Each subset remains a Ray Dataset object, so they can later stream to the training workers in parallel.

In [None]:
# 06. Split to train / validation Ray Datasets
train_ds, val_ds = ds_full.split_proportionately([0.8])
print(f"Train rows: {train_ds.count()},  Val rows: {val_ds.count()}")

### 7 · Inspect a mini-batch  
Taking a tiny pandas batch helps verify that feature columns and labels have the expected shapes and types. You also build `feature_columns`, a list you reuse when building XGBoost’s `DMatrix`.


In [None]:
# 07. Look into one batch to confirm feature dimensionality
batch = train_ds.take_batch(batch_size=5, batch_format="pandas")
print(batch.head())
feature_columns = [c for c in batch.columns if c != "label"]

### 8 · Custom per-worker training loop  
Ray Train launches one copy of `train_func` on every worker (16 CPUs in your case).  
Inside the loop you:  
1. Pull the local shard of both the training and validation Ray datasets.  
2. Convert each pandas shard into an XGBoost `DMatrix` (efficient Certificate Signing Request (CSR) format).  
3. Resume from an existing checkpoint if Ray passed one in with `get_checkpoint()`.  
4. Call `xgb.train`, handing it a `RayTrainReportCallback` so that **every boosting round automatically reports metrics**.  

In [None]:
# 08. Custom Ray Train loop for XGBoost (CPU)

def train_func(config):
    """Per-worker training loop executed by Ray Train."""

    # --------------------------------------------------------
    # 1. Pull this worker’s data shard from Ray Datasets
    # --------------------------------------------------------
    label_col   = config["label_column"]
    train_df    = get_dataset_shard("train").materialize().to_pandas()
    eval_df     = get_dataset_shard("evaluation").materialize().to_pandas()
    feature_cols = [c for c in train_df.columns if c != label_col]

    # Convert pandas → DMatrix (fast CSR format used by XGBoost)
    dtrain = xgb.DMatrix(train_df[feature_cols], label=train_df[label_col])
    deval  = xgb.DMatrix(eval_df[feature_cols],  label=eval_df[label_col])

    # --------------------------------------------------------
    # 2. Train booster — RayTrainReportCallback handles:
    #       • per-round ray.train.report(...)
    #       • checkpoint upload to Ray storage
    # --------------------------------------------------------

    # Optional resume from checkpoint (Ray sets this automatically if resuming)
    ckpt = get_checkpoint()
    if ckpt:
        with ckpt.as_directory() as d:
            model_path = os.path.join(d, RayTrainReportCallback.CHECKPOINT_NAME)
            booster = xgb.Booster()
            booster.load_model(model_path)
            print(f"[Rank {get_context().get_world_rank()}] Resumed from checkpoint")
    else:
        booster = None
    
    evals_result = {}  # <- XGBoost fills this with per-iteration metrics

    xgb.train(
        params          = config["params"],
        dtrain          = dtrain,
        evals           = [(dtrain, "train"), (deval, "validation")],  # ← CHANGED label only
        num_boost_round = config["num_boost_round"],
        xgb_model       = booster,  # <- resumes if booster is not None
        evals_result    = evals_result,  # <- NEW: capture metrics per round
        callbacks=[
            RayTrainReportCallback()  # ← CHANGED: let it auto-collect metrics
        ],
    )
    # --------------------------------------------------------
    # 3. Rank-0 writes metrics JSON to the shared path
    # --------------------------------------------------------

    if get_context().get_world_rank() == 0:
        out_json_path = config["out_json_path"]

        # Optionally add a quick “best” summary for convenience
        v_hist = evals_result.get("validation", {}).get("mlogloss", [])
        best_idx = int(np.argmin(v_hist)) if len(v_hist) else None
        payload = {
            "evals_result": evals_result,
            "best": {
                "iteration": (best_idx + 1) if best_idx is not None else None,
                "validation-mlogloss": (float(v_hist[best_idx]) if best_idx is not None else None),
            },
        }

        os.makedirs(os.path.dirname(out_json_path), exist_ok=True)
        with open(out_json_path, "w") as f:
            json.dump(payload, f)
        print(f"[Rank 0] Wrote metrics JSON → {out_json_path}")

### 9 · Configure XGBoost and build the Trainer  
Here you define all model hyper-parameters (objective, number of classes, CPU tree method, etc.) and wrap `train_func` inside an `XGBoostTrainer`.  
* `ScalingConfig(num_workers=16, use_gpu=False)` allocates one CPU per worker.  
* `CheckpointConfig(checkpoint_frequency=10, num_to_keep=3)` keeps the three most recent checkpoints.  
* `FailureConfig(max_failures=1)` tells Ray to retry training up to one time if a worker crashes.  
Because you pass the Ray Datasets directly, Ray takes care of sharding them evenly across workers.

In [None]:
# 09. XGBoost config + Trainer (uses train_func above)
xgb_params = {
    "objective": "multi:softprob",
    "num_class": 7,
    "eval_metric": "mlogloss",
    "tree_method": "hist",  # CPU histogram algorithm 
    "eta": 0.3,
    "max_depth": 8,
}

trainer = XGBoostTrainer(
    train_func,                
    scaling_config   = ScalingConfig(num_workers=16, use_gpu=False),
    datasets         = {"train": train_ds, "evaluation": val_ds},
    train_loop_config={
        "label_column": "label",
        "params": xgb_params,
        "num_boost_round": 50,  # Increase or decrease to adjust training iterations
        "out_json_path": "/mnt/cluster_storage/covtype/results/covtype_xgb_cpu/metrics.json",
    },
    run_config       = RunConfig(
        name="covtype_xgb_cpu",
        storage_path="/mnt/cluster_storage/covtype/results",
        checkpoint_config=CheckpointConfig(checkpoint_frequency=10, num_to_keep=1),
        failure_config=FailureConfig(max_failures=1),  # resume up to 3 times
    ),
)

### 10 · Start distributed training  
`trainer.fit()` blocks until all boosting rounds finish (or until Ray exhausts retries).  The result object contains the last reported metrics and the best checkpoint found so far. Print the final validation log-loss and keep a handle to the checkpoint for inference.

In [None]:
# 10. Fit the trainer (reports eval metrics every boosting round)
result = trainer.fit()
best_ckpt = result.checkpoint            # saved automatically by Trainer 

### 11 · Plot log-loss over boosting rounds  
During training you captured the full per-round evaluation history using XGBoost’s built-in `evals_result` and saved it to JSON. Reloading that JSON now gives you both training and validation log-loss values for each boosting round. Plotting these lists against their round index shows how the model converges: training loss decrease steadily, while validation loss follows, maintaining a small gap.

In [None]:
# 11. Plot evaluation history from saved JSON

with open("/mnt/cluster_storage/covtype/results/covtype_xgb_cpu/metrics.json") as f:
    payload = json.load(f)

hist = payload["evals_result"]
train = hist["train"]["mlogloss"]
val   = hist["validation"]["mlogloss"]

xs = np.arange(1, len(val) + 1)
plt.figure(figsize=(7,4))
plt.plot(xs, train, label="Train")
plt.plot(xs, val,   label="Val")
plt.xlabel("Boosting round"); plt.ylabel("Log-loss"); plt.title("XGBoost log-loss")
plt.grid(True); plt.legend(); plt.tight_layout(); plt.show()

best = payload["best"]["validation-mlogloss"]
print("Best validation log-loss:", best)

### 12 · Evaluate the trained model  
Pull the XGBoost `Booster` back from the checkpoint, run predictions on the entire validation set, and compute overall accuracy. Converting the Ray Dataset to pandas keeps the example short; in production you could stream batches instead of materialising the whole frame.


In [None]:
# 12. Retrieve Booster object from Ray Checkpoint
booster = RayTrainReportCallback.get_model(best_ckpt)

# Convert Ray Dataset → pandas for quick local scoring
val_pd = val_ds.to_pandas()
dmatrix = xgb.DMatrix(val_pd[feature_columns])
pred_prob = booster.predict(dmatrix)
pred_labels = np.argmax(pred_prob, axis=1)

acc = accuracy_score(val_pd.label, pred_labels)
print(f"Validation accuracy: {acc:.3f}")

### 13 · Confusion matrix visualisation  
Raw counts and row-normalised ratios highlight which cover types the model confuses most often. Diagonal dominance indicates good performance; off-diagonal hot spots may suggest a need for more data or feature engineering for those specific classes.

In [None]:
# 13. Confusion matrix

cm = confusion_matrix(val_pd.label, pred_labels)  # or sample_batch.label if used

sns.heatmap(cm, annot=True, fmt="d", cmap="viridis")
plt.title("Confusion Matrix with Counts")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

cm_norm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt=".2f", cmap="viridis")
plt.title("Normalized Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.show()

### 14 · CPU batch inference with Ray remote tasks  
To demonstrate scalable inference, send a 1024-row pandas batch to a **single CPU worker**.  The remote function loads the model once per task, converts the batch to `DMatrix`, and returns class indices. Measure accuracy on the fly to confirm that out-of-process inference matches earlier results.


In [None]:
# 14. Example: Run batch inference using Ray remote task on a CPU worker

# This remote function is scheduled on a CPU-enabled Ray worker.
# It loads a trained XGBoost model from a Ray checkpoint and runs predictions on a pandas DataFrame.
@ray.remote(num_cpus=1)
def predict_batch(ckpt, batch_pd):
    # Load the trained XGBoost Booster model from the checkpoint.
    model = RayTrainReportCallback.get_model(ckpt)

    # Convert the input batch (pandas DataFrame) to DMatrix, required by XGBoost for inference.
    dmatrix = xgb.DMatrix(batch_pd[feature_columns])

    # Predict class probabilities for each row in the batch.
    preds = model.predict(dmatrix)

    # Select the class with highest predicted probability for each row.
    return np.argmax(preds, axis=1)

# Take a random sample of 1024 rows from the validation set to use as input.
sample_batch = val_pd.sample(1024, random_state=0)

# Submit the batch inference task to a Ray worker and block until it finishes.
preds = ray.get(predict_batch.remote(best_ckpt, sample_batch))

# Compute and print classification accuracy by comparing predictions to true labels.
print("Sample batch accuracy:", accuracy_score(sample_batch.label, preds))

### 15 · Feature-importance diagnostics  
XGBoost’s built-in `get_score(importance_type="gain")` ranks each feature by its average gain across all splits. Visualising the top-15 helps connect model behaviour back to domain knowledge. For example, elevation, and soil-type often dominate forest-cover prediction.

In [None]:
# 15. Gain‑based feature importance
importances = booster.get_score(importance_type="gain")
keys, gains = zip(*sorted(importances.items(), key=lambda kv: kv[1], reverse=True)[:15])

plt.barh(range(len(gains)), gains)
plt.yticks(range(len(gains)), keys)
plt.gca().invert_yaxis()
plt.title("Top-15 Feature Importances (gain)"); plt.xlabel("Average gain"); plt.show()

### 16 · Continue training from the latest checkpoint  
Because `train_func` always checks for `get_checkpoint()`, re-invoking `trainer.fit()` automatically resumes boosting from where you left off. Simply call `fit()` a second time and print the new best validation log-loss.

In [None]:
# 16. Run 50 more training iterations from the last saved checkpoint
result = trainer.fit()
best_ckpt = result.checkpoint            # saved automatically by Trainer 

### 17 · Verify post-training inference 
Rerun the same batch-inference helper to confirm that extra boosting rounds improved accuracy.

In [None]:
# 17. Rerun example batch inference from before to verify improved accuracy:

# Take a random sample of 1024 rows from the validation set to use as input.
sample_batch = val_pd.sample(1024, random_state=0)

# Submit the batch inference task to a Ray worker and block until it finishes.
preds = ray.get(predict_batch.remote(best_ckpt, sample_batch))

# Compute and print classification accuracy by comparing predictions to true labels.
print("Sample batch accuracy:", accuracy_score(sample_batch.label, preds))

### 18 · Cleanup  
Finally, tidy up by deleting temporary checkpoint folders, the metrics CSV, and any intermediate result directories. Clearing out old artefacts frees disk space and leaves your workspace clean for whatever comes next.

In [None]:
# 18. Optional clean‑up to free space
ARTIFACT_DIR = "/mnt/cluster_storage/covtype"
if os.path.exists(ARTIFACT_DIR):
    shutil.rmtree(ARTIFACT_DIR)
    print(f"Deleted {ARTIFACT_DIR}")

## 🎉 Wrapping Up & Next Steps

Awesome work making it to the end. You’ve built a fast and fault-tolerant XGBoost training loop that runs on real data, scales across CPUs, recovers from worker failures, and supports batch inference, all inside a single notebook.

You should now feel confident:

* Using **Ray Data** to ingest, shuffle, and shard large tabular datasets across a cluster  
* Defining custom `train_func`s that run on **Ray Train** workers and resume seamlessly from checkpoints  
* Tracking per-round metrics and saving checkpoints with **RayTrainReportCallback**  
* Leveraging **Ray’s distributed execution model** to evaluate and monitor models without manual orchestration  
* Launching remote CPU-powered inference tasks using **Ray remote functions** for scalable batch scoring


---

### 🚀 Where can you take this next?

Below are a few directions you might explore to adapt or extend the pattern:

1. **Early Stopping & Best Iteration Tracking**  
   * Add `early_stopping_rounds=10` to `xgb.train` and log the best round.  
   * Track performance delta across resumed runs.

2. **Hyperparameter Sweeps**  
   * Wrap the trainer with **Ray Tune** and search over `eta`, `max_depth`, or `subsample`.  
   * Use Tune’s built-in checkpoint pruning and log callbacks.

3. **Feature Engineering at Scale**  
   * Create new features using `Ray Dataset.map_batches`, such as terrain interactions or log-scaled distances.  
   * Materialize multiple Parquet shards and benchmark load time.

4. **Model Interpretability**  
   * Use XGBoost’s built-in `Booster.get_score` for feature attributions.  
   * Rank features by importance and validate with domain knowledge.

5. **Serving the Model**  
   * Package the Booster as a Ray task or **Ray Serve** endpoint.  
   * Deploy an API that takes a feature vector and returns the predicted cover type.

6. **Real-Time Logging**  
   * Integrate with MLflow or Weights & Biases to store logs, plots, and checkpoints.  
   * Use tags and metadata to track experiments over time.

7. **Alternative Objectives**  
   * Try a binary objective (For example, presence vs. absence of a species) or regression target (For example, canopy height).  
   * Fine-tune loss functions for specific ecological tasks.

8. **End-to-End MLOps**  
   * Schedule retraining with Ray Jobs or Anyscale Jobs.  
   * Upload new data snapshots and trigger daily training runs with automatic checkpoint cleanup.
