## 15. Continue training from the latest checkpoint  
Because `train_func` always checks for `get_checkpoint()`, re-invoking `trainer.fit()` automatically resumes boosting from where you left off. Simply call `fit()` a second time and print the new best validation log-loss.

In [None]:
# 15. Run 50 more training iterations from the last saved checkpoint
result = trainer.fit()
best_ckpt = result.checkpoint            # Saved automatically by Trainer 

### 16. Verify post-training inference  

Rerun the Ray Data inference pipeline with the latest checkpoint to confirm that  
additional boosting rounds improved validation accuracy.  
This reuses the same distributed actors, ensuring consistent and scalable evaluation.  

In [None]:
# 16. Rerun Ray Data inference to verify improved accuracy after continued training

# Reuse the existing Ray Data inference setup with the latest checkpoint
pred_ds = val_ds.map_batches(
    XGBPredictor,
    fn_constructor_args=(best_ckpt, feature_columns),
    batch_format="pandas",
    compute=ActorPoolStrategy(),
    num_cpus=1,
)

# Aggregate accuracy across all batches
stats_ds = pred_ds.map_batches(
    lambda df: pd.DataFrame({
        "correct": [int((df["pred"] == df["label"]).sum())],
        "n": [int(len(df))]
    }),
    batch_format="pandas",
)

correct = int(stats_ds.sum("correct"))
n = int(stats_ds.sum("n"))
print(f"Validation accuracy after continued training: {correct / n:.3f}")

### 17. Clean up  
Finally, tidy up by deleting temporary checkpoint folders, the metrics CSV, and any intermediate result directories. Clearing out old artifacts frees disk space and leaves your workspace clean for whatever comes next.

In [None]:
# 17. Optional cleanup to free space
ARTIFACT_DIR = "/mnt/cluster_storage/covtype"
if os.path.exists(ARTIFACT_DIR):
    shutil.rmtree(ARTIFACT_DIR)
    print(f"Deleted {ARTIFACT_DIR}")

### Wrap up and next steps

You built a fast and fault-tolerant XGBoost training loop that runs on real data, scales across CPUs, recovers from worker failures, and supports batch inference, all inside a single notebook.

You should now feel confident:

* Using **Ray Data** to ingest, shuffle, and shard large tabular datasets across a cluster  
* Defining custom `train_func`s that run on **Ray Train** workers and resume seamlessly from checkpoints  
* Tracking per-round metrics and saving checkpoints with **RayTrainReportCallback**  
* Leveraging **Ray’s distributed execution model** to evaluate and monitor models without manual orchestration  
* Launching remote CPU-powered inference tasks using **Ray Data** for scalable batch scoring


---

### Where can you take this next?

Below are a few directions you might explore to adapt or extend the pattern:

1. **Early stopping and best iteration tracking**  
   * Add `early_stopping_rounds=10` to `xgb.train` and log the best round.  
   * Track performance delta across resumed runs.

2. **Hyperparameter sweeps**  
   * Wrap the trainer with **Ray Tune** and search over `eta`, `max_depth`, or `subsample`.  
   * Use Tune’s built-in checkpoint pruning and log callbacks.

3. **Feature engineering at scale**  
   * Create new features using `Ray Dataset.map_batches`, such as terrain interactions or log-scaled distances.  
   * Materialize multiple Parquet shards and benchmark load time.

4. **Model interpretability**  
   * Use XGBoost’s built-in `Booster.get_score` for feature attributions.  
   * Rank features by importance and validate with domain knowledge.

5. **Serving the model**  
   * Package the Booster as a Ray task or **Ray Serve** endpoint.  
   * Deploy an API that takes a feature vector and returns the predicted cover type.

6. **Real-time logging**  
   * Integrate with MLflow or Weights & Biases to store logs, plots, and checkpoints.  
   * Use tags and metadata to track experiments over time.

7. **Alternative objectives**  
   * Try a binary objective (for example, presence versus absence of a species) or regression target (for example, canopy height).  
   * Fine-tune loss functions for specific ecological tasks.

8. **End-to-end MLOps**  
   * Schedule retraining with Ray Jobs or Anyscale Jobs.  
   * Upload new data snapshots and trigger daily training runs with automatic checkpoint cleanup.
