## 1. Imports  
Before you touch any data, import every tool you need.  
Alongside the standard scientific-Python stack, bring in **XGBoost** for gradient-boosted decision trees and **Ray** for distributed data loading and training. Ray Train’s helper classes (RunConfig, ScalingConfig, CheckpointConfig, FailureConfig) give you fault-tolerant, CPU training with almost no extra code.


In [None]:
# 00. Runtime setup 
import os, sys, subprocess

# Non-secret env var 
os.environ["RAY_TRAIN_V2_ENABLED"] = "1"

# Install Python dependencies 
subprocess.check_call([
    sys.executable, "-m", "pip", "install", "--no-cache-dir",
    "matplotlib==3.10.6",
    "scikit-learn==1.7.2",
    "pyarrow==14.0.2",    
    "xgboost==3.0.5",
    "seaborn==0.13.2",
])

In [None]:
# 01. Imports
import os, shutil, json, uuid, tempfile, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_covtype
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pyarrow as pa

import ray
import ray.data as rd
from ray.data import ActorPoolStrategy
from ray.train import RunConfig, ScalingConfig, CheckpointConfig, FailureConfig, get_dataset_shard, get_checkpoint, get_context
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback

### 2. Load the University of California, Irvine (UCI) Cover type dataset  
The Cover type dataset contains ~580 000 forest-cover observations with 54 tabular features and a 7-class label. Fetch it from `sklearn.datasets`, rename the target column to `label` (Ray’s default), and shift the classes from **1-7** to **0-6** so they're zero-indexed as XGBoost expects. A quick `value_counts` sanity-check confirms the mapping worked.


In [None]:
# 02. Load the UCI Cover type dataset (~580k rows, 54 features)
data = fetch_covtype(as_frame=True)
df = data.frame
df.rename(columns={"Cover_Type": "label"}, inplace=True)   # Ray expects "label"
df["label"] = df["label"] - 1          # 1-7  →  0-6
assert df["label"].between(0, 6).all()
print(df.shape, df.label.value_counts(normalize=True).head())

### 3. Visualize class balance  
Highly imbalanced targets can bias tree-based models, so plot the raw label counts. The cover type distribution shows skew, but not much—the bar chart lets you judge whether extra re-scaling or class-weighting is necessary. Rely on XGBoost’s built-in handling for this step.

In [None]:
# 03. Visualize class distribution
df.label.value_counts().plot(kind="bar", figsize=(6,3), title="Cover Type distribution")
plt.ylabel("Frequency"); plt.show()

### 4. Write train / validation Parquet files  

Rather than splitting a large dataset in memory later, you persist **train** and **validation** splits up front.  
Each split is written to the cluster’s shared volume (`/mnt/cluster_storage`) so that all Ray workers can access it directly.  
This approach keeps the workflow reproducible and avoids rematerializing the dataset during distributed training.  

You perform a **stratified 80 / 20 split** to preserve class balance across splits, then write each subset to its own Parquet file.  
Parquet is columnar and compressed, making it ideal for Ray Data ingestion and parallel reads.  

In [None]:
# 04. Write separate train/val Parquets to /mnt/cluster_storage/covtype/

PARQUET_DIR = "/mnt/cluster_storage/covtype/parquet"
os.makedirs(PARQUET_DIR, exist_ok=True)

TRAIN_PARQUET = os.path.join(PARQUET_DIR, "train.parquet")
VAL_PARQUET   = os.path.join(PARQUET_DIR, "val.parquet")

# Stratified 80/20 split for reproducibility
train_df, val_df = train_test_split(
    df, test_size=0.2, random_state=42, stratify=df["label"]
)

train_df.to_parquet(TRAIN_PARQUET, index=False)
val_df.to_parquet(VAL_PARQUET, index=False)

print(f"Wrote Train → {TRAIN_PARQUET} ({len(train_df):,} rows)")
print(f"Wrote Val   → {VAL_PARQUET}   ({len(val_df):,} rows)")

### 5. Load the train and validation splits as Ray Datasets  

Now that the data is stored in Parquet, you load each split directly with `ray.data.read_parquet`.  
Each call returns a **lazy, columnar Ray Dataset** that supports distributed reads and transformations across the cluster.  

Calling `.random_shuffle()` on the training split ensures balanced sampling during training,  
while leaving the validation split unshuffled preserves its deterministic order for evaluation.  

From this point forward, all data access is **parallel and streaming**, eliminating single-node I/O bottlenecks.


In [None]:
# 05. Load the two splits as Ray Datasets (lazy, columnar)
train_ds = rd.read_parquet(TRAIN_PARQUET).random_shuffle()
val_ds   = rd.read_parquet(VAL_PARQUET)

print(train_ds)
print(val_ds)

### 6. Inspect dataset sizes (optional)

After loading the Parquet files, quickly confirm that both splits were read correctly by counting their rows.  
This step triggers a lightweight distributed count across the cluster and verifies that the  
**train / validation partitioning** matches the expected 80 / 20 ratio before moving on to distributed training.  


In [None]:
print(f"Train rows: {train_ds.count():,},  Val rows: {val_ds.count():,}")  # Note that this will materialize the dataset (skip at scale)

### 7. Inspect a mini-batch  
Taking a tiny pandas batch helps verify that feature columns and labels have the expected shapes and types. You also build `feature_columns`, a list you reuse when building XGBoost’s `DMatrix`.


In [None]:
# 07. Look into one batch to confirm feature dimensionality
batch = train_ds.take_batch(batch_size=5, batch_format="pandas")
print(batch.head())
feature_columns = [c for c in batch.columns if c != "label"]