# Distributed XGBoost Training Job
Use XGBoost distributed training on multiple instances managed by sagemaker to train and evaluate against 35M+ rows of 2023 and 10M rows for first quarter 2024 respectively

- Using category enabled fields in XGBoost supported in version > 1.5. No need to convert to onehot encoded factors for categorical variables

### Data at scale bottlenecks:
- Tree Memory: RAM requirements grows exponentially with depth of the tree, so you need to limit the depth of the tree, keep it around 10 or restrict the number of leaves to 1024 or less.
- Restrict the distinct categories in the categorical variables as XGBoost assess split on every subset of the categories, if thousands of categories, XGBoost will choke

### XGBoost Estimator; Points to Note:
- When instance_count is set to grt 1, Distributed training will automatically turns on
- Uses RABIT protocol for gradient sharing across shards on different instances to calculate global gradients
- Automatically calculates the global accuracy of models only for implicit evaluation inside xgboost, it won't do calculation when you write explicit code for evaluation.
- Similarly, when processing data in training script, it will process data locally only and it won't sync globally, i.e. one hot encoding, different number of columns if different number of distinct categories in the shard.
- So keep data processing, explicit evaluation out of the trainig script, keep it minimal only for training
- Keep a consistent schema across all the shards, utilize a data processing step for that
- Though you can do statless transformation operation in the training script, but it should keep the schema consistent across all steps
  

In [None]:
import sagemaker
print(sagemaker.__version__)
# If < 2.27.0, upgrade:
#!pip install --upgrade sagemaker

In [None]:
import sagemaker 
import boto3 
import pandas as pd 
from datetime import datetime

In [None]:
REGION = sagemaker.session.Session().boto_region_name
print("REGION: ", REGION) 

boto3_session = boto3.Session(region_name=REGION)

sagemaker_boto3_client = boto3_session.client("sagemaker")
s3_boto3_client = boto3_session.client("s3")
sagemaker_session = sagemaker.session.Session(boto_session=boto3_session, sagemaker_client=sagemaker_boto3_client)

BUCKET = sagemaker_session.default_bucket()
PREFIX = "NYC_Taxi_Prediction"

ROLE=sagemaker.get_execution_role()
print("ROLE: ", ROLE)
print("BUCKET: ", BUCKET) 
print("PREFIX: ", PREFIX) 

s3_dir_uri = f"s3://{BUCKET}/{PREFIX}"
print(s3_dir_uri)

In [None]:
# Training Job Output Dir
s3_estimator_out_dir_uri = f'{s3_dir_uri}/02_training_jobs/'
print(s3_estimator_out_dir_uri)

# Training Job input data dir
s3_dataprocess_out_dir_uri = 's3://sagemaker-us-east-1-205930620783/NYC_Taxi_Prediction/01_dataprocessing_jobs/NYC-Taxi-Prediction-2025-08-20-14-24-53'
train_dir = f'{s3_dataprocess_out_dir_uri}/train/'
test_dir = f'{s3_dataprocess_out_dir_uri}/test/'
print(train_dir)

## Training Step
#### First test the 'xgbost_model_script.py' on the local data

In [None]:
import os
print(os.getcwd())

# Download the small data from s3
#sagemaker_session.download_data(path='data/', bucket="sagemaker-us-east-1-205930620783", key_prefix="NYC_Taxi_Prediction/01_dataprocessing_jobs/NYC-Taxi-Prediction-2025-08-14-12-25-26")


In [None]:
import time

start_time = time.time()

!python scripts/xgboost_model_script.py \
--train-dir "./data/train/" \
--test-dir "./data/test/" \
--model-out-dir "./data/model/" \
--model-data-out-dir "./data/model" \
--target-var "trip_duration_mins" \
--features "features|passenger_count|trip_distance|vendorid|ratecodeid|day_of_week|day_of_month|month_of_year|hour_of_day|week_of_year" \
--num-boost-round 10 \
--max-depth 5 \
--eta 0.1 \
--objective "reg:squarederror" 


end_time = time.time()
elapsed = end_time - start_time
print(f"Execution time: {elapsed:.2f} seconds")

###  How multi-node training works in SageMaker XGBoost
- Training Phase
  - Data sharding
    - SageMaker automatically splits your S3 dataset into chunks, one per node.
    - Example: If you have 4 instances and train.libsvm has 400,000 rows → each worker gets ~100,000 rows.
    - Sharding is handled by the SageMaker channel input configuration, so you don’t have to code for it.

  - Rabit initialization
    SageMaker sets environment variables for distributed training:

    DMLC_ROLE=worker
    
    DMLC_NUM_WORKER=<instance_count>
 
    DMLC_TRACKER_URI=<master_node_ip>
 
    DMLC_TRACKER_PORT=9091
 
    XGBoost detects these and starts Rabit, which sets up synchronous allreduce communication between nodes.


  - Local gradient computation
    - Each worker trains on its local shard, computing gradient and hessian statistics for its subset of the data.
    - Gradient aggregation: Rabit performs an allreduce operation to sum gradients and hessians across all workers.
    - Each worker gets the global sum, so they make identical split decisions.
    This is why every node builds the same model.


- Evaluation Phase
    - Local evaluation
        - For each dataset you pass in evals (train, test), each worker computes the evaluation metric (RMSE, logloss, etc.) on its shard only.
    - Global aggregation
        - Rabit allreduces the partial sums (e.g., sum of squared errors) and counts across workers.
        - It then computes the global metric that reflects all shards combined.
    This is why your train and eval metrics in eval_results are identical on every node — they are global metrics.


#### This means:
The RMSE you see in multi-node training is exactly the same as if you trained on one machine with the whole dataset. There is no “per-node” metric; everything is aggregated.


#### Parameters to enable multi-node training in SageMaker
There is no special hyperparameter you pass to XGBoost to turn on multi-node training — it’s activated automatically if:
- instance_count > 1 in your Estimator.
- You’re using an AWS built-in XGBoost container framework_version >= 0.90-2 (recommended: 1.5-1 or newer).
- Your input data is in S3 and provided via the SageMaker fit() call (so it can shard automatically).

#### How it works in multi-node
XGBoost use all cpus available in the instance by default.
- Multi-node parallelism (Rabit) — distributes data shards across multiple instances.
- Multi-threaded execution within each node — uses all CPU cores on that node to process its shard faster.

In [None]:
from sagemaker.xgboost.estimator import XGBoost 
from sagemaker.inputs import TrainingInput 

xgb_hyperparams = {
    'target-var': 'trip_duration_mins', 
    #'features': "features|passenger_count|trip_distance|vendorid|ratecodeid|day_of_week|day_of_month|month_of_year|hour_of_day|week_of_year", 
    #'features': "pick_drop_loc|passenger_count|trip_distance|vendorid|day_of_week|day_of_month|month_of_year|hour_of_day|week_of_year", 
    'features': "trip_distance", # Is the most important feature here, and almost all performance coming from this factor
    'max-depth':15, 
    'max-leaves':1024, 
    #'eta':0.3, 
    'objective':'reg:squarederror',
    'num-boost-round': 100
}
xgb_estimator = XGBoost(
    framework_version="1.5-1",
    entry_point="scripts/v2-xgboost_model_script.py",
    output_path=s3_estimator_out_dir_uri,       # For model file and metrics file. 
    code_location=s3_estimator_out_dir_uri, # If not provided it will be put in the default bucket
    hyperparameters=xgb_hyperparams,
    role=ROLE,
    instance_count=16,
    instance_type="ml.r5.2xlarge",#"ml.m5.2xlarge",#"ml.r5.xlarge",#"ml.m5.xlarge",
    max_run=3600#,
    #base_job_name=f'NYC_Taxi_Prediction-Training'
)

xgb_estimator.fit(
    inputs={
        # Copies s3 data to SM_CHANNEL_TRAIN i.e. /opt/ml/input/data/train. Needs to have enough disk to accomodate large files.
        # can read directly from s3 files by providing the s3 uri to script, but s3 reads are slower v/s local reads.
        'train': TrainingInput(train_dir, distribution="ShardedByS3Key"), 
        'test': TrainingInput(test_dir, distribution="ShardedByS3Key")
})
