[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

daviddwlee84 · 2023-12-05T03:40:51Z

What happened + What you expected to happen

When I finish XGBoost training using XGBoostTrainer I want to continue training on the best checkpoint

Assign resume_from_checkpoint failed to load the checkpoint
XGBoostTrainer.get_model can't get the checkpoint either.

The first issue error message happens when creating a new trainer with resume_from_checkpoint and is quite similar to this #16375

2023-12-05 10:52:43,353	WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:52:53,378	INFO tune.py:1047 -- Total run time: 11.02 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:52:53,393	WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_5c19d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_5c19d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_5c19d_00000_0_2023-12-05_10-52-42')

This error message will be like the second one when I remove the early stop config stop=ExperimentPlateauStopper('train-error', mode='min') in RunConfig

xgboost.core.XGBoostError: [11:25:04] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f768b5dc24e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f768b6086f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f768b590731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f768b5909f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0959829dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe095982067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe09599b1e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fe09599bc95]
  [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x55b5d791b13f]

And the second issue might be relevant to this #41374
Either Ray saves the XGBoost model to legacy binary or cannot load the non-default model name from the checkpoint.
The workaround seems not working.

Where there are warning logs like this

(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000015)
(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:41] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.

And if use XGBoostTrainer.get_model(checkpoint) will get error

XGBoostError: [11:16:54] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000017/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f49ef86824e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f49ef8946f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f49ef81c731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f49ef81c9f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f4b8c2bd9dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f4b8c2bd067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f4b8c2d61e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f4b8c2d6c95]
  [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x5581bcf8513f]

Versions / Dependencies

Python 3.8.13

Packages

ray                               2.8.1
xgboost-ray                       0.1.19
xgboost                           2.0.2

OS

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.6 LTS
Release:        18.04
Codename:       bionic

Reproduction script

The reproduction script is based on the official tutorial Get Started with XGBoost and LightGBM — Ray 2.8.0

Load data and do the first training

import ray
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig
from ray.tune.stopper import ExperimentPlateauStopper

ray.init()

dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv"))
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3

run_config = RunConfig(
    name="XGBoost_ResumeExperiment",
    storage_path="/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    # Remove this will get different error message later
    stop=ExperimentPlateauStopper('train-error', mode='min'),
)

scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)

trainer = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)

result = trainer.fit()

During fitting will get warnings like this

(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:42] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000018)

Get the Best Checkpoint and Resume

checkpoint = result.get_best_checkpoint('valid-logloss', 'min')

trainer_continue = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    resume_from_checkpoint=checkpoint
)

result_continue = trainer_continue.fit()

This will get an error like this when enabling early stopping

2023-12-05 10:25:41,638	WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:25:50,900	INFO tune.py:1047 -- Total run time: 9.96 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:25:50,911	WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_95a7d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_95a7d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_95a7d_00000_0_2023-12-05_10-25-40')

And error like this without an early stopping

xgboost.core.XGBoostError: [11:25:25] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f2f6f5dc24e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f2f6f6086f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f2f6f590731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f2f6f5909f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f9976bab9dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f9976bab067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f9976bc41e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f9976bc4c95]
  [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x556d2854d13f]

Which will be the same as

model = XGBoostTrainer.get_model(checkpoint)

XGBoostError: [11:36:40] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f105a97824e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f105a9a46f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f105a92c731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f105a92c9f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f11f73d09dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f11f73d0067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f11f73e91e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f11f73e9c95]
  [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x55ea790af13f]

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

daviddwlee84 · 2023-12-05T05:35:02Z

ray.train.xgboost.xgboost_trainer — Ray 2.8.0

ray/python/ray/train/xgboost/xgboost_trainer.py

Lines 92 to 109 in 8362c50

    
           @staticmethod 
        
           def get_model(checkpoint: Checkpoint) -> xgboost.Booster: 
        
               """Retrieve the XGBoost model stored in this checkpoint.""" 
        
               with checkpoint.as_directory() as checkpoint_path: 
        
                   booster = xgboost.Booster() 
        
                   booster.load_model( 
        
                       os.path.join(checkpoint_path, XGBoostCheckpoint.MODEL_FILENAME) 
        
                   ) 
        
                   return booster 
        
           def _train(self, **kwargs): 
        
               return xgboost_ray.train(**kwargs) 
        
           def _load_checkpoint(self, checkpoint: Checkpoint) -> xgboost.Booster: 
        
               return self.__class__.get_model(checkpoint) 
        
           def _save_model(self, model: xgboost.Booster, path: str): 
        
               model.save_model(os.path.join(path, XGBoostCheckpoint.MODEL_FILENAME))

The filename is "model.json"

ray/python/ray/train/xgboost/xgboost_checkpoint.py

Line 18 in 8362c50

MODEL_FILENAME = "model.json"

But somehow the dump checkpoints are still binary

daviddwlee84 · 2023-12-05T05:47:37Z

Quick workaround now

import xgboost

# Replace all XGBoostTrainer with MyXGBoostTrainer
class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgboost.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgboost.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model')
            )
            return booster

    def _save_model(self, model: xgboost.Booster, path: str) -> None:
        model.save_model(os.path.join(path, 'model'))

daviddwlee84 · 2023-12-05T06:16:15Z

(Additional question, not very related to the fail-to-load issue)
Is it possible to continue the same checkpoint directory and checkpoint index when using resume_from_checkpoint

For example, if I load from Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_96983_00000_0_2023-12-05_13-53-18/checkpoint_000017)

I am expecting if I create another XGBoostTrainer with resume_from_checkpoint pointing to this checkpoint and set num_boost_round=20, it will continue training another 20 rounds and get checkpoint like checkpoint_000037.
But instead, it will create another folder under /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/. And I will have to increase num_boost_round to like 40 (> 20) to continue the training. And the checkpoint index will start over from 0.

This would be inconvenient if I want to monitor this in TensorBoard which I expected it was the "same experiment".

I think this experience is a little bit weird.
Since if I want to restore from a failure I would use ray.train.xgboost.XGBoostTrainer.restore and do something like

# NOTE: the datasets are the same here. But when using `resume_from_checkpoint` I would use incremental data.
trainer_restore = MyXGBoostTrainer.restore(os.path.dirname(os.path.dirname(checkpoint.path)), datasets={"train": train_dataset, "valid": valid_dataset})
result_restore = trainer_restore.fit()

daviddwlee84 · 2023-12-12T10:12:22Z

Found printing the path in the XGBoostTrainer._save_model shows dynamically generated temporary directory path like /tmp/tmppbsxfulk. (Seems this issue belongs to Ray 2.8+)

ray/python/ray/train/gbdt_trainer.py

Lines 272 to 288 in ee10ea6

    
           def _checkpoint_at_end(self, model, evals_result: dict) -> None: 
        
               # We need to call session.report to save checkpoints, so we report 
        
               # the last received metrics (possibly again). 
        
               result_dict = flatten_dict(evals_result, delimiter="-") 
        
               for k in list(result_dict): 
        
                   result_dict[k] = result_dict[k][-1] 
        
               if getattr(self._tune_callback_checkpoint_cls, "_report_callbacks_cls", None): 
        
                   # Deprecate: Remove in Ray 2.8 
        
                   with tune.checkpoint_dir(step=self._model_iteration(model)) as cp_dir: 
        
                       self._save_model(model, path=os.path.join(cp_dir, MODEL_KEY)) 
        
                   tune.report(**result_dict) 
        
               else: 
        
                   with tempfile.TemporaryDirectory() as checkpoint_dir: 
        
                       self._save_model(model, path=checkpoint_dir) 
        
                       checkpoint = Checkpoint.from_directory(checkpoint_dir) 
        
                       train.report(result_dict, checkpoint=checkpoint)

Haven't found where Ray actually dumped the model itself

Maybe controlled by the StorageContext. Not sure.

ray/python/ray/train/gbdt_trainer.py

Line 288 in ee10ea6

train.report(result_dict, checkpoint=checkpoint)

ray/python/ray/train/_internal/session.py

Line 244 in ee10ea6

self.storage.persist_artifacts(force=True)

daviddwlee84 · 2023-12-13T07:41:54Z

How to Configure Persistent Storage in Ray Tune — Ray 2.8.1
Saving and Loading Checkpoints — Ray 2.8.1
- In Ray 2.8+ it will use ray.train.report(..., checkpoint=checkpoint) to dump checkpoint to "persistent storage".
- Haven't found how the "copying" failed to copy model.json (The Trainer._save_model() not properly working) but got model instead
Configuring Persistent Storage — Ray 2.8.1
- Seems adding SyncConfig(sync_artifacts=True) to RunConfig still not working

import ray
from ray.train import SyncConfig, RunConfig, CheckpointConfig, FailureConfig, ScalingConfig, Checkpoint
from ray.train.xgboost import XGBoostTrainer
import xgboost as xgb
import os

class MyXGBoostTrainer(XGBoostTrainer):
    """
    Workaround
    https://github.com/ray-project/ray/issues/41608
    https://github.com/dmlc/xgboost/issues/3089
    """
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        """
        BUG: somehow we are not saving to the correct place we want
        https://github.com/ray-project/ray/issues/41608
        Path direct to the temp file
        /tmp/tmppbsxfulk
        """
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model'))

dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
sync_config = SyncConfig(sync_artifacts=True)
run_config = RunConfig(
    name=f"XGBoost_Test_Checkpoint_Save_Load",
    storage_path="/NAS/ShareFolder/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    sync_config=sync_config,
)
scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)
trainer = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()
checkpoint = result.get_best_checkpoint('valid-logloss', 'min')
# Failed using XGBoostTrainer
booster = XGBoostTrainer.get_model(checkpoint)
# This will work
booster = MyXGBoostTrainer.get_model(checkpoint)
# But feature_name is None
print(booster.feature_names)

If modify Trainer to store the entire booster as a pickle, I will still not be able to find the model.pickle in the directory.

class XGBoostTrainerWithPickle(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            with open(os.path.join(checkpoint_path, 'model.pickle'), 'rb') as fp:
                booster = pickle.load(fp)
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        with open(os.path.join(path, 'model.pickle'), 'wb') as fp:
            pickle.dump(model, fp)

daviddwlee84 · 2023-12-13T08:10:12Z

Confirmed it more like the ray.train.report issue. Not XGBoostTrainer.

class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model.json'))

        # Can found ['model.json'] in the temp dir
        print(os.listdir(path))
        print(ckpt := Checkpoint.from_directory(path))
        # Successfully load XGBoost booster and print feature names
        print(MyXGBoostTrainer.get_model(ckpt).feature_names)

justinvyu · 2023-12-14T17:54:16Z

@daviddwlee84 Thanks for posting all of your investigation, I'll take a closer look today.

justinvyu · 2023-12-15T00:36:22Z

@daviddwlee84 I am not able to reproduce this with all the same package versions as you -- the biggest thing I can think of is xgboost_ray being out of date, and saving to the model file instead of the updated model.json. Could you double-check your xgboost_ray version and upgrade it to latest?

daviddwlee84 · 2023-12-15T01:40:57Z

Sure, I will take a look today to see if it is xgboost_ray related issue. (Version is the latest already)

# https://github.com/ray-project/xgboost_ray/releases/tag/v0.1.19
$ pip list | grep xgboost-ray
xgboost-ray                       0.1.19

But the weird thing is that, no matter how I modify XGBoostTrainer (get_model() and especially _save_model() method), the content in the destination stays unchanged. (So, modifying get_model() to load what it gives me is the only workaround now)

I tried:

No change, which is expected to have model.json
Change the model name from model.json to model
Save the entire booster as a pickle
Save additional information (feature_name) in the booster attributes

Conclusion

It will successfully store in the temp folder
The destination that was set in RunConfig will always and can only find the legacy model file without all operations I did in the _save_model()

I haven't found how ray.train.report save the Checkpoint to the destination. I expected there should be a copy operation (should have the same content as what's in the temp dir) or reinvoke of the _save_model() (which forces the booster checkpoint to be the legacy model output name).
Maybe it is the _TrainSession.persist_artifacts's work, but haven't found out how it works.

@justinvyu Did your checkpoint content change when you modify XGBoostTrainer._save_model()?

daviddwlee84 · 2023-12-15T02:14:40Z

Maybe another clue is where the warning message was generated

WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.

I can do the experiment in get_model()

class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )

            # This will dump the user warning
            # WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
            booster.save_model('model')

            # This works fine
            booster.save_model('model.json')

It shows that as long as I didn't call booster.save_model() without giving *.json postfix in _save_model(), this warning shouldn't exist unless there is somewhere override my code.

Somehow Ray didn't use what I set in the _save_model(), and I think xgboost_ray only involves the worker training part.
Because the TensorBoard-like checkpoint directory structure is generated by Ray Tuner API, xgboost_ray only returns booster and we will have to save it manually.

justinvyu · 2023-12-15T21:38:33Z

@daviddwlee84

I haven't found how ray.train.report save the Checkpoint to the destination.

This is where the checkpoint gets persisted -- it does get copied from the temp dir to the location on persistent storage (NFS/S3). It happens during the ray.train.report call:

ray/python/ray/train/_internal/session.py

Lines 416 to 417 in ee10ea6

    
           # Persist the reported checkpoint files to storage. 
        
           persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)

Did your checkpoint content change when you modify XGBoostTrainer._save_model()?

I tried this:

class MyXGBoostTrainer(XGBoostTrainer):
    def _save_model(self, model, path: str):
        model.save_model(os.path.join(path, "model.ubj"))

This works fine for me:

$ ls /home/ray/ray_results/XGBoost_ResumeExperiment/MyXGBoostTrainer_da263_00000_0_2023-12-15_13-35-25/checkpoint_000020
model.ubj

Q: What's your cluster setup? Are you running on multiple nodes, and is the xgboost/xgboost_ray/ray version the same on every node?

daviddwlee84 · 2023-12-18T01:20:51Z

Q: What's your cluster setup? Are you running on multiple nodes, and is the xgboost/xgboost_ray/ray version the same on every node?

I have three machines. I set up the workspace (/mnt/NAS/ShareFolder/MyRepo) in a NAS directory which are accessible for these three machine and have mounted under the same directory structure.
In the workspace, I created a Python 3.8.13 virtual environment (/mnt/NAS/ShareFolder/MyRepo/MyVenv), which installed ray==2.8.1, xgboost-ray==0.1.19, xgboost==2.0.2.

And I start the cluster like this

Start the head node on a machine

# launch_ray_head_node.sh
RAY_record_ref_creation_sites=1 RAY_PROMETHEUS_HOST=http://192.168.222.235:9000 RAY_GRAFANA_HOST=http://192.168.222.235:3000 RAY_scheduler_spread_threshold=0.0 /mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/ray start --head --node-ip-address 192.168.222.235 --port 6379 --dashboard-host 0.0.0.0 --dashboard-port 8265 --object-store-memory 450000000000

Start the other two machines

# launch_ray_worker_node.sh
RAY_record_ref_creation_sites=1 RAY_scheduler_spread_threshold=0.0 /mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/ray --address 192.168.222.235:6379 --object-store-memory 450000000000

Start the training

/mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/python -m trainer.ray_training

In this script, I have a RunConfig like this, which directs the checkpoint to the NAS share folder

run_config = RunConfig(
    name="ExperimentName",
    storage_path="/mnt/NAS/ShareFolder/MyRepo/Results",
    ...
)

If the Ray version is inconsistent, it will raise an error at the cluster starting phase, but I am not sure if it will warn for other packages.

daviddwlee84 · 2023-12-18T02:23:22Z

I tried to print the package version by doing this

import ray
import logging

ray.init()

@ray.remote(scheduling_strategy='SPREAD')
class Actor:
    def __init__(self):
        logging.basicConfig(level=logging.INFO)

    def log(self):
        logger = logging.getLogger(__name__)
        import xgboost
        import xgboost_ray
        logger.info({
            'xgboost': xgboost.__version__,
            'xgboost_ray': xgboost_ray.__version__,
            'ray': ray.__version__,
        })


for _ in range(3):
    actor = Actor.remote()
    ray.get(actor.log.remote())

And get the following logs

/mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/python /mnt/NAS/ShareFolder/MyRepo/ray_environment_check.py 
2023-12-18 10:19:37,829 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 192.168.222.235:6379...
2023-12-18 10:19:37,858 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://192.168.222.235:8265 
(Actor pid=38713, ip=192.168.222.236) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}
(Actor pid=35015, ip=192.168.222.237) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}
(Actor pid=38897) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}

Not sure if this confirms they are using the same packages.

daviddwlee84 · 2023-12-18T06:50:31Z

I found only the latest iteration checkpoint is correctly called _save_model().

Other iteration dumps are not calling the _save_model().

ray/python/ray/train/gbdt_trainer.py

Lines 272 to 288 in ee10ea6

    
           def _checkpoint_at_end(self, model, evals_result: dict) -> None: 
        
               # We need to call session.report to save checkpoints, so we report 
        
               # the last received metrics (possibly again). 
        
               result_dict = flatten_dict(evals_result, delimiter="-") 
        
               for k in list(result_dict): 
        
                   result_dict[k] = result_dict[k][-1] 
        
               if getattr(self._tune_callback_checkpoint_cls, "_report_callbacks_cls", None): 
        
                   # Deprecate: Remove in Ray 2.8 
        
                   with tune.checkpoint_dir(step=self._model_iteration(model)) as cp_dir: 
        
                       self._save_model(model, path=os.path.join(cp_dir, MODEL_KEY)) 
        
                   tune.report(**result_dict) 
        
               else: 
        
                   with tempfile.TemporaryDirectory() as checkpoint_dir: 
        
                       self._save_model(model, path=checkpoint_dir) 
        
                       checkpoint = Checkpoint.from_directory(checkpoint_dir) 
        
                       train.report(result_dict, checkpoint=checkpoint)

Seems if GBDTTrainer._checkpoint_at_end() means the very end, then it works as expected.

daviddwlee84 · 2023-12-18T06:59:12Z

Okay, I found the exact issue! The "non-end" checkpoint is preserved by _tune_callback_checkpoint_cls

ray/python/ray/train/gbdt_trainer.py

Lines 307 to 319 in ee10ea6

    
           if not any( 
        
               isinstance(cb, self._tune_callback_checkpoint_cls) 
        
               for cb in config["callbacks"] 
        
           ): 
        
               # Only add our own callback if it hasn't been added before 
        
               checkpoint_frequency = ( 
        
                   self.run_config.checkpoint_config.checkpoint_frequency 
        
               ) 
        
               callback = self._tune_callback_checkpoint_cls( 
        
                   filename=MODEL_KEY, frequency=checkpoint_frequency 
        
               ) 
        
               config["callbacks"] += [callback]

ray/python/ray/train/xgboost/xgboost_trainer.py

Line 84 in ee10ea6

_tune_callback_checkpoint_cls: type = TuneReportCheckpointCallback

https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/tune.py#L69-L76

ray/python/ray/tune/integration/xgboost.py

Lines 156 to 175 in ee10ea6

    
           def after_iteration(self, model: Booster, epoch: int, evals_log: Dict): 
        
               if self._frequency > 0 and self._checkpoint_callback_cls: 
        
                   self._checkpoint_callback_cls.after_iteration(self, model, epoch, evals_log) 
        
               if self._report_callbacks_cls: 
        
                   # Deprecate: Raise error in Ray 2.8 
        
                   if log_once("xgboost_ray_legacy"): 
        
                       warnings.warn( 
        
                           "You are using an outdated version of XGBoost-Ray that won't be " 
        
                           "compatible with future releases of Ray. Please update XGBoost-Ray " 
        
                           "with `pip install -U xgboost_ray`." 
        
                       ) 
        
                   self._report_callbacks_cls.after_iteration(self, model, epoch, evals_log) 
        
                   return 
        
               with self._get_checkpoint( 
        
                   model=model, epoch=epoch, filename=self._filename, frequency=self._frequency 
        
               ) as checkpoint: 
        
                   report_dict = self._get_report_dict(evals_log) 
        
                   train.report(report_dict, checkpoint=checkpoint)

And the filename "model" is been set here.

ray/python/ray/air/constants.py

Line 5 in ee10ea6

MODEL_KEY = "model"

Now I am building a workaround for this. If it is successful, I will post the solution here.

But this is kind of tricky and non-intuitive that not all models were saved by the trainer's _save_model() method.

@justinvyu I think this issue can be reproduced by setting CheckpointConfig for non-zero checkpoint_frequency.

checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
    ),

daviddwlee84 · 2023-12-18T07:41:38Z

Using self-defined checkpoint callback

from typing import Optional
import ray
from ray.train import SyncConfig, RunConfig, CheckpointConfig, FailureConfig, ScalingConfig, Checkpoint
from ray.train.xgboost import XGBoostTrainer
from ray.tune.integration.xgboost import TuneReportCheckpointCallback
from contextlib import contextmanager
import tempfile
import xgboost as xgb
import os


class MyXGBoostCheckpointCallback(TuneReportCheckpointCallback):
    @contextmanager
    def _get_checkpoint(
        self, model: xgb.Booster, epoch: int, filename: str, frequency: int
    ) -> Optional[Checkpoint]:
        if not frequency or epoch % frequency > 0 or (not epoch and frequency > 1):
            # Skip 0th checkpoint if frequency > 1
            yield None
            return

        with tempfile.TemporaryDirectory() as checkpoint_dir:
            if hasattr(model, 'feature_names'):
                model.set_attr(feature_names='|'.join(model.feature_names))
            model.save_model(os.path.join(checkpoint_dir, filename))
            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            yield checkpoint


class MyXGBoostTrainer(XGBoostTrainer):
    # HERE
    # This is a must-have, even though we have set the callback in the trainer's callback argument.
    # In GBDTTrainer training_loop, it will check the object, if not the same it will create another one,
    # And you will dump double checkpoints unconsciously
    _tune_callback_checkpoint_cls = MyXGBoostCheckpointCallback

    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model.json'))


dataset = ray.data.read_csv(
    "s3://anonymous@air-example-data/breast_cancer.csv")
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
sync_config = SyncConfig(sync_artifacts=True)
run_config = RunConfig(
    name=f"XGBoost_Test_Checkpoint_Save_Load",
    storage_path="/NAS/ShareFolder/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    sync_config=sync_config,
)
scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)
trainer = MyXGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=30,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    # HERE
    callbacks=[MyXGBoostCheckpointCallback(
        filename="model.json", frequency=1)],
)
result = trainer.fit()
print(checkpoint := result.get_best_checkpoint('valid-logloss', 'min'))
booster = MyXGBoostTrainer.get_model(checkpoint)
print(booster.num_boosted_rounds())
print(booster.feature_names)

justinvyu · 2023-12-26T18:25:47Z

Thank you for the investigation! The checkpoint_at_end and checkpoint_frequency do indeed go through different codepaths, and I was able to reproduce with checkpoint_frequency=1. I'll put up a fix PR to clean this up!

daviddwlee84 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 5, 2023

anyscalesam added the train Ray Train Related Issue label Dec 5, 2023

woshiyyya self-assigned this Dec 11, 2023

woshiyyya removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 11, 2023

woshiyyya assigned justinvyu and unassigned woshiyyya Dec 11, 2023

justinvyu added the P1 Issue that should be fixed within a few weeks label Dec 13, 2023

justinvyu mentioned this issue Dec 28, 2023

[train] Simplify ray.train.xgboost/lightgbm (1/n): Align frequency-based and checkpoint_at_end checkpoint formats #42111

Merged

10 tasks

justinvyu closed this as completed in #42111 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

daviddwlee84 commented Dec 5, 2023 •

edited

daviddwlee84 commented Dec 5, 2023

daviddwlee84 commented Dec 5, 2023

daviddwlee84 commented Dec 5, 2023 •

edited

daviddwlee84 commented Dec 12, 2023

daviddwlee84 commented Dec 13, 2023

daviddwlee84 commented Dec 13, 2023 •

edited

justinvyu commented Dec 14, 2023

justinvyu commented Dec 15, 2023

daviddwlee84 commented Dec 15, 2023 •

edited

daviddwlee84 commented Dec 15, 2023 •

edited

justinvyu commented Dec 15, 2023 •

edited

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023 •

edited

daviddwlee84 commented Dec 18, 2023 •

edited

justinvyu commented Dec 26, 2023

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

Comments

daviddwlee84 commented Dec 5, 2023 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Load data and do the first training

Get the Best Checkpoint and Resume

Issue Severity

daviddwlee84 commented Dec 5, 2023

daviddwlee84 commented Dec 5, 2023

daviddwlee84 commented Dec 5, 2023 • edited

daviddwlee84 commented Dec 12, 2023

daviddwlee84 commented Dec 13, 2023

daviddwlee84 commented Dec 13, 2023 • edited

justinvyu commented Dec 14, 2023

justinvyu commented Dec 15, 2023

daviddwlee84 commented Dec 15, 2023 • edited

daviddwlee84 commented Dec 15, 2023 • edited

justinvyu commented Dec 15, 2023 • edited

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023

daviddwlee84 commented Dec 18, 2023 • edited

daviddwlee84 commented Dec 18, 2023 • edited

justinvyu commented Dec 26, 2023

daviddwlee84 commented Dec 5, 2023 •

edited

daviddwlee84 commented Dec 5, 2023 •

edited

daviddwlee84 commented Dec 13, 2023 •

edited

daviddwlee84 commented Dec 15, 2023 •

edited

daviddwlee84 commented Dec 15, 2023 •

edited

justinvyu commented Dec 15, 2023 •

edited

daviddwlee84 commented Dec 18, 2023 •

edited

daviddwlee84 commented Dec 18, 2023 •

edited