Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

Closed
daviddwlee84 opened this issue Dec 5, 2023 · 17 comments · Fixed by #42111
Closed

[Train] XGBoost continue train (resume_from_checkpoint) and get_model failed #41608

daviddwlee84 opened this issue Dec 5, 2023 · 17 comments · Fixed by #42111
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@daviddwlee84
Copy link

daviddwlee84 commented Dec 5, 2023

What happened + What you expected to happen

When I finish XGBoost training using XGBoostTrainer I want to continue training on the best checkpoint

  1. Assign resume_from_checkpoint failed to load the checkpoint
  2. XGBoostTrainer.get_model can't get the checkpoint either.

The first issue error message happens when creating a new trainer with resume_from_checkpoint and is quite similar to this #16375

2023-12-05 10:52:43,353	WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:52:53,378	INFO tune.py:1047 -- Total run time: 11.02 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:52:53,393	WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_5c19d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_5c19d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_5c19d_00000_0_2023-12-05_10-52-42')

This error message will be like the second one when I remove the early stop config stop=ExperimentPlateauStopper('train-error', mode='min') in RunConfig

xgboost.core.XGBoostError: [11:25:04] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f768b5dc24e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f768b6086f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f768b590731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f768b5909f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0959829dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe095982067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fe09599b1e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fe09599bc95]
  [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x55b5d791b13f]

And the second issue might be relevant to this #41374
Either Ray saves the XGBoost model to legacy binary or cannot load the non-default model name from the checkpoint.
The workaround seems not working.

Where there are warning logs like this

(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000015)
(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:41] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.

And if use XGBoostTrainer.get_model(checkpoint) will get error

XGBoostError: [11:16:54] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000017/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f49ef86824e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f49ef8946f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f49ef81c731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f49ef81c9f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f4b8c2bd9dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f4b8c2bd067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f4b8c2d61e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f4b8c2d6c95]
  [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x5581bcf8513f]

Versions / Dependencies

Python 3.8.13

Packages

ray                               2.8.1
xgboost-ray                       0.1.19
xgboost                           2.0.2

OS

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.6 LTS
Release:        18.04
Codename:       bionic

Reproduction script

The reproduction script is based on the official tutorial Get Started with XGBoost and LightGBM — Ray 2.8.0

Load data and do the first training

import ray
from ray.train.xgboost import XGBoostTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig
from ray.tune.stopper import ExperimentPlateauStopper

ray.init()

dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv"))
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3

run_config = RunConfig(
    name="XGBoost_ResumeExperiment",
    storage_path="/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    # Remove this will get different error message later
    stop=ExperimentPlateauStopper('train-error', mode='min'),
)

scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)

trainer = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)

result = trainer.fit()

During fitting will get warnings like this

(XGBoostTrainer pid=96776, ip=192.168.222.237) [11:16:42] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
(XGBoostTrainer pid=96776, ip=192.168.222.237) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_a3ce0_00000_0_2023-12-05_11-16-11/checkpoint_000018)

Get the Best Checkpoint and Resume

checkpoint = result.get_best_checkpoint('valid-logloss', 'min')

trainer_continue = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    resume_from_checkpoint=checkpoint
)

result_continue = trainer_continue.fit()

This will get an error like this when enabling early stopping

2023-12-05 10:25:41,638	WARNING experiment_state.py:327 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
2023-12-05 10:25:50,900	INFO tune.py:1047 -- Total run time: 9.96 seconds (0.14 seconds for the tuning loop).
2023-12-05 10:25:50,911	WARNING experiment_analysis.py:185 -- Failed to fetch metrics for 1 trial(s):
- MyXGBoostTrainer_95a7d_00000: FileNotFoundError('Could not fetch metrics for MyXGBoostTrainer_95a7d_00000: both result.json and progress.csv were not found at /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/MyXGBoostTrainer_95a7d_00000_0_2023-12-05_10-25-40')

And error like this without an early stopping

xgboost.core.XGBoostError: [11:25:25] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f2f6f5dc24e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f2f6f6086f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f2f6f590731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f2f6f5909f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f9976bab9dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f9976bab067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f9976bc41e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f9976bc4c95]
  [bt] (8) ray::_Inner.train(_PyObject_MakeTpCall+0x3bf) [0x556d2854d13f]

Which will be the same as

model = XGBoostTrainer.get_model(checkpoint)
XGBoostError: [11:36:40] /workspace/src/common/io.cc:147: Opening /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_c84d8_00000_0_2023-12-05_11-24-22/checkpoint_000019/model.json failed: No such file or directory
Stack trace:
  [bt] (0) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1ba24e) [0x7f105a97824e]
  [bt] (1) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x1e66f3) [0x7f105a9a46f3]
  [bt] (2) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x16e731) [0x7f105a92c731]
  [bt] (3) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGBoosterLoadModel+0xb9) [0x7f105a92c9f9]
  [bt] (4) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f11f73d09dd]
  [bt] (5) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7f11f73d0067]
  [bt] (6) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7f11f73e91e9]
  [bt] (7) /mnt/NAS/sda/ShareFolder/anaconda3/envs/research/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7f11f73e9c95]
  [bt] (8) /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/daweilee_research/bin/python(_PyObject_MakeTpCall+0x3bf) [0x55ea790af13f]

Issue Severity

High: It blocks me from completing my task.

@daviddwlee84 daviddwlee84 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 5, 2023
@daviddwlee84
Copy link
Author

ray.train.xgboost.xgboost_trainer — Ray 2.8.0

@staticmethod
def get_model(checkpoint: Checkpoint) -> xgboost.Booster:
"""Retrieve the XGBoost model stored in this checkpoint."""
with checkpoint.as_directory() as checkpoint_path:
booster = xgboost.Booster()
booster.load_model(
os.path.join(checkpoint_path, XGBoostCheckpoint.MODEL_FILENAME)
)
return booster
def _train(self, **kwargs):
return xgboost_ray.train(**kwargs)
def _load_checkpoint(self, checkpoint: Checkpoint) -> xgboost.Booster:
return self.__class__.get_model(checkpoint)
def _save_model(self, model: xgboost.Booster, path: str):
model.save_model(os.path.join(path, XGBoostCheckpoint.MODEL_FILENAME))

The filename is "model.json"

MODEL_FILENAME = "model.json"

But somehow the dump checkpoints are still binary

image

@daviddwlee84
Copy link
Author

Quick workaround now

import xgboost

# Replace all XGBoostTrainer with MyXGBoostTrainer
class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgboost.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgboost.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model')
            )
            return booster

    def _save_model(self, model: xgboost.Booster, path: str) -> None:
        model.save_model(os.path.join(path, 'model'))

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 5, 2023

(Additional question, not very related to the fail-to-load issue)
Is it possible to continue the same checkpoint directory and checkpoint index when using resume_from_checkpoint

For example, if I load from Checkpoint(filesystem=local, path=/mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/XGBoostTrainer_96983_00000_0_2023-12-05_13-53-18/checkpoint_000017)

I am expecting if I create another XGBoostTrainer with resume_from_checkpoint pointing to this checkpoint and set num_boost_round=20, it will continue training another 20 rounds and get checkpoint like checkpoint_000037.
But instead, it will create another folder under /mnt/NAS/sda/ShareFolder/lidawei/ExperimentNotebook/ray_debug/XGBoost_ResumeExperiment/. And I will have to increase num_boost_round to like 40 (> 20) to continue the training. And the checkpoint index will start over from 0.

This would be inconvenient if I want to monitor this in TensorBoard which I expected it was the "same experiment".

I think this experience is a little bit weird.
Since if I want to restore from a failure I would use ray.train.xgboost.XGBoostTrainer.restore and do something like

# NOTE: the datasets are the same here. But when using `resume_from_checkpoint` I would use incremental data.
trainer_restore = MyXGBoostTrainer.restore(os.path.dirname(os.path.dirname(checkpoint.path)), datasets={"train": train_dataset, "valid": valid_dataset})
result_restore = trainer_restore.fit()

@anyscalesam anyscalesam added the train Ray Train Related Issue label Dec 5, 2023
@woshiyyya woshiyyya self-assigned this Dec 11, 2023
@woshiyyya woshiyyya removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Dec 11, 2023
@woshiyyya woshiyyya assigned justinvyu and unassigned woshiyyya Dec 11, 2023
@daviddwlee84
Copy link
Author

Found printing the path in the XGBoostTrainer._save_model shows dynamically generated temporary directory path like /tmp/tmppbsxfulk. (Seems this issue belongs to Ray 2.8+)

def _checkpoint_at_end(self, model, evals_result: dict) -> None:
# We need to call session.report to save checkpoints, so we report
# the last received metrics (possibly again).
result_dict = flatten_dict(evals_result, delimiter="-")
for k in list(result_dict):
result_dict[k] = result_dict[k][-1]
if getattr(self._tune_callback_checkpoint_cls, "_report_callbacks_cls", None):
# Deprecate: Remove in Ray 2.8
with tune.checkpoint_dir(step=self._model_iteration(model)) as cp_dir:
self._save_model(model, path=os.path.join(cp_dir, MODEL_KEY))
tune.report(**result_dict)
else:
with tempfile.TemporaryDirectory() as checkpoint_dir:
self._save_model(model, path=checkpoint_dir)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(result_dict, checkpoint=checkpoint)

Haven't found where Ray actually dumped the model itself

Maybe controlled by the StorageContext. Not sure.

train.report(result_dict, checkpoint=checkpoint)

self.storage.persist_artifacts(force=True)

@daviddwlee84
Copy link
Author

import ray
from ray.train import SyncConfig, RunConfig, CheckpointConfig, FailureConfig, ScalingConfig, Checkpoint
from ray.train.xgboost import XGBoostTrainer
import xgboost as xgb
import os

class MyXGBoostTrainer(XGBoostTrainer):
    """
    Workaround
    https://github.com/ray-project/ray/issues/41608
    https://github.com/dmlc/xgboost/issues/3089
    """
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        """
        BUG: somehow we are not saving to the correct place we want
        https://github.com/ray-project/ray/issues/41608
        Path direct to the temp file
        /tmp/tmppbsxfulk
        """
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model'))

dataset = ray.data.read_csv("s3://anonymous@air-example-data/breast_cancer.csv")
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
sync_config = SyncConfig(sync_artifacts=True)
run_config = RunConfig(
    name=f"XGBoost_Test_Checkpoint_Save_Load",
    storage_path="/NAS/ShareFolder/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    sync_config=sync_config,
)
scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)
trainer = XGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=20,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
)
result = trainer.fit()
checkpoint = result.get_best_checkpoint('valid-logloss', 'min')
# Failed using XGBoostTrainer
booster = XGBoostTrainer.get_model(checkpoint)
# This will work
booster = MyXGBoostTrainer.get_model(checkpoint)
# But feature_name is None
print(booster.feature_names)

If modify Trainer to store the entire booster as a pickle, I will still not be able to find the model.pickle in the directory.

class XGBoostTrainerWithPickle(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            with open(os.path.join(checkpoint_path, 'model.pickle'), 'rb') as fp:
                booster = pickle.load(fp)
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        with open(os.path.join(path, 'model.pickle'), 'wb') as fp:
            pickle.dump(model, fp)

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 13, 2023

Confirmed it more like the ray.train.report issue. Not XGBoostTrainer.

class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model.json'))

        # Can found ['model.json'] in the temp dir
        print(os.listdir(path))
        print(ckpt := Checkpoint.from_directory(path))
        # Successfully load XGBoost booster and print feature names
        print(MyXGBoostTrainer.get_model(ckpt).feature_names)

@justinvyu justinvyu added the P1 Issue that should be fixed within a few weeks label Dec 13, 2023
@justinvyu
Copy link
Contributor

@daviddwlee84 Thanks for posting all of your investigation, I'll take a closer look today.

@justinvyu
Copy link
Contributor

@daviddwlee84 I am not able to reproduce this with all the same package versions as you -- the biggest thing I can think of is xgboost_ray being out of date, and saving to the model file instead of the updated model.json. Could you double-check your xgboost_ray version and upgrade it to latest?

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 15, 2023

Sure, I will take a look today to see if it is xgboost_ray related issue. (Version is the latest already)

# https://github.com/ray-project/xgboost_ray/releases/tag/v0.1.19
$ pip list | grep xgboost-ray
xgboost-ray                       0.1.19

But the weird thing is that, no matter how I modify XGBoostTrainer (get_model() and especially _save_model() method), the content in the destination stays unchanged. (So, modifying get_model() to load what it gives me is the only workaround now)

I tried:

  1. No change, which is expected to have model.json
  2. Change the model name from model.json to model
  3. Save the entire booster as a pickle
  4. Save additional information (feature_name) in the booster attributes

Conclusion

  1. It will successfully store in the temp folder
  2. The destination that was set in RunConfig will always and can only find the legacy model file without all operations I did in the _save_model()

I haven't found how ray.train.report save the Checkpoint to the destination. I expected there should be a copy operation (should have the same content as what's in the temp dir) or reinvoke of the _save_model() (which forces the booster checkpoint to be the legacy model output name).
Maybe it is the _TrainSession.persist_artifacts's work, but haven't found out how it works.

@justinvyu Did your checkpoint content change when you modify XGBoostTrainer._save_model()?

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 15, 2023

Maybe another clue is where the warning message was generated

WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.

I can do the experiment in get_model()

class MyXGBoostTrainer(XGBoostTrainer):
    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )

            # This will dump the user warning
            # WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
            booster.save_model('model')

            # This works fine
            booster.save_model('model.json')

It shows that as long as I didn't call booster.save_model() without giving *.json postfix in _save_model(), this warning shouldn't exist unless there is somewhere override my code.

Somehow Ray didn't use what I set in the _save_model(), and I think xgboost_ray only involves the worker training part.
Because the TensorBoard-like checkpoint directory structure is generated by Ray Tuner API, xgboost_ray only returns booster and we will have to save it manually.

@justinvyu
Copy link
Contributor

justinvyu commented Dec 15, 2023

@daviddwlee84

I haven't found how ray.train.report save the Checkpoint to the destination.

This is where the checkpoint gets persisted -- it does get copied from the temp dir to the location on persistent storage (NFS/S3). It happens during the ray.train.report call:

# Persist the reported checkpoint files to storage.
persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)

Did your checkpoint content change when you modify XGBoostTrainer._save_model()?

I tried this:

class MyXGBoostTrainer(XGBoostTrainer):
    def _save_model(self, model, path: str):
        model.save_model(os.path.join(path, "model.ubj"))

This works fine for me:

$ ls /home/ray/ray_results/XGBoost_ResumeExperiment/MyXGBoostTrainer_da263_00000_0_2023-12-15_13-35-25/checkpoint_000020
model.ubj

Q: What's your cluster setup? Are you running on multiple nodes, and is the xgboost/xgboost_ray/ray version the same on every node?

@daviddwlee84
Copy link
Author

Q: What's your cluster setup? Are you running on multiple nodes, and is the xgboost/xgboost_ray/ray version the same on every node?

I have three machines. I set up the workspace (/mnt/NAS/ShareFolder/MyRepo) in a NAS directory which are accessible for these three machine and have mounted under the same directory structure.
In the workspace, I created a Python 3.8.13 virtual environment (/mnt/NAS/ShareFolder/MyRepo/MyVenv), which installed ray==2.8.1, xgboost-ray==0.1.19, xgboost==2.0.2.

And I start the cluster like this

  1. Start the head node on a machine
# launch_ray_head_node.sh
RAY_record_ref_creation_sites=1 RAY_PROMETHEUS_HOST=http://192.168.222.235:9000 RAY_GRAFANA_HOST=http://192.168.222.235:3000 RAY_scheduler_spread_threshold=0.0 /mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/ray start --head --node-ip-address 192.168.222.235 --port 6379 --dashboard-host 0.0.0.0 --dashboard-port 8265 --object-store-memory 450000000000
  1. Start the other two machines
# launch_ray_worker_node.sh
RAY_record_ref_creation_sites=1 RAY_scheduler_spread_threshold=0.0 /mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/ray --address 192.168.222.235:6379 --object-store-memory 450000000000
  1. Start the training
/mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/python -m trainer.ray_training

In this script, I have a RunConfig like this, which directs the checkpoint to the NAS share folder

run_config = RunConfig(
    name="ExperimentName",
    storage_path="/mnt/NAS/ShareFolder/MyRepo/Results",
    ...
)

If the Ray version is inconsistent, it will raise an error at the cluster starting phase, but I am not sure if it will warn for other packages.

@daviddwlee84
Copy link
Author

I tried to print the package version by doing this

import ray
import logging

ray.init()

@ray.remote(scheduling_strategy='SPREAD')
class Actor:
    def __init__(self):
        logging.basicConfig(level=logging.INFO)

    def log(self):
        logger = logging.getLogger(__name__)
        import xgboost
        import xgboost_ray
        logger.info({
            'xgboost': xgboost.__version__,
            'xgboost_ray': xgboost_ray.__version__,
            'ray': ray.__version__,
        })


for _ in range(3):
    actor = Actor.remote()
    ray.get(actor.log.remote())

And get the following logs

/mnt/NAS/ShareFolder/MyRepo/MyVenv/bin/python /mnt/NAS/ShareFolder/MyRepo/ray_environment_check.py 
2023-12-18 10:19:37,829 INFO worker.py:1489 -- Connecting to existing Ray cluster at address: 192.168.222.235:6379...
2023-12-18 10:19:37,858 INFO worker.py:1664 -- Connected to Ray cluster. View the dashboard at http://192.168.222.235:8265 
(Actor pid=38713, ip=192.168.222.236) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}
(Actor pid=35015, ip=192.168.222.237) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}
(Actor pid=38897) INFO:__main__:{'xgboost': '2.0.2', 'xgboost_ray': '0.1.19', 'ray': '2.8.1'}

Not sure if this confirms they are using the same packages.

@daviddwlee84
Copy link
Author

I found only the latest iteration checkpoint is correctly called _save_model().
image

Other iteration dumps are not calling the _save_model().

def _checkpoint_at_end(self, model, evals_result: dict) -> None:
# We need to call session.report to save checkpoints, so we report
# the last received metrics (possibly again).
result_dict = flatten_dict(evals_result, delimiter="-")
for k in list(result_dict):
result_dict[k] = result_dict[k][-1]
if getattr(self._tune_callback_checkpoint_cls, "_report_callbacks_cls", None):
# Deprecate: Remove in Ray 2.8
with tune.checkpoint_dir(step=self._model_iteration(model)) as cp_dir:
self._save_model(model, path=os.path.join(cp_dir, MODEL_KEY))
tune.report(**result_dict)
else:
with tempfile.TemporaryDirectory() as checkpoint_dir:
self._save_model(model, path=checkpoint_dir)
checkpoint = Checkpoint.from_directory(checkpoint_dir)
train.report(result_dict, checkpoint=checkpoint)

Seems if GBDTTrainer._checkpoint_at_end() means the very end, then it works as expected.

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 18, 2023

Okay, I found the exact issue! The "non-end" checkpoint is preserved by _tune_callback_checkpoint_cls

if not any(
isinstance(cb, self._tune_callback_checkpoint_cls)
for cb in config["callbacks"]
):
# Only add our own callback if it hasn't been added before
checkpoint_frequency = (
self.run_config.checkpoint_config.checkpoint_frequency
)
callback = self._tune_callback_checkpoint_cls(
filename=MODEL_KEY, frequency=checkpoint_frequency
)
config["callbacks"] += [callback]

_tune_callback_checkpoint_cls: type = TuneReportCheckpointCallback

https://github.com/ray-project/xgboost_ray/blob/9081780c5826194b780fdad4dbe6872470527cab/xgboost_ray/tune.py#L69-L76

def after_iteration(self, model: Booster, epoch: int, evals_log: Dict):
if self._frequency > 0 and self._checkpoint_callback_cls:
self._checkpoint_callback_cls.after_iteration(self, model, epoch, evals_log)
if self._report_callbacks_cls:
# Deprecate: Raise error in Ray 2.8
if log_once("xgboost_ray_legacy"):
warnings.warn(
"You are using an outdated version of XGBoost-Ray that won't be "
"compatible with future releases of Ray. Please update XGBoost-Ray "
"with `pip install -U xgboost_ray`."
)
self._report_callbacks_cls.after_iteration(self, model, epoch, evals_log)
return
with self._get_checkpoint(
model=model, epoch=epoch, filename=self._filename, frequency=self._frequency
) as checkpoint:
report_dict = self._get_report_dict(evals_log)
train.report(report_dict, checkpoint=checkpoint)

And the filename "model" is been set here.

MODEL_KEY = "model"


Now I am building a workaround for this. If it is successful, I will post the solution here.

But this is kind of tricky and non-intuitive that not all models were saved by the trainer's _save_model() method.

@justinvyu I think this issue can be reproduced by setting CheckpointConfig for non-zero checkpoint_frequency.

checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
    ),

@daviddwlee84
Copy link
Author

daviddwlee84 commented Dec 18, 2023

Using self-defined checkpoint callback

from typing import Optional
import ray
from ray.train import SyncConfig, RunConfig, CheckpointConfig, FailureConfig, ScalingConfig, Checkpoint
from ray.train.xgboost import XGBoostTrainer
from ray.tune.integration.xgboost import TuneReportCheckpointCallback
from contextlib import contextmanager
import tempfile
import xgboost as xgb
import os


class MyXGBoostCheckpointCallback(TuneReportCheckpointCallback):
    @contextmanager
    def _get_checkpoint(
        self, model: xgb.Booster, epoch: int, filename: str, frequency: int
    ) -> Optional[Checkpoint]:
        if not frequency or epoch % frequency > 0 or (not epoch and frequency > 1):
            # Skip 0th checkpoint if frequency > 1
            yield None
            return

        with tempfile.TemporaryDirectory() as checkpoint_dir:
            if hasattr(model, 'feature_names'):
                model.set_attr(feature_names='|'.join(model.feature_names))
            model.save_model(os.path.join(checkpoint_dir, filename))
            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            yield checkpoint


class MyXGBoostTrainer(XGBoostTrainer):
    # HERE
    # This is a must-have, even though we have set the callback in the trainer's callback argument.
    # In GBDTTrainer training_loop, it will check the object, if not the same it will create another one,
    # And you will dump double checkpoints unconsciously
    _tune_callback_checkpoint_cls = MyXGBoostCheckpointCallback

    @staticmethod
    def get_model(checkpoint: Checkpoint) -> xgb.Booster:
        """Retrieve the XGBoost model stored in this checkpoint."""
        with checkpoint.as_directory() as checkpoint_path:
            booster = xgb.Booster()
            booster.load_model(
                os.path.join(checkpoint_path, 'model.json')
            )
            if booster.attr('feature_names') is not None:
                booster.feature_names = booster.attr(
                    'feature_names').split('|')
            return booster

    def _save_model(self, model: xgb.Booster, path: str) -> None:
        if hasattr(model, 'feature_names'):
            model.set_attr(feature_names='|'.join(model.feature_names))
        model.save_model(os.path.join(path, 'model.json'))


dataset = ray.data.read_csv(
    "s3://anonymous@air-example-data/breast_cancer.csv")
train_dataset, valid_dataset = dataset.train_test_split(test_size=0.3)
sync_config = SyncConfig(sync_artifacts=True)
run_config = RunConfig(
    name=f"XGBoost_Test_Checkpoint_Save_Load",
    storage_path="/NAS/ShareFolder/ray_debug",
    checkpoint_config=CheckpointConfig(
        checkpoint_frequency=1,
        num_to_keep=10,
        checkpoint_at_end=True,
        checkpoint_score_attribute='train-error',
        checkpoint_score_order='min',
    ),
    failure_config=FailureConfig(max_failures=2),
    sync_config=sync_config,
)
scaling_config = ScalingConfig(
    num_workers=3,
    placement_strategy="SPREAD",
    use_gpu=False,
)
trainer = MyXGBoostTrainer(
    scaling_config=scaling_config,
    run_config=run_config,
    label_column="target",
    num_boost_round=30,
    params={
        "objective": "binary:logistic",
        "eval_metric": ["logloss", "error"],
    },
    datasets={"train": train_dataset, "valid": valid_dataset},
    # HERE
    callbacks=[MyXGBoostCheckpointCallback(
        filename="model.json", frequency=1)],
)
result = trainer.fit()
print(checkpoint := result.get_best_checkpoint('valid-logloss', 'min'))
booster = MyXGBoostTrainer.get_model(checkpoint)
print(booster.num_boosted_rounds())
print(booster.feature_names)

@justinvyu
Copy link
Contributor

Thank you for the investigation! The checkpoint_at_end and checkpoint_frequency do indeed go through different codepaths, and I was able to reproduce with checkpoint_frequency=1. I'll put up a fix PR to clean this up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
4 participants