Skip to content

Avoid full-dataset spill to disk when building DaskDMatrix for XGBoost on local cluster #11452

Closed
@theabrahamaudu

Description

@theabrahamaudu

Environment:

  • OS: Windows 11
  • RAM: 16 GB
  • Python packages: dask, xgboost (Dask API)
  • Data format: 98 GB HDF5 file, ~13 M rows × 926 columns

Background

  1. I preprocess my 98 GB HDF5 file in chunked fashion using Dask alone, and it works within RAM limits.
  2. Once I switch to a distributed.Client() (local cluster) and call DaskXGBRegressor.fit(), XGBoost internally builds a DaskDMatrix by touching every partition up front.
  3. As each worker exceeds its memory threshold, Dask spills those partitions to a local temp directory—eventually duplicating the entire 98 GB of source data on disk.

Expected behavior

  • On-demand streaming of partitions from the HDF5 source into memory, without duplicating the full dataset on disk.
  • Dask workers should only spill the minimum required blocks, not every partition, when training large models locally.

Actual behavior

  • Dask spills every partition to disk while constructing the DaskDMatrix, resulting in ~98 GB of interim files in ~/.dask/temp/ (or OS temp dir).
  • This doubles disk usage and negates the benefits of lazy, chunked loading.

Steps to reproduce

  1. Create a 16 GB-limited local Dask cluster:

    from dask.distributed import Client
    client = Client(n_workers=2, threads_per_worker=1, memory_limit="6GB")
  2. Load HDF5 lazily with small chunks:

    import dask.dataframe as dd
    ddf = dd.read_hdf("data.h5", key="series_train_preprocessed", chunksize=10_000)
  3. Train a DaskXGBRegressor:

    from xgboost.dask import DaskXGBRegressor
    model = DaskXGBRegressor(tree_method="hist", n_estimators=200)
    model.client = client
    model.fit(ddf, ddf_target)
  4. Observe ~98 GB of spill files in the temp directory before any boosting round.


Request

  1. Introduce a “streaming” toggle (or similar mechanism) in distributed.Client or DaskXGBRegressor that allows truly on-demand loading directly from source file into memory for each partition, without pre-spilling all data.
  2. Or, provide configuration options to limit the volume of pre-spill during DaskDMatrix construction on a local cluster, so only the actively processed partitions are written to disk.

Thanks for any guidance or suggestions to avoid this full-dataset duplication.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions