Avoid full-dataset spill to disk when building DaskDMatrix for XGBoost on local cluster

**Environment:**

* OS: Windows 11
* RAM: 16 GB
* Python packages: `dask`, `xgboost` (Dask API)
* Data format: 98 GB HDF5 file, \~13 M rows × 926 columns

---

### Background

1. I preprocess my 98 GB HDF5 file in **chunked fashion** using Dask alone, and it works within RAM limits.
2. Once I switch to a `distributed.Client()` (local cluster) and call `DaskXGBRegressor.fit()`, XGBoost internally builds a `DaskDMatrix` by touching every partition up front.
3. As each worker exceeds its memory threshold, Dask spills those partitions to a local temp directory—eventually duplicating the entire 98 GB of source data on disk.

---

### Expected behavior

* **On-demand streaming** of partitions from the HDF5 source into memory, without duplicating the full dataset on disk.
* Dask workers should only spill the **minimum** required blocks, not every partition, when training large models locally.

---

### Actual behavior

* Dask spills **every** partition to disk while constructing the `DaskDMatrix`, resulting in \~98 GB of interim files in `~/.dask/temp/` (or OS temp dir).
* This doubles disk usage and negates the benefits of lazy, chunked loading.

---

### Steps to reproduce

1. Create a 16 GB-limited local Dask cluster:

   ```python
   from dask.distributed import Client
   client = Client(n_workers=2, threads_per_worker=1, memory_limit="6GB")
   ```
2. Load HDF5 lazily with small chunks:

   ```python
   import dask.dataframe as dd
   ddf = dd.read_hdf("data.h5", key="series_train_preprocessed", chunksize=10_000)
   ```
3. Train a DaskXGBRegressor:

   ```python
   from xgboost.dask import DaskXGBRegressor
   model = DaskXGBRegressor(tree_method="hist", n_estimators=200)
   model.client = client
   model.fit(ddf, ddf_target)
   ```
4. Observe \~98 GB of spill files in the temp directory before any boosting round.

---

### Request

1. **Introduce a “streaming” toggle** (or similar mechanism) in `distributed.Client` or `DaskXGBRegressor` that allows truly on-demand loading directly from source file into memory for each partition, without pre-spilling all data.
2. Or, provide configuration options to limit the volume of pre-spill during `DaskDMatrix` construction on a **local** cluster, so only the actively processed partitions are written to disk.

---

Thanks for any guidance or suggestions to avoid this full-dataset duplication.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Avoid full-dataset spill to disk when building DaskDMatrix for XGBoost on local cluster #11452

Background

Expected behavior

Actual behavior

Steps to reproduce

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Avoid full-dataset spill to disk when building DaskDMatrix for XGBoost on local cluster #11452

Description

Background

Expected behavior

Actual behavior

Steps to reproduce

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions