Closed
Description
Environment:
- OS: Windows 11
- RAM: 16 GB
- Python packages:
dask
,xgboost
(Dask API) - Data format: 98 GB HDF5 file, ~13 M rows × 926 columns
Background
- I preprocess my 98 GB HDF5 file in chunked fashion using Dask alone, and it works within RAM limits.
- Once I switch to a
distributed.Client()
(local cluster) and callDaskXGBRegressor.fit()
, XGBoost internally builds aDaskDMatrix
by touching every partition up front. - As each worker exceeds its memory threshold, Dask spills those partitions to a local temp directory—eventually duplicating the entire 98 GB of source data on disk.
Expected behavior
- On-demand streaming of partitions from the HDF5 source into memory, without duplicating the full dataset on disk.
- Dask workers should only spill the minimum required blocks, not every partition, when training large models locally.
Actual behavior
- Dask spills every partition to disk while constructing the
DaskDMatrix
, resulting in ~98 GB of interim files in~/.dask/temp/
(or OS temp dir). - This doubles disk usage and negates the benefits of lazy, chunked loading.
Steps to reproduce
-
Create a 16 GB-limited local Dask cluster:
from dask.distributed import Client client = Client(n_workers=2, threads_per_worker=1, memory_limit="6GB")
-
Load HDF5 lazily with small chunks:
import dask.dataframe as dd ddf = dd.read_hdf("data.h5", key="series_train_preprocessed", chunksize=10_000)
-
Train a DaskXGBRegressor:
from xgboost.dask import DaskXGBRegressor model = DaskXGBRegressor(tree_method="hist", n_estimators=200) model.client = client model.fit(ddf, ddf_target)
-
Observe ~98 GB of spill files in the temp directory before any boosting round.
Request
- Introduce a “streaming” toggle (or similar mechanism) in
distributed.Client
orDaskXGBRegressor
that allows truly on-demand loading directly from source file into memory for each partition, without pre-spilling all data. - Or, provide configuration options to limit the volume of pre-spill during
DaskDMatrix
construction on a local cluster, so only the actively processed partitions are written to disk.
Thanks for any guidance or suggestions to avoid this full-dataset duplication.
Metadata
Metadata
Assignees
Labels
No labels