Skip to content

Commit acb64f7

Browse files
authored
[EM] Push for higher device memory utilization. (dmlc#10895)
- Allow some pages to be in device memory for the validation dataset. - Allow bounding the stream length for quantile sketching. - Rename the parameter `extmem_concat_pages` to `extmem_single_page` to avoid confusion with the cache optimization.
1 parent 8af7062 commit acb64f7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+655
-329
lines changed

doc/parameter.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ Parameters for Non-Exact Tree Methods
245245
trees. After 3.0, this parameter affects GPU algorithms as well.
246246

247247

248-
* ``extmem_concat_pages``, [default = ``false``]
248+
* ``extmem_single_page``, [default = ``false``]
249249

250250
This parameter is only used for the ``hist`` tree method with ``device=cuda`` and
251251
``subsample != 1.0``. Before 3.0, pages were always concatenated.

doc/tutorials/external_memory.rst

+18-10
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The external memory support has undergone multiple development iterations. Like
2525
:py:class:`~xgboost.QuantileDMatrix` with :py:class:`~xgboost.DataIter`, XGBoost loads
2626
data batch-by-batch using a custom iterator supplied by the user. However, unlike the
2727
:py:class:`~xgboost.QuantileDMatrix`, external memory does not concatenate the batches
28-
(unless specified by the ``extmem_concat_pages``) . Instead, it caches all batches in the
28+
(unless specified by the ``extmem_single_page``) . Instead, it caches all batches in the
2929
external memory and fetch them on-demand. Go to the end of the document to see a
3030
comparison between :py:class:`~xgboost.QuantileDMatrix` and the external memory version of
3131
:py:class:`~xgboost.ExtMemQuantileDMatrix`.
@@ -120,8 +120,11 @@ the ``hist`` tree method is employed. For a GPU device, the main memory is the d
120120
memory, whereas the external memory can be either a disk or the CPU memory. XGBoost stages
121121
the cache on CPU memory by default. Users can change the backing storage to disk by
122122
specifying the ``on_host`` parameter in the :py:class:`~xgboost.DataIter`. However, using
123-
the disk is not recommended. It's likely to make the GPU slower than the CPU. The option is
124-
here for experimental purposes only.
123+
the disk is not recommended as it's likely to make the GPU slower than the CPU. The option
124+
is here for experimental purposes only. In addition,
125+
:py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``,
126+
``min_cache_page_bytes``, and ``max_quantile_batches`` can help control the data placement
127+
and memory usage.
125128

126129
Inputs to the :py:class:`~xgboost.ExtMemQuantileDMatrix` (through the iterator) must be on
127130
the GPU. This is a current limitation we aim to address in the future.
@@ -157,12 +160,17 @@ the GPU. This is a current limitation we aim to address in the future.
157160
evals=[(Xy_train, "Train"), (Xy_valid, "Valid")]
158161
)
159162
160-
It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ for
161-
all memory allocation when training with external memory. XGBoost relies on the memory
162-
pool to reduce the overhead for data fetching. In addition, the open source `NVIDIA Linux
163-
driver
163+
It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ with
164+
an asynchronous memory resource for all memory allocation when training with external
165+
memory. XGBoost relies on the asynchronous memory pool to reduce the overhead of data
166+
fetching. In addition, the open source `NVIDIA Linux driver
164167
<https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/>`__
165-
is required for ``Heterogeneous memory management (HMM)`` support.
168+
is required for ``Heterogeneous memory management (HMM)`` support. Usually, users need not
169+
to change :py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``
170+
and ``min_cache_page_bytes``, they are automatically configured based on the device and
171+
don't change model accuracy. However, the ``max_quantile_batches`` can be useful if
172+
:py:class:`~xgboost.ExtMemQuantileDMatrix` is running out of device memory during
173+
construction, see :py:class:`~xgboost.QuantileDMatrix` for more info.
166174

167175
In addition to the batch-based data fetching, the GPU version supports concatenating
168176
batches into a single blob for the training data to improve performance. For GPUs
@@ -181,7 +189,7 @@ concatenation can be enabled by:
181189
182190
param = {
183191
"device": "cuda",
184-
"extmem_concat_pages": true,
192+
"extmem_single_page": true,
185193
'subsample': 0.2,
186194
'sampling_method': 'gradient_based',
187195
}
@@ -200,7 +208,7 @@ interconnect between the CPU and the GPU. With the host memory serving as the da
200208
XGBoost can retrieve data with significantly lower overhead. When the input data is dense,
201209
there's minimal to no performance loss for training, except for the initial construction
202210
of the :py:class:`~xgboost.ExtMemQuantileDMatrix`. The initial construction iterates
203-
through the input data twice, as a result, the most significantly overhead compared to
211+
through the input data twice, as a result, the most significant overhead compared to
204212
in-core training is one additional data read when the data is dense. Please note that
205213
there are multiple variants of the platform and they come with different C2C
206214
bandwidths. During initial development of the feature, we used the LPDDR5 480G version,

include/xgboost/c_api.h

+44-27
Original file line numberDiff line numberDiff line change
@@ -308,35 +308,40 @@ XGB_DLL int XGDMatrixCreateFromCudaArrayInterface(char const *data, char const *
308308
* used by JVM packages. It uses `XGBoostBatchCSR` to accept batches for CSR formated
309309
* input, and concatenate them into 1 final big CSR. The related functions are:
310310
*
311-
* - \ref XGBCallbackSetData
312-
* - \ref XGBCallbackDataIterNext
313-
* - \ref XGDMatrixCreateFromDataIter
311+
* - @ref XGBCallbackSetData
312+
* - @ref XGBCallbackDataIterNext
313+
* - @ref XGDMatrixCreateFromDataIter
314314
*
315-
* Another set is used by external data iterator. It accept foreign data iterators as
315+
* Another set is used by external data iterator. It accepts foreign data iterators as
316316
* callbacks. There are 2 different senarios where users might want to pass in callbacks
317-
* instead of raw data. First it's the Quantile DMatrix used by hist and GPU Hist. For
318-
* this case, the data is first compressed by quantile sketching then merged. This is
319-
* particular useful for distributed setting as it eliminates 2 copies of data. 1 by a
320-
* `concat` from external library to make the data into a blob for normal DMatrix
321-
* initialization, another by the internal CSR copy of DMatrix. The second use case is
322-
* external memory support where users can pass a custom data iterator into XGBoost for
323-
* loading data in batches. There are short notes on each of the use cases in respected
324-
* DMatrix factory function.
317+
* instead of raw data. First it's the Quantile DMatrix used by the hist and GPU-based
318+
* hist tree method. For this case, the data is first compressed by quantile sketching
319+
* then merged. This is particular useful for distributed setting as it eliminates 2
320+
* copies of data. First one by a `concat` from external library to make the data into a
321+
* blob for normal DMatrix initialization, another one by the internal CSR copy of
322+
* DMatrix.
323+
*
324+
* The second use case is external memory support where users can pass a custom data
325+
* iterator into XGBoost for loading data in batches. For both cases, the iterator is only
326+
* used during the construction of the DMatrix and can be safely freed after construction
327+
* finishes. There are short notes on each of the use cases in respected DMatrix factory
328+
* function.
325329
*
326330
* Related functions are:
327331
*
328332
* # Factory functions
329-
* - \ref XGDMatrixCreateFromCallback for external memory
330-
* - \ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
333+
* - @ref XGDMatrixCreateFromCallback for external memory
334+
* - @ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
335+
* - @ref XGExtMemQuantileDMatrixCreateFromCallback for External memory Quantile DMatrix
331336
*
332337
* # Proxy that callers can use to pass data to XGBoost
333-
* - \ref XGProxyDMatrixCreate
334-
* - \ref XGDMatrixCallbackNext
335-
* - \ref DataIterResetCallback
336-
* - \ref XGProxyDMatrixSetDataCudaArrayInterface
337-
* - \ref XGProxyDMatrixSetDataCudaColumnar
338-
* - \ref XGProxyDMatrixSetDataDense
339-
* - \ref XGProxyDMatrixSetDataCSR
338+
* - @ref XGProxyDMatrixCreate
339+
* - @ref XGDMatrixCallbackNext
340+
* - @ref DataIterResetCallback
341+
* - @ref XGProxyDMatrixSetDataCudaArrayInterface
342+
* - @ref XGProxyDMatrixSetDataCudaColumnar
343+
* - @ref XGProxyDMatrixSetDataDense
344+
* - @ref XGProxyDMatrixSetDataCSR
340345
* - ... (data setters)
341346
*
342347
* @{
@@ -346,7 +351,7 @@ XGB_DLL int XGDMatrixCreateFromCudaArrayInterface(char const *data, char const *
346351

347352
/*! \brief handle to a external data iterator */
348353
typedef void *DataIterHandle; // NOLINT(*)
349-
/*! \brief handle to a internal data holder. */
354+
/** @brief handle to an internal data holder. */
350355
typedef void *DataHolderHandle; // NOLINT(*)
351356

352357

@@ -473,7 +478,7 @@ XGB_DLL int XGDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHandle proxy
473478
*/
474479

475480
/**
476-
* @brief Create a Quantile DMatrix with data iterator.
481+
* @brief Create a Quantile DMatrix with a data iterator.
477482
*
478483
* Short note for how to use the second set of callback for (GPU)Hist tree method:
479484
*
@@ -494,7 +499,13 @@ XGB_DLL int XGDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHandle proxy
494499
* - missing: Which value to represent missing value
495500
* - nthread (optional): Number of threads used for initializing DMatrix.
496501
* - max_bin (optional): Maximum number of bins for building histogram. Must be consistent with
497-
the corresponding booster training parameter.
502+
* the corresponding booster training parameter.
503+
* - max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming
504+
* batches with multiple growing substreams. This parameter sets the maximum number
505+
* of batches before XGBoost can cut the sub-stream and create a new one. This can
506+
* help bound the memory usage. By default, XGBoost grows new sub-streams
507+
* exponentially until batches are exhausted. Only used for the training dataset and
508+
* the default is None (unbounded).
498509
* @param out The created Quantile DMatrix.
499510
*
500511
* @return 0 when success, -1 when failure happens
@@ -509,7 +520,7 @@ XGB_DLL int XGQuantileDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHand
509520
*
510521
* @since 3.0.0
511522
*
512-
* @note This is still under development, not ready for test yet.
523+
* @note This is experimental and subject to change.
513524
*
514525
* @param iter A handle to external data iterator.
515526
* @param proxy A DMatrix proxy handle created by @ref XGProxyDMatrixCreate.
@@ -521,12 +532,18 @@ XGB_DLL int XGQuantileDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHand
521532
* - cache_prefix: The path of cache file, caller must initialize all the directories in this path.
522533
* - nthread (optional): Number of threads used for initializing DMatrix.
523534
* - max_bin (optional): Maximum number of bins for building histogram. Must be consistent with
524-
the corresponding booster training parameter.
535+
* the corresponding booster training parameter.
525536
* - on_host (optional): Whether the data should be placed on host memory. Used by GPU inputs.
526537
* - min_cache_page_bytes (optional): The minimum number of bytes for each internal GPU
527538
* page. Set to 0 to disable page concatenation. Automatic configuration if the
528539
* parameter is not provided or set to None.
529-
* @param out The created Quantile DMatrix.
540+
* - max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming
541+
* batches with multiple growing substreams. This parameter sets the maximum number
542+
* of batches before XGBoost can cut the sub-stream and create a new one. This can
543+
* help bound the memory usage. By default, XGBoost grows new sub-streams
544+
* exponentially until batches are exhausted. Only used for the training dataset and
545+
* the default is None (unbounded).
546+
* @param out The created Quantile DMatrix.
530547
*
531548
* @return 0 when success, -1 when failure happens
532549
*/

include/xgboost/data.h

+16-2
Original file line numberDiff line numberDiff line change
@@ -517,6 +517,7 @@ class BatchSet {
517517

518518
struct XGBAPIThreadLocalEntry;
519519

520+
// Configuration for external memoroy DMatrix.
520521
struct ExtMemConfig {
521522
// Cache prefix, not used if the cache is in the host memory. (on_host is true)
522523
std::string cache;
@@ -527,8 +528,20 @@ struct ExtMemConfig {
527528
std::int64_t min_cache_page_bytes{0};
528529
// Missing value.
529530
float missing{std::numeric_limits<float>::quiet_NaN()};
531+
// Maximum number of pages cached in device.
532+
std::int64_t max_num_device_pages{0};
530533
// The number of CPU threads.
531534
std::int32_t n_threads{0};
535+
536+
ExtMemConfig() = default;
537+
ExtMemConfig(std::string cache, bool on_host, std::int64_t min_cache, float missing,
538+
std::int64_t max_num_d, std::int32_t n_threads)
539+
: cache{std::move(cache)},
540+
on_host{on_host},
541+
min_cache_page_bytes{min_cache},
542+
missing{missing},
543+
max_num_device_pages{max_num_d},
544+
n_threads{n_threads} {}
532545
};
533546

534547
/**
@@ -637,7 +650,7 @@ class DMatrix {
637650
typename XGDMatrixCallbackNext>
638651
static DMatrix* Create(DataIterHandle iter, DMatrixHandle proxy, std::shared_ptr<DMatrix> ref,
639652
DataIterResetCallback* reset, XGDMatrixCallbackNext* next, float missing,
640-
std::int32_t nthread, bst_bin_t max_bin);
653+
std::int32_t nthread, bst_bin_t max_bin, std::int64_t max_quantile_blocks);
641654

642655
/**
643656
* @brief Create an external memory DMatrix with callbacks.
@@ -671,7 +684,8 @@ class DMatrix {
671684
typename XGDMatrixCallbackNext>
672685
static DMatrix* Create(DataIterHandle iter, DMatrixHandle proxy, std::shared_ptr<DMatrix> ref,
673686
DataIterResetCallback* reset, XGDMatrixCallbackNext* next,
674-
bst_bin_t max_bin, ExtMemConfig const& config);
687+
bst_bin_t max_bin, std::int64_t max_quantile_blocks,
688+
ExtMemConfig const& config);
675689

676690
virtual DMatrix *Slice(common::Span<int32_t const> ridxs) = 0;
677691

0 commit comments

Comments
 (0)