IntelPython
diff --git a/‎doc/parameter.rst
Lines changed: 1 addition & 1 deletion b/‎doc/parameter.rst
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/tutorials/external_memory.rst
Lines changed: 18 additions & 10 deletions b/‎doc/tutorials/external_memory.rst
Lines changed: 18 additions & 10 deletions
diff --git a/‎include/xgboost/c_api.h
Lines changed: 44 additions & 27 deletions b/‎include/xgboost/c_api.h
Lines changed: 44 additions & 27 deletions
diff --git a/‎include/xgboost/data.h
Lines changed: 16 additions & 2 deletions b/‎include/xgboost/data.h
Lines changed: 16 additions & 2 deletions
@@ -245,7 +245,7 @@ Parameters for Non-Exact Tree Methods
     trees. After 3.0, this parameter affects GPU algorithms as well.
 
 
-* ``extmem_concat_pages``, [default = ``false``]
+* ``extmem_single_page``, [default = ``false``]
 
   This parameter is only used for the ``hist`` tree method with ``device=cuda`` and
   ``subsample != 1.0``. Before 3.0, pages were always concatenated.
 
@@ -25,7 +25,7 @@ The external memory support has undergone multiple development iterations. Like
 :py:class:`~xgboost.QuantileDMatrix` with :py:class:`~xgboost.DataIter`, XGBoost loads
 data batch-by-batch using a custom iterator supplied by the user. However, unlike the
 :py:class:`~xgboost.QuantileDMatrix`, external memory does not concatenate the batches
-(unless specified by the ``extmem_concat_pages``) . Instead, it caches all batches in the
+(unless specified by the ``extmem_single_page``) . Instead, it caches all batches in the
 external memory and fetch them on-demand. Go to the end of the document to see a
 comparison between :py:class:`~xgboost.QuantileDMatrix` and the external memory version of
 :py:class:`~xgboost.ExtMemQuantileDMatrix`.
@@ -120,8 +120,11 @@ the ``hist`` tree method is employed. For a GPU device, the main memory is the d
 memory, whereas the external memory can be either a disk or the CPU memory. XGBoost stages
 the cache on CPU memory by default. Users can change the backing storage to disk by
 specifying the ``on_host`` parameter in the :py:class:`~xgboost.DataIter`. However, using
-the disk is not recommended. It's likely to make the GPU slower than the CPU. The option is
-here for experimental purposes only.
+the disk is not recommended as it's likely to make the GPU slower than the CPU. The option
+is here for experimental purposes only. In addition,
+:py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``,
+``min_cache_page_bytes``, and ``max_quantile_batches`` can help control the data placement
+and memory usage.
 
 Inputs to the :py:class:`~xgboost.ExtMemQuantileDMatrix` (through the iterator) must be on
 the GPU. This is a current limitation we aim to address in the future.
@@ -157,12 +160,17 @@ the GPU. This is a current limitation we aim to address in the future.
 	    evals=[(Xy_train, "Train"), (Xy_valid, "Valid")]
 	)
 
-It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ for
-all memory allocation when training with external memory. XGBoost relies on the memory
-pool to reduce the overhead for data fetching. In addition, the open source `NVIDIA Linux
-driver
+It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ with
+an asynchronous memory resource for all memory allocation when training with external
+memory. XGBoost relies on the asynchronous memory pool to reduce the overhead of data
+fetching. In addition, the open source `NVIDIA Linux driver
 <https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/>`__
-is required for ``Heterogeneous memory management (HMM)`` support.
+is required for ``Heterogeneous memory management (HMM)`` support. Usually, users need not
+to change :py:class:`~xgboost.ExtMemQuantileDMatrix` parameters ``max_num_device_pages``
+and ``min_cache_page_bytes``, they are automatically configured based on the device and
+don't change model accuracy. However, the ``max_quantile_batches`` can be useful if
+:py:class:`~xgboost.ExtMemQuantileDMatrix` is running out of device memory during
+construction, see :py:class:`~xgboost.QuantileDMatrix` for more info.
 
 In addition to the batch-based data fetching, the GPU version supports concatenating
 batches into a single blob for the training data to improve performance. For GPUs
@@ -181,7 +189,7 @@ concatenation can be enabled by:
 
   param = {
     "device": "cuda",
-    "extmem_concat_pages": true,
+    "extmem_single_page": true,
     'subsample': 0.2,
     'sampling_method': 'gradient_based',
   }
@@ -200,7 +208,7 @@ interconnect between the CPU and the GPU. With the host memory serving as the da
 XGBoost can retrieve data with significantly lower overhead. When the input data is dense,
 there's minimal to no performance loss for training, except for the initial construction
 of the :py:class:`~xgboost.ExtMemQuantileDMatrix`.  The initial construction iterates
-through the input data twice, as a result, the most significantly overhead compared to
+through the input data twice, as a result, the most significant overhead compared to
 in-core training is one additional data read when the data is dense. Please note that
 there are multiple variants of the platform and they come with different C2C
 bandwidths. During initial development of the feature, we used the LPDDR5 480G version,
 
@@ -308,35 +308,40 @@ XGB_DLL int XGDMatrixCreateFromCudaArrayInterface(char const *data, char const *
  * used by JVM packages.  It uses `XGBoostBatchCSR` to accept batches for CSR formated
  * input, and concatenate them into 1 final big CSR.  The related functions are:
  *
- * - \ref XGBCallbackSetData
- * - \ref XGBCallbackDataIterNext
- * - \ref XGDMatrixCreateFromDataIter
+ * - @ref XGBCallbackSetData
+ * - @ref XGBCallbackDataIterNext
+ * - @ref XGDMatrixCreateFromDataIter
  *
- * Another set is used by external data iterator. It accept foreign data iterators as
+ * Another set is used by external data iterator. It accepts foreign data iterators as
  * callbacks.  There are 2 different senarios where users might want to pass in callbacks
- * instead of raw data.  First it's the Quantile DMatrix used by hist and GPU Hist. For
- * this case, the data is first compressed by quantile sketching then merged.  This is
- * particular useful for distributed setting as it eliminates 2 copies of data.  1 by a
- * `concat` from external library to make the data into a blob for normal DMatrix
- * initialization, another by the internal CSR copy of DMatrix.  The second use case is
- * external memory support where users can pass a custom data iterator into XGBoost for
- * loading data in batches.  There are short notes on each of the use cases in respected
- * DMatrix factory function.
+ * instead of raw data.  First it's the Quantile DMatrix used by the hist and GPU-based
+ * hist tree method. For this case, the data is first compressed by quantile sketching
+ * then merged.  This is particular useful for distributed setting as it eliminates 2
+ * copies of data. First one by a `concat` from external library to make the data into a
+ * blob for normal DMatrix initialization, another one by the internal CSR copy of
+ * DMatrix.
+ *
+ * The second use case is external memory support where users can pass a custom data
+ * iterator into XGBoost for loading data in batches. For both cases, the iterator is only
+ * used during the construction of the DMatrix and can be safely freed after construction
+ * finishes. There are short notes on each of the use cases in respected DMatrix factory
+ * function.
  *
  * Related functions are:
  *
  * # Factory functions
- * - \ref XGDMatrixCreateFromCallback for external memory
- * - \ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
+ * - @ref XGDMatrixCreateFromCallback for external memory
+ * - @ref XGQuantileDMatrixCreateFromCallback for quantile DMatrix
+ * - @ref XGExtMemQuantileDMatrixCreateFromCallback for External memory Quantile DMatrix
  *
  * # Proxy that callers can use to pass data to XGBoost
- * - \ref XGProxyDMatrixCreate
- * - \ref XGDMatrixCallbackNext
- * - \ref DataIterResetCallback
- * - \ref XGProxyDMatrixSetDataCudaArrayInterface
- * - \ref XGProxyDMatrixSetDataCudaColumnar
- * - \ref XGProxyDMatrixSetDataDense
- * - \ref XGProxyDMatrixSetDataCSR
+ * - @ref XGProxyDMatrixCreate
+ * - @ref XGDMatrixCallbackNext
+ * - @ref DataIterResetCallback
+ * - @ref XGProxyDMatrixSetDataCudaArrayInterface
+ * - @ref XGProxyDMatrixSetDataCudaColumnar
+ * - @ref XGProxyDMatrixSetDataDense
+ * - @ref XGProxyDMatrixSetDataCSR
  * - ... (data setters)
  *
  * @{
@@ -346,7 +351,7 @@ XGB_DLL int XGDMatrixCreateFromCudaArrayInterface(char const *data, char const *
 
 /*! \brief handle to a external data iterator */
 typedef void *DataIterHandle;  // NOLINT(*)
-/*! \brief handle to a internal data holder. */
+/** @brief handle to an internal data holder. */
 typedef void *DataHolderHandle;  // NOLINT(*)
 
 
@@ -473,7 +478,7 @@ XGB_DLL int XGDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHandle proxy
  */
 
 /**
- * @brief Create a Quantile DMatrix with data iterator.
+ * @brief Create a Quantile DMatrix with a data iterator.
  *
  * Short note for how to use the second set of callback for (GPU)Hist tree method:
  *
@@ -494,7 +499,13 @@ XGB_DLL int XGDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHandle proxy
  *   - missing:      Which value to represent missing value
  *   - nthread (optional): Number of threads used for initializing DMatrix.
  *   - max_bin (optional): Maximum number of bins for building histogram. Must be consistent with
-                           the corresponding booster training parameter.
+ *                         the corresponding booster training parameter.
+ *   - max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming
+ *       batches with multiple growing substreams. This parameter sets the maximum number
+ *       of batches before XGBoost can cut the sub-stream and create a new one. This can
+ *       help bound the memory usage. By default, XGBoost grows new sub-streams
+ *       exponentially until batches are exhausted. Only used for the training dataset and
+ *       the default is None (unbounded).
  * @param out      The created Quantile DMatrix.
  *
  * @return 0 when success, -1 when failure happens
@@ -509,7 +520,7 @@ XGB_DLL int XGQuantileDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHand
  *
  * @since 3.0.0
  *
- * @note This is still under development, not ready for test yet.
+ * @note This is experimental and subject to change.
  *
  * @param iter     A handle to external data iterator.
  * @param proxy    A DMatrix proxy handle created by @ref XGProxyDMatrixCreate.
@@ -521,12 +532,18 @@ XGB_DLL int XGQuantileDMatrixCreateFromCallback(DataIterHandle iter, DMatrixHand
  *   - cache_prefix: The path of cache file, caller must initialize all the directories in this path.
  *   - nthread (optional): Number of threads used for initializing DMatrix.
  *   - max_bin (optional): Maximum number of bins for building histogram. Must be consistent with
-                           the corresponding booster training parameter.
+ *                         the corresponding booster training parameter.
  *   - on_host (optional): Whether the data should be placed on host memory. Used by GPU inputs.
  *   - min_cache_page_bytes (optional): The minimum number of bytes for each internal GPU
  *      page. Set to 0 to disable page concatenation. Automatic configuration if the
  *      parameter is not provided or set to None.
- * @param out      The created Quantile DMatrix.
+ *   - max_quantile_blocks (optional): For GPU-based inputs, XGBoost handles incoming
+ *       batches with multiple growing substreams. This parameter sets the maximum number
+ *       of batches before XGBoost can cut the sub-stream and create a new one. This can
+ *       help bound the memory usage. By default, XGBoost grows new sub-streams
+ *       exponentially until batches are exhausted. Only used for the training dataset and
+ *       the default is None (unbounded).
+ * @param out The created Quantile DMatrix.
  *
  * @return 0 when success, -1 when failure happens
  */
 
@@ -517,6 +517,7 @@ class BatchSet {
 
 struct XGBAPIThreadLocalEntry;
 
+// Configuration for external memoroy DMatrix.
 struct ExtMemConfig {
   // Cache prefix, not used if the cache is in the host memory. (on_host is true)
   std::string cache;
@@ -527,8 +528,20 @@ struct ExtMemConfig {
   std::int64_t min_cache_page_bytes{0};
   // Missing value.
   float missing{std::numeric_limits<float>::quiet_NaN()};
+  // Maximum number of pages cached in device.
+  std::int64_t max_num_device_pages{0};
   // The number of CPU threads.
   std::int32_t n_threads{0};
+
+  ExtMemConfig() = default;
+  ExtMemConfig(std::string cache, bool on_host, std::int64_t min_cache, float missing,
+               std::int64_t max_num_d, std::int32_t n_threads)
+      : cache{std::move(cache)},
+        on_host{on_host},
+        min_cache_page_bytes{min_cache},
+        missing{missing},
+        max_num_device_pages{max_num_d},
+        n_threads{n_threads} {}
 };
 
 /**
@@ -637,7 +650,7 @@ class DMatrix {
             typename XGDMatrixCallbackNext>
   static DMatrix* Create(DataIterHandle iter, DMatrixHandle proxy, std::shared_ptr<DMatrix> ref,
                          DataIterResetCallback* reset, XGDMatrixCallbackNext* next, float missing,
-                         std::int32_t nthread, bst_bin_t max_bin);
+                         std::int32_t nthread, bst_bin_t max_bin, std::int64_t max_quantile_blocks);
 
   /**
    * @brief Create an external memory DMatrix with callbacks.
@@ -671,7 +684,8 @@ class DMatrix {
             typename XGDMatrixCallbackNext>
   static DMatrix* Create(DataIterHandle iter, DMatrixHandle proxy, std::shared_ptr<DMatrix> ref,
                          DataIterResetCallback* reset, XGDMatrixCallbackNext* next,
-                         bst_bin_t max_bin, ExtMemConfig const& config);
+                         bst_bin_t max_bin, std::int64_t max_quantile_blocks,
+                         ExtMemConfig const& config);
 
   virtual DMatrix *Slice(common::Span<int32_t const> ridxs) = 0;