Memory mapping causes disk space usage bloat in big models with many search parameters #19608

asansal-quantico · 2021-03-03T16:28:27Z

Hello,

We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.

When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.

Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.

Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.

scikit-learn/sklearn/model_selection/_search.py

Line 767 in 28ee486

parallel = Parallel(n_jobs=self.n_jobs,

cmarmo · 2021-03-05T07:54:22Z

Hi @asansal-quantico, thanks for reaching out. Do you mind producing some benchmark plots or logs to make the issue more quantitative? This will help in triggering the core-dev attention. Thanks for your collaboration.

INF800 · 2021-11-17T23:27:06Z

Hi, this issue looks inactive. Can I take a shot at reproducing the issue and sending a PR if required

cmarmo · 2021-11-18T18:39:21Z

Hi @INF800 , feel free.
Quantitative benchmarks are needed here before any PR.
Thanks for your collaboration!

INF800 · 2021-11-19T23:09:18Z

I am trying out with simple Lasso and xgboost's XGBRegressor with config as stated in the above comment. Any pointers on what model to use? @cmarmo @asansal-quantico

INF800 · 2021-11-19T23:09:40Z

Until now Lasso seems to be working fine

thomasjpfan · 2021-11-26T23:41:18Z

@ogrisel @lesteve Is there a way to adjust Parallel in GridSearchCV to turn off memmapping?

trendelkampschroer · 2022-08-25T05:57:47Z

I can confirm the existence of this problem. The max_nbytes argument or more specifically a descriptive version of this argument, e.g. memmap_threshold_bytes, could be added to GridSearchCV to alleviate this problem. I am happy to issue a PR if this is desired. My feature request is based on the following observations:

Disk space usage may increases steadily until termination/completion of a script running a GridSearch. Disk-usages of and above 200GB for non-trivial examples are possible...

For the example below approx. 8GB of data is written to disk until the grid-search completes. Increasing the grid-size will further increase disk-usage. In the exampe below this "disk leak" only occurs for the data with the MultiIndex. For a standard Index with int or str entries disk usage quickly levels-off at approx 200MB, same fornumpy arrays.

Setting max_nbytes=None patching the Parallel object deep in sklearn.model_selection._search prevents disk-spill, but severely increases the runtime of the script. I will follow up with detailed numbers shortly. Setting max_nbytes='4M' results in disk usage levelling off quickly at around a the same number as for the standard Index case.

You can find a minimal example below.

import time
from typing import Tuple
from unittest.mock import patch

import numpy as np
import pandas as pd
from joblib import Parallel
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


def generate_multi_index(n_dates: int, n_entities: int) -> pd.Index:
    dates = pd.date_range("2001-01-01", freq="B", periods=n_dates)
    entities = [f"id_{i}" for i in range(1, n_entities + 1)]
    index = pd.MultiIndex.from_product([dates, entities], names=["date", "entity"])
    return index


def generate_int_index(n_dates: int, n_entities) -> pd.Index:
    return pd.Index(range(n_dates * n_entities))


def generate_str_index(n_dates: int, n_entities) -> pd.Index:
    return pd.Index([f"id_{i}" for i in range(n_dates * n_entities)])


def generate_data(index: pd.Index, n_features: int) -> Tuple[pd.DataFrame, pd.Series]:
    rng = np.random.default_rng()
    columns = [f"feature_{i}" for i in range(n_features)]
    n_samples = len(index)
    X = pd.DataFrame(data=rng.normal(size=(n_samples, n_features)), index=index, columns=columns)
    y = pd.Series(data=rng.normal(size=n_samples), index=index)
    return X, y

class DummyRegressor(RegressorMixin, BaseEstimator):

    def __init__(self, p1 = 1, p2 = 2, **kwargs):
        self.p1 = p1
        self.p2 = p2

    def fit(self, X, y, sample_weight=None):
        time.sleep(0.05)
        return self

    def predict(self, X):
        n_samples = X.shape[0]
        return np.zeros((n_samples,))


def make_pipeline() -> Pipeline:
    return Pipeline(steps=[("regressor", DummyRegressor())])


class _ParallelPatch(Parallel):
    """A patch for the `joblib.Parallel` class disabling memory mapping of data shared with worker processes.

    See `joblib.Parallel` for documentation.
    """

    def __init__(self, *args, max_nbytes="1M", **kwargs):
        if "max_nbytes" not in kwargs:
            kwargs["max_nbytes"] = max_nbytes
        super().__init__(*args, **kwargs)

if __name__ == "__main__":
    n_dates = 2500
    n_entities = 400
    n_features = 10
    index = generate_multi_index(n_dates, n_entities)
    X_train, y_train = generate_data(index, n_features)
    pipeline = make_pipeline()

    n_splits = 5 * 4
    cv = n_splits
    param_grid = {"regressor__p1": range(10), "regressor__p2": range(10)}
    with patch("sklearn.model_selection._search.Parallel") as mock:
         mock.side_effect = lambda *args, **kwargs: _ParallelPatch(*args, max_nbytes="1M", **kwargs)
         grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, n_jobs=-1)
         grid_search.fit(X_train, y_train)

trendelkampschroer · 2022-08-25T06:51:50Z

These are the numbers observed with the example above, max_nbytes = '1M' is the default for joblib.Parallel. Numbers are rounded to entire secs and 10s or 100s MB of disk usage.

We can see that we can completely prevent "disk leak" by setting max_nbytes = '4M' with only a small penalty in runtime increase. Completely disabling memory mapping severely increases the runtime of the grid-search.

# Grid points	`max_nbytes`	Time (sec.)	Disk (MB)
20	1M	23	1700
20	4M	25	200
20	10M	38	70
20	None	130	0
100	1M	100	7500
100	4M	110	200
200	1M	190	15000
200	4M	220	200

cf. joblib/joblib#1316 (comment) for a follow up on this in the joblib repository.

thomasjpfan · 2022-08-26T00:50:16Z

@trendelkampschroer Thank you for sharing your results. It looks like the defaults for Parallel is not sufficient in some cases and it would be good to allow users to adjust Parallel's max_nbytes from GridSearchCV.

Given that joblib.Parallel I agree with the solution in #19608 (comment) that we should add a parallel_kwargs to pass directly into Parallel.

Logistically, we would need some consensus from maintainers to move forward because we stopped adding configuration options for Parallel and even removed them: #18030

trendelkampschroer · 2022-08-26T12:25:33Z

@thomasjpfan thanks a lot for your quick reply. Quickly reading the discussion in #18030 I get the impression that exposing Parallel kwargs is decided on a case by case basis. I think given the "trouble" that not having access to Parallel's max_nbytes argument can get you in when running a grid search I'd say this is a good case for exposing it.

Happy to issue a short PR and move the ensuing discussion there...

cmarmo added Enhancement Large Scale module:model_selection labels Mar 5, 2021

thomasjpfan mentioned this issue Dec 13, 2022

FIX Better support large or read-only datasets in decomposition.DictionaryLearning #25172

Merged

ogrisel mentioned this issue Jan 18, 2023

RFC Consider making auto-memmaping a manual operation joblib/joblib#1376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

asansal-quantico commented Mar 3, 2021 •

edited

cmarmo commented Mar 5, 2021

INF800 commented Nov 17, 2021

cmarmo commented Nov 18, 2021

INF800 commented Nov 19, 2021

INF800 commented Nov 19, 2021

thomasjpfan commented Nov 26, 2021 •

edited

trendelkampschroer commented Aug 25, 2022 •

edited

trendelkampschroer commented Aug 25, 2022 •

edited

thomasjpfan commented Aug 26, 2022

trendelkampschroer commented Aug 26, 2022

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

Comments

asansal-quantico commented Mar 3, 2021 • edited

cmarmo commented Mar 5, 2021

INF800 commented Nov 17, 2021

cmarmo commented Nov 18, 2021

INF800 commented Nov 19, 2021

INF800 commented Nov 19, 2021

thomasjpfan commented Nov 26, 2021 • edited

trendelkampschroer commented Aug 25, 2022 • edited

trendelkampschroer commented Aug 25, 2022 • edited

thomasjpfan commented Aug 26, 2022

trendelkampschroer commented Aug 26, 2022

asansal-quantico commented Mar 3, 2021 •

edited

thomasjpfan commented Nov 26, 2021 •

edited

trendelkampschroer commented Aug 25, 2022 •

edited

trendelkampschroer commented Aug 25, 2022 •

edited