Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory mapping causes disk space usage bloat in big models with many search parameters #19608

Open
asansal-quantico opened this issue Mar 3, 2021 · 10 comments

Comments

@asansal-quantico
Copy link

asansal-quantico commented Mar 3, 2021

Hello,

We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.

When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.

Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.

Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.

parallel = Parallel(n_jobs=self.n_jobs,

@cmarmo
Copy link
Member

cmarmo commented Mar 5, 2021

Hi @asansal-quantico, thanks for reaching out. Do you mind producing some benchmark plots or logs to make the issue more quantitative? This will help in triggering the core-dev attention. Thanks for your collaboration.

@INF800
Copy link

INF800 commented Nov 17, 2021

Hi, this issue looks inactive. Can I take a shot at reproducing the issue and sending a PR if required

@cmarmo
Copy link
Member

cmarmo commented Nov 18, 2021

Hi @INF800 , feel free.
Quantitative benchmarks are needed here before any PR.
Thanks for your collaboration!

@INF800
Copy link

INF800 commented Nov 19, 2021

I am trying out with simple Lasso and xgboost's XGBRegressor with config as stated in the above comment. Any pointers on what model to use? @cmarmo @asansal-quantico

@INF800
Copy link

INF800 commented Nov 19, 2021

Until now Lasso seems to be working fine

@thomasjpfan
Copy link
Member

thomasjpfan commented Nov 26, 2021

@ogrisel @lesteve Is there a way to adjust Parallel in GridSearchCV to turn off memmapping?

@trendelkampschroer
Copy link

trendelkampschroer commented Aug 25, 2022

I can confirm the existence of this problem. The max_nbytes argument or more specifically a descriptive version of this argument, e.g. memmap_threshold_bytes, could be added to GridSearchCV to alleviate this problem. I am happy to issue a PR if this is desired. My feature request is based on the following observations:

Disk space usage may increases steadily until termination/completion of a script running a GridSearch. Disk-usages of and above 200GB for non-trivial examples are possible...

For the example below approx. 8GB of data is written to disk until the grid-search completes. Increasing the grid-size will further increase disk-usage. In the exampe below this "disk leak" only occurs for the data with the MultiIndex. For a standard Index with int or str entries disk usage quickly levels-off at approx 200MB, same fornumpy arrays.

Setting max_nbytes=None patching the Parallel object deep in sklearn.model_selection._search prevents disk-spill, but severely increases the runtime of the script. I will follow up with detailed numbers shortly. Setting max_nbytes='4M' results in disk usage levelling off quickly at around a the same number as for the standard Index case.

You can find a minimal example below.

import time
from typing import Tuple
from unittest.mock import patch

import numpy as np
import pandas as pd
from joblib import Parallel
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline


def generate_multi_index(n_dates: int, n_entities: int) -> pd.Index:
    dates = pd.date_range("2001-01-01", freq="B", periods=n_dates)
    entities = [f"id_{i}" for i in range(1, n_entities + 1)]
    index = pd.MultiIndex.from_product([dates, entities], names=["date", "entity"])
    return index


def generate_int_index(n_dates: int, n_entities) -> pd.Index:
    return pd.Index(range(n_dates * n_entities))


def generate_str_index(n_dates: int, n_entities) -> pd.Index:
    return pd.Index([f"id_{i}" for i in range(n_dates * n_entities)])


def generate_data(index: pd.Index, n_features: int) -> Tuple[pd.DataFrame, pd.Series]:
    rng = np.random.default_rng()
    columns = [f"feature_{i}" for i in range(n_features)]
    n_samples = len(index)
    X = pd.DataFrame(data=rng.normal(size=(n_samples, n_features)), index=index, columns=columns)
    y = pd.Series(data=rng.normal(size=n_samples), index=index)
    return X, y

class DummyRegressor(RegressorMixin, BaseEstimator):

    def __init__(self, p1 = 1, p2 = 2, **kwargs):
        self.p1 = p1
        self.p2 = p2

    def fit(self, X, y, sample_weight=None):
        time.sleep(0.05)
        return self

    def predict(self, X):
        n_samples = X.shape[0]
        return np.zeros((n_samples,))


def make_pipeline() -> Pipeline:
    return Pipeline(steps=[("regressor", DummyRegressor())])


class _ParallelPatch(Parallel):
    """A patch for the `joblib.Parallel` class disabling memory mapping of data shared with worker processes.

    See `joblib.Parallel` for documentation.
    """

    def __init__(self, *args, max_nbytes="1M", **kwargs):
        if "max_nbytes" not in kwargs:
            kwargs["max_nbytes"] = max_nbytes
        super().__init__(*args, **kwargs)

if __name__ == "__main__":
    n_dates = 2500
    n_entities = 400
    n_features = 10
    index = generate_multi_index(n_dates, n_entities)
    X_train, y_train = generate_data(index, n_features)
    pipeline = make_pipeline()

    n_splits = 5 * 4
    cv = n_splits
    param_grid = {"regressor__p1": range(10), "regressor__p2": range(10)}
    with patch("sklearn.model_selection._search.Parallel") as mock:
         mock.side_effect = lambda *args, **kwargs: _ParallelPatch(*args, max_nbytes="1M", **kwargs)
         grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, n_jobs=-1)
         grid_search.fit(X_train, y_train)

@trendelkampschroer
Copy link

trendelkampschroer commented Aug 25, 2022

These are the numbers observed with the example above, max_nbytes = '1M' is the default for joblib.Parallel. Numbers are rounded to entire secs and 10s or 100s MB of disk usage.

We can see that we can completely prevent "disk leak" by setting max_nbytes = '4M' with only a small penalty in runtime increase. Completely disabling memory mapping severely increases the runtime of the grid-search.

# Grid points max_nbytes Time (sec.) Disk (MB)
20 1M 23 1700
20 4M 25 200
20 10M 38 70
20 None 130 0
100 1M 100 7500
100 4M 110 200
200 1M 190 15000
200 4M 220 200

cf. joblib/joblib#1316 (comment) for a follow up on this in the joblib repository.

@thomasjpfan
Copy link
Member

@trendelkampschroer Thank you for sharing your results. It looks like the defaults for Parallel is not sufficient in some cases and it would be good to allow users to adjust Parallel's max_nbytes from GridSearchCV.

Given that joblib.Parallel I agree with the solution in #19608 (comment) that we should add a parallel_kwargs to pass directly into Parallel.

Logistically, we would need some consensus from maintainers to move forward because we stopped adding configuration options for Parallel and even removed them: #18030

@trendelkampschroer
Copy link

@thomasjpfan thanks a lot for your quick reply. Quickly reading the discussion in #18030 I get the impression that exposing Parallel kwargs is decided on a case by case basis. I think given the "trouble" that not having access to Parallel's max_nbytes argument can get you in when running a grid search I'd say this is a good case for exposing it.

Happy to issue a short PR and move the ensuing discussion there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants