-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory mapping causes disk space usage bloat in big models with many search parameters #19608
Comments
Hi @asansal-quantico, thanks for reaching out. Do you mind producing some benchmark plots or logs to make the issue more quantitative? This will help in triggering the core-dev attention. Thanks for your collaboration. |
Hi, this issue looks inactive. Can I take a shot at reproducing the issue and sending a PR if required |
Hi @INF800 , feel free. |
I am trying out with simple Lasso and xgboost's XGBRegressor with config as stated in the above comment. Any pointers on what model to use? @cmarmo @asansal-quantico |
Until now Lasso seems to be working fine |
I can confirm the existence of this problem. The Disk space usage may increases steadily until termination/completion of a script running a GridSearch. Disk-usages of and above 200GB for non-trivial examples are possible... For the example below approx. 8GB of data is written to disk until the grid-search completes. Increasing the grid-size will further increase disk-usage. In the exampe below this "disk leak" only occurs for the data with the Setting You can find a minimal example below. import time
from typing import Tuple
from unittest.mock import patch
import numpy as np
import pandas as pd
from joblib import Parallel
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
def generate_multi_index(n_dates: int, n_entities: int) -> pd.Index:
dates = pd.date_range("2001-01-01", freq="B", periods=n_dates)
entities = [f"id_{i}" for i in range(1, n_entities + 1)]
index = pd.MultiIndex.from_product([dates, entities], names=["date", "entity"])
return index
def generate_int_index(n_dates: int, n_entities) -> pd.Index:
return pd.Index(range(n_dates * n_entities))
def generate_str_index(n_dates: int, n_entities) -> pd.Index:
return pd.Index([f"id_{i}" for i in range(n_dates * n_entities)])
def generate_data(index: pd.Index, n_features: int) -> Tuple[pd.DataFrame, pd.Series]:
rng = np.random.default_rng()
columns = [f"feature_{i}" for i in range(n_features)]
n_samples = len(index)
X = pd.DataFrame(data=rng.normal(size=(n_samples, n_features)), index=index, columns=columns)
y = pd.Series(data=rng.normal(size=n_samples), index=index)
return X, y
class DummyRegressor(RegressorMixin, BaseEstimator):
def __init__(self, p1 = 1, p2 = 2, **kwargs):
self.p1 = p1
self.p2 = p2
def fit(self, X, y, sample_weight=None):
time.sleep(0.05)
return self
def predict(self, X):
n_samples = X.shape[0]
return np.zeros((n_samples,))
def make_pipeline() -> Pipeline:
return Pipeline(steps=[("regressor", DummyRegressor())])
class _ParallelPatch(Parallel):
"""A patch for the `joblib.Parallel` class disabling memory mapping of data shared with worker processes.
See `joblib.Parallel` for documentation.
"""
def __init__(self, *args, max_nbytes="1M", **kwargs):
if "max_nbytes" not in kwargs:
kwargs["max_nbytes"] = max_nbytes
super().__init__(*args, **kwargs)
if __name__ == "__main__":
n_dates = 2500
n_entities = 400
n_features = 10
index = generate_multi_index(n_dates, n_entities)
X_train, y_train = generate_data(index, n_features)
pipeline = make_pipeline()
n_splits = 5 * 4
cv = n_splits
param_grid = {"regressor__p1": range(10), "regressor__p2": range(10)}
with patch("sklearn.model_selection._search.Parallel") as mock:
mock.side_effect = lambda *args, **kwargs: _ParallelPatch(*args, max_nbytes="1M", **kwargs)
grid_search = GridSearchCV(estimator=pipeline, param_grid=param_grid, cv=cv, n_jobs=-1)
grid_search.fit(X_train, y_train) |
These are the numbers observed with the example above, We can see that we can completely prevent "disk leak" by setting
cf. joblib/joblib#1316 (comment) for a follow up on this in the |
@trendelkampschroer Thank you for sharing your results. It looks like the defaults for Given that joblib.Parallel I agree with the solution in #19608 (comment) that we should add a Logistically, we would need some consensus from maintainers to move forward because we stopped adding configuration options for |
@thomasjpfan thanks a lot for your quick reply. Quickly reading the discussion in #18030 I get the impression that exposing Happy to issue a short PR and move the ensuing discussion there... |
Hello,
We use GridSearchCV in a project where our training data is quite large (approx, 20,000 rows x 100 columns/features). We search a large space of hyperparameters, and with 5-10 cross-validation runs total number of model sometimes up to 1000.
When running default parallelization using joblib, the tool generates a lot of memory-mapped disk data and uses a lot of disk space on the scale of hundreds of gigabytes.
Solution:
Once we disable memory mapping, it runs successfully without using the disk. The solution is to call joblib's Parallel(..., max_nbytes=None) within GridSearchCV, see the link to code below. The default for joblib is '1M', which is 1 megabyte. When we pass None, it disables memory mapping.
Suggested improvement is to add a kwarg to GridSearchCV that gets passed to Parallel's max_nbytes.
scikit-learn/sklearn/model_selection/_search.py
Line 767 in 28ee486
The text was updated successfully, but these errors were encountered: