-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory footprint when using stochastic optimizers with shuffle activated #14075
Conversation
@meyer89 The following patch would make the tests pass: diff --git a/sklearn/neural_network/multilayer_perceptron.py b/sklearn/neural_network/multilayer_perceptron.py
index 0d3cb2e4f..349117ae4 100644
--- a/sklearn/neural_network/multilayer_perceptron.py
+++ b/sklearn/neural_network/multilayer_perceptron.py
@@ -21,6 +21,7 @@ from ..preprocessing import LabelBinarizer
from ..utils import gen_batches, check_random_state
from ..utils import shuffle
from ..utils import check_array, check_X_y, column_or_1d
+from ..utils import safe_indexing
from ..exceptions import ConvergenceWarning
from ..utils.extmath import safe_sparse_dot
from ..utils.validation import check_is_fitted
@@ -511,9 +512,10 @@ class BaseMultilayerPerceptron(BaseEstimator, metaclass=ABCMeta):
idx = shuffle(idx, random_state=self._random_state)
accumulated_loss = 0.0
for batch_slice in gen_batches(n_samples, batch_size):
- activations[0] = X[idx[batch_slice]]
+ activations[0] = safe_indexing(X, idx[batch_slice])
batch_loss, coef_grads, intercept_grads = self._backprop(
- X[idx[batch_slice]], y[idx[batch_slice]],
+ safe_indexing(X, idx[batch_slice]),
+ y[idx[batch_slice]],
activations, deltas,
coef_grads, intercept_grads)
accumulated_loss += batch_loss * (batch_slice.stop -
@@ -661,7 +663,7 @@ class BaseMultilayerPerceptron(BaseEstimator, metaclass=ABCMeta):
y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
The decision function of the samples for each class in the model.
"""
- X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
+ X = check_array(X, accept_sparse=['csr', 'csc'])
# Make sure self.hidden_layer_sizes is a list
hidden_layer_sizes = self.hidden_layer_sizes
@@ -917,7 +919,7 @@ class MLPClassifier(BaseMultilayerPerceptron, ClassifierMixin):
n_iter_no_change=n_iter_no_change)
def _validate_input(self, X, y, incremental):
- X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
+ X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'],
multi_output=True)
if y.ndim == 2 and y.shape[1] == 1:
y = column_or_1d(y, warn=True)
@@ -1317,7 +1319,7 @@ class MLPRegressor(BaseMultilayerPerceptron, RegressorMixin):
return y_pred
def _validate_input(self, X, y, incremental):
- X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
+ X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'],
multi_output=True, y_numeric=True)
if y.ndim == 2 and y.shape[1] == 1:
y = column_or_1d(y, warn=True) |
I applied the suggested patch and everything is still working. No regression in speed or memory usage using numpy.ndarray as training data storage. In my opinion it should be okay to drop COO support, maybe with an automatic conversion for some time.
|
I agree that it should be okay to convert coo to csr.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM. Thanks!
Checked with benchmarks/bench_sgd_regression.py
-- doesn't look like this has much impact on runtime.
Please add an entry to the change log at doc/whats_new/v0.22.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself with :user:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with benchmarks/bench_sgd_regression.py
As in you switched these tests to MLP?
This pr lgtm and is more consistent with how SGDRegressor does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in you switched these tests to MLP?
Oups, wrong benchmark. Somehow I got confused and tested the SDG regressor instead :/
Running,
python benchmarks/bench_mnist.py --classifiers=MultilayerPerceptron
after setting max_iter=4
for MLPClassifier
.
When data is a memory map (default in that benchmark)
Before,
Classifier train-time test-time error-rate
------------------------------------------------------------
MultilayerPerceptron 3.00s 0.14s 0.0244
after,
Classifier train-time test-time error-rate
------------------------------------------------------------
MultilayerPerceptron 180.02s 0.14s 0.0244
so it really makes things worse in this case. We need to trigger a copy of the data if it's a mmap I think.
When data is not a mmap
before
Classifier train-time test-time error-rate
------------------------------------------------------------
MultilayerPerceptron 2.73s 0.14s 0.0244
after,
Classifier train-time test-time error-rate
------------------------------------------------------------
MultilayerPerceptron 2.66s 0.14s 0.0244
so this makes it marginally faster.
I presume the higher time is copying each batch off the disk. It's really going to depend on where that disk is. Are your benchmarks from being memmapped on a local solid state drive? The benchmark should probably use Let me try with my local SSD. |
It was on a server with an HDD. I think previously the data was loaded from disk once and copied when shuffling. So here, we should just make a copy of it when it's a mmap, to avoid decreasing performance there. |
Assuming it's small enough to copy the entire thing into memory?? |
I agree that for the benmchmark this is much much slower :( |
@rth If we trigger a copy then we go back to the previous memory consumption, isn't it? |
You are doing an excellent job checking the code for any side-effects here. Generates good confindence in using this library :-) I did some more tests and found the reason for the slowdown inside safe_indexing. memmap handlingThe input data is converted to np.ndarray in check_X_y, so it is not clear to me how np.memmap could have caused the slowdown. Maybe it was something else like sample size? I did not check which call inside check_X_y causes the conversion to numpy.ndarray. I think this behavior should be at least controllable, because sometimes there might be a good reason to keep the data as memmap to safe memory. benchmark resultsI also use "benchmarks/bench_mnist.py --classifier=MultilayerPerceptron" with max_iter set to 4 on a slow machine.
safe_indexingSo the reason is the X.take command in safe_indexing. In the comment in safe_indexing is is written that take should be preferred. However, I do not find any supporting benchmarks in the web.
As this slicing function is called very often, it would be best to only support X matrices which support indexing using X[idx[batch_size]]. All the needed data conversion. for example support for Pandas.DataFrame, could be handled once in the check_X_y function (I would just use X.values for a pointer to the underlying numpy.ndarray for any pandas.DataFrame). I do not know if the usage of take() inside of safe_indexing makes sense in other scenarios, but this comparison gives clear contradictory results to the claimed performance benefit in the comment. General impactIn general I think that the performance impact of this change should be small. The numpy indexing of random ordered integer indices is slightly slower than indexing of a slice, so it is a good idea to add a fast-path for non-shuffled data. However, the shuffling of a simple vector instead of big data matrices saves some time, so in total the changed algorithm is slightly faster. |
CIs are failing. |
I fixed the indexing error with a modified safe_indexing function (always use fancy indexing instead of numpy take). Not all platforms succeed though, and I do not know why. I checked the version with memory-profiler and everyhing works as it should. Additionally, the benchmark "python benchmarks\bench_mnist.py --classifier=MultilayerPerceptron" with default parameters looks fine. Before
After
|
Please open a separate issue to move safe_indexing to use fancy rather than
take then. Thanks.
|
# Conflicts: # sklearn/utils/__init__.py
Sorry I did not have the time during the last months to look at this. Due to the update in safe_indexing the take() operator is removed outside this MR. The memory consumption is still reduced on 0.23dev0, as you can see in the following figure. CI jobs are failing, but master branch is failing too. |
@rth When I understand the process correctly you need to review the MR again. CI jobs are passing right now and the difference is relatively small, so I hope you can verify that there are no unwanted side effects from this MR. The previous discussion regarding take() is obsolete now, as it is not used any longer. |
Is the test time degradation that can be observed in #14075 (comment) meaningful or just random fluctuation? Just reading the patch, the prediction should be unchanged, unless the test data is COO? Maybe it's a matter of warming-up memmaped test data in the benchmark script? |
The runtime differences in the benchmark are just normal fluctuations. Without MR: 81.39s, 86.16s, 79.10s Therefore, I do not see any significant performance impact. If there is any, this MR makes the training process marginally faster. The training result should remain identical, which is supported by the obervation that the error-rate of the benchmark script is identical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just additional nitpicks to be more explicit. After that LGTM.
@rth Since you have a pending reviews could you have a quick look.
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Good to be merged. Thanks @meyer89 |
During training I encountered that the training data is duplicated in memory. This does not need to be, as the order does not matter when shuffle is activated. Therefore, only one complete copy of the data should to be kept in memory.
However, when the original copy of the data is kept outside of the fit() call, it is not garbage collected by python. One possible improvement would be, that just a small indexing vector is shuffled.
Reference Issues/PRs
n/a
What does this implement/fix? Explain your changes.
Instead of shuffling the complete input matrix X, y, only a vector with indices is shuffled. This way the input matrix is never duplicated.
Any other comments?
Instead of shuffle np.random.permutate() may work the same and may look a bit cleaner.