Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint when using stochastic optimizers with shuffle activated #14075

Merged
merged 19 commits into from
Jan 13, 2020

Conversation

meyer89
Copy link
Contributor

@meyer89 meyer89 commented Jun 12, 2019

During training I encountered that the training data is duplicated in memory. This does not need to be, as the order does not matter when shuffle is activated. Therefore, only one complete copy of the data should to be kept in memory.

However, when the original copy of the data is kept outside of the fit() call, it is not garbage collected by python. One possible improvement would be, that just a small indexing vector is shuffled.

Reference Issues/PRs

n/a

What does this implement/fix? Explain your changes.

Instead of shuffling the complete input matrix X, y, only a vector with indices is shuffled. This way the input matrix is never duplicated.

Any other comments?

Instead of shuffle np.random.permutate() may work the same and may look a bit cleaner.

@meyer89
Copy link
Contributor Author

meyer89 commented Jun 12, 2019

I hope this is the right way to do it. For further proof of effectiveness I have a screenshot of the outpu of mprof (memory-profiler). You can clearly see that the input data is duplicated in the old version (black), but not in the updated version (blue).

grafik

@glemaitre glemaitre self-requested a review June 13, 2019 12:31
@glemaitre
Copy link
Member

@ogrisel @jnothman would it be interesting to introduce such a feature at the cost of converting COO to CSR.

@meyer89 Do you see any performance (speed) difference between your implementation in the previous one?

@glemaitre
Copy link
Member

@meyer89 The following patch would make the tests pass:

diff --git a/sklearn/neural_network/multilayer_perceptron.py b/sklearn/neural_network/multilayer_perceptron.py
index 0d3cb2e4f..349117ae4 100644
--- a/sklearn/neural_network/multilayer_perceptron.py
+++ b/sklearn/neural_network/multilayer_perceptron.py
@@ -21,6 +21,7 @@ from ..preprocessing import LabelBinarizer
 from ..utils import gen_batches, check_random_state
 from ..utils import shuffle
 from ..utils import check_array, check_X_y, column_or_1d
+from ..utils import safe_indexing
 from ..exceptions import ConvergenceWarning
 from ..utils.extmath import safe_sparse_dot
 from ..utils.validation import check_is_fitted
@@ -511,9 +512,10 @@ class BaseMultilayerPerceptron(BaseEstimator, metaclass=ABCMeta):
                     idx = shuffle(idx, random_state=self._random_state)
                 accumulated_loss = 0.0
                 for batch_slice in gen_batches(n_samples, batch_size):
-                    activations[0] = X[idx[batch_slice]]
+                    activations[0] = safe_indexing(X, idx[batch_slice])
                     batch_loss, coef_grads, intercept_grads = self._backprop(
-                        X[idx[batch_slice]], y[idx[batch_slice]],
+                        safe_indexing(X, idx[batch_slice]),
+                        y[idx[batch_slice]],
                         activations, deltas,
                         coef_grads, intercept_grads)
                     accumulated_loss += batch_loss * (batch_slice.stop -
@@ -661,7 +663,7 @@ class BaseMultilayerPerceptron(BaseEstimator, metaclass=ABCMeta):
         y_pred : array-like, shape (n_samples,) or (n_samples, n_outputs)
             The decision function of the samples for each class in the model.
         """
-        X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
+        X = check_array(X, accept_sparse=['csr', 'csc'])
 
         # Make sure self.hidden_layer_sizes is a list
         hidden_layer_sizes = self.hidden_layer_sizes
@@ -917,7 +919,7 @@ class MLPClassifier(BaseMultilayerPerceptron, ClassifierMixin):
             n_iter_no_change=n_iter_no_change)
 
     def _validate_input(self, X, y, incremental):
-        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
+        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'],
                          multi_output=True)
         if y.ndim == 2 and y.shape[1] == 1:
             y = column_or_1d(y, warn=True)
@@ -1317,7 +1319,7 @@ class MLPRegressor(BaseMultilayerPerceptron, RegressorMixin):
         return y_pred
 
     def _validate_input(self, X, y, incremental):
-        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
+        X, y = check_X_y(X, y, accept_sparse=['csr', 'csc'],
                          multi_output=True, y_numeric=True)
         if y.ndim == 2 and y.shape[1] == 1:
             y = column_or_1d(y, warn=True)

@meyer89
Copy link
Contributor Author

meyer89 commented Jun 17, 2019

I applied the suggested patch and everything is still working. No regression in speed or memory usage using numpy.ndarray as training data storage.

In my opinion it should be okay to drop COO support, maybe with an automatic conversion for some time.
Intended Usage from https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html

  • COO is a fast format for constructing sparse matrices
  • Once a matrix has been constructed, convert to CSR or CSC format for fast arithmetic and matrix vector operations
  • By default when converting to CSR or CSC format, duplicate (i,j) entries will be summed together. This facilitates efficient construction of finite element matrices and the like. (see example)

@jnothman
Copy link
Member

jnothman commented Jun 17, 2019 via email

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM. Thanks!

Checked with benchmarks/bench_sgd_regression.py -- doesn't look like this has much impact on runtime.

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself with :user:.

sklearn/neural_network/multilayer_perceptron.py Outdated Show resolved Hide resolved
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with benchmarks/bench_sgd_regression.py

As in you switched these tests to MLP?

This pr lgtm and is more consistent with how SGDRegressor does it.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in you switched these tests to MLP?

Oups, wrong benchmark. Somehow I got confused and tested the SDG regressor instead :/

Running,

python benchmarks/bench_mnist.py --classifiers=MultilayerPerceptron

after setting max_iter=4 for MLPClassifier.

When data is a memory map (default in that benchmark)
Before,

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron          3.00s       0.14s       0.0244

after,

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron        180.02s       0.14s       0.0244

so it really makes things worse in this case. We need to trigger a copy of the data if it's a mmap I think.

When data is not a mmap

before

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron          2.73s       0.14s       0.0244

after,

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron          2.66s       0.14s       0.0244

so this makes it marginally faster.

@jnothman
Copy link
Member

so it really makes things worse in this case. We need to trigger a copy of the data if it's a mmap I think.

I presume the higher time is copying each batch off the disk. It's really going to depend on where that disk is. Are your benchmarks from being memmapped on a local solid state drive? The benchmark should probably use /tmp rather than ~/ to be more realistic to people's local setups since ~ is often going to be LDAP/NFS, while memmaps are more likely to be local storage.

Let me try with my local SSD.

@rth
Copy link
Member

rth commented Jun 20, 2019

It was on a server with an HDD. I think previously the data was loaded from disk once and copied when shuffling. So here, we should just make a copy of it when it's a mmap, to avoid decreasing performance there.

@jnothman
Copy link
Member

So here, we should just make a copy of it when it's a mmap, to avoid decreasing performance there.

Assuming it's small enough to copy the entire thing into memory??

@jnothman
Copy link
Member

I agree that for the benmchmark this is much much slower :(

@glemaitre
Copy link
Member

@rth If we trigger a copy then we go back to the previous memory consumption, isn't it?

@meyer89
Copy link
Contributor Author

meyer89 commented Jun 20, 2019

You are doing an excellent job checking the code for any side-effects here. Generates good confindence in using this library :-) I did some more tests and found the reason for the slowdown inside safe_indexing.

memmap handling

The input data is converted to np.ndarray in check_X_y, so it is not clear to me how np.memmap could have caused the slowdown. Maybe it was something else like sample size?

I did not check which call inside check_X_y causes the conversion to numpy.ndarray. I think this behavior should be at least controllable, because sometimes there might be a good reason to keep the data as memmap to safe memory.

benchmark results

I also use "benchmarks/bench_mnist.py --classifier=MultilayerPerceptron" with max_iter set to 4 on a slow machine.

X,y are shuffled, idx is constant:

  • X[batch_size]: 12.13s
  • X[idx[batch_size]]: 13.85s

Y,y are constant, idx is shuffled:

  • X[idx[batch_size]]: 11.85s
  • safe_indexing(X, idx[batch_size]): 228.13s
  • X.take(idx[batch_size], axis=0): 263.93s

safe_indexing

So the reason is the X.take command in safe_indexing. In the comment in safe_indexing is is written that take should be preferred. However, I do not find any supporting benchmarks in the web.

This is often substantially faster than X[indices]

As this slicing function is called very often, it would be best to only support X matrices which support indexing using X[idx[batch_size]]. All the needed data conversion. for example support for Pandas.DataFrame, could be handled once in the check_X_y function (I would just use X.values for a pointer to the underlying numpy.ndarray for any pandas.DataFrame).

I do not know if the usage of take() inside of safe_indexing makes sense in other scenarios, but this comparison gives clear contradictory results to the claimed performance benefit in the comment.

General impact

In general I think that the performance impact of this change should be small. The numpy indexing of random ordered integer indices is slightly slower than indexing of a slice, so it is a good idea to add a fast-path for non-shuffled data. However, the shuffling of a simple vector instead of big data matrices saves some time, so in total the changed algorithm is slightly faster.

@glemaitre
Copy link
Member

CIs are failing.

@meyer89
Copy link
Contributor Author

meyer89 commented Jun 22, 2019

I fixed the indexing error with a modified safe_indexing function (always use fancy indexing instead of numpy take). Not all platforms succeed though, and I do not know why.

I checked the version with memory-profiler and everyhing works as it should.

Additionally, the benchmark "python benchmarks\bench_mnist.py --classifier=MultilayerPerceptron" with default parameters looks fine.

Before

Classifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron         77.16s       0.22s       0.0263

After

assifier               train-time   test-time   error-rate
------------------------------------------------------------
MultilayerPerceptron         70.61s       0.41s       0.0263

@jnothman
Copy link
Member

jnothman commented Jun 26, 2019 via email

@meyer89
Copy link
Contributor Author

meyer89 commented Dec 17, 2019

Sorry I did not have the time during the last months to look at this. Due to the update in safe_indexing the take() operator is removed outside this MR.

The memory consumption is still reduced on 0.23dev0, as you can see in the following figure.

grafik

CI jobs are failing, but master branch is failing too.

@meyer89 meyer89 requested a review from rth December 19, 2019 08:40
@meyer89
Copy link
Contributor Author

meyer89 commented Jan 2, 2020

@rth When I understand the process correctly you need to review the MR again.

CI jobs are passing right now and the difference is relatively small, so I hope you can verify that there are no unwanted side effects from this MR.

The previous discussion regarding take() is obsolete now, as it is not used any longer.

@ogrisel
Copy link
Member

ogrisel commented Jan 2, 2020

Is the test time degradation that can be observed in #14075 (comment) meaningful or just random fluctuation?

Just reading the patch, the prediction should be unchanged, unless the test data is COO?

Maybe it's a matter of warming-up memmaped test data in the benchmark script?

@meyer89
Copy link
Contributor Author

meyer89 commented Jan 2, 2020

The runtime differences in the benchmark are just normal fluctuations.
I just started the script "benchmarks\bench_mnist.py --classifier=MultilayerPerceptron" again three times using 0.23dev0.

Without MR: 81.39s, 86.16s, 79.10s
With MR: 79.15s, 76.60s, 88.39s

Therefore, I do not see any significant performance impact. If there is any, this MR makes the training process marginally faster.

The training result should remain identical, which is supported by the obervation that the error-rate of the benchmark script is identical.

@jnothman jnothman requested a review from glemaitre January 6, 2020 20:03
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just additional nitpicks to be more explicit. After that LGTM.
@rth Since you have a pending reviews could you have a quick look.

sklearn/neural_network/_multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/_multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/_multilayer_perceptron.py Outdated Show resolved Hide resolved
sklearn/neural_network/_multilayer_perceptron.py Outdated Show resolved Hide resolved
@glemaitre glemaitre merged commit d2febf2 into scikit-learn:master Jan 13, 2020
@glemaitre
Copy link
Member

Good to be merged. Thanks @meyer89

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants