-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomized PCA.transform uses a lot of RAM #11102
Comments
yes we can probably remove the peak usage at the end of the fit.
PR welcome.
|
Current situationAfter looking into this for a while, I believe that extra memory footprint is generated by total_var = np.var(X, ddof=1, axis=0) Now, two observations:
Potential solutionSo I wonder if we could potentially replace Edit: current semantics are that when the user sets Exampledef in_place_var(X):
N = X.shape[0]-1
X -= X.mean(axis=0)
np.square(X, out=X)
np.sum(X, axis=0, out=X[0])
X = X[0]
X /= N
out = np.sum(X)
del X
return out And when replacing scikit-learn/sklearn/decomposition/_pca.py Lines 620 to 630 in f9d0467
with # Get variance explained by singular values
self.explained_variance_ = (S ** 2) / (n_samples - 1)
# Use in-place calculation if self.copy
total_var = in_place_var(X) if self.copy else np.var(X, ddof=1, axis=0).sum()
self.explained_variance_ratio_ = self.explained_variance_ / total_var
self.singular_values_ = S.copy() # Store the singular values.
if self.n_components_ < min(n_features, n_samples):
self.noise_variance_ = total_var - self.explained_variance_.sum()
self.noise_variance_ /= min(n_features, n_samples) - n_components
else:
self.noise_variance_ = 0.0 the memory profile over time goes from DisclaimerObviously I have put no thought into error handling or checking for correctness/precision-- this is strictly meant to be a demonstration since there are surely better ways of handling it. In my manual tests over several random initializations of Thoughts? @thomasjpfan |
what is the shape of X you used here?
… Message ID: ***@***.***>
|
|
sure but give me numbers ? is it an effect you only see with for example
n_features >> n_samples
and n_features > 10_000?
… Message ID: ***@***.***>
|
This was something I only observed for very large sample sizes and feature counts. Specifically, I was testing around |
can you share a full code snippet to replicate easily eg using random data?
…On Mon, Feb 14, 2022 at 6:28 PM Meekail Zain ***@***.***> wrote:
This was something I only observed for very large sample sizes and feature
counts. Specifically, I was testing around (20000, 16384) as originally
suggested, but observed this even for sample sizes and feature coutns an
order of magnitude smaller. I haven't spent too much time checking out the
mem profile of many different shape configurations, though I do note that
the effect doesn't seem that apparent or strong in smaller cases.
—
Reply to this email directly, view it on GitHub
<#11102 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABHKHBMI5UNG3XKADQCD3LU3E3SXANCNFSM4FAINHUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
import numpy as np
from sklearn.decomposition import PCA
samples = np.random.random((20000, 16384))
pca = PCA(copy=False, n_components=128, svd_solver='randomized', iterated_power=4)
pca.fit_transform(samples) I'm using |
Description
Randomized
sklearn.decomposition.PCA
uses about2*n_samples*n_features
memory (RAM), including specified samples.While
fbpca
(https://github.com/facebook/fbpca) uses 2 times less.Is this expected behavour?
(I understand that
sklearn
version computes more things like explained_variance_)Steps/Code to Reproduce
sklearn
version:fbpca
version:Expected Results
Randomized
sklearn.decomposition.PCA
uses aboutn_samples*n_features + n_samples*n_components + <variance matrices etc.>
memory (RAM).Actual Results
Randomized
![pca_memory_test](https://user-images.githubusercontent.com/648976/40145721-a0fa827e-596b-11e8-91c4-4b363a18cbd8.jpg)
sklearn.decomposition.PCA
uses about2*n_samples*n_features
memory (RAM).We see peaks at
transform
step.(generated with
memory_profiler
andgnuplot
)Versions
Darwin-17.4.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.14.3
SciPy 1.1.0
Scikit-Learn 0.19.1
(tested on different Linux machines as well)
P.S.
We are trying to perform PCA for large matrices (2m x 16k, ~110GB). IncrementalPCA is very slow for us. Randomized PCA is very fast, but we are trying to reduce memory consumption to use cheaper instances.
Thank you.
The text was updated successfully, but these errors were encountered: