Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected MemoryError from IncrementalPCA used with memmap #5173

Closed
rsnape opened this issue Aug 27, 2015 · 11 comments
Closed

Unexpected MemoryError from IncrementalPCA used with memmap #5173

rsnape opened this issue Aug 27, 2015 · 11 comments
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted

Comments

@rsnape
Copy link

rsnape commented Aug 27, 2015

Prompted by a question asked in Stackoverflow chat, I investigated why a user would encounter a MemoryError when using the following minimal code:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

I found that the MemoryError was called by this call to check_array:

X = check_array(X, dtype=np.float)

The problem is with adding the dtype=np.float to the call to check_array. This means that the copy=False default of check_array is ignored, because the dtype has changed in that call to np.array() and the docs state that change of dtype will force a copy irrespective of the copy argument.

I made a simple gist demonstrating the issue here

I propose that the line above could be changed to

X = check_array(X)

but am not expert enough in this tool to understand whether that might have negative consequences further along the chain of execution. If that is a good solution, happy to submit a PR.

@kastnerkyle
Copy link
Member

git blame points the finger right back on me

@lesteve did you have any issues with this for nilearn?

I am probably OK with loosening the check_array to only run on things which
have dtypes which are not in {float16, float32, float64} though I am not
sure what the numerical implications are for 16bit - might not be enough
precision to get numerically identical results to true PCA.

On Thu, Aug 27, 2015 at 7:58 AM, J. Richard Snape notifications@github.com
wrote:

Prompted by a question asked in Stackoverflow chat
http://chat.stackoverflow.com/transcript/message/25344627#25344627, I
investigated why a user would encounter a MemoryError when using the
following minimal code:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

I found that the MemoryError was called by this call to check_array
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/incremental_pca.py#L167
:

X = check_array(X, dtype=np.float)

The problem is with adding the dtype=np.float to the call to check_array.
This means that the copy=False default of check_array is ignored, because
the dtype has changed in that call to np.array() and the docs state that
change of dtype will force a copy irrespective of the copy argument.

I made a simple gist demonstrating the issue here
https://gist.github.com/rsnape/bd1f30db4b789a5f7665

I propose that the line above could be changed to

X = check_array(X)

but am not expert enough in this tool to understand whether that might
have negative consequences further along the chain of execution. If that is
a good solution, happy to submit a PR.


Reply to this email directly or view it on GitHub
#5173.

@rsnape
Copy link
Author

rsnape commented Aug 28, 2015

Hmmm, there may be deeper issues... I was going to submit a PR based on patching like this (following your comment)

check_type = np.float
if X.dtype in (np.float16,np.float32,np.float64):
    check_type = X.dtype.type
X = check_array(X,check_type)

This allows memmap objects of the three specific float dtypes to be checked without making a copy, while preserving the same functionality for other dtype. I've verified this works and the check does not make a copy if the dtype is the same.

However, when I tried this out I then hit a MemoryError further along the road - where IncrementalPCA calls linalg.svd. I looked at that code (in core scipy) and the line where it fails is a call through to the lapack function gesdd, which is a wrapped native, I assume. So I don't think there's an easy fix here.

Important note

This appears to 'all just work'™ in a 64 bit environment (I guess because the array can be copied into memory)

I think the person who originally raised the issue is probably using type float16 so that the very large matrix can be created in a 32 bit environment. numpy prevents creation of a memmap at 140000 x 3504 with type np.float / np.float64 with error OverflowError: cannot fit 'long' into an index-sized integer in a 32 bit environment. With type np.float32, I get an error from the OS via python saying that there is not enough storage available to create the memmap.

I wonder if the answer may be to document these limitations. Interested in your thoughts.

@kastnerkyle
Copy link
Member

Try reducing the minibatch size by setting the parameter batch_size to
something like 1000. It sounds like the default batch size is still too big
in this case but maybe I am reading the error wrong.

If the check array is an issue on the full, we could easily do it on a "per
minibatch" basis.

On Fri, Aug 28, 2015 at 6:49 AM, J. Richard Snape notifications@github.com
wrote:

Hmmm, there may be deeper issues... I was going to submit a PR based on
patching like this (following your comment)

check_type = np.float
if X.dtype in (np.float16,np.float32,np.float64):
check_type = X.dtype.type
X = check_array(X,check_type)

This allows memmap objects of the three specific float dtypes to be
checked without making a copy, while preserving the same functionality for
other dtype. I've verified this works and the check does not make a copy
if the dtype is the same.

However, when I tried this out I then hit a MemoryError further along the
road - where IncrementalPCA calls linalg.svd. I looked at that code (in
core scipy) and the line where it fails
https://github.com/scipy/scipy/blob/master/scipy/linalg/decomp_svd.py#L103
is a call through to the lapack function gesdd, which is a wrapped
native, I assume. So I don't think there's an easy fix here.

###Important note

This appears to 'all just work'™ in a 64 bit environment (I guess because
the array can be copied into memory)

I think the person who originally raised the issue is probably using type
float16 so that the very large matrix can be created in a 32 bit
environment. numpy prevents creation of a memmap at 140000 x 3504 with
type np.float / np.float64 with error OverflowError: cannot fit 'long'
into an index-sized integer in a 32 bit environment. With type np.float32,
I get an error from the OS via python saying that there is not enough
storage available to create the memmap.

I wonder if the answer may be to document these limitations. Interested in
your thoughts.


Reply to this email directly or view it on GitHub
#5173 (comment)
.

@rsnape
Copy link
Author

rsnape commented Aug 28, 2015

Ah - excellent suggestion. With the patch to the call to check_array and then setting explicit batch_size and n_components, the small example above works. Without the suggested patch to respect the dtype in the original data, the call to check_array still fails (even with batch_size specified).

But once we're past that, it seems the batch_size setting ensures the svd works OK and n_components should be set to a specific value (as helpfully suggested by the helpful error message you put in there when I ran it without n_components 👍 )

I'm in a bit of a strange situation, as this issue was raised on Stack overflow and, while I could trace the source of it, I haven't got a known big dataset to test with, so I'm testing this on an empty memmap of the right size. I could try to get the original dataset or a representative set from the person who raised it on SO - I have no idea how many components they expect to project onto.

I guess it is your call on whether to move to doing the check_array on a per minibatch basis, or leaving it as it is, potentially with the patch to respect dtype in (np.float16,np.float32,np.float64)

@kastnerkyle
Copy link
Member

It is easier to just check at the start - if someone needs it per minibatch
we can move it then, since it should be equivalent. But (IMO)
IncrementalPCA of bool/int values is a poor idea anyways and upfront checks
are much more straightforward. inside the minibatch we might have to do a
lot of testing to be sure that calls to partial fit with different dtypes
give the same results...

Kyle

On Fri, Aug 28, 2015 at 8:43 AM, J. Richard Snape notifications@github.com
wrote:

Ah - excellent suggestion. With the patch to the call to check_array and
then setting explicit batch_size and n_components, the small example
above works. Without the suggested patch to respect the dtype in the
original data, the call to check_array still fails (even with batch_size
specified).

But once we're past that, it seems the batch_size setting ensures the svd
works OK and n_components should be set to a specific value (as helpfully
suggested by the helpful error message you put in there when I ran it
without n_components [image: 👍] )

I'm in a bit of a strange situation, as this issue was raised on Stack
overflow and, while I could trace the source of it, I haven't got a known
big dataset to test with, so I'm testing this on an empty memmap of the
right size. I could try to get the original dataset or a representative set
from the person who raised it on SO - I have no idea how many components
they expect to project onto.

I guess it is your call on whether to move to doing the check_array on a
per minibatch basis, or leaving it as it is, potentially with the patch to
respect dtype in (np.float16,np.float32,np.float64)


Reply to this email directly or view it on GitHub
#5173 (comment)
.

@lesteve
Copy link
Member

lesteve commented Aug 28, 2015

@lesteve did you have any issues with this for nilearn?

Sorry to be late on this one, but I don't remember getting this kind of issue

@amueller
Copy link
Member

amueller commented Sep 9, 2015

check_array now supports lists of dtypes: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L288
So the change is just X = check_array(X, dtype=[np.float64, np.float32, np.float16])

@jnothman
Copy link
Member

So this is an easy fix, and can't really be unit tested?

@jnothman jnothman added the Easy Well-defined and straightforward way to resolve label Jun 14, 2017
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 14, 2017 via email

@cmarmo cmarmo removed this from the 0.19 milestone Sep 29, 2020
@antonky
Copy link

antonky commented Feb 23, 2021

Hi, this issue can be closed. I wasn't able to reproduce it, and after some digging I see that the fix was implemented here https://github.com/scikit-learn/scikit-learn/pull/5104/files#diff-5e29b5d9b2eeb884f0a1b69aa52e8e7234880af4c71910fae6a1bc1b5604f13bR199
The current code is different, but the fix is still in there
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/decomposition/_incremental_pca.py#L200
_validate_data calls check_array with the correct dtypes.

@thomasjpfan
Copy link
Member

Thanks for the update @nakamin ! We also currently have common test to check that all our transformers work with memmapped data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
Development

No branches or pull requests

10 participants