Unexpected MemoryError from IncrementalPCA used with memmap #5173

rsnape · 2015-08-27T11:58:21Z

Prompted by a question asked in Stackoverflow chat, I investigated why a user would encounter a MemoryError when using the following minimal code:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

I found that the MemoryError was called by this call to check_array:

X = check_array(X, dtype=np.float)

The problem is with adding the dtype=np.float to the call to check_array. This means that the copy=False default of check_array is ignored, because the dtype has changed in that call to np.array() and the docs state that change of dtype will force a copy irrespective of the copy argument.

I made a simple gist demonstrating the issue here

I propose that the line above could be changed to

X = check_array(X)

but am not expert enough in this tool to understand whether that might have negative consequences further along the chain of execution. If that is a good solution, happy to submit a PR.

The text was updated successfully, but these errors were encountered:

kastnerkyle · 2015-08-27T14:28:48Z

git blame points the finger right back on me

@lesteve did you have any issues with this for nilearn?

I am probably OK with loosening the check_array to only run on things which
have dtypes which are not in {float16, float32, float64} though I am not
sure what the numerical implications are for 16bit - might not be enough
precision to get numerically identical results to true PCA.

On Thu, Aug 27, 2015 at 7:58 AM, J. Richard Snape notifications@github.com
wrote:

Prompted by a question asked in Stackoverflow chat
http://chat.stackoverflow.com/transcript/message/25344627#25344627, I
investigated why a user would encounter a MemoryError when using the
following minimal code:

ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)

I found that the MemoryError was called by this call to check_array
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/incremental_pca.py#L167
:

X = check_array(X, dtype=np.float)

The problem is with adding the dtype=np.float to the call to check_array.
This means that the copy=False default of check_array is ignored, because
the dtype has changed in that call to np.array() and the docs state that
change of dtype will force a copy irrespective of the copy argument.

I made a simple gist demonstrating the issue here
https://gist.github.com/rsnape/bd1f30db4b789a5f7665

I propose that the line above could be changed to

X = check_array(X)

but am not expert enough in this tool to understand whether that might
have negative consequences further along the chain of execution. If that is
a good solution, happy to submit a PR.

—
Reply to this email directly or view it on GitHub
#5173.

rsnape · 2015-08-28T10:49:16Z

Hmmm, there may be deeper issues... I was going to submit a PR based on patching like this (following your comment)

check_type = np.float
if X.dtype in (np.float16,np.float32,np.float64):
    check_type = X.dtype.type
X = check_array(X,check_type)

This allows memmap objects of the three specific float dtypes to be checked without making a copy, while preserving the same functionality for other dtype. I've verified this works and the check does not make a copy if the dtype is the same.

However, when I tried this out I then hit a MemoryError further along the road - where IncrementalPCA calls linalg.svd. I looked at that code (in core scipy) and the line where it fails is a call through to the lapack function gesdd, which is a wrapped native, I assume. So I don't think there's an easy fix here.

Important note

This appears to 'all just work'™ in a 64 bit environment (I guess because the array can be copied into memory)

I think the person who originally raised the issue is probably using type float16 so that the very large matrix can be created in a 32 bit environment. numpy prevents creation of a memmap at 140000 x 3504 with type np.float / np.float64 with error OverflowError: cannot fit 'long' into an index-sized integer in a 32 bit environment. With type np.float32, I get an error from the OS via python saying that there is not enough storage available to create the memmap.

I wonder if the answer may be to document these limitations. Interested in your thoughts.

kastnerkyle · 2015-08-28T11:36:36Z

Try reducing the minibatch size by setting the parameter batch_size to
something like 1000. It sounds like the default batch size is still too big
in this case but maybe I am reading the error wrong.

If the check array is an issue on the full, we could easily do it on a "per
minibatch" basis.

On Fri, Aug 28, 2015 at 6:49 AM, J. Richard Snape notifications@github.com
wrote:

Hmmm, there may be deeper issues... I was going to submit a PR based on
patching like this (following your comment)

check_type = np.float
if X.dtype in (np.float16,np.float32,np.float64):
check_type = X.dtype.type
X = check_array(X,check_type)

This allows memmap objects of the three specific float dtypes to be
checked without making a copy, while preserving the same functionality for
other dtype. I've verified this works and the check does not make a copy
if the dtype is the same.

However, when I tried this out I then hit a MemoryError further along the
road - where IncrementalPCA calls linalg.svd. I looked at that code (in
core scipy) and the line where it fails
https://github.com/scipy/scipy/blob/master/scipy/linalg/decomp_svd.py#L103
is a call through to the lapack function gesdd, which is a wrapped
native, I assume. So I don't think there's an easy fix here.

###Important note

This appears to 'all just work'™ in a 64 bit environment (I guess because
the array can be copied into memory)

I think the person who originally raised the issue is probably using type
float16 so that the very large matrix can be created in a 32 bit
environment. numpy prevents creation of a memmap at 140000 x 3504 with
type np.float / np.float64 with error OverflowError: cannot fit 'long'
into an index-sized integer in a 32 bit environment. With type np.float32,
I get an error from the OS via python saying that there is not enough
storage available to create the memmap.

I wonder if the answer may be to document these limitations. Interested in
your thoughts.

—
Reply to this email directly or view it on GitHub
#5173 (comment)
.

rsnape · 2015-08-28T12:43:42Z

Ah - excellent suggestion. With the patch to the call to check_array and then setting explicit batch_size and n_components, the small example above works. Without the suggested patch to respect the dtype in the original data, the call to check_array still fails (even with batch_size specified).

But once we're past that, it seems the batch_size setting ensures the svd works OK and n_components should be set to a specific value (as helpfully suggested by the helpful error message you put in there when I ran it without n_components 👍 )

I'm in a bit of a strange situation, as this issue was raised on Stack overflow and, while I could trace the source of it, I haven't got a known big dataset to test with, so I'm testing this on an empty memmap of the right size. I could try to get the original dataset or a representative set from the person who raised it on SO - I have no idea how many components they expect to project onto.

I guess it is your call on whether to move to doing the check_array on a per minibatch basis, or leaving it as it is, potentially with the patch to respect dtype in (np.float16,np.float32,np.float64)

kastnerkyle · 2015-08-28T13:28:44Z

It is easier to just check at the start - if someone needs it per minibatch
we can move it then, since it should be equivalent. But (IMO)
IncrementalPCA of bool/int values is a poor idea anyways and upfront checks
are much more straightforward. inside the minibatch we might have to do a
lot of testing to be sure that calls to partial fit with different dtypes
give the same results...

Kyle

On Fri, Aug 28, 2015 at 8:43 AM, J. Richard Snape notifications@github.com
wrote:

Ah - excellent suggestion. With the patch to the call to check_array and
then setting explicit batch_size and n_components, the small example
above works. Without the suggested patch to respect the dtype in the
original data, the call to check_array still fails (even with batch_size
specified).

But once we're past that, it seems the batch_size setting ensures the svd
works OK and n_components should be set to a specific value (as helpfully
suggested by the helpful error message you put in there when I ran it
without n_components [image: 👍] )

I'm in a bit of a strange situation, as this issue was raised on Stack
overflow and, while I could trace the source of it, I haven't got a known
big dataset to test with, so I'm testing this on an empty memmap of the
right size. I could try to get the original dataset or a representative set
from the person who raised it on SO - I have no idea how many components
they expect to project onto.

I guess it is your call on whether to move to doing the check_array on a
per minibatch basis, or leaving it as it is, potentially with the patch to
respect dtype in (np.float16,np.float32,np.float64)

—
Reply to this email directly or view it on GitHub
#5173 (comment)
.

lesteve · 2015-08-28T14:35:30Z

@lesteve did you have any issues with this for nilearn?

Sorry to be late on this one, but I don't remember getting this kind of issue

amueller · 2015-09-09T19:41:16Z

check_array now supports lists of dtypes: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L288
So the change is just X = check_array(X, dtype=[np.float64, np.float32, np.float16])

jnothman · 2017-06-14T06:13:35Z

So this is an easy fix, and can't really be unit tested?

GaelVaroquaux · 2017-06-14T06:15:22Z

So the change is just X = check_array(X, dtype=[np.float64, np.float32, np.float16])

Do we really want to support float16?

antonky · 2021-02-23T18:56:21Z

Hi, this issue can be closed. I wasn't able to reproduce it, and after some digging I see that the fix was implemented here https://github.com/scikit-learn/scikit-learn/pull/5104/files#diff-5e29b5d9b2eeb884f0a1b69aa52e8e7234880af4c71910fae6a1bc1b5604f13bR199
The current code is different, but the fix is still in there
https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/decomposition/_incremental_pca.py#L200
_validate_data calls check_array with the correct dtypes.

thomasjpfan · 2021-02-23T19:26:02Z

Thanks for the update @nakamin ! We also currently have common test to check that all our transformers work with memmapped data.

kastnerkyle mentioned this issue Oct 20, 2015

Estimators should not try to modify X and y inplace in order to handle readonly memory maps #5481

Closed

glouppe added the Bug label Oct 21, 2015

amueller modified the milestone: 0.19 Sep 29, 2016

jnothman added the Easy Well-defined and straightforward way to resolve label Jun 14, 2017

cmarmo removed this from the 0.19 milestone Sep 29, 2020

cmarmo added hacktoberfest help wanted and removed hacktoberfest labels Sep 29, 2020

thomasjpfan closed this as completed Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected MemoryError from IncrementalPCA used with memmap #5173

Unexpected MemoryError from IncrementalPCA used with memmap #5173

rsnape commented Aug 27, 2015

kastnerkyle commented Aug 27, 2015

rsnape commented Aug 28, 2015

kastnerkyle commented Aug 28, 2015

rsnape commented Aug 28, 2015

kastnerkyle commented Aug 28, 2015

lesteve commented Aug 28, 2015

amueller commented Sep 9, 2015

jnothman commented Jun 14, 2017

GaelVaroquaux commented Jun 14, 2017 via email

antonky commented Feb 23, 2021

thomasjpfan commented Feb 23, 2021

Unexpected MemoryError from IncrementalPCA used with memmap #5173

Unexpected MemoryError from IncrementalPCA used with memmap #5173

Comments

rsnape commented Aug 27, 2015

kastnerkyle commented Aug 27, 2015

rsnape commented Aug 28, 2015

kastnerkyle commented Aug 28, 2015

rsnape commented Aug 28, 2015

kastnerkyle commented Aug 28, 2015

lesteve commented Aug 28, 2015

amueller commented Sep 9, 2015

jnothman commented Jun 14, 2017

GaelVaroquaux commented Jun 14, 2017 via email

antonky commented Feb 23, 2021

thomasjpfan commented Feb 23, 2021