Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iloc breaks on read-only dataframe #10043

Closed
amueller opened this issue May 1, 2015 · 28 comments
Closed

iloc breaks on read-only dataframe #10043

amueller opened this issue May 1, 2015 · 28 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@amueller
Copy link

amueller commented May 1, 2015

This is picking up #9928 again. I don't know if the behavior is expected, but it is a bit odd to me. Maybe I'm doing something wrong, I'm not that familiar with the pandas internals.

We call df.iloc[indices] and that breaks with a read-only dataframe. I feel that it shouldn't though, as it is not writing.

Minimal reproducing example:

import pandas as pd
import numpy as np
array = np.eye(10)
array.setflags(write=False)

X = pd.DataFrame(array)
X.iloc[[1, 2, 3]]

ValueError buffer source array is read-only

Is there a way to slice the rows of the dataframe in another way that doesn't need a writeable array?

@amueller
Copy link
Author

amueller commented May 1, 2015

Maybe to clarify: I would be happy if the sliced DF is a copy, I just don't want to copy the original data [I think].

@jreback
Copy link
Contributor

jreback commented May 1, 2015

by definition what you are doing is ALWAYS going to be a copy. Frames are stored in columns.

In [10]: df = DataFrame(np.arange(20).reshape(5,4))

In [11]: df
Out[11]: 
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

In [12]: result = df.iloc[[1,2,3]]

In [13]: result
Out[13]: 
    0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

In [14]: result.values.base
Out[14]: 
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [15]: df._data
Out[15]: 
BlockManager
Items: Int64Index([0, 1, 2, 3], dtype='int64')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(0, 4, 1), 4 x 5, dtype: int64


In [17]: df._data.blocks[0].values
Out[17]: 
array([[ 0,  4,  8, 12, 16],
       [ 1,  5,  9, 13, 17],
       [ 2,  6, 10, 14, 18],
       [ 3,  7, 11, 15, 19]])

you can just access using [17] if you want to get the actual numpy array. Keep in mind, this is ONLY valid for a single-dtyped frame.

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label May 1, 2015
@TomAugspurger
Copy link
Contributor

Would it be OK for scikit-learn to rely on ._data / blocks in library code?

This also isn't great since I've passed DataFrames into scikit-learn that are mixes of int / float. I'm guessing that's a pretty common thing.

@amueller
Copy link
Author

amueller commented May 1, 2015

Getting a copy of the slice if fine, I'd just like to avoid copying the whole initial data.
I would like to retain the dataframe structure.

Can you elaborate what it means that this is only valid for single-dtyped frames?

I would like to have something that works on mixed type and single dtype frames and produces a dataframe.

To elaborate on the use-case:
We have a function that does some slicing, cross_val_score that takes a model. I want the slicing to work and not destroy the dataframe, in case the user provided a custom model that wants a dataframe.

@jreback
Copy link
Contributor

jreback commented May 1, 2015

then use use .iloc as you have indicated. I am not sure why you'd have a read-only array in the first place. I suppose that could be handled (it actually borks because its taking with a memory view).

Slicing doesn't touch the original data at all. It will make a copy. (It will hand you a view if its possible, but this is cross-sectional slicing so that is numpy dependent, but ONLY if you have a single-dtype). If you have multiple dtypes, you will ALWAYS get a copy.

What is the actual error you are getting?

@shoyer
Copy link
Member

shoyer commented May 1, 2015

External libraries should not be using the private pandas block system _data. I think that suggestion is a non-starter.

I agree that it is weird that read-only arrays don't work with .iloc. The ultimate problem appears to be that Cython memory views can't be created on read-only arrays:
https://mail.python.org/pipermail/cython-devel/2013-February/003384.html

To fix this on our end, we should probably wrap our calls to the internal take_nd Cython routines with something that ensures the source arrays are writable. Given that we aren't actually writing to them, I think that should be safe.

@amueller
Copy link
Author

amueller commented May 1, 2015

The traceback is in #9928, I agree with @shoyer that it seems to be cause by what is described in the cython thread. The solution sounds ok to me but as I said I don't know much about the internals of pandas.

The non-writeable array comes from joblib. I am not entirely sure about the background and I will investigate if we can get rid of it. I was not aware of the cython behavior when I opened the issue and hoped I was actually doing something wrong.

@shoyer
Copy link
Member

shoyer commented May 1, 2015

Ultimately, I think this issue should be solved in on the Cython end of things -- there are plenty of legitimate users for readonly memory views. In fact @richardhansen reports in this this stackoverflow answer that he has a rough patch for this.

In the meantime, does anyone know how to create a "writeable" array from a readonly array with (a) not making any copies of the underlying data and (b) no modifications to the source array? This does not seem to work:

array = np.arange(10.0)
array.setflags(write=False)

array2 = np.asarray(array)
array2.setflags(write=True)
print array.flags.writeable
True

@amueller
Copy link
Author

amueller commented May 1, 2015

Ok, I think it is fair of you to punt this to cython. I have to work around it in sklearn anyhow for the moment (even if you fixed it we want to support current stable pandas etc).
I think asarray doesn't do anything if you pass it an array, but

array = np.arange(10.0)
array.setflags(write=False)

array2 = np.array(array)
array2.setflags(write=True)
print array.flags.writeable

False

@amueller
Copy link
Author

amueller commented May 1, 2015

Feel free to close.

@shoyer
Copy link
Member

shoyer commented May 1, 2015

@amueller The problem with calling np.array is that you're copying the data... which is generally not ideal (especially for a routine like .iloc, which is supposed to be fast). (I just updated my comment from 8 minutes ago to clarify what I'm looking for.)

@amueller
Copy link
Author

amueller commented May 1, 2015

brainfart. With copy=False it also gives "True".

@jreback jreback added this to the Next Major Release milestone May 4, 2015
@ogrisel
Copy link
Contributor

ogrisel commented May 5, 2015

Here is a more realistic use case for the same underlying problem. Assume we have some data stored in a file on disk by some data generating process. For instance a process that save 100 random numbers:

>>> import numpy as np
>>> np.save('/tmp/data.npy', np.random.randn(100))

In a more realistic setting this file could be several hundreds of GB and would not necessarily fit in memory.

Now assume that we have a second program that wants to use the pandas API to manipulate this data but also wants to leverage the memory mapping feature of numpy to avoid copying the memory between processes running on the same host and only load from the drive the memory pages that are actually used by the program. Wrapping the memory mapped data works fine:

>>> import pandas as pd
>>> import numpy as np
>>> data = np.load('/tmp/data.npy', mmap_mode='r')
>>> df = pd.DataFrame(data, columns=['column'])

Slicing the data frame by rows works fine:

>>> df.iloc[:3]
     column
0  0.419577
1 -1.050912
2 -0.562929

However fancy indexing on the DataFrame breaks:

>>> df.iloc[[1, 2, 3]]
Traceback (most recent call last):
  File "<ipython-input-55-7550557fa06e>", line 1, in <module>
    df.iloc[[1, 2, 3]]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1217, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 1508, in _getitem_axis
    return self._get_loc(key, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/indexing.py", line 92, in _get_loc
    return self.obj._ixs(key, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/frame.py", line 1714, in _ixs
    result = self.take(i, axis=axis)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/generic.py", line 1351, in take
    convert=True, verify=True)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3269, in take
    axis=axis, allow_dups=True)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in reindex_indexer
    for blk in self.blocks]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 3156, in <listcomp>
    for blk in self.blocks]
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/internals.py", line 851, in take_nd
    allow_fill=True, fill_value=fill_value)
  File "/Users/ogrisel/venvs/py34/lib/python3.4/site-packages/pandas/core/common.py", line 823, in take_nd
    func(arr, indexer, out, fill_value)
  File "pandas/src/generated.pyx", line 3472, in pandas.algos.take_2d_axis0_float64_float64 (pandas/algos.c:89674)
  File "stringsource", line 614, in View.MemoryView.memoryview_cwrapper (pandas/algos.c:175798)
  File "stringsource", line 321, in View.MemoryView.memoryview.__cinit__ (pandas/algos.c:172387)
ValueError: buffer source array is read-only

However I don't see any compelling reason why would pandas require write access to the underlying memory mapped buffer.

Edit: in my original comment I used to create the data frame with pd.DataFrame({'column': data}) which apparently prevents the problem from occurring.

@amueller
Copy link
Author

amueller commented May 5, 2015

@ogrisel as mentioned above, as soon as you get buffer into cython, you'll run into this issue, so pandas can't really do much.

@ogrisel
Copy link
Contributor

ogrisel commented May 5, 2015

Indeed I had not properly read the end of the discussion. An alternative would be to use the ndarray Cython type instead of the Cython memoryview for those functions.

I did a quick patch on the generated.pyx file to check:

diff --git a/pandas/src/generated.pyx b/pandas/src/generated.pyx
index cab3a84..cb9f586 100644
--- a/pandas/src/generated.pyx
+++ b/pandas/src/generated.pyx
@@ -3469,7 +3469,7 @@ def take_2d_axis0_float32_float64(float32_t[:, :] values,

 @cython.wraparound(False)
 @cython.boundscheck(False)
-def take_2d_axis0_float64_float64(float64_t[:, :] values,
+def take_2d_axis0_float64_float64(ndarray[float64_t, ndim=2] values,
                                     ndarray[int64_t] indexer,
                                     float64_t[:, :] out,
                                     fill_value=np.nan):
@@ -9360,5 +9360,3 @@ def inner_join_indexer_int64(ndarray[int64_t] left,
                 j += 1

     return result, lindexer, rindexer

and it seems to make iloc work properly on dataframes backed by read-only arrays. Is there any performance benefit or other benefit in using memorviews for the values argument of the take_nd_axis* functions?

@jreback
Copy link
Contributor

jreback commented May 5, 2015

@ogrisel this was originally done for a non-trivial perf benefit, which requires the memory to be correctly aligned (which is done at a higher level). So would have to see how perf is different.

@amueller
Copy link
Author

amueller commented May 5, 2015

you have difficulty and effort tags. nice.

@jreback
Copy link
Contributor

jreback commented May 5, 2015

stolen from astropy

@amueller
Copy link
Author

amueller commented May 5, 2015

sorry for OT: @jreback are they working well for you? We mostly use a very vague "easy" tag.

@ogrisel
Copy link
Contributor

ogrisel commented May 5, 2015

Maybe a "temporary" workaround would be to have take_2d_axis0_float64_float64 and the likes call an inline variant that uses the memoryview only when values.flags.writeable is True and fallback to an inline variant with ndarray dtype otherwise. That would introduce some small per-function call overhead but it should not depend on the size of the arrays. The main con would be the added code complexity but as this code is already generated from a template that might be such a big deal. WDYT?

@jreback
Copy link
Contributor

jreback commented May 6, 2015

@ogrisel yes that would work, and shouldn't be too complex.

@amueller I think too soon to tell if the labels are working; We used to have a 'good as first PR' which is now our 'Difficulty Novice'. These were useful at PyCon hackathon. I think also will be useful in the long run, but will take some time to classify things.

@ogrisel
Copy link
Contributor

ogrisel commented May 6, 2015

@ogrisel yes that would work, and shouldn't be too complex.

Do you want to do it your-self or would rather me to give it a try?

@jreback
Copy link
Contributor

jreback commented May 6, 2015

@ogrisel be my guest!

@ogrisel
Copy link
Contributor

ogrisel commented May 6, 2015

I submitted a first draft in #10070. Feedback appreciated.

@jreback jreback modified the milestones: 0.16.1, Next Major Release May 6, 2015
@rhansen
Copy link

rhansen commented May 7, 2015

In the meantime, does anyone know how to create a "writeable" array from a readonly array with (a) not making any copies of the underlying data and (b) no modifications to the source array?

I'm not familiar with pandas, but maybe this gross hack will do what you want:

from cpython.buffer cimport PyBUF_WRITABLE, \
    PyObject_CheckBuffer, PyObject_GetBuffer

cdef class ForceWritableBufferWrapper:
    cdef object buf
    def __cinit__(self, object buf):
        if not PyObject_CheckBuffer(buf):
            raise TypeError("argument must follow the buffer protocol")
        self.buf = buf
    def __getbuffer__(self, Py_buffer *view, int flags):
        PyObject_GetBuffer(self.buf, view, flags & ~PyBUF_WRITABLE)
        view.readonly = 0
    def __releasebuffer__(self, Py_buffer *view):
        pass

This Cython extension type provides a writable buffer interface for any buffer object, even if the underlying buffer is read-only. Of course this must only be used if you are absolutely certain nobody will write to the buffer, otherwise Bad Things(tm) will happen.

You can probably do some __setattr__ and __getattribute__ magic to make this wrapper act more like the underlying object if you so desire.

@jreback
Copy link
Contributor

jreback commented May 7, 2015

@richardhansen interesting idea. Do you know if this works with np.memmap (that is the original usecase). Would this be thread-safe? (I suppose we could just add a with gil to force it anyhow)

@jreback
Copy link
Contributor

jreback commented May 8, 2015

closed by #10070

@jreback jreback closed this as completed May 8, 2015
@rhansen
Copy link

rhansen commented May 8, 2015

Do you know if this works with np.memmap (that is the original usecase).

I haven't tested that specific class, but it should work with any object that implements the buffer protocol.

Would this be thread-safe?

Yes. Just like with most other Cython code, you shouldn't need with gil unless you're doing something special (e.g., in a callback invoked by a plain C function that was called in a with nogil block).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

6 participants