Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqfault on creation of dataframe with np.empty_like #13717

Closed
flatberg opened this issue Jul 20, 2016 · 18 comments
Closed

Seqfault on creation of dataframe with np.empty_like #13717

flatberg opened this issue Jul 20, 2016 · 18 comments
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@flatberg
Copy link

This code segfaults:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3), index='A B C'.split())
pd.DataFrame(np.empty_like(df.index))

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-91-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

In [1]: pd.DataFrame(np.empty_like(np.array([1,2])))
Out[1]:
     0
0  109
1    0

In [2]: pd.DataFrame(np.empty_like(np.array(['a', 'b'])))
Out[2]:
  0
0
1

# setfault
In [3]: pd.DataFrame(np.empty_like(Index(['a', 'b'])))
Out[3]:

not really sure what np.empty_like actually does under the hood.

cc @charris
cc @gfyoung

@jreback jreback added the Bug label Jul 20, 2016
@jreback jreback added this to the Next Major Release milestone Jul 20, 2016
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jul 20, 2016
@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2016

I suspect it is a case of memory corruption on the numpy end when creating a None for the object dtype, as evidenced below:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index(['a', 'b']))
>>> arr
array([None, None], dtype=object)
>>> arr2 = np.array([None, None], dtype=object)
>>> arr == arr2
True
>>> arr.data.tobytes()    # segfault
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> arr2.data.tobytes()   # no segfault
b'0\x91\xad\x8b\xd8\x7f\x00\x000\x91\xad\x8b\xd8\x7f\x00\x00'

You can see here that the data is not corrupted when you have integers:

>>> import numpy as np
>>> import pandas as pd
>>> arr = np.empty_like(pd.Index([1, 2]))
>>> arr
array([140568033367712, 140568033367744], dtype=int64)
>>> arr2 = np.array([140568033367712, 140568033367744], dtype=np.int64)
>>> arr == arr2
True
>>> arr.data.tobytes()    # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00'
>>> arr2.data.tobytes()   # no segfault
b'\xa0\x16\xb2\x8b\xd8\x7f\x00\x00\xc0\x16\xb2\x8b\xd8\x7f\x00\x00'

Also, you don't need an Index to trigger the segfault:

>>> import numpy as np
>>> import pandas as pd
>>> pd.DataFrame(np.empty_like([None]))

@bashtage
Copy link
Contributor

You can do more wierd tricks with this:

np.empty_like(np.array([None, None], dtype=object)).data.tobytes()
Out[41]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).data.tobytes()
Out[42]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

np.empty_like(np.array([None, None], dtype=object)).astype(np.int64).astype(np.object).data.tobytes()
Out[43]: b'\xd0\x01\x9ac\x00\x00\x00\x00\xd0\x01\x9ac\x00\x00\x00\x00'

This looks like a NumPy bug.

I suspect that an object array initialized using None should contain the value of id(None) if correct. 0-filled "Nones" don't make any sense, and seem to an incorrect treatment of NumPy.

>>> np.array([None]).data.tobytes()[::-1].hex()
'0000000063979420'

>>> hex(id(None))
'0x63979420'

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

agree with @bashtage here - some memory aliasing going on

want to open a numpy side issue?

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2016

@jreback : I opened a PR now in numpy that should fix the issue. However, even if it doesn't get merged, technically, this issue could be closed since the bug is on the numpy side.

@jreback
Copy link
Contributor

jreback commented Jul 21, 2016

running this in the debugger actually works. I think there is a reference issue to the Nones; an element-wise copy would fix it I suppose (from pandas perspective). but yes this is a numpy-bug.

@jreback jreback closed this as completed Jul 21, 2016
@jreback jreback modified the milestones: No action, Next Major Release Jul 21, 2016
@bashtage
Copy link
Contributor

It is probably, technically, a Pandas bug. NumPy is returns an empty array filled with NULL. Pandas doesn't seems to know what to do when an object array contains NULLs. This said, it would be better if NumPy addressed this by never returning object arrays containing NULLs, and in instead returned the address of None.

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2016

@bashtage : True, but the annoying part on the pandas end about this issue is that it's hard to step through and catch where exactly the segfault is occurring. I tried throwing Exceptions in the middle of the __init__ for DataFrame and NDFrame without any success.

@njsmith
Copy link

njsmith commented Jul 21, 2016

This should probably be treated as a pandas bug, and definitely not as a bug in empty_like specifically.

Numpy's semantics for interpreting the bytes in object arrays are, consistently:

  • Zero field values represent None, but without holding a refcount
  • All other fields values represent a concrete PyObject*, and hold a refcount.

If pandas wants to implement its own logic for reading objects out of object arrays, then it kinda has to implement the same semantics that numpy uses.

Potentially numpy could get rid of the special case for null pointers. I'm not actually sure if this would be a good idea or not, since there is a bit of an optimization there (the special case means we can allocate a large array of None objects, as in np.empty or np.zeroes, in O(1) time instead of the O(n) it would take to initialize all those fields and issue all those calls to Py_INCREF -- but OTOH maybe the win here is trivial and not with the complexity, I dunno).

Getting rid of this special case is more involved than just patching empty_like, though: this logic is implemented consistently in all numpy code dealing with object arrays, and potentially depended on by third party libraries.

@gfyoung
Copy link
Member

gfyoung commented Jul 21, 2016

@njsmith: So you have semantically aliased the zero field value to Py_None if I understand properly?

@njsmith
Copy link

njsmith commented Jul 22, 2016

Well not me personally :-). But yes, that's my understanding.

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2016

Okay, given the push-back I'm seeing from the numpy maintainers, digging a bit deeper on the pandas end, the issue is with printing of the DataFrame and the formatting behind the scenes. The segfault can be triggered in a much simpler fashion:

>>> import pandas.lib as lib
>>> import numpy as np
>>> lib.isnullobj(np.empty_like([None]))

This drew my attention when I realised the segfault wasn't occurring with just the mere creation of the DataFrame, but rather when I called str(...). pdb indicated that there was a segfault being signalled in lib.c, which is the Cythonized version of lib.pyx.

So it does seem like pandas does not handle these zero-value Py_None values, but I didn't realise it would be coming from printing! Definitely on the right track, so will continue digging...

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2016

Okay, so the issue is a memory access violation occurring right here (if you follow my example). This definitely looks like a numpy issue to me, as we do nothing to the object array beforehand.

I should also point out that if I define the following function in numpy (e.g. mtrand.pyx):

def getfirstelt(ndarray[object] arr):
    return arr[0]

and then do this:

>>> from numpy.random import getfirstelt
>>> import numpy as np
>>> np.getfirstelt(np.empty_like([None]))

this also segfaults.

@njsmith : what are your thoughts about this?

@njsmith
Copy link

njsmith commented Jul 23, 2016

@gfyoung: I think that your code there is triggering cython to generate some C code that directly pokes around inside the ndarray's memory. Possibly cython is generating buggy C code?

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2016

@njsmith : Perhaps, though FWIW, this will work (i.e. not segfault):

def getfirstelt(ndarray arr):
    return arr[0]

It could be a Cython bug, and the reason why it hasn't surfaced in numpy is because __getitem__ IINM is implemented in C by you guys already, so there is no Cython layer to compile for that.

@njsmith
Copy link

njsmith commented Jul 23, 2016

So look at the C code cython generates for those two snippets and see what the difference is :-)

@gfyoung
Copy link
Member

gfyoung commented Jul 23, 2016

@njsmith : maybe when I have A LOT of time on my hands :). FYI, I think we can also close the related numpy issue since we figured out what to do here.

gfyoung added a commit to forking-repos/pandas that referenced this issue Jul 23, 2016
Weird segfault arises when you call lib.isnullobj
(or any of its equivalents like lib.isnullobj2d) with
an array that uses 0-field values to mean None. Changed
input to be a Python object (i.e. no typing), and the
segfault went away.

Closes pandas-devgh-13717.

[ci skip]
jreback pushed a commit that referenced this issue Jul 24, 2016
Weird segfault arises when you call `lib.isnullobj` with an array that
uses 0-field values to mean `None`.  Changed input to be a `Python`
object (i.e. no typing), and the segfault went away.  Discovered when
there were segfaults in printing a `DataFrame` containing such an
array.

Closes #13717.

Author: gfyoung <gfyoung17@gmail.com>

Closes #13764 from gfyoung/isnullobj-segfault and squashes the following commits:

0338b5d [gfyoung] BUG: Fix segfault in lib.isnullobj
@jreback jreback modified the milestones: 0.19.0, No action Jul 24, 2016
@jreback
Copy link
Contributor

jreback commented Jul 24, 2016

closed by #13764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants