BUG: Segfault in DataFrame.pivot with complex #12666

ycopin · 2016-03-18T17:56:48Z

With attached CSV-file test_csv.txt, the following code will segfault if the input dataframe has more than 17 entries:

import pandas as PD

df = PD.read_csv('test_csv.txt',
                 index_col=0,
                 converters={'xyinrad': complex,
                             'xyoutm': complex})

# Pivot works fine with 17 entries
df[:-1].info()
pivot = df[:-1].pivot('wave', 'xyinrad', 'xyoutm')
print("Still alive".center(50, '!'))

# Pivot segfaults with 18 entries
df.info()
print("Dave, my mind is going.".center(50, '.'))
pivot = df.pivot('wave', 'xyinrad', 'xyoutm')

Result

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 3 columns):
wave       16 non-null float64
xyinrad    16 non-null complex128
xyoutm     16 non-null complex128
dtypes: complex128(2), float64(1)
memory usage: 768.0 bytes
!!!!!!!!!!!!!!!!!!!Still alive!!!!!!!!!!!!!!!!!!!!
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 0 to 16
Data columns (total 3 columns):
wave       17 non-null float64
xyinrad    17 non-null complex128
xyoutm     17 non-null complex128
dtypes: complex128(2), float64(1)
memory usage: 816.0 bytes
.............Dave, my mind is going...............
Segmentation fault (core dumped)

Expected Output

No segfault!
Return the pivot table

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-38-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.4
pip: 8.1.1
setuptools: 20.3
Cython: 0.21.1
numpy: 1.10.4
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: None
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-18T18:03:11Z

can you show a reproducible example that is copy-pastable. e.g either include the data in-line or create a sample frame. You are pivoting on complex numbers, you mean to do that?

ycopin · 2016-03-18T18:12:01Z

Sorry, here is a copy-pastable example:

from io import StringIO
import pandas as PD

csv = StringIO(u"""
,wave,xyinrad,xyoutm
0,1.55e-06,(0.00523598775598+0.0200712863979j),(-0.06421774-0.06848953j)
1,1.55e-06,(0.00523598775598+0.0209439510239j),(-0.06422449-0.07932723j)
2,1.55e-06,(0.00610865238198+0.00872664625997j),(-0.07546985+0.07215781j)
3,1.55e-06,(0.00610865238198+0.00959931088597j),(-0.07540574+0.06128717j)
4,1.55e-06,(0.00610865238198+0.010471975512j),(-0.07534734+0.05043315j)
5,1.55e-06,(0.00610865238198+0.011344640138j),(-0.07529464+0.03959332j)
6,1.55e-06,(0.00610865238198+0.012217304764j),(-0.07524759+0.02876524j)
7,1.55e-06,(0.00610865238198+0.01308996939j),(-0.07520614+0.01794649j)
8,1.55e-06,(0.00610865238198+0.013962634016j),(-0.07517025+0.007134677j)
9,1.55e-06,(0.00610865238198+0.014835298642j),(-0.07513988-0.003672601j)
10,1.55e-06,(0.00610865238198+0.0157079632679j),(-0.07511502-0.01447775j)
11,1.55e-06,(0.00610865238198+0.0165806278939j),(-0.07509567-0.02528313j)
12,1.55e-06,(0.00610865238198+0.0174532925199j),(-0.0750818-0.03609117j)
13,1.55e-06,(0.00610865238198+0.0183259571459j),(-0.07507343-0.04690428j)
14,1.55e-06,(0.00610865238198+0.0191986217719j),(-0.07507053-0.05772485j)
15,1.55e-06,(0.00610865238198+0.0200712863979j),(-0.07507306-0.06855524j)
16,1.55e-06,(0.00610865238198+0.0209439510239j),(-0.07508092-0.07939773j)
""")

df = PD.read_csv(csv,
                 index_col=0,
                 converters={'xyinrad': complex,
                             'xyoutm': complex})

# Pivot works fine with 17 entries
df[:-1].info()
pivot = df[:-1].pivot('wave', 'xyinrad', 'xyoutm')
print("Still alive".center(50, '!'))

# Pivot segfaults with 18 entries
df.info()
print("Dave, my mind is going.".center(50, '.'))
pivot = df.pivot('wave', 'xyinrad', 'xyoutm')

Yes I want to pivot on complex numbers, which are coordinates. It does not sound very clever to index/column on float-likes, but this was working fine until now.

jreback · 2016-03-18T19:28:13Z

hmm, seems odd that that core dumps.

ycopin · 2016-03-18T21:07:07Z

For what it's worth, valgrind run points to sorting function:

==17290== Invalid read of size 8
==17290==    at 0x6AF4328: OBJECT_compare (arraytypes.c.src:2679)
==17290==    by 0x6BC9C15: npy_quicksort (quicksort.c.src:401)
==17290==    by 0x6B4CE34: _new_sortlike (item_selection.c:871)
==17290==    by 0x6B8FAB0: array_sort (methods.c:1159)

jreback · 2016-03-18T21:10:20Z

hmm, easiest is simply to step thru this (until it borks).

ycopin · 2016-03-19T07:57:34Z

The segfault originates in uniques.argsort() of pandas.core.algorithms.factorize(df['xyinrad'], sort=True) (called from MultiIndex.from_arrays([df['wave'], df['xyinrad']]) and pandas.core.categorical.Categorical(df['xyinrad'], ordered=True)).

The issue is surely coming from the fact that complex array is converted to 'object', by lack of complex-case in pandas.core.algorithms._get_data_algo() (and associated quantity _hashtables).

Could a complex128 case be added to int64 and float64? But this looks like a major work...

ycopin · 2016-03-19T10:28:35Z

Here is the shortest segfault case:

from pandas.core.algorithms import factorize
n = 17
codes, categories = factorize([ complex(i) for i in range(n) ], sort=True)

This does not crash with n = 16, nor with sort=False.

This raises 2 naive questions:

in MultiIndex.from_arrays, why is Categorical.from_array(arr, ordered=True) always called with option ordered=True while arr might not be sortable (e.g. of type 'object')?
in Categorical initialization, why is factorize(values, sort=True) always called with option sort=True while values might not be sortable (e.g. of type 'object')?

jreback · 2016-03-19T18:16:59Z

So this is a numpy bug, pls open an issue there.

In [3]: x = np.array([ complex(i) for i in range(16) ], dtype=object)

In [4]: x
Out[4]: 
array([0j, (1+0j), (2+0j), (3+0j), (4+0j), (5+0j), (6+0j), (7+0j), (8+0j),
       (9+0j), (10+0j), (11+0j), (12+0j), (13+0j), (14+0j), (15+0j)], dtype=object)

# this is what I would expect
In [5]: x.argsort()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-3733f927a1b7> in <module>()
----> 1 x.argsort()

TypeError: no ordering relation is defined for complex numbers


# here's your coredump
In [7]: x = np.array([ complex(i) for i in range(17) ], dtype=object)

In [8]: x.argsort()
/Users/jreback/miniconda/bin/python.app: line 3: 59868 Segmentation fault: 11  (core dumped) /Users/jreback/miniconda/python.app/Contents/MacOS/python "$@"

We use object dtype when we don't have defined hashtables for things (e.g. floats,ints). So to work around in pandas we should have to do something like:

In [1]: x = np.array([ complex(i) for i in range(17) ], dtype=object)

In [2]: pandas.lib.infer_dtype(x)
Out[2]: 'mixed'

This should report complex (but is not supported ATM). Then we could work around.

ycopin · 2016-03-20T11:35:50Z

No more segfault and things are working fine when using master numpy (now 1.12.0.dev0+41ec9fc), as well as numpy 1.11.rc2. Next release of numpy (1.11) should therefore fix the issue.

Please feel free to close the issue, and thanks for your support.

jreback · 2016-03-20T14:45:55Z

hmm, can you put in a test for this. we'll need to skip obviously for < 1.11, but we do test against numpy master.

ycopin · 2016-03-20T21:55:30Z

I'll try to do my best (never did that). Should I write a test for DataFrame.pivot (pandas/pandas/tests/frame/test_reshape.py) or for deeper Categorical.from_array or factorize (in pandas/pandas/tests/test_algos.py)?

jreback · 2016-03-20T21:57:24Z

I think the test_algos is most appropriate

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Complex Complex Numbers labels Mar 18, 2016

jreback added this to the 0.18.1 milestone Mar 18, 2016

jreback added Difficulty Intermediate labels Mar 18, 2016

jreback changed the title ~~Segfault in DataFrame.pivot~~ BUG: Segfault in DataFrame.pivot with complex Mar 19, 2016

ycopin mentioned this issue Mar 19, 2016

Segfault in argsort of object array numpy/numpy#7435

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016

jorisvandenbossche added the Testing pandas testing functions or related to the test suite label Aug 21, 2016

mralgos mentioned this issue Aug 28, 2016

Test for segfault in factorize (gh12666) #14112

Merged

3 tasks

jreback closed this as completed in #14112 Aug 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Segfault in DataFrame.pivot with complex #12666

BUG: Segfault in DataFrame.pivot with complex #12666

ycopin commented Mar 18, 2016

jreback commented Mar 18, 2016

ycopin commented Mar 18, 2016

jreback commented Mar 18, 2016

ycopin commented Mar 18, 2016

jreback commented Mar 18, 2016

ycopin commented Mar 19, 2016

ycopin commented Mar 19, 2016

jreback commented Mar 19, 2016

ycopin commented Mar 20, 2016

jreback commented Mar 20, 2016

ycopin commented Mar 20, 2016

jreback commented Mar 20, 2016

BUG: Segfault in DataFrame.pivot with complex #12666

BUG: Segfault in DataFrame.pivot with complex #12666

Comments

ycopin commented Mar 18, 2016

Result

Expected Output

Output of pd.show_versions()

jreback commented Mar 18, 2016

ycopin commented Mar 18, 2016

jreback commented Mar 18, 2016

ycopin commented Mar 18, 2016

jreback commented Mar 18, 2016

ycopin commented Mar 19, 2016

ycopin commented Mar 19, 2016

jreback commented Mar 19, 2016

ycopin commented Mar 20, 2016

jreback commented Mar 20, 2016

ycopin commented Mar 20, 2016

jreback commented Mar 20, 2016

Output of `pd.show_versions()`