Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Segfault in DataFrame.pivot with complex #12666

Closed
ycopin opened this issue Mar 18, 2016 · 12 comments · Fixed by #14112
Closed

BUG: Segfault in DataFrame.pivot with complex #12666

ycopin opened this issue Mar 18, 2016 · 12 comments · Fixed by #14112
Labels
Bug Complex Complex Numbers Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite
Milestone

Comments

@ycopin
Copy link

ycopin commented Mar 18, 2016

With attached CSV-file test_csv.txt, the following code will segfault if the input dataframe has more than 17 entries:

import pandas as PD

df = PD.read_csv('test_csv.txt',
                 index_col=0,
                 converters={'xyinrad': complex,
                             'xyoutm': complex})

# Pivot works fine with 17 entries
df[:-1].info()
pivot = df[:-1].pivot('wave', 'xyinrad', 'xyoutm')
print("Still alive".center(50, '!'))

# Pivot segfaults with 18 entries
df.info()
print("Dave, my mind is going.".center(50, '.'))
pivot = df.pivot('wave', 'xyinrad', 'xyoutm')

Result

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 3 columns):
wave       16 non-null float64
xyinrad    16 non-null complex128
xyoutm     16 non-null complex128
dtypes: complex128(2), float64(1)
memory usage: 768.0 bytes
!!!!!!!!!!!!!!!!!!!Still alive!!!!!!!!!!!!!!!!!!!!
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 0 to 16
Data columns (total 3 columns):
wave       17 non-null float64
xyinrad    17 non-null complex128
xyoutm     17 non-null complex128
dtypes: complex128(2), float64(1)
memory usage: 816.0 bytes
.............Dave, my mind is going...............
Segmentation fault (core dumped)

Expected Output

  • No segfault!
  • Return the pivot table

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-38-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.4
pip: 8.1.1
setuptools: 20.3
Cython: 0.21.1
numpy: 1.10.4
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: None
dateutil: 2.5.0
pytz: 2016.1
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.4.5 (dt dec mx pq3 ext)
jinja2: 2.8
boto: None
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

can you show a reproducible example that is copy-pastable. e.g either include the data in-line or create a sample frame. You are pivoting on complex numbers, you mean to do that?

@ycopin
Copy link
Author

ycopin commented Mar 18, 2016

Sorry, here is a copy-pastable example:

from io import StringIO
import pandas as PD

csv = StringIO(u"""
,wave,xyinrad,xyoutm
0,1.55e-06,(0.00523598775598+0.0200712863979j),(-0.06421774-0.06848953j)
1,1.55e-06,(0.00523598775598+0.0209439510239j),(-0.06422449-0.07932723j)
2,1.55e-06,(0.00610865238198+0.00872664625997j),(-0.07546985+0.07215781j)
3,1.55e-06,(0.00610865238198+0.00959931088597j),(-0.07540574+0.06128717j)
4,1.55e-06,(0.00610865238198+0.010471975512j),(-0.07534734+0.05043315j)
5,1.55e-06,(0.00610865238198+0.011344640138j),(-0.07529464+0.03959332j)
6,1.55e-06,(0.00610865238198+0.012217304764j),(-0.07524759+0.02876524j)
7,1.55e-06,(0.00610865238198+0.01308996939j),(-0.07520614+0.01794649j)
8,1.55e-06,(0.00610865238198+0.013962634016j),(-0.07517025+0.007134677j)
9,1.55e-06,(0.00610865238198+0.014835298642j),(-0.07513988-0.003672601j)
10,1.55e-06,(0.00610865238198+0.0157079632679j),(-0.07511502-0.01447775j)
11,1.55e-06,(0.00610865238198+0.0165806278939j),(-0.07509567-0.02528313j)
12,1.55e-06,(0.00610865238198+0.0174532925199j),(-0.0750818-0.03609117j)
13,1.55e-06,(0.00610865238198+0.0183259571459j),(-0.07507343-0.04690428j)
14,1.55e-06,(0.00610865238198+0.0191986217719j),(-0.07507053-0.05772485j)
15,1.55e-06,(0.00610865238198+0.0200712863979j),(-0.07507306-0.06855524j)
16,1.55e-06,(0.00610865238198+0.0209439510239j),(-0.07508092-0.07939773j)
""")

df = PD.read_csv(csv,
                 index_col=0,
                 converters={'xyinrad': complex,
                             'xyoutm': complex})

# Pivot works fine with 17 entries
df[:-1].info()
pivot = df[:-1].pivot('wave', 'xyinrad', 'xyoutm')
print("Still alive".center(50, '!'))

# Pivot segfaults with 18 entries
df.info()
print("Dave, my mind is going.".center(50, '.'))
pivot = df.pivot('wave', 'xyinrad', 'xyoutm')

Yes I want to pivot on complex numbers, which are coordinates. It does not sound very clever to index/column on float-likes, but this was working fine until now.

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

hmm, seems odd that that core dumps.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Complex Complex Numbers labels Mar 18, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 18, 2016
@ycopin
Copy link
Author

ycopin commented Mar 18, 2016

For what it's worth, valgrind run points to sorting function:

==17290== Invalid read of size 8
==17290==    at 0x6AF4328: OBJECT_compare (arraytypes.c.src:2679)
==17290==    by 0x6BC9C15: npy_quicksort (quicksort.c.src:401)
==17290==    by 0x6B4CE34: _new_sortlike (item_selection.c:871)
==17290==    by 0x6B8FAB0: array_sort (methods.c:1159)

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

hmm, easiest is simply to step thru this (until it borks).

@ycopin
Copy link
Author

ycopin commented Mar 19, 2016

The segfault originates in uniques.argsort() of pandas.core.algorithms.factorize(df['xyinrad'], sort=True) (called from MultiIndex.from_arrays([df['wave'], df['xyinrad']]) and pandas.core.categorical.Categorical(df['xyinrad'], ordered=True)).

The issue is surely coming from the fact that complex array is converted to 'object', by lack of complex-case in pandas.core.algorithms._get_data_algo() (and associated quantity _hashtables).

Could a complex128 case be added to int64 and float64? But this looks like a major work...

@ycopin
Copy link
Author

ycopin commented Mar 19, 2016

Here is the shortest segfault case:

from pandas.core.algorithms import factorize
n = 17
codes, categories = factorize([ complex(i) for i in range(n) ], sort=True)

This does not crash with n = 16, nor with sort=False.

This raises 2 naive questions:

  • in MultiIndex.from_arrays, why is Categorical.from_array(arr, ordered=True) always called with option ordered=True while arr might not be sortable (e.g. of type 'object')?
  • in Categorical initialization, why is factorize(values, sort=True) always called with option sort=True while values might not be sortable (e.g. of type 'object')?

@jreback
Copy link
Contributor

jreback commented Mar 19, 2016

So this is a numpy bug, pls open an issue there.

In [3]: x = np.array([ complex(i) for i in range(16) ], dtype=object)

In [4]: x
Out[4]: 
array([0j, (1+0j), (2+0j), (3+0j), (4+0j), (5+0j), (6+0j), (7+0j), (8+0j),
       (9+0j), (10+0j), (11+0j), (12+0j), (13+0j), (14+0j), (15+0j)], dtype=object)

# this is what I would expect
In [5]: x.argsort()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-3733f927a1b7> in <module>()
----> 1 x.argsort()

TypeError: no ordering relation is defined for complex numbers


# here's your coredump
In [7]: x = np.array([ complex(i) for i in range(17) ], dtype=object)

In [8]: x.argsort()
/Users/jreback/miniconda/bin/python.app: line 3: 59868 Segmentation fault: 11  (core dumped) /Users/jreback/miniconda/python.app/Contents/MacOS/python "$@"

We use object dtype when we don't have defined hashtables for things (e.g. floats,ints). So to work around in pandas we should have to do something like:

In [1]: x = np.array([ complex(i) for i in range(17) ], dtype=object)

In [2]: pandas.lib.infer_dtype(x)
Out[2]: 'mixed'

This should report complex (but is not supported ATM). Then we could work around.

@jreback jreback changed the title Segfault in DataFrame.pivot BUG: Segfault in DataFrame.pivot with complex Mar 19, 2016
@ycopin
Copy link
Author

ycopin commented Mar 20, 2016

No more segfault and things are working fine when using master numpy (now 1.12.0.dev0+41ec9fc), as well as numpy 1.11.rc2. Next release of numpy (1.11) should therefore fix the issue.

Please feel free to close the issue, and thanks for your support.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

hmm, can you put in a test for this. we'll need to skip obviously for < 1.11, but we do test against numpy master.

@ycopin
Copy link
Author

ycopin commented Mar 20, 2016

I'll try to do my best (never did that). Should I write a test for DataFrame.pivot (pandas/pandas/tests/frame/test_reshape.py) or for deeper Categorical.from_array or factorize (in pandas/pandas/tests/test_algos.py)?

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

I think the test_algos is most appropriate

@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016
@jorisvandenbossche jorisvandenbossche added the Testing pandas testing functions or related to the test suite label Aug 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Complex Complex Numbers Reshaping Concat, Merge/Join, Stack/Unstack, Explode Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants