Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas error kills IPython kernel #9205

Closed
michaelaye opened this issue Jan 6, 2015 · 20 comments
Closed

pandas error kills IPython kernel #9205

michaelaye opened this issue Jan 6, 2015 · 20 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@michaelaye
Copy link
Contributor

Doing this simple thing can kill an IPython 2 notebook kernel (version:GH master):

url = 'http://nssdc.gsfc.nasa.gov/planetary/factsheet/index.html'
pd.read_table(url)

Note that I know this will fail, I just don't expect it to kill a notebook kernel?

Version:
pandas: 128ce85
IPython: 13facaf0206240a7301e045666143d68305d0119

@jreback
Copy link
Contributor

jreback commented Jan 7, 2015

well this prob segfaults. It has weird line breaks and such. Doesn't even looks like a valid file to me.

@michaelaye
Copy link
Contributor Author

Well. read_html(url) works just fine on it?

@michaelaye
Copy link
Contributor Author

and i just tried pandas 0.14.1 and no such crash happens there.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2015

well, things seg fault if you feed them garbage input. Not sure that can be prevented in all cases. If you'd like to debug, feel free.

@jreback jreback added the IO CSV read_csv, to_csv label Jan 9, 2015
@michaelaye
Copy link
Contributor Author

what changed since 14.1 though? No seg fault there.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2015

blank line and comment parsing
see the whatsnew

@michaelaye
Copy link
Contributor Author

the question is, can the file be that garbage, when pd.read_html() can read it without a problem?

@jreback
Copy link
Contributor

jreback commented Jan 9, 2015

html and CSV are completely different

@kay1793
Copy link

kay1793 commented Jan 9, 2015

well, things seg fault if you feed them garbage input. Not sure that can be prevented in all cases.

strongly disagree. code shouldn't segfault and if it does than it's a bug which should get fixed.
Easiest way is to revert the bad commits and ask for a fix from the PR submitter.

@cpcloud
Copy link
Member

cpcloud commented Jan 9, 2015

going from working to segfault and leaving it that way seems like a bad idea,

@michaelaye
if you pass in lineterminator='\r', this "works", but read_table is for parsing CSV, not HTML <table> elements.

in 0.14.1 were you getting a reasonable DataFrame out? all i get is this with my above suggestion:

In [19]: t = pd.read_table('http://nssdc.gsfc.nasa.gov/planetary/factsheet/index.html',
lineterminator='\r')

In [20]: t
Out[20]:
                                                <html>
0                                               <head>
1                  <title>Planetary Fact Sheet</title>
2                                              </head>
3                                <body bgcolor=FFFFFF>
4    <p>                                           ...
5                                                 <hr>
6               <H1>Planetary Fact Sheet - Metric</H1>
7                                                 <hr>
8                                                  <p>
9         <table border=2 cellspacing=1 cellpadding=4>
10                                                <tr>
11                   <td align=left><b>&nbsp;</b></td>
12     <td align=center bgcolor=F5F5F5><b>&nbsp;<a ...
13     <td align=center><b>&nbsp;<a href="venusfact...
14     <td align=center bgcolor=F5F5F5><b>&nbsp;<a ...
..                                                 ...
309  - Explanations of the values and headings in t...
310  <a href="/planetary/education/schoolyard_ss/">...
311  - Demonstration scale model of the solar syste...
312                                               </b>
313                                               <hr>
314  <img vspace=5 align=left alt="[NASA Logo]" src...
315                                          <address>
316  Author/Curator:<br>\nDr. David R. Williams, <a...
317                                         </address>
318                                    <br clear=left>
319                                               <hr>
320  <h6>NASA Official: Ed Grayzeck, edwin.j.grayze...
321              Last Updated: 25 April 2014, DRW</h6>
322                                            </body>
323                                            </html>

[324 rows x 1 columns]

@cpcloud
Copy link
Member

cpcloud commented Jan 9, 2015

@michaelaye as you suggested, read_html works and is the proper way to get this table into a useful DataFrame, not one that is simply the text split by lineterminator

@cpcloud
Copy link
Member

cpcloud commented Jan 9, 2015

In [14]: t = pd.read_html('http://nssdc.gsfc.nasa.gov/planetary/factsheet/index.html', header=0, index_col=0)[0].iloc[:-1]

In [15]: t
Out[15]:
                              MERCURY    VENUS  EARTH    MOON   MARS  \
Mass (1024kg)                   0.330     4.87   5.97   0.073  0.642
Diameter (km)                    4879    12104  12756    3475   6792
Density (kg/m3)                  5427     5243   5514    3340   3933
Gravity (m/s2)                    3.7      8.9    9.8     1.6    3.7
Escape Velocity (km/s)            4.3     10.4   11.2     2.4    5.0
Rotation Period (hours)        1407.6  -5832.5   23.9   655.7   24.6
Length of Day (hours)          4222.6   2802.0   24.0   708.7   24.7
Distance from Sun (106 km)       57.9    108.2  149.6  0.384*  227.9
Perihelion (106 km)              46.0    107.5  147.1  0.363*  206.6
Aphelion (106 km)                69.8    108.9  152.1  0.406*  249.2
Orbital Period (days)            88.0    224.7  365.2    27.3  687.0
Orbital Velocity (km/s)          47.4     35.0   29.8     1.0   24.1
Orbital Inclination (degrees)     7.0      3.4    0.0     5.1    1.9
Orbital Eccentricity            0.205    0.007  0.017   0.055  0.094
Axial Tilt (degrees)             0.01    177.4   23.4     6.7   25.2
Mean Temperature (C)              167      464     15     -20    -65
Surface Pressure (bars)             0       92      1       0   0.01
Number of Moons                     0        0      1       0      2
Ring System?                       No       No     No      No     No
Global Magnetic Field?            Yes       No    Yes      No     No

                                JUPITER    SATURN    URANUS   NEPTUNE    PLUTO
Mass (1024kg)                      1898       568      86.8       102   0.0131
Diameter (km)                    142984    120536     51118     49528     2390
Density (kg/m3)                    1326       687      1271      1638     1830
Gravity (m/s2)                     23.1       9.0       8.7      11.0      0.6
Escape Velocity (km/s)             59.5      35.5      21.3      23.5      1.1
Rotation Period (hours)             9.9      10.7     -17.2      16.1   -153.3
Length of Day (hours)               9.9      10.7      17.2      16.1    153.3
Distance from Sun (106 km)        778.6    1433.5    2872.5    4495.1   5870.0
Perihelion (106 km)               740.5    1352.6    2741.3    4444.5   4435.0
Aphelion (106 km)                 816.6    1514.5    3003.6    4545.7   7304.3
Orbital Period (days)              4331     10747     30589     59800    90588
Orbital Velocity (km/s)            13.1       9.7       6.8       5.4      4.7
Orbital Inclination (degrees)       1.3       2.5       0.8       1.8     17.2
Orbital Eccentricity              0.049     0.057     0.046     0.011    0.244
Axial Tilt (degrees)                3.1      26.7      97.8      28.3    122.5
Mean Temperature (C)               -110      -140      -195      -200     -225
Surface Pressure (bars)        Unknown*  Unknown*  Unknown*  Unknown*        0
Number of Moons                      67        62        27        14        5
Ring System?                        Yes       Yes       Yes       Yes       No
Global Magnetic Field?              Yes       Yes       Yes       Yes  Unknown

@jreback
Copy link
Contributor

jreback commented Jan 9, 2015

Can one of you guys take a look? Its possible that these pr's changed this

cc @mdmueller #7470
cc @selasley #8752

@michaelaye
Copy link
Contributor Author

I'm very well aware that CSV table is not equal HTML table and I never asked for read_table to work on this file, as pointed out in my issue. All I'm worried about is the fact that a pandas segfault due to user error using a wrong function on the wrong data is extremely disruptive and did not happen as such in 0.14.1. Can a segfault even be caught with a try/except? (Can't try it out, am on holidays)

@cpcloud
Copy link
Member

cpcloud commented Jan 9, 2015

segfaults can't be caught in python 2, but some signals that segfaults may generate can in python 3

https://docs.python.org/3/library/faulthandler.html

@selasley
Copy link
Contributor

The first commit that has the segfault is 31c2558. It looks like the problem is in the call to tokenize_nrows in _tokenize_rows

File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas-0.14.1_473_g31c2558-py3.4-macosx-10.6-intel.egg/pandas/io/parsers.py", line 1150, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 754, in pandas.parser.TextReader.read (pandas/parser.c:7383)
File "pandas/parser.pyx", line 776, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7623)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._read_rows (pandas/parser.c:8245)
File "pandas/parser.pyx", line 816, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8119)
File "pandas/parser.pyx", line 1728, in pandas.parser.raise_parser_error (pandas/parser.c:20349)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 538976288 fields in line 8410, saw 538976289

538976288 is 20202020 hex, which makes me think that spaces are being misinterpreted somewhere, but perhaps only in the error message. The parser seems to think there are 524287 rows in the file, maybe because of the CR line endings. I'll dig some more this weekend.

@kay1793
Copy link

kay1793 commented Jan 10, 2015

The first commit that has the segfault is 31c2558.

@selasley, nice. I wish it wasn't squashd down to a single commit, though. it would have been better to pin down what change broke.

segfaults can't be caught in python 2, but some signals that segfaults may generate

wouldn't it be better to just fix the cython/c code that's segfaulting?

@cpcloud
Copy link
Member

cpcloud commented Jan 10, 2015

wouldn't it be better to just fix the cython/c code that's segfaulting?

oh absolutely! @michaelaye wondered if you can "catch" signals generated by segfaults and I just wanted to point out that you can in Python 3 but not in Python 2 (that I know of). I wasn't suggesting that we should catch them.

@selasley
Copy link
Contributor

Pull request #9360 fixes the buffer overflows that caused the interpreter to crash with this input file and a few others.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

closed by #9360

@jreback jreback closed this as completed Feb 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants