read_csv treats zeroes as nan if column contains any nan #2599

sanand0 · 2012-12-26T04:47:38Z

If data.csv contains the following (column B has a zero in the first row, and is empty in the second)

A,B
0,0
0,

... pandas 0.10.0 incorrectly reads it as:

In [7]: pd.read_csv('data.csv')
Out[9]:
   A   B
0  0 NaN
1  0 NaN

... whereas pandas 0.9.0 reads it right:

In [5]: pd.read_csv('data.csv')
Out[6]:
   A   B
0  0   0
1  0 NaN

The text was updated successfully, but these errors were encountered:

wesm · 2012-12-27T00:52:51Z

That's upsetting. Not represented in the test suite obviously. Marked as a bug

wesm · 2012-12-28T14:11:41Z

I'm not able to reproduce on pandas 0.10:

In [3]: open('/home/wesm/tmp/foo5.csv').read()
Out[3]: 'A,B\n0,0\n0,'

In [4]: read_csv('/home/wesm/tmp/foo5.csv')
Out[4]: 
   A   B
0  0   0
1  0 NaN

sanand0 · 2012-12-28T15:04:15Z

I'm using a Windows 7 machine with NumPy 1.6.1 and Pandas 0.10. A NumPy version issue? Perhaps?

In [9]: open('d:/temp/killme.csv').read()
Out[9]: 'A,B\n0,0\n0,\n'

In [10]: pd.read_csv('d:/temp/killme.csv')
Out[10]:
   A   B
0  0 NaN
1  0 NaN

In [11]: pd.__version__
Out[11]: '0.10.0'

In [12]: numpy.__version__
Out[12]: '1.6.1'

In [16]: platform.uname()
Out[16]:
('Windows',
 'Obsidian',
 '7',
 '6.1.7601',
 'AMD64',
 'Intel64 Family 6 Model 42 Stepping 7, GenuineIntel')

wesm · 2012-12-28T23:04:22Z

That's super strange. I'll add a unit test and investigate on my windows 7 box

dieterv77 · 2013-01-03T13:55:44Z

I was able to reproduce this with pandas master on 64bit Windows and 32bit Linux. I also found that this only happens if the values in the B column are actually integers. If i change the one zero to a float, then things work fine on 64bit Windows and 32bit Linux.

dieterv77 · 2013-01-03T18:05:06Z

One additional comment: things work fine when using the python parsing engine instead of the C one.

wesm · 2013-01-03T18:23:01Z

dieter, if you're up to help me debug this, the relevant code is the _try_int64 function in pandas/src/parser.pyx. You should see what the values of result and na_count are at the end of that function

wesm · 2013-01-03T18:25:41Z

if not I'll have to try to reproduce it on my windows VM at home this weekend

dieterv77 · 2013-01-04T00:35:59Z

Thanks for tip, Wes. The problem (on 32bit linux anyway) is that for some reason INT64_MIN == np.int64(0). So any values of 0 also get masked as nan's during the _maybe_upcast.

The result array in _try_int64 returns contains two zeros, and na_count is 1.

dieterv77 · 2013-01-04T01:17:37Z

I got a bit further with this. Consider the following cython file:

cdef extern from "stdint.h":
enum: INT64_MIN

cpdef f():
cdef long long x = INT64_MIN
print x
print INT64_MIN

If you run this function on 32bit linux, you will get:
-9223372036854775808
0
If you look in the C file generated by cython, you can see that the "print INT64_MIN" line results in an attempt to cast INT64_MIN using PyInt_FromLong instead of PyInt_FromLongLong. I think the issue in parser.pyx is effectively the same.

dieterv77 · 2013-01-04T01:24:48Z

Would it make sense to use np.iinfo and np.finfo to figure out the dtype dependent na_values instead?

dieterv77 · 2013-01-04T03:50:09Z

I added a pull request for this: #2635

wesm · 2013-01-05T21:22:20Z

Merged dieter's fix and this appears to be working now (I was able to reproduce the failure on windows 64-bit and it's fixed now)

wesm added a commit that referenced this issue Jan 5, 2013

TST: unit test to reproduce #2599 on some platforms

02b57a8

wesm closed this as completed Jan 5, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv treats zeroes as nan if column contains any nan #2599

read_csv treats zeroes as nan if column contains any nan #2599

sanand0 commented Dec 26, 2012

wesm commented Dec 27, 2012

wesm commented Dec 28, 2012

sanand0 commented Dec 28, 2012

wesm commented Dec 28, 2012

dieterv77 commented Jan 3, 2013

dieterv77 commented Jan 3, 2013

wesm commented Jan 3, 2013

wesm commented Jan 3, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

wesm commented Jan 5, 2013

read_csv treats zeroes as nan if column contains any nan #2599

read_csv treats zeroes as nan if column contains any nan #2599

Comments

sanand0 commented Dec 26, 2012

wesm commented Dec 27, 2012

wesm commented Dec 28, 2012

sanand0 commented Dec 28, 2012

wesm commented Dec 28, 2012

dieterv77 commented Jan 3, 2013

dieterv77 commented Jan 3, 2013

wesm commented Jan 3, 2013

wesm commented Jan 3, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

dieterv77 commented Jan 4, 2013

wesm commented Jan 5, 2013