Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv treats zeroes as nan if column contains any nan #2599

Closed
sanand0 opened this issue Dec 26, 2012 · 13 comments
Closed

read_csv treats zeroes as nan if column contains any nan #2599

sanand0 opened this issue Dec 26, 2012 · 13 comments
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@sanand0
Copy link

sanand0 commented Dec 26, 2012

If data.csv contains the following (column B has a zero in the first row, and is empty in the second)

A,B
0,0
0,

... pandas 0.10.0 incorrectly reads it as:

In [7]: pd.read_csv('data.csv')
Out[9]:
   A   B
0  0 NaN
1  0 NaN

... whereas pandas 0.9.0 reads it right:

In [5]: pd.read_csv('data.csv')
Out[6]:
   A   B
0  0   0
1  0 NaN
@wesm
Copy link
Member

wesm commented Dec 27, 2012

That's upsetting. Not represented in the test suite obviously. Marked as a bug

@wesm
Copy link
Member

wesm commented Dec 28, 2012

I'm not able to reproduce on pandas 0.10:

In [3]: open('/home/wesm/tmp/foo5.csv').read()
Out[3]: 'A,B\n0,0\n0,'

In [4]: read_csv('/home/wesm/tmp/foo5.csv')
Out[4]: 
   A   B
0  0   0
1  0 NaN

@sanand0
Copy link
Author

sanand0 commented Dec 28, 2012

I'm using a Windows 7 machine with NumPy 1.6.1 and Pandas 0.10. A NumPy version issue? Perhaps?

In [9]: open('d:/temp/killme.csv').read()
Out[9]: 'A,B\n0,0\n0,\n'

In [10]: pd.read_csv('d:/temp/killme.csv')
Out[10]:
   A   B
0  0 NaN
1  0 NaN

In [11]: pd.__version__
Out[11]: '0.10.0'

In [12]: numpy.__version__
Out[12]: '1.6.1'

In [16]: platform.uname()
Out[16]:
('Windows',
 'Obsidian',
 '7',
 '6.1.7601',
 'AMD64',
 'Intel64 Family 6 Model 42 Stepping 7, GenuineIntel')

@wesm
Copy link
Member

wesm commented Dec 28, 2012

That's super strange. I'll add a unit test and investigate on my windows 7 box

@dieterv77
Copy link
Contributor

I was able to reproduce this with pandas master on 64bit Windows and 32bit Linux. I also found that this only happens if the values in the B column are actually integers. If i change the one zero to a float, then things work fine on 64bit Windows and 32bit Linux.

@dieterv77
Copy link
Contributor

One additional comment: things work fine when using the python parsing engine instead of the C one.

@wesm
Copy link
Member

wesm commented Jan 3, 2013

dieter, if you're up to help me debug this, the relevant code is the _try_int64 function in pandas/src/parser.pyx. You should see what the values of result and na_count are at the end of that function

@wesm
Copy link
Member

wesm commented Jan 3, 2013

if not I'll have to try to reproduce it on my windows VM at home this weekend

@dieterv77
Copy link
Contributor

Thanks for tip, Wes. The problem (on 32bit linux anyway) is that for some reason INT64_MIN == np.int64(0). So any values of 0 also get masked as nan's during the _maybe_upcast.

The result array in _try_int64 returns contains two zeros, and na_count is 1.

@dieterv77
Copy link
Contributor

I got a bit further with this. Consider the following cython file:

cdef extern from "stdint.h":
enum: INT64_MIN

cpdef f():
cdef long long x = INT64_MIN
print x
print INT64_MIN

If you run this function on 32bit linux, you will get:
-9223372036854775808
0
If you look in the C file generated by cython, you can see that the "print INT64_MIN" line results in an attempt to cast INT64_MIN using PyInt_FromLong instead of PyInt_FromLongLong. I think the issue in parser.pyx is effectively the same.

@dieterv77
Copy link
Contributor

Would it make sense to use np.iinfo and np.finfo to figure out the dtype dependent na_values instead?

@dieterv77
Copy link
Contributor

I added a pull request for this: #2635

@wesm
Copy link
Member

wesm commented Jan 5, 2013

Merged dieter's fix and this appears to be working now (I was able to reproduce the failure on windows 64-bit and it's fixed now)

@wesm wesm closed this as completed Jan 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants