Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd behavior from read_csv with na_values set to non-string values #3611

Closed
rhstanton opened this issue May 15, 2013 · 34 comments · Fixed by #3615 or #3841
Closed

Odd behavior from read_csv with na_values set to non-string values #3611

rhstanton opened this issue May 15, 2013 · 34 comments · Fixed by #3615 or #3841
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@rhstanton
Copy link
Contributor

read_csv behaves oddly when na_values is set to non-string values. Sometimes
it correctly replaces the assigned number with NaN, and sometimes it doesn't. Here are some examples. Note in particular the different behavior of the last two statements:

Create file

df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})
df.to_csv('test2.csv', sep=' ', index=False)

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999])


A B
0 NaN 1.2
1 2 -999.0

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0])


A B
0 -999 1.2
1 2 NaN

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999.0,-999])


A B
0 -999 1.2
1 2 NaN

2 3 4.5

print read_csv('test2.csv', sep= ' ', header=0, na_values=[-999,-999.0])


A B
0 NaN 1.2
1 2 -999.0

2 3 4.5

@jreback
Copy link
Contributor

jreback commented May 16, 2013

@rhstanton give a try with master, this should be fixed (via #3615)

@rhstanton
Copy link
Contributor Author

Thanks! The behavior in the last two cases is now consistent.

However, using 999 does not register as na when the input value is (float) 999.0, and vice versa. I’m not really sure what should happen here, or even if this should work at all instead of flagging an error (should this insist on string values?). If it is supposed to work with numeric instead of string values (which I’d be quite happy with), wouldn’t it then make sense for (say) 999 to work with both integer and float inputs equal to 999? After all, if I set i = 999, then both

i == 999

and

i == 999.0

return True.

From: jreback [mailto:notifications@github.com]
Sent: Wednesday, May 15, 2013 6:02 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton give a try with master, this should be fixed (via #3615#3615)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-17975374.

@jreback
Copy link
Contributor

jreback commented May 16, 2013

999 and 999.0 are different strings which is how this matched

u can different values per column as well (pass a dict)

@rhstanton
Copy link
Contributor Author

‘999’ and ‘999.0’ are different strings, but in these examples we’re passing numbers, not strings, and

999 == 999.0

returns True. If it’s going to work at all passing numbers, I think my expectation would be for the results to be based on a numerical comparison, not a string comparison.

Thanks for the dict tip. That could be useful.

From: jreback [mailto:notifications@github.com]
Sent: Wednesday, May 15, 2013 9:01 PM
To: pydata/pandas
Cc: Richard Stanton
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

999 and 999.0 are different strings which is how this matched

u can different values per column as well (pass a dict)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-17980206.

@jreback
Copy link
Contributor

jreback commented May 16, 2013

keep in mind that this is an exact string match, these don't have dtypes at this point. that said, you have a valid point, so you are proposing adding the float versions of ints and vice versa if specified in na_vaues, pls open an enhancement issue for this

note you can do this:

read in with no na match for 999,999.0

df[df==999] == np.nan

will work for ints and floats (you can of course column restrict)

@fonnesbeck
Copy link

This is a problem for me as well, despite having a very recent build of Pandas. In my dataset, 88 is used as a missing value. Hence, I used the following:

dataset = data_file.parse("Sheet1", index_col=0, na_values=[88])

however, this value was not picked up as being NaN:

dataset.diarrhea.value_counts()

0     282
1     197
88      8
dtype: int64

I have tried using a float and an integer, but neither worked. Also, specifying the columns for replacement with a dict does not work.

Here is the dataset that I am using.

Also, the solution you suggest above does not work if even a single column is of a different type:

dataset[dataset==88] = np.nan

ValueError: Cannot do boolean setting on mixed-type frame

So, it appears that the only solution is to manually check which columns contain the value, and replace them in a loop.

@wesm
Copy link
Member

wesm commented Jun 5, 2013

If you pass na_values=['88'] does it work?

@fonnesbeck
Copy link

Unfortunately not.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

This was a bug (introduced when I fixed this originally)

also, slight API change, na_values will now take a value like 88 (as a int), and match '88',88,88.0 (string/numeric dtypes) in whatever the resulting dtype of the column

@fonnesbeck it seems natural to read this dataset with index_col=None

@fonnesbeck pls try this out if you can (via the PR)

In [5]: pd.read_excel('test.xlsx',"Sheet1", index_col=0, na_values=['88'])
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Index: 490 entries, 1.0 to 1.0
Data columns (total 36 columns):
race         490  non-null values
ethn         484  non-null values
stcol        484  non-null values
rota         393  non-null values
blood        487  non-null values
saliva       484  non-null values
diarrhea     479  non-null values
qod          160  non-null values
abdpain      333  non-null values
vomit        483  non-null values
dos          478  non-null values
travel       16  non-null values
sickcon      231  non-null values
noc          486  non-null values
type         490  non-null values
abx          28  non-null values
dtp          337  non-null values
fever        472  non-null values
feverval     112  non-null values
pathogens    160  non-null values
futuse       488  non-null values
ecoli        486  non-null values
amc          490  non-null values
am           490  non-null values
cz           490  non-null values
fep          490  non-null values
caz          490  non-null values
cro          490  non-null values
cxm          490  non-null values
cip          490  non-null values
gm           490  non-null values
mem          490  non-null values
tzp          490  non-null values
stx          490  non-null values
case         490  non-null values
month        490  non-null values
dtypes: float64(36)

In [6]: pd.read_excel('test.xlsx',"Sheet1", index_col=0, na_values=['88.0']).head()
Out[6]: 
     race  ethn  stcol  rota  blood  saliva  diarrhea  qod  abdpain  vomit  dos  travel  sickcon  noc  type  abx  dtp  fever  feverval  pathogens  futuse  ecoli  amc  am  cz  fep  caz  cro  cxm  cip  gm  mem  tzp  stx  case  month
sex                                                                                                                                                                                                                                   
1       1     1      1     0      0       1         1  NaN        0      0    4     NaN      NaN    1     0  NaN    8      0       NaN        NaN       1      1    1   2   1    1    1    1    1    1   1    1    1    1     1     12
0       2     2      1     0      0       1         0  NaN      NaN      1    3     NaN        1    0     0  NaN  NaN      1       NaN        NaN       0      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12
0       2     2      0   NaN      0       2         0  NaN        1      1    4     NaN      NaN    0     0  NaN  NaN      0       NaN        NaN       1      0    0   0   0    0    0    0    0    0   0    0    0    0     1      0
1       3     2      1     0      0       1         1    0      NaN      1    4     NaN        1    0     0  NaN  NaN      0       NaN        NaN       1      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12
1       1     1      1     0      0       1         0  NaN        1      1    2     NaN        0    0     0  NaN    3      1       NaN        NaN       0      0    0   0   0    0    0    0    0    0   0    0    0    0     1     12

@fonnesbeck
Copy link

Sorry, the original file included a patient ID in column 0, which I had to remove before sharing. I will try the fix.

@fonnesbeck
Copy link

It is still failing for me, unfortunately:

In[11]: pd.read_excel("Ecoli1.xlsx", "Sheet1", na_values=[88]).travel.value_counts()

Out[11]:
88    471
0      14
1       2
dtype: int64

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

did you try this PR? (its not in master)

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

I just merged it...try master now

@fonnesbeck
Copy link

It sure helps when I build the right code. Sorry. Works for me now, yes.

@jreback
Copy link
Contributor

jreback commented Jun 5, 2013

great!
thanks for catching these bugs!
reclosing

@jreback jreback closed this as completed Jun 5, 2013
@rhstanton
Copy link
Contributor Author

I like the sound of this edit, but it ether doesn't quite work or I'm misinterpreting what its supposed to do. Here's an example, where I want -9988 to denote missing data, and where the data file, test.csv, contains both '-9988' and '-9988.0':


Field1,Field2,Field3
-9988,-9988.0,5.6


The command

read_csv('data/test.csv', header=0, sep=',', na_values=[-9988])

yields the output

Field1  Field2  Field3

0 NaN -9988 5.6

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

Are you on latest master?

In [9]: pd.__version__
Out[9]: '0.11.1.dev-2b7efc3'

In [4]: path = 'test.csv'

In [5]: df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})

In [10]: df
Out[10]: 
     A      B
0 -999    1.2
1    2 -999.0
2    3    4.5

In [6]: df.to_csv(path, sep=' ', index=False)

In [7]: read_csv(path, sep= ' ', header=0, na_values=['-999.0','-999'])
Out[7]: 
    A    B
0 NaN  1.2
1   2  NaN
2   3  4.5

In [8]: read_csv(path, sep= ' ', header=0, na_values=[-999,-999.0])
Out[8]: 
    A    B
0 NaN  1.2
1   2  NaN
2   3  4.5

@rhstanton
Copy link
Contributor Author

I downloaded the latest master this morning, so it should be up to date. I may just be misunderstanding how this is supposed to work. I only passed the integer version of the missing data indicator (999 in your example), assuming this would also match against the float value (999.0) in my csv file (e.g., 999.0), but it doesn't seem to. How is it supposed to work?

From: jreback <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, June 10, 2013 11:07 AM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

Are you on latest master?

In [9]: pd.version
Out[9]: '0.11.1.dev-2b7efc3'

In [4]: path = 'test.csv'

In [5]: df = DataFrame({'A' : [-999, 2, 3], 'B' : [1.2, -999, 4.5]})

In [10]: df
Out[10]:
A B
0 -999 1.2
1 2 -999.0
2 3 4.5

In [6]: df.to_csv(path, sep=' ', index=False)

In [7]: read_csv(path, sep= ' ', header=0, na_values=['-999.0','-999'])
Out[7]:
A B
0 NaN 1.2
1 2 NaN
2 3 4.5

In [8]: read_csv(path, sep= ' ', header=0, na_values=[-999,-999.0])
Out[8]:
A B
0 NaN 1.2
1 2 NaN
2 3 4.5


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19216006.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

hmmm....I think this is not working like it should....

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

@rhstanton if you would try this PR would be appreciated

I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float

and the same values should be generated if you pass 999 (as an int)

I think its fixed now as fi you pass any of [999,999.0,'999','999.0'] will get the same end result (which I think it right)

@rhstanton
Copy link
Contributor Author

I'd be happy to do so, but how do you use a PR? I've only pulled the master repository in the past.

From: jreback <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, June 10, 2013 1:42 PM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton if you would try this PR would be appreciated

I believe I have fixed the issue which basically was this. Putting 999 (as an int) should have generated na_values of 999, and 999.0 (as string) as well as 999 as an in , and 999.0 as a float

and the same values should be generated if you pass 999 (as an int)

I think its fixed now as fi you pass any of [999,999.0,'999','999.0'] will get the same end result (which I think it right)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19224949.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

very similar to how you use master:

git clone -b GH3611_2 https://github.com/jreback/pandas.git GH3611_2
cd GH3611_2
python setup.py build_ext --inplace

then test out from that dir (you can rename the final GH3611_2 if you want)

@rhstanton
Copy link
Contributor Author

Thanks. It seems to work fine if the number in the csv file has only a single zero after the decimal point, but (as long as I really am using the right version now...) fails if you add an extra zero (i.e., make it 999.00 in the csv file). In this case, even 999.00 fails. The only way to get this one to work is to explicitly include '999.00'

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

yep....not exactly sure what to do about that; you would have to specifiy it exactly in that case; however a float will normally be written as '999.0' so cover that case...(or using an integer) which is 999

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string NaN or whatever embeded, so can't convert it yet; the other way to deal with this is to NOT specify na values, then do something like: df[df==999.0] = np.nan which is a float conversion

@jreback
Copy link
Contributor

jreback commented Jun 10, 2013

@wesm any thoughts here?

@rhstanton
Copy link
Contributor Author

Yup. Would it be a pain to do the matching after conversion?

From: jreback <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, June 10, 2013 4:46 PM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

the basic issue is we are doing string matching as the column hasn't been converted (its still a string), e.g. you could have say the string NaN or whatever embeded, so can't convert it yet; the other way to deal with this is to NOT specify na values, then do something like: df[df==999.0] = np.nan which is a float conversion


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19233408.

@rhstanton
Copy link
Contributor Author

What about using regular expressions in the string matching (maybe this is already done...)? That would allow me to specify (say) 999 followed by an arbitrary number of zeros.

From: jreback <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Monday, June 10, 2013 4:46 PM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@wesmhttps://github.com/wesm any thoughts here?


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19233417.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

these are hash set matched done via c-code
I think I know away

@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

@rhstanton alrgiht....update that branch

git pull...build again and try

have to test for floats in a slightly different way, but
now passing any of 999,999.0,'999.0','999' should match string/float/ints that are 999, which
I think is what 'makes' sense

@rhstanton
Copy link
Contributor Author

That all now seems to work as expected.

Just one thought, though. Shouldn't we treat string and numeric values differently? If I pass the number 999, it makes absolute sense for it to match anything that Python would regard (after parsing) as being equal to that number, including 999, 999.0, 999.00, etc. However, if I pass a string, doesn't it make more sense for the match to be exact? So if I pass (string) '999', this should match against '999', but not against '999.0', etc. After all, if I set na_values = 'ABC', it would be very counterintuitive for this to match against 'ABC.000'

From: jreback <notifications@github.commailto:notifications@github.com>
Reply-To: pydata/pandas <reply@reply.github.commailto:reply@reply.github.com>
Date: Tuesday, June 11, 2013 6:41 AM
To: pydata/pandas <pandas@noreply.github.commailto:pandas@noreply.github.com>
Cc: Richard Stanton <stanton@haas.berkeley.edumailto:stanton@haas.berkeley.edu>
Subject: Re: [pandas] Odd behavior from read_csv with na_values set to non-string values (#3611)

@rhstantonhttps://github.com/rhstanton alrgiht....update that branch

git pull...build again and try

have to test for floats in a slightly different way, but
now passing any of 999,999.0,'999.0','999' should match string/float/ints that are 999, which
I think is what 'makes' sense


Reply to this email directly or view it on GitHubhttps://github.com//issues/3611#issuecomment-19262144.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

ths only will happen for floats, thus 'ABC.000' would not be a matching value

if you can construct a case that would be great, but I think passing [999,999.0,'999','999.0'] should be treated identically

@rhstanton
Copy link
Contributor Author

I guessed that 'ABC.000' would not match! I really just meant to use it to
point out that when people match strings, they usually really want to
match the string. On the other hand, especially given the history of using
strings for na_values, it is probably true that when people pass '999.0'
as a string here, they're thinking of it as a number, so I'm quite happy
to leave this where it is. Thanks!

@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

ok...great..mergine this in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
4 participants