Skip to content
This repository

More consistent na_values handling in read_csv #1657

Closed
BrenBarn opened this Issue July 21, 2012 · 9 comments

4 participants

BrenBarn Chang She kayvonr jreback
BrenBarn

The current handling of the na_values argument to read_csv is strangely different depending on what kind of value you pass to na_values. If you pass None, the default NA values are used. If you pass a dict mapping column names to values, then those values will be used for those columns, totally overriding the default NA values, while for columns not in the dict, the default values will be used. If you pass some other kind of iterable, it uses the union of the passed values and the default values as the NA values.

This behavior is confusing because sometimes the passed values override the defaults, but other times they just add to the defaults. It's also contrary to the documentation at http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files, which says: "If you pass an empty list or an empty list for a particular column, no values (including empty strings) will be considered NA." But passing an empty list doesn't result in no values being considered NA. In fact, passing an empty list does nothing, since the empty list is unioned with the default NA values, so the default NA values are just used anyway.

Currently there is no easy way to pass a list of NA values which overrides the default for all columns. You can pass a dict, but then you have to specify the defaults per column. If you pass a list, you're not overriding the defaults, you're adding to them. This makes for confusing behavior when reading CSV files with string data in which strings like "na" and "nan" are valid data and should be read as their literal string values.

There should be a way to pass an all-column set of NA values that overrides the defaults. One possibility would be to have two arguments, something like all_na_values and more_na_values, to specify overriding and additional values, respectively. Another possibility would be to expose the default (currently the module-level _NA_VALUES in parsers.py), and allow users to add to it it they want to add more NA values (e.g., read_csv(na_values=set(['newNA']) | pandas.default_nas).

Chang She
Collaborator

I agree that it's too confusing/inconsistent as it stands. I added custom NA handling for a previous issue but I should have revisited the whole NA specifying API for read_csv. The fix will be part of 0.8.2 coming out in a few weeks.

Thanks for the bug report!

Chang She
Collaborator

I added a keep_default_na keyword

Chang She changhiskhan closed this August 19, 2012
kayvonr

Hey there - I believe this issue is still not completely fixed. Reading in a csv with countries and using both keep_default_na=False and na_values='' still results in "Namibia", country code "NA", being read in as a float NaN

In [56]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values='')

In [57]: data.ix[160:165]
Out[57]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [61]: data.ix[163, 'geo_country']
Out[61]: nan

In [62]: type(data.ix[163, 'geo_country'])
Out[62]: float
jreback
Collaborator

@kayvonr what version of pandas?

jreback
Collaborator

http://pandas.pydata.org/pandas-docs/dev/io.html#na-values

na_values has to be a list, so try with: na_values=['']

kayvonr

pandas 0.12.0

I originally tried it with a list argument first and that presents two different problems:

In [7]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[' '])

In [8]: data.ix[160:165]
Out[8]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

In [9]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=[''])

In [10]: data.ix[160:165]
Out[10]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163         NaN        Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

An argument with any non-empty string causes the "NA" to be read in as an empty string.

In [13]: data.ix[163, 'geo_country']
Out[13]: ''

Double checked it wasn't just the space with

In [11]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values=['foo'])

In [12]: data.ix[160:165]
Out[12]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF

and an empty string results in the "NA" string being again turned into a NaN

kayvonr

Issue also present with a dictionary argument to na_values

In [14]: data = pd.read_csv('iso_country_continent.csv', keep_default_na=False,  na_values={"geo_country" : "foo"})

In [15]: data.ix[163, 'geo_country']
Out[15]: ''

In [16]: data.ix[160:165]
Out[16]:
     Unnamed: 0 geo_country        country continent
160         160          MX         Mexico       N_A
161         161          MY       Malaysia        AS
162         162          MZ     Mozambique        AF
163         163                    Namibia        AF
164         164          NC  New Caledonia        OC
165         165          NE          Niger        AF
jreback
Collaborator

this was fixed by #4374, it in 0.13

kayvonr

ah ok, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.