read_csv() force numeric dtype on column and set unparseable entry as missing (NaN) #2570

Open
dragoljub opened this Issue Dec 19, 2012 · 14 comments

6 participants

@dragoljub

How can I force a dtype on a column and ensure that any not-parseable data entry are filled as NaN? This is important in cases where there are unpredictable data entry errors in CSVs or database streams that cannot be mapped to missing values a priori.

Eg: Below I want column 'a' to be parsed as np.float but the erroneous 'Dog' entry causes an exception. Is there a way to tell read_csv() to force parsing a column 'a' as np.float and fill all non-parseable entries with NaN?

data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
df.dtypes

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-cd8b6f868aec> in <module>()
      1 data = 'a,b,c\n1.1,2,3\nDog,5,6\n7.7,8,9.5'
----> 2 df = pd.read_csv(StringIO.StringIO(data), dtype={'a': np.float})
      3 df.dtypes

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze)
    389                     buffer_lines=buffer_lines)
    390 
--> 391         return _read(filepath_or_buffer, kwds)
    392 
    393     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    205         return parser
    206 
--> 207     return parser.read()
    208 
    209 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    943 
    944         try:
--> 945             data = self._reader.read(nrows)
    946         except StopIteration:
    947             if nrows is None:

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader.read (pandas\src\parser.c:5785)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6002)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._read_rows (pandas\src\parser.c:6870)()

C:\Python27\lib\site-packages\pandas\_parser.pyd in pandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7919)()

AttributeError: 'NoneType' object has no attribute 'dtype'
@wesm
Python for Data member

This would be a nice enhancement sometime. I don't know when I or someone else will get to it though

@jreback

same as #2779 ?

@dragoljub

Yes this is the same enhancement request.

@jreback

ok...why don't you close the other one then....

@dragoljub

Comments from closed issue #2779:

I have data for which I know what the data type should be for each column, e.g. float. Occasionally these columns will also have spurious non-float string values due to erroneous data processing/pivoting up stream. I would like read_csv to help me ignore this (non-predictable) spurious data by parsing them as np.nan (missing values). This rules out using read_csv( na_values= ) parameter because I cant predict them before hand.

I wanted to use the read_csv supported converter functions to return an np.float if it can be parsed but it seems that this method is prohibitively slow between 4-10x slower in some examples I have tried.

I would love read_csv's quick parser to have a "force_dtype=True" option where no matter what string is in the column the specified dtype is always returned or np.nan is used to indicate an impossible parse and therefore a missing value is placed instead.

import numpy as np
import pandas as pd
import StringIO

data = 'a,b,c\n1.1,2.2,3.3\n2.2,garb,4.4\n3.3,4.4,5.5'
data = data+data[5:]*3000

def converter(num):
try:
return np.float(num)
except:
return np.nan

%time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
df.dtypes

%time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
df2.dtypes

%time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
df3.dtypes

In [7]: %time df = pd.read_csv(StringIO.StringIO((data+data[5:]*30)))
CPU times: user 0.12 s, sys: 0.00 s, total: 0.12 s
Wall time: 0.12 s <-- Quick native CSV parser

In [8]: df.dtypes <-- Returns column 'b' as float and string without the ability to convert to a single type.
Out[8]: a float64
b object
c float64

In [9]: %time df2 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), converters={'b':converter})
CPU times: user 0.42 s, sys: 0.00 s, total: 0.42 s
Wall time: 0.42 s <-- Using converter function slows the parsing down by 4X (this is only for a one-column converter)

In [10]: df2.dtypes <-- This time only np.float types are returned, and np.nan is used in place for any spurious values in column 'b'
Out[10]: a float64
b float64
c float64

In [11]: %time df3 = pd.read_csv(StringIO.StringIO((data+data[5:]*30)), dtype=np.float)
AttributeError: 'NoneType' object has no attribute 'dtype' <-- Trying to force np.float on the columns results in un-parseable exception for column 'b' :(
@Zelazny7

Just wanted to add my 2 cents that the feature requested by @dragoljub is what SAS does by default when creating a data set. I have a similar desire for this functionality. In SAS, if I specify a column as numeric on import and a character field is parsed, it will return a NULL value.

@wesm
Python for Data member

I'll have a look at this (unless someone beats me to it) because it's a somewhat straightforward addition and makes sense in the event of data types explicitly specified.

@jreback

@dragoljub quite straightforward after reading, I guess this is a request to push this down to read_csv (de factor when you specify a dtype)

In [5]: df = read_csv(StringIO(data))

In [6]: df
Out[6]: 
     a  b    c
0  1.1  2  3.0
1  Dog  5  6.0
2  7.7  8  9.5

In [8]: df['a'].convert_objects(convert_numeric='force')
Out[8]: 
0    1.1
1    NaN
2    7.7
dtype: float64
@jreback jreback modified the milestone: Someday, 0.14.0 Feb 18, 2014
@jreback

@dragoljub maybe best to add this as a cookbook entry / or in docs (see my example at the end)

@dragoljub

This is a good learning to have in the cookbook. BTW does convert_objects() work on DataFrames too?

I guess some explicit pandas dtype post processing is not so bad after the CSV parser reads all the data so fast. ✌️

@jreback

yep...convert_objects works on all NDFrames

want to do a quick PR for the cookbook?

@lminer

If dtype is specified ahead of time it would be nice if conversion were forced for date types as well. I'd be willing to give this a shot...

@jreback

@lminer that would be excellent!

ideally implement for both the c and python parsers. lmk if you need help!

@adrivsh

Apparently there is already code that does that

df.convert_objects(convert_numeric=True)

So it is only a matter of callin convert_objects when reading, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment