ENH: sql support for NaN/NaT conversions #2754

Closed
jreback opened this Issue Jan 25, 2013 · 13 comments

Projects

None yet

3 participants

@jreback

UPDATE from @jorisvandenbossche:

Overview of current status writing nan values (see also tests added in #7100):

  • For MySQL using pymysql/MySQLdb nothing works: you get message Unknown column 'nan' in 'field list' (see also eg http://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values)
  • Numeric columns:
    • working for sqlite and postgres
    • Only full NaN columns stay None in sqlite
  • Object columns (eg strings)
    • for postgresql: NaN is converted to the string u'NaN', which is not really what we want
    • for sqlite it is returned as None
  • NaT:
    • postgresql: gives error on inserting "0001-255-255T00:00:00"
    • sqlite3: writing works, but reading it with query returns '-001--1--1 -1:-1:-1.-00001'
  • MSSQL: not working with message "The supplied value is not a valid instance of data type float", see #8088 for more details

not sure exactly what sql expects (Nones?) rather than np.nan (or NaT)

https://groups.google.com/forum/?fromgroups#!topic/pydata/lxhnFtuzvWQ

also provide pandas datetime64[ns], instead of datetime/date types
#3532

@danielballan

Yes, I believe it expects Nones. I will post some code this weekend.

@danielballan

I may in over my head here, actually, because performance is crucial. Would it helpful to consider #2717 ?

@jreback

this is actually tricky, you might need need to do some cython to get this fast...

e.g. in pytables I had to pass a numpy rec-array because that is how it like it

how are you passing, as a ndarray of dtype=object? (or rec-array)? list?

@jreback

try looking at this:

pandas/io/pytables.py/AppenableTable.write_data (write a test table and break there)

it uses this to write basically a numpy rec-array

lib.pyx/create_hdf_rows_2d

pretty sure you could do something similar
(essentially your are creating a tuple of the data, sucking it out of the frame, and then replace say missing data with whatever you need)

this is a bit more complicated, because I use a mask to avoid writing completely empty rows (which is much more common in say writing a panel)

@jreback

see #3047 for a strategy of how u should deal with conversion to datetime64[ns] to make sure that u r returning NaT for missing values

(this might actually be reasonably efficient too)

@jreback

you can also do this if u have the array directly

wesm commented 10 minutes ago
I'd recommend using pandas.to_datetime(df['date'])

@jreback jreback modified the milestone: 0.15.0, 0.14.0 Mar 25, 2014
@jorisvandenbossche jorisvandenbossche modified the milestone: 0.14.0, 0.15.0 Apr 29, 2014
@jorisvandenbossche

Actually I think we should try to fix this before 0.14.
Basic NaN handling is really essential.

@jorisvandenbossche jorisvandenbossche changed the title from ENH: sql to provided NaN/NaT conversions to ENH: sql support for NaN/NaT conversions May 13, 2014
@jorisvandenbossche

Overview of current status (see also tests added in #7100):

  • For MySQL using pymysql nothing works: you get message Unknown column 'nan' in 'field list' (see also eg http://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values)
  • Numeric columns:
    • working for sqlite and postgres
    • Only full NaN columns stay None in sqlite
  • Object columns (eg strings)
    • for postgresql: NaN is converted to the string u'NaN', which is not really what we want
    • for sqlite it is returned as None
  • NaT:
    • postgresql: gives error on inserting "0001-255-255T00:00:00"
    • sqlite3: writing works, but reading it with query returns '-001--1--1 -1:-1:-1.-00001'
  • MSSQL: not working with message "The supplied value is not a valid instance of data type float", see #8088 for more details
@jreback

in hdf I do a translation on NaN to a string that can be user specified - (as can't store the NaN directly)

I think that is a reasonable soln in case it can't be natively stored (or maybe always?)

http://pandas-docs.github.io/pandas-docs-travis/io.html#string-columns (see nan_rep)

@jorisvandenbossche

But in theory it should be possible to store it natively in SQL I think, as there is the NULL value.

@jreback

Their are 2 approaches you can use here:

This is what i do in sqlalchemy (I am using a custom date-time type (not Timestamp but same idea)
easy enough to convert Timestamps to datetimes directly and convert NaT to None

class DateTime(sql.types.TypeDecorator):
    """convert to/from Date type """

    impl = sql.types.DateTime

    def process_bind_param(self, value, dialect):
        f = getattr(value,'as_naive_datetime_for_db',None)
        if f is not None:
            return f()
        return value

    def process_result_value(self, value, dialect):
        if value is None: return None
        return dates.Date(value)

alternatively, you can vectorize the conversion

In [23]: df = DataFrame(dict(A = Timestamp('20130101')),index=range(3))

In [26]: df.iloc[1] = np.nan

In [27]: df
Out[27]: 
           A
0 2013-01-01
1        NaT
2 2013-01-01

[3 rows x 1 columns]

In [35]: s = df['A'].astype(object).values

In [36]: s
Out[36]: 
array([datetime.datetime(2013, 1, 1, 0, 0), nan,
       datetime.datetime(2013, 1, 1, 0, 0)], dtype=object)

In [38]: s[s==np.nan] = None

In [39]: s
Out[39]: 
array([datetime.datetime(2013, 1, 1, 0, 0), None,
       datetime.datetime(2013, 1, 1, 0, 0)], dtype=object)
@jreback

@jorisvandenbossche ok...move to 0.14.1?

@jorisvandenbossche jorisvandenbossche modified the milestone: 0.14.1, 0.14.0 May 15, 2014
@jorisvandenbossche jorisvandenbossche modified the milestone: 0.15.0, 0.14.1 Jul 1, 2014
@jorisvandenbossche jorisvandenbossche modified the milestone: 0.15.1, 0.15.0 Aug 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment