Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.read_sql_query() does not convert NULLs to NaN #14319

Closed
Gerenuk opened this issue Sep 29, 2016 · 12 comments
Closed

pd.read_sql_query() does not convert NULLs to NaN #14319

Gerenuk opened this issue Sep 29, 2016 · 12 comments
Labels
API Design IO SQL to_sql, read_sql, read_sql_query

Comments

@Gerenuk
Copy link

Gerenuk commented Sep 29, 2016

A small, complete example of the issue

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite://')
conn = engine.connect()
conn.execute("create table test (a float)")
for _ in range(5):
    conn.execute("insert into test values (NULL)")

df = pd.read_sql_query("select * from test", engine, coerce_float=True)
print(df.a)

Expected Output

In pandas 0.18.1 this will result in a column of type object with None values, whereas I needed float("nan"). The coerce_float=True option made no difference. This is most needed, when reading in a float column chunk-wise, since there may be sequences of NULLs.

(also http://stackoverflow.com/questions/30652457/adjust-pandas-read-sql-query-null-value-treatment/)

@TomAugspurger
Copy link
Contributor

Seems to be the same as #14314 (though your example is better, thanks; I'll copy it over there).

The coerce_float option is for things like Decimal to float conversion, not null to NaN.

Does your actual use-case use select * from test? if so you can do pd.read_sql_table('test', engine, chunksize=chunksize), which will get the type information from the database. read_sql_query

@TomAugspurger
Copy link
Contributor

Let's actually reopen this, is it worth adding a coerce_null parameter to read_sql_query to handle cases like this?

@TomAugspurger TomAugspurger reopened this Sep 29, 2016
@TomAugspurger TomAugspurger added API Design IO SQL to_sql, read_sql, read_sql_query labels Sep 29, 2016
@Gerenuk
Copy link
Author

Gerenuk commented Sep 29, 2016

My actual example was slightly more complex. I used an SQL query and had a way to do chunk-wise reading, which made this issue more likely.

@jorisvandenbossche
Copy link
Member

I am not sure it is worth adding such a parameter. The issue is also that it should not necessarily result in a float column, as in SQL the NULLs can be in any type of column.

To solve the dtype issues with multiple chunks, we could also think about a dtype kwarg where you could specify types for certain columns explicitly (similar as the dtype argument in eg read_csv). Although this of course is manual work specifying the exact column.

The problem with queries is that we do not have any information about the database table layout, we just get back result sets. But maybe we could also have a new kwarg to specify the table name where to get the type information. Or a helper function to get this information and to automatically construct a dict that could be passed to a possible dtype argument. Just brainstorming here.

@jorisvandenbossche jorisvandenbossche removed the Duplicate Report Duplicate issue or pull request label Sep 29, 2016
@TomAugspurger
Copy link
Contributor

Agreed about the coerce_null parameter not making sense.

A helper method for getting dtypes from the table might be useful. I've written similar methods in the past that get the column info from a sqlalchemy MetaData object and transform them to a dictionary of {column_name: pandas type}. That could be passed to a dtype keyword for read_sql_query.

@threemonks
Copy link

Is there any work around for this while the issue is being addressed? I tried the following, but neither works:

df.replace(None, np.nan, inplace=True) df.fillna(value=np.nan, inplace=True)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 12, 2018

Does to_numeric work? (with errors='coerce')

@puttkraidej
Copy link

same issue here, pandas forced decimal to float dtype when using read_sql_query() :(

@annette987
Copy link

Was this ever resolved? I am having the same issue. See also: https://stackoverflow.com/questions/53315035/pandas-read-sql-query-returning-none-for-all-values-in-some-columns

I am using pandas read_sql_query to read data from a MySQL database table into a pandas dataframe. Some columns in this table have all NULL values. For those columns the pandas dataframe contains None in every row. For all other columns the dataframe contains NaN where there was a NULL value. Can anyone explain why None is returned for the all NULL columns? And how do I make sure I have all NaNs, hopefully without doing manual conversions? I should add that two of the columns causing this problem are float and the third is of type double,

Here is an example. The columns pef and fer contain all NULLS in the database.

from sqlalchemy import create_engine
import pandas as pd
import math

querystr = "SELECT * FROM dbname.mytable"
engine = create_engine('mysql+pymysql://username:password@localhost/' + "dbname")
df = pd.read_sql_query(querystr, engine)
df.head()

    sys     dias    pef     fer
0   NaN     NaN     None    None
1   159.0   92.666  None    None
2   NaN     NaN     None    None
3   NaN     NaN     None    None
4   102.0   63.333  None    None

In the MySQL database these columns are defined as:

Columns: 
    sys float 
    dias float 
    pef float 
    fer float

I would expect the columns pef and fer to contain NaN in each row, not None.

@TomAugspurger
Copy link
Contributor

Still open.

Is your actual query SELECT * FROM ...? If so, you could use pd.read_sql_table.

Pandas can't use the dtypes from the table for arbitrary sql expressions passed to read_sql_query, since the expression can change the column names / types. But pandas does use the types for read_sql_table.

@annette987
Copy link

My actual query is more complicated than that and involves multiple tables. So I can't just use pd.read_sql_table.
What I am doing at the moment is just converting None to NaN in the dataframe:
df.replace([None], np.nan, inplace=True)

@TomAugspurger
Copy link
Contributor

Closing in favor of #10285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

No branches or pull requests

6 participants