CLN: Use `to_dataframe` to download query results. #247

tswast · 2019-01-25T23:23:16Z

This allows us to remove logic for parsing the schema and align with
google-cloud-bigquery.

Towards #149

Note, I ran the profiler before and after. For some reason I got about an extra 20 seconds with this versus before (331 before, 352 after). I can't figure out why that would be since in both cases we're just looping over all the rows and constructing a dataframe. Maybe it was just a fluke of my work WiFi network?

In either case, this change will help us with #242 once I release the next version of google-cloud-bigquery. It'll also help us with #133 whenever the mystery faster API launches.

tswast · 2019-01-25T23:28:06Z

Looks like some of our system tests run run_query directly. I'll have to do some clean-up.

tswast · 2019-02-04T23:11:04Z

There are a couple of errors where the dtype that google-cloud-bigquery returns is different from the one pandas-gbq returns.

The one that most concerns me is the following:

>       tm.assert_frame_equal(df, DataFrame({"null_timestamp": [NaT]}))
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  object
E       [right]: datetime64[ns]

google-cloud-bigquery returns an object dtype for nullable timestamp columns. I think the solution here is to retain pandas-gbq's schema defaults and use the new ability to select a dtype in the the google-cloud-bigquery to_dataframe method.

tswast · 2019-02-09T00:58:43Z

I've updated the tests to match the new dtypes. The main difference is that we now use datetime64[ns, UTC] for TIMESTAMP columns.

tswast · 2019-02-09T01:00:18Z

I don't know what's going on with CircleCI right now. It can't find versioneer.py for some reason.

max-sixty · 2019-02-09T16:11:21Z

tests/system/test_gbq.py

+        empty_columns = {
+            "title": pandas.Series([], dtype=object),
+            "id": pandas.Series([], dtype=object),
+            "is_bot": pandas.Series([], dtype=object),


But isn't this a worse result that previous - that an empty column reverts to object rather than its original type?

Yes, it is worse. We'd have to add a second type-conversion step for any non-null object columns. I believe this case only happens on empty dataframes, so I guess it wouldn't be the worst to check for that.

I am definitely keen on doing this well once, rather than us & google-cloud-bigquery repeating logic.

But here, it seems like pandas-gbq logic is better - to read the schema, and then use that schema in the dataframe. No?

Done in 0a6e8d9

Great!

Should we do this on every dataframe; no need to have a special case for empty ones?

Should we do this on every dataframe; no need to have a special case for empty ones?

We could if we tell astype() to ignore errors, but I think it's unnecessary. The dataframe constructor seems to pick the right type when bools and ints are passed in: object if it contains nulls, bool/int64 otherwise.

I'm very tempted to use the new nullable integer type by default at #242, but probably want to wait for pandas to make that change. For the cases where folks want a very particular dtype, I propose in #242 that we add a dtypes argument, much as we did in google-cloud-bigquery.

OK great, thanks @tswast

This allows us to remove logic for parsing the schema and align with google-cloud-bigquery.

tswast · 2019-02-22T19:23:09Z

I think the conda failure uncovered a real known issue in pandas-dev/pandas#12513. The workaround is to use the previous behavior of timezone-naive datetimes for BQ TIMEZONE columns, even though that's not exactly correct.

tswast · 2019-02-22T19:55:42Z

@max-sixty Tests are finally green. Mind doing another review before I squash and merge?

max-sixty · 2019-02-22T19:59:14Z

Interesting. I haven't hit that before, but don't use conda, so I guess that explains it. I think the workaround is fine in the meantime - we'll hear about it if anyone strongly needs the tz

max-sixty

Thanks a lot!

jreback · 2019-02-22T20:16:50Z

pandas_gbq/gbq.py

    # http://pandas.pydata.org/pandas-docs/dev/missing_data.html
    # #missing-data-casting-rules-and-indexing
    dtype_map = {
        "FLOAT": np.dtype(float),
+        # Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't
+        # support datetime64[ns, UTC] as dtype in DataFrame constructors. See:
+        # https://github.com/pandas-dev/pandas/issues/12513


this is not generally an issue as this is a single dtype for all columns; you are much better off constructing a properly dtypes Series in order to use the tz-aware dtypes

furthermore you need to be really clear if this is a localized timestamp or not ; see the sql conversion code in pandas which handles all of this correctly

Yeah, I wasn't having an issue except on the conda build with pandas pre-wheels. From pandas-dev/pandas#12513 (comment) you can see in the stacktrace that this is actually constructing Series and then combining them as a DataFrame. It's the Series constructor that's failing on dtype='datetime64[ns, UTC]'.

/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1325: in _to_dataframe_dtypes columns[column] = pandas.Series(columns[column], dtype=dtypes[column]) /opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:248: in __init__ raise_cast_failure=True) /opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:2967: in _sanitize_array subarr = _try_cast(data, False)

I had failures on the DataFrame constructor too in the unit tests. If it had worked to construct a series, you're right that I'd prefer to have these be tz-aware.

you must have an old version
this was worked out a while back

OK. I'll double-check the Conda CI config.

I kind of prefer to leave this as-is (not tz-aware) as that's the existing behavior. https://github.com/pydata/pandas-gbq/blob/f729a44a48744acc2898350fbfbded791d900967/pandas_gbq/gbq.py#L647 I should make a separate issue for changing that for TIMESTAMP columns.

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 25, 2019

tswast force-pushed the issue149-to_dataframe branch from bd39414 to 77d2528 Compare February 8, 2019 22:08

max-sixty reviewed Feb 9, 2019

View reviewed changes

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Feb 12, 2019

tswast added 12 commits February 15, 2019 15:38

CLN: Use to_dataframe to download query results.

f26e390

This allows us to remove logic for parsing the schema and align with google-cloud-bigquery.

Supply expected dtypes to to_dataframe()

7043501

Bump miniimum google-cloud-bigquery version for dtypes argument.

b9f931d

Update tests to match dtypes from to_dataframe().

f805dba

Cast to correct dtype in empty dataframes.

275ea26

Blacken

90eb9fe

Blacken benchmark.

013b00f

Remove timezone from datetime tests.

5a526c0

Blacken tests.

d8e3b99

Update docs for minimum google-cloud-bigquery version.

1e4009a

Update version number in unit tests.

827a065

Update dependencies for tests.

159bda0

tswast force-pushed the issue149-to_dataframe branch from 770743e to 159bda0 Compare February 15, 2019 23:40

tswast added 5 commits February 15, 2019 16:50

Fix lint error.

0ecce9b

Specify column order on empty DataFrame.

b80cf5c

Don't wipe out conda dependencies.

20dd01a

Add pydata-google-auth to conda deps.

59a9328

Document change in behavior of TIMEZONE columns.

d7c1ca5

tswast mentioned this pull request Feb 22, 2019

BUG: DataFrame constructor raises error if specify tz dtype dtype='datetime64[ns, UTC]' pandas-dev/pandas#12513

Closed

tswast added 2 commits February 22, 2019 11:27

Use timezone-naive datetime64 dtype for TIMESTAMP

5b1bd0a

Blacken

094058b

max-sixty self-requested a review February 22, 2019 19:59

max-sixty approved these changes Feb 22, 2019

View reviewed changes

jreback reviewed Feb 22, 2019

View reviewed changes

tswast mentioned this pull request Feb 22, 2019

TST: Conda CI build not using latest wheels? #253

Closed

tswast merged commit ebcbfbe into googleapis:master Feb 23, 2019

tswast deleted the issue149-to_dataframe branch February 23, 2019 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN: Use `to_dataframe` to download query results. #247

CLN: Use `to_dataframe` to download query results. #247

tswast commented Jan 25, 2019

tswast commented Jan 25, 2019

tswast commented Feb 4, 2019

tswast commented Feb 9, 2019 •

edited

tswast commented Feb 9, 2019

max-sixty Feb 9, 2019

tswast Feb 12, 2019

max-sixty Feb 12, 2019

tswast Feb 12, 2019

max-sixty Feb 12, 2019

tswast Feb 12, 2019 •

edited

max-sixty Feb 12, 2019

tswast commented Feb 22, 2019

tswast commented Feb 22, 2019

max-sixty commented Feb 22, 2019

max-sixty left a comment

jreback Feb 22, 2019

tswast Feb 22, 2019

jreback Feb 22, 2019

tswast Feb 22, 2019

CLN: Use to_dataframe to download query results. #247

CLN: Use to_dataframe to download query results. #247

Conversation

tswast commented Jan 25, 2019

tswast commented Jan 25, 2019

tswast commented Feb 4, 2019

tswast commented Feb 9, 2019 • edited

tswast commented Feb 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast Feb 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Feb 22, 2019

tswast commented Feb 22, 2019

max-sixty commented Feb 22, 2019

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLN: Use `to_dataframe` to download query results. #247

CLN: Use `to_dataframe` to download query results. #247

tswast commented Feb 9, 2019 •

edited

tswast Feb 12, 2019 •

edited