Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: Use to_dataframe to download query results. #247

Merged
merged 19 commits into from Feb 23, 2019

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jan 25, 2019

This allows us to remove logic for parsing the schema and align with
google-cloud-bigquery.

Towards #149

Note, I ran the profiler before and after. For some reason I got about an extra 20 seconds with this versus before (331 before, 352 after). I can't figure out why that would be since in both cases we're just looping over all the rows and constructing a dataframe. Maybe it was just a fluke of my work WiFi network?

In either case, this change will help us with #242 once I release the next version of google-cloud-bigquery. It'll also help us with #133 whenever the mystery faster API launches.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 25, 2019
@tswast
Copy link
Collaborator Author

tswast commented Jan 25, 2019

Looks like some of our system tests run run_query directly. I'll have to do some clean-up.

@tswast
Copy link
Collaborator Author

tswast commented Feb 4, 2019

There are a couple of errors where the dtype that google-cloud-bigquery returns is different from the one pandas-gbq returns.

The one that most concerns me is the following:

>       tm.assert_frame_equal(df, DataFrame({"null_timestamp": [NaT]}))
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  object
E       [right]: datetime64[ns]

google-cloud-bigquery returns an object dtype for nullable timestamp columns. I think the solution here is to retain pandas-gbq's schema defaults and use the new ability to select a dtype in the the google-cloud-bigquery to_dataframe method.

@tswast
Copy link
Collaborator Author

tswast commented Feb 9, 2019

I've updated the tests to match the new dtypes. The main difference is that we now use datetime64[ns, UTC] for TIMESTAMP columns.

@tswast
Copy link
Collaborator Author

tswast commented Feb 9, 2019

I don't know what's going on with CircleCI right now. It can't find versioneer.py for some reason.

empty_columns = {
"title": pandas.Series([], dtype=object),
"id": pandas.Series([], dtype=object),
"is_bot": pandas.Series([], dtype=object),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't this a worse result that previous - that an empty column reverts to object rather than its original type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is worse. We'd have to add a second type-conversion step for any non-null object columns. I believe this case only happens on empty dataframes, so I guess it wouldn't be the worst to check for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am definitely keen on doing this well once, rather than us & google-cloud-bigquery repeating logic.

But here, it seems like pandas-gbq logic is better - to read the schema, and then use that schema in the dataframe. No?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0a6e8d9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Should we do this on every dataframe; no need to have a special case for empty ones?

Copy link
Collaborator Author

@tswast tswast Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this on every dataframe; no need to have a special case for empty ones?

We could if we tell astype() to ignore errors, but I think it's unnecessary. The dataframe constructor seems to pick the right type when bools and ints are passed in: object if it contains nulls, bool/int64 otherwise.

I'm very tempted to use the new nullable integer type by default at #242, but probably want to wait for pandas to make that change. For the cases where folks want a very particular dtype, I propose in #242 that we add a dtypes argument, much as we did in google-cloud-bigquery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great, thanks @tswast

@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Feb 12, 2019
@tswast
Copy link
Collaborator Author

tswast commented Feb 22, 2019

I think the conda failure uncovered a real known issue in pandas-dev/pandas#12513. The workaround is to use the previous behavior of timezone-naive datetimes for BQ TIMEZONE columns, even though that's not exactly correct.

@tswast
Copy link
Collaborator Author

tswast commented Feb 22, 2019

@max-sixty Tests are finally green. Mind doing another review before I squash and merge?

@max-sixty
Copy link
Contributor

Interesting. I haven't hit that before, but don't use conda, so I guess that explains it. I think the workaround is fine in the meantime - we'll hear about it if anyone strongly needs the tz

@max-sixty max-sixty self-requested a review February 22, 2019 19:59
Copy link
Contributor

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
# #missing-data-casting-rules-and-indexing
dtype_map = {
"FLOAT": np.dtype(float),
# Even though TIMESTAMPs are timezone-aware in BigQuery, pandas doesn't
# support datetime64[ns, UTC] as dtype in DataFrame constructors. See:
# https://github.com/pandas-dev/pandas/issues/12513
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not generally an issue as this is a single dtype for all columns; you are much better off constructing a properly dtypes Series in order to use the tz-aware dtypes

furthermore you need to be really clear if this is a localized timestamp or not ; see the sql conversion code in pandas which handles all of this correctly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't having an issue except on the conda build with pandas pre-wheels. From pandas-dev/pandas#12513 (comment) you can see in the stacktrace that this is actually constructing Series and then combining them as a DataFrame. It's the Series constructor that's failing on dtype='datetime64[ns, UTC]'.

/opt/conda/envs/test-environment/lib/python3.6/site-packages/google/cloud/bigquery/table.py:1325: in _to_dataframe_dtypes
    columns[column] = pandas.Series(columns[column], dtype=dtypes[column])
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:248: in __init__
    raise_cast_failure=True)
/opt/conda/envs/test-environment/lib/python3.6/site-packages/pandas/core/series.py:2967: in _sanitize_array
    subarr = _try_cast(data, False)

I had failures on the DataFrame constructor too in the unit tests. If it had worked to construct a series, you're right that I'd prefer to have these be tz-aware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you must have an old version
this was worked out a while back

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll double-check the Conda CI config.

I kind of prefer to leave this as-is (not tz-aware) as that's the existing behavior. https://github.com/pydata/pandas-gbq/blob/f729a44a48744acc2898350fbfbded791d900967/pandas_gbq/gbq.py#L647 I should make a separate issue for changing that for TIMESTAMP columns.

@tswast tswast merged commit ebcbfbe into googleapis:master Feb 23, 2019
@tswast tswast deleted the issue149-to_dataframe branch February 23, 2019 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants