Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.io.gbq Version 2 #6937

Merged
merged 2 commits into from
Jun 30, 2014
Merged

pandas.io.gbq Version 2 #6937

merged 2 commits into from
Jun 30, 2014

Conversation

jacobschaer
Copy link
Contributor

closes #5840 (as new interface makes it obsolete)
closes #6096

@jreback : We still have some documentation to work on, but we would like your initial thoughts on what we have so far. The key change for this version is the removal of bq.py as a dependency (except as a setup method for a test case). Instead, we rely entirely on the BigQuery python API. We also simplified to_gbq() significantly. Though it cost a few features, the code is much more manageable and less error prone (thanks Google!). Test cases are much more granular and run significantly faster. To use the test cases fully, a BigQuery project_id is still required, though there are some unittests offline.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

can u rebase on master?

also flip github switch for Travis

@jacobschaer
Copy link
Contributor Author

I rebased to master, and Travis was turned on. I'm still working to get Travis passing - mostly just a matter of getting the right dependencies in place and being sure integration tests are skipped.

@jreback jreback added this to the 0.14.0 milestone Apr 23, 2014
@@ -4,8 +4,10 @@ python-dateutil==1.5
pytz==2013b
http://www.crummy.com/software/BeautifulSoup/bs4/download/4.2/beautifulsoup4-4.2.0.tar.gz
html5lib==1.0b2
bigquery==2.0.17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don;'t you still need the bigquery package so that bq is installed? (or is that in the google-api-python-client), what a horrible package name google!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if requirements are changing, pls update install.rst as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigquery is only required for the to_gbq() test suite, which can't be run in CI anyways due to lack of valid project id. Will update install.rst soon.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

this going to close #5840 ?

@jacobschaer
Copy link
Contributor Author

@jreback : I believe #5840 was no longer an issue due to a backend change with bigquery. However, it's no longer an issue with our code, since we track the page tokens we've seen so far and ensure we don't get a duplicate.

self.assertEqual(len(df.drop_duplicates()), 200005)

@unittest.skipIf(missing_bq(), "Cannot run to_gbq tests without bq command line client")
@unittest.skipIf(PROJECT_ID is None, "Cannot run integration tests without a project id")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These @unittest.skipIf seem to cause backward-compatibility issues... Perhaps I should switch all these over to whatever the nose equivalent is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas doesn't implement (properly);

make these regular tests and just do a raise nose.SkipTest(slipping message) when u skip
they are indicated by travis at the very end (do u can see that u skipped them)

@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

@jacobschaer you are doing small doc change to close #6096 as well?

@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

@jacobschaer like to get this in soon...how's coming?

@jreback
Copy link
Contributor

jreback commented May 1, 2014

@jacobschaer how's this coming?

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 1, 2014
@jreback
Copy link
Contributor

jreback commented Jun 3, 2014

@jacobschaer pls rebase on master

@jacobschaer
Copy link
Contributor Author

OK, all rebased and mostly squashed. Were there any documentation things we're missing?

into BigQuery and pull it into a DataFrame.
As an example, suppose you want to load all data from an existing BigQuery
table : `test_dataset.test_table` into a DataFrame using the :func:`~pandas.io.read_gbq`
function.

.. code-block:: python

from pandas.io import gbq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think read_gbq is already available as a top-level function. So also use it here as such?

@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

@jacobschaer

the 3.4 and numpy-dev builds are failing because they are not installing httplib2 which defacto is a requirement of gbq now. those tests should be skipping (because of the import failure).

those builds need to pass (they are experimental just because the build matrix comes back faster)

@jacobschaer
Copy link
Contributor Author

@jreback Seems 3.4 and numpy-dev builds now pass.

- ``io.gbq.read_gbq`` and ``io.gbq.to_gbq`` were refactored to remove the
dependency on the Google ``bq.py`` command line client. This submodule
now uses ``httplib2`` and the Google ``apiclient`` and ``oauth2client`` API client
libraries which should be more stable and, therefore, reliable than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the issue number here (this PR number is ok)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in the API changes section as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or in an 'experimental' (sub)section? (as it was and still is tagged as experimental?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes...put that in the experimental section (down a little on the page in v0.14.1.txt)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to the experimental section and added issue number with this PR number.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

@jacobschaer pls squash down to a small number of commits (and remove merge branches), and rebase on master

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

@jacobschaer can you rebase and squash?

…ng except unit testing. Minor API changes were also introduced.
@jacobschaer
Copy link
Contributor Author

All rebased and squashed.

jreback added a commit that referenced this pull request Jun 30, 2014
@jreback jreback merged commit 5a07b7b into pandas-dev:master Jun 30, 2014
@jreback
Copy link
Contributor

jreback commented Jun 30, 2014

@jacobschaer thanks for this!

pls review the docs when they are built in case a followup is needed. http://pandas-docs.github.io/pandas-docs-travis/

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

@jacobschaer If you have a chance. Can you break the big paragraph into say bullet points:

Finally, you can append data to a BigQuery table from a pandas DataFrame using the to_gbq() 
function. This function uses the Google streaming API which requires that your destination table exists in
 BigQuery. Given the BigQuery table already exists, your DataFrame should match the destination table 
in column order, structure, and data types. DataFrame indexes are not supported. By default, rows are 
streamed to BigQuery in chunks of 10,000 rows, but you can pass other chuck values via the chunksize 
argument. You can also see the progess of your post via the verbose flag which defaults to True. The 
http response code of Google BigQuery can be successful (200) even if the append failed. For this 
reason, if there is a failure to append to the table, the complete error response from BigQuery is 
returned which can be quite long given it provides a status for each row. You may want to start with 
smaller chuncks to test that the size and types of your dataframe match your destination table to make 
debugging simpler.

http://pandas-docs.github.io/pandas-docs-travis/io.html#io-bigquery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: track status for stale comment in gbq docs pandas.io.gbq.read_gbq() returns incorrect results
5 participants