Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-8 in Google BigQuery results #5940

Closed
andrewryno opened this issue Jan 14, 2014 · 20 comments
Closed

Support UTF-8 in Google BigQuery results #5940

andrewryno opened this issue Jan 14, 2014 · 20 comments
Labels
Enhancement Unicode Unicode strings
Milestone

Comments

@andrewryno
Copy link

Given that the entire Google BigQuery API returns UTF-8, it would make sense to handle UTF-8 output from BigQuery in the gbq.read_gbq IO module.

I'd love to do a pull request but I'm not sure the preferred way of handling this. I'd assume that this line should be changed to field_value = field_value.decode('utf-8'). I made that change and tests passed but figuring out how to properly test UTF-8 encoding is giving me some trouble (keeps making my test addition fail).

@jacobschaer
Copy link
Contributor

That's a good point - I think we shelved that at the time as low priority, but you're correct that it should be fairly simple. Can you post what your test was that's failing? Also, was this for Python 2.7, since I thought str() would work with UTF8 automagically in Python 3.

Note, unless I'm mistaken, we'll still have issues with UTF8 Headers per line 236:
col_names.append(field['name'].encode('ascii', 'ignore'))

@andrewryno
Copy link
Author

I changed the test to this:

def test_type_conversion(self):
    # All BigQuery Types should be cast into appropriate numpy types
    sample_input = [('1.095292800E9', 'TIMESTAMP'),
             ('false', 'BOOLEAN'),
             ('2', 'INTEGER'),
             ('3.14159', 'FLOAT'),
             ('Hello World', 'STRING'),
             ('éü', 'STRING')]
    actual_output = [gbq._parse_entry(result[0],result[1]) for result in sample_input]
    sample_output = [np.datetime64('2004-09-16T00:00:00.000000Z'),
              np.bool(False),
              np.int('2'),
              np.float('3.14159'),
              u'Hello World',
              u'éü']
    self.assertEqual(actual_output, sample_output, 'A format conversion failed')

Keep in mind I'm new to the whole ascii/unicode problem in Python so the sample_input addition with éü should probably be done differently.

Removing the optional message this is the error I'm getting:

======================================================================
FAIL: test_type_conversion (__main__.TestGbq)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/andrewryno/Sites/venv/lib/python2.7/site-packages/pandas/io/tests/test_gbq.py", line 199, in test_type_conversion
    self.assertEqual(actual_output, sample_output)
AssertionError: Lists differ: [numpy.datetime64('2004-09-15T... != [numpy.datetime64('2004-09-15T...

First differing element 5:
\xe9\xfc
\xc3\xa9\xc3\xbc

  [numpy.datetime64('2004-09-15T17:00:00.000000-0700'),
   False,
-  2.0,
?   --

+  2,
   3.14159,
   u'Hello World',
-  u'\xe9\xfc']
+  u'\xc3\xa9\xc3\xbc']

Decode isn't working properly, which I attribute to my test.

@andrewryno
Copy link
Author

Just noticed I made a mistake in the initial post. Should be encode, not decode.

I also just edited this manually on our install and it is working fine. I would just need to figure out how to write a proper test for it.

@jacobschaer
Copy link
Contributor

Something like this?
jacobschaer@a50fce0

I'll have to see what it returns from BigQuery as... part of the problem is you're not supposed to have Unicode literals in your source code: http://www.python.org/dev/peps/pep-0263/ . We might have to be careful here, since I'm not sure what I picked was normalized: http://en.wikipedia.org/wiki/Unicode_equivalence

@andrewryno
Copy link
Author

Yeah that works. Didn't know about the Unicode literals PEP, so that's good to know.

Edit: actually, that may not work. str() is what is throwing the error. It needs to be something like: field_value.encode('utf-8'). Though I'm not sure about compatibility with Python 3.

@jacobschaer
Copy link
Contributor

I didn't know about PEP either - I just found it when I was messing around with your test. str() is definitely a source of issues. I'll try that.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2014

to embed unicode literal a use the 'u' function

from pandas.compat import u

see pandas/tests/test_format.py for some examples

@jacobschaer
Copy link
Contributor

That's a handy method :-) Looks like this works fine... it looks like Google returns a normalized version, so "\xc3\xa9\xc3\xbc" becomes "\xe9\xfc".

Something like this?
jacobschaer@1eb852a

You should be able to clone mine and install it with the 'python setup.py develop' to test it if you have a dataset in mind.

@jacobschaer
Copy link
Contributor

Let me know if you can confirm this works:
https://github.com/jacobschaer/pandas/compare/GBQ_Unicode_Support

If so, we'll wrap this into a pull request.

EDIT: Moved to branch

@andrewryno
Copy link
Author

I haven't had a chance to try that specific branch yet, I'll try to get to it this week.

I've been running field_value = field_value.encode('UTF-8') in production for a couple weeks now as a temporary fix and haven't encountered any problems.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2014

@andrewryno
Copy link
Author

@jacobschaer I've been running your code for the past few weeks and it's been running smoothly. Only remaining UTF-8 issues are within my own code.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@jacobschaer

it appears that the bigquery installation fails under setuptools >= 3.0 (which I think is now public ish).
works fine on 2.2. This has to do with the setup of the dependencies

you can see here: https://travis-ci.org/pydata/pandas/jobs/20387092
(open the install tab)

worked around in pandas...maybe could report upstream....

@azbones
Copy link

azbones commented Mar 11, 2014

@jreback @jacobschaer I'll talk to their developer relations people and see if they can help us resolve this- seems like it should be an easy fix for them...I'll let you know what I read back.

@jacobschaer
Copy link
Contributor

@jreback RE apputils dependency:

It looks like Travis is somehow getting an older version of apputils (v0.3). I tried locally and PIP ended up fetching v0.4 from google's repository instead of from pypi...

Downloading/unpacking google-apputils (from bigquery)
  Could not fetch URL http://google-apputils-python.googlecode.com/files/google-apputils-0.1.tar.gz (from https://pypi.python.org/simple/google-apputils/): HTTP Error 404: Not Found
  Will skip URL http://google-apputils-python.googlecode.com/files/google-apputils-0.1.tar.gz when looking for download links for google-apputils (from bigquery)
  Could not fetch URL http://google-apputils-python.googlecode.com/files/google-apputils-0.2.tar.gz (from https://pypi.python.org/simple/google-apputils/): HTTP Error 404: Not Found
  Will skip URL http://google-apputils-python.googlecode.com/files/google-apputils-0.2.tar.gz when looking for download links for google-apputils (from bigquery)
  Using version 0.4.0 (newest of versions: 0.4.0, 0.3.0, 0.2, 0.2, 0.1)
  You are installing a potentially insecure and unverifiable file. Future versions of pip will default to disallowing insecure files.
  Downloading from URL http://google-apputils-python.googlecode.com/files/google-apputils-0.4.0.tar.gz (from http://code.google.com/p/google-apputils-python/)
  Running setup.py egg_info for package google-apputils
    running egg_info
    creating pip-egg-info/google_apputils.egg-info
    writing requirements to pip-egg-info/google_apputils.egg-info/requires.txt
    writing pip-egg-info/google_apputils.egg-info/PKG-INFO
    writing namespace_packages to pip-egg-info/google_apputils.egg-info/namespace_packages.txt
    writing top-level names to pip-egg-info/google_apputils.egg-info/top_level.txt
    writing dependency_links to pip-egg-info/google_apputils.egg-info/dependency_links.txt
    writing entry points to pip-egg-info/google_apputils.egg-info/entry_points.txt
    writing manifest file 'pip-egg-info/google_apputils.egg-info/SOURCES.txt'
    warning: manifest_maker: standard file '-c' not found

    reading manifest file 'pip-egg-info/google_apputils.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/google_apputils.egg-info/SOURCES.txt'

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

closed by #6596

@jreback jreback closed this as completed Mar 14, 2014
@azbones
Copy link

azbones commented Mar 14, 2014

@jreback @jacobschaer Assuming the new version of setuptools is not downloading the 0.4.0 version from the unsecure Google code site, the fix would be for Google to update https://pypi.python.org/packages/source/g/google-apputils/ to include the 0.4.0 version like the main pypi page indicates https://pypi.python.org/pypi/google-apputils/0.4.0

I sent them an e-mail to see if they can do this.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

thanks!

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

I resolved the issue on Travis by forcing setup tools 2.2
which we have used for quite some time

so no big deal

thanks for looking into this

@azbones
Copy link

azbones commented Mar 14, 2014

Ah, that explains why we were seeing setuptools v 2.2 in the CI install script on your branch...I was wondering about that. Thanks. Hopefully they will update pypi too as any local installs with 3.x would have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

4 participants