-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support UTF-8 in Google BigQuery results #5940
Comments
That's a good point - I think we shelved that at the time as low priority, but you're correct that it should be fairly simple. Can you post what your test was that's failing? Also, was this for Python 2.7, since I thought str() would work with UTF8 automagically in Python 3. Note, unless I'm mistaken, we'll still have issues with UTF8 Headers per line 236: |
I changed the test to this: def test_type_conversion(self):
# All BigQuery Types should be cast into appropriate numpy types
sample_input = [('1.095292800E9', 'TIMESTAMP'),
('false', 'BOOLEAN'),
('2', 'INTEGER'),
('3.14159', 'FLOAT'),
('Hello World', 'STRING'),
('éü', 'STRING')]
actual_output = [gbq._parse_entry(result[0],result[1]) for result in sample_input]
sample_output = [np.datetime64('2004-09-16T00:00:00.000000Z'),
np.bool(False),
np.int('2'),
np.float('3.14159'),
u'Hello World',
u'éü']
self.assertEqual(actual_output, sample_output, 'A format conversion failed') Keep in mind I'm new to the whole ascii/unicode problem in Python so the sample_input addition with Removing the optional message this is the error I'm getting:
Decode isn't working properly, which I attribute to my test. |
Just noticed I made a mistake in the initial post. Should be I also just edited this manually on our install and it is working fine. I would just need to figure out how to write a proper test for it. |
Something like this? I'll have to see what it returns from BigQuery as... part of the problem is you're not supposed to have Unicode literals in your source code: http://www.python.org/dev/peps/pep-0263/ . We might have to be careful here, since I'm not sure what I picked was normalized: http://en.wikipedia.org/wiki/Unicode_equivalence |
Yeah that works. Didn't know about the Unicode literals PEP, so that's good to know. Edit: actually, that may not work. |
I didn't know about PEP either - I just found it when I was messing around with your test. str() is definitely a source of issues. I'll try that. |
to embed unicode literal a use the 'u' function from pandas.compat import u see pandas/tests/test_format.py for some examples |
That's a handy method :-) Looks like this works fine... it looks like Google returns a normalized version, so "\xc3\xa9\xc3\xbc" becomes "\xe9\xfc". Something like this? You should be able to clone mine and install it with the 'python setup.py develop' to test it if you have a dataset in mind. |
Let me know if you can confirm this works: If so, we'll wrap this into a pull request. EDIT: Moved to branch |
I haven't had a chance to try that specific branch yet, I'll try to get to it this week. I've been running |
not sure if you guys follow pandas on SO: http://stackoverflow.com/questions/21886742/convert-pandas-dtypes-to-bigquery-type-representation |
@jacobschaer I've been running your code for the past few weeks and it's been running smoothly. Only remaining UTF-8 issues are within my own code. |
it appears that the bigquery installation fails under setuptools >= 3.0 (which I think is now public ish). you can see here: https://travis-ci.org/pydata/pandas/jobs/20387092 worked around in pandas...maybe could report upstream.... |
@jreback @jacobschaer I'll talk to their developer relations people and see if they can help us resolve this- seems like it should be an easy fix for them...I'll let you know what I read back. |
@jreback RE apputils dependency: It looks like Travis is somehow getting an older version of apputils (v0.3). I tried locally and PIP ended up fetching v0.4 from google's repository instead of from pypi...
|
closed by #6596 |
@jreback @jacobschaer Assuming the new version of setuptools is not downloading the 0.4.0 version from the unsecure Google code site, the fix would be for Google to update https://pypi.python.org/packages/source/g/google-apputils/ to include the 0.4.0 version like the main pypi page indicates https://pypi.python.org/pypi/google-apputils/0.4.0 I sent them an e-mail to see if they can do this. |
thanks! |
I resolved the issue on Travis by forcing setup tools 2.2 so no big deal thanks for looking into this |
Ah, that explains why we were seeing setuptools v 2.2 in the CI install script on your branch...I was wondering about that. Thanks. Hopefully they will update pypi too as any local installs with 3.x would have the same problem. |
Given that the entire Google BigQuery API returns UTF-8, it would make sense to handle UTF-8 output from BigQuery in the
gbq.read_gbq
IO module.I'd love to do a pull request but I'm not sure the preferred way of handling this. I'd assume that this line should be changed to
field_value = field_value.decode('utf-8')
. I made that change and tests passed but figuring out how to properly test UTF-8 encoding is giving me some trouble (keeps making my test addition fail).The text was updated successfully, but these errors were encountered: