Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UnicodeEncodeError on file encoding detection #274

Merged
merged 1 commit into from Jan 4, 2016

Conversation

Projects
None yet
6 participants
@imankulov
Copy link
Contributor

commented Oct 13, 2015

If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.

@codecov-io

This comment has been minimized.

Copy link

commented Oct 13, 2015

Current coverage is 84.21%

Merging #274 into master will not affect coverage as of a3950c0

@@            master    #274   diff @@
======================================
  Files           22      22       
  Stmts         3770    3770       
  Branches         0       0       
  Methods          0       0       
======================================
  Hit           3175    3175       
  Partial          0       0       
  Missed         595     595       

Review entire Coverage Diff as of a3950c0

Powered by Codecov. Updated on successful CI builds.

@etanol

This comment has been minimized.

Copy link
Contributor

commented Oct 15, 2015

Do you have a working example where this happens? Because any byte sequence can be decoded as Latin1. It doesn't mean that the decoded result should make sense, but decoding Latin1 to unicode usually never fails.

@imankulov

This comment has been minimized.

Copy link
Contributor Author

commented Oct 15, 2015

Oh, to be frank, I misunderstood the source of the problem initially. Of course, when you have a byte sequence (Python 2.x string) and you decode it, you cannot end up with UnicodeEncodeError.

The exception is thrown later on, when we try to parse the resulting object with parser.suite. The thing is, in Python 2.x parser.suite expects a byte sequence (py2.x str), and in Python 3.x it expects a unicode object (py3.x str).

Python 2.x example

$ python
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f3150d18090>
>>> parser.suite(u'print("я")')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u044f' in position 7: ordinal not in range(128)

Python 3.x example

$ python3
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f1f222e3f90>
>>> parser.suite('print("я")'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: suite() argument 1 must be str, not bytes

As it turned out, to support both 2.x and 3.x properly we have to switch between py3.x str and py2.x str, depending on the interpreter version. Current solution is to "die quietly" if first string contains non-ascii symbols. As I feel it, it is "good enough" to cover almost everything. Non-ascii first string theoretically can contain encoding information, like # coding: utf-8 😈, but do we really have to properly process this case?

imankulov added a commit to imankulov/mako that referenced this pull request Oct 23, 2015

Ensure babel i18n extactor works properly with non-ascii input
If mako templates contain something like "_('Köln')", babel extractor converts
it to pure ASCII so that resulting .po file would contain "K\xf6ln". Not all
translation tools and translations are ready for such kind of escape sequences.

Babel allows message ids to be non-ascii, the plugin just has to return Unicode
objects instead of ASCII strings (and that's exactly how Babel built-in Python
and JavaScript extractors work).

This fix ensures mako extractor doesn't excape non-ascii symbols, works well
both for Unicode and non-unicode input (there is a test for cp1251 encoding),
and also provides a workaround for babel charset detector python-babel/babel#274.
@akx

This comment has been minimized.

Copy link
Member

commented Jan 3, 2016

@imankulov Can you rebase this, please?

Fix UnicodeEncodeError on file encoding detection
If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.

@imankulov imankulov force-pushed the imankulov:fix-encoding-detection branch from 258cb37 to f9b04a5 Jan 4, 2016

@gitmate-bot gitmate-bot added the size/S label Jan 4, 2016

@imankulov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 4, 2016

@akx Done.

@akx

This comment has been minimized.

Copy link

commented on f9b04a5 Jan 4, 2016

ack

@akx

This comment has been minimized.

Copy link
Member

commented Jan 4, 2016

Thank you, this looks good!

akx added a commit that referenced this pull request Jan 4, 2016

Merge pull request #274 from imankulov/fix-encoding-detection
Fix UnicodeEncodeError on file encoding detection

@akx akx merged commit 07aa84f into python-babel:master Jan 4, 2016

5 of 6 checks passed

review/gitmate/manual This commit needs review.
Details
Scrutinizer 3 updated code elements
Details
codecov/project 84.21% remains the same compared to 4f1e0c5
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
review/gitmate/commit All is well! :)
Details

@sils sils removed the (3) pending review label Jan 4, 2016

@imankulov

This comment has been minimized.

Copy link
Contributor Author

commented Jan 4, 2016

@akx Awesome! Thanks a lot for merging it.

@akx akx added this to the Babel 2.3 milestone Jan 4, 2016

@pyup-bot pyup-bot referenced this pull request Jan 6, 2017

Merged

Update babel to 2.3.4 #13

@pyup-bot pyup-bot referenced this pull request Jan 31, 2017

Open

Update babel to 2.3.4 #28

@pyup-bot pyup-bot referenced this pull request Apr 11, 2017

Open

Initial Update #3

@pyup-bot pyup-bot referenced this pull request May 12, 2017

Closed

Initial Update #43

@pyup-bot pyup-bot referenced this pull request Jul 4, 2017

Merged

Initial Update #2

@pyup-bot pyup-bot referenced this pull request Oct 17, 2017

Closed

Initial Update #373

@pyup-bot pyup-bot referenced this pull request Nov 3, 2017

Closed

Update babel to 2.5.1 #424

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.