-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UnicodeEncodeError on file encoding detection #274
Conversation
Current coverage is
|
Do you have a working example where this happens? Because any byte sequence can be decoded as Latin1. It doesn't mean that the decoded result should make sense, but decoding Latin1 to |
Oh, to be frank, I misunderstood the source of the problem initially. Of course, when you have a byte sequence (Python 2.x string) and you decode it, you cannot end up with UnicodeEncodeError. The exception is thrown later on, when we try to parse the resulting object with Python 2.x example $ python
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f3150d18090>
>>> parser.suite(u'print("я")')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u044f' in position 7: ordinal not in range(128) Python 3.x example $ python3
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f1f222e3f90>
>>> parser.suite('print("я")'.encode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: suite() argument 1 must be str, not bytes As it turned out, to support both 2.x and 3.x properly we have to switch between py3.x str and py2.x str, depending on the interpreter version. Current solution is to "die quietly" if first string contains non-ascii symbols. As I feel it, it is "good enough" to cover almost everything. Non-ascii first string theoretically can contain encoding information, like |
If mako templates contain something like "_('Köln')", babel extractor converts it to pure ASCII so that resulting .po file would contain "K\xf6ln". Not all translation tools and translations are ready for such kind of escape sequences. Babel allows message ids to be non-ascii, the plugin just has to return Unicode objects instead of ASCII strings (and that's exactly how Babel built-in Python and JavaScript extractors work). This fix ensures mako extractor doesn't excape non-ascii symbols, works well both for Unicode and non-unicode input (there is a test for cp1251 encoding), and also provides a workaround for babel charset detector python-babel/babel#274.
@imankulov Can you rebase this, please? |
If the first line of a python file is not a valid latin-1 string, parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be valid in some scenarios (for example, Mako extractor uses babel.messages.extract.extract_python), and it makes more sense to ignore this exception and return None.
258cb37
to
f9b04a5
Compare
@akx Done. |
Thank you, this looks good! |
Fix UnicodeEncodeError on file encoding detection
@akx Awesome! Thanks a lot for merging it. |
If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.