-
-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wmlxgettext: fix UTF-8 issue: #1785 #1793
Conversation
utils/pywmlx/state/machine.py
Outdated
except UnicodeDecodeError as e: | ||
errpos = int(e.start) # error position on file object with UTF-8 error | ||
errbval = hex(e.object[errpos]) # value of byte wich causes UTF-8 error | ||
# well... when exception occurred, the _current_lineno valie |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "value"
I feel like it should be possible to get the line number without reprocessing the whole file... why exactly is |
Actually, IMHO Nobun/AncientLich's solutions seems the correct one. Once the UnicodeDecodeError exception is raised, you don't get a file object, but an exception object instead, which contains the decoded text inside its .object member.
Because it's a global variable, which is used by several functions (if I remember correctly). |
@Elvish hunter: I confirm: _current_lineno is global variable of state machine pywmlx/state/machine.py. Even if _current_lineno should be updated line by line in the try block, when the exception occurs, the exception is managed before all other errors. Infact, if you add a WML error on purpose BEFORE line 272 (example: line 3) you could expect that the first error reported should the wml error and not the UTF-8 error. BUT NOT. The exception is considered the first error, and the try block is considered simply failed. I suspect that, when happening the exception, the failed try block is somehow managed as "not executed code", so why the _current_lineno is not changed as you would normally expect. This is why I did the workaround, exactly as described by Elvish Hunter. Thank for correcting the typo, CelticMinstrel @Elvish-Hunter: We can add something wich says 'use your text editor, not Hex editor... I will think about it. This change could be nice?
Any suggestion is well accepted: I will wait a bit before making another commit with error message modifications |
I guess it's something like the entire file is read when the line iterator is invoked, and the unicode decode error happens there, before the loop body even has a chance to execute. Oh well. |
Error message reviewed as follows:
I noticed only now I didn't mentioned the thing of "not use BOM when re-save file" marked by Elvish Hunter. Should it be added? |
Now the -o parameter is mandatory. On a basic usage, infact, it is better to avoid output redirection, wich may lead to text encoding problems. This way UMC developers would learn to use write pot files using the -o parameter instead of using output redirection like they used in past with the perl wmlxgettext. However, writing output to stdout is still possible if a person really want to do it on purpose, setting the -o parameter to "-" like the example showed here:
Note: I forgot in the previous commits, but I updated the wmlxgettext version number to 2017.06.25.py3 |
utils/wmlxgettext
Outdated
help= ('Destination file. In some special situation you could want to ' | ||
'write the output to STDOUT write an actual file. In that case ' | ||
'you can use "-o -" to write the pot file to STDOUT on purpose ' | ||
'(wich it is something you normally would avoid, becouse it can ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "which". Also, I assume you mean "to STDOUT instead of an actual file" rather than "write" there.
The text is also slightly awkward-sounding to a native speaker. Some possible fixes:
- Change "you could want" to "you might want" or "you may want".
- Add a comma after "In that case".
- "which is something you would normally avoid" - basically swapping two words and removing the "it"
Thank for corrections:
Note: I Don't know how to display part of the code published, so I used the code quotation |
Some corrections:
|
utils/wmlxgettext
Outdated
'to write the output to STDOUT instead of writing ' | ||
'an actual file. In that case, you can use "-o -" to write ' | ||
'the pot file to STDOUT on purpose ' | ||
'(wich is something you would normally avoid, becouse it can ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a double "you" on the first line and misspelled "which" in the second-to-last line. I'd also add "to" on the end of the second line, but that might be more debatable.
Sorry if I replied late. I changed the -o help text a bit. Please, tell me if the new text is better and if there are other language errors to fix. |
Does this look okay to merge, @Elvish-Hunter ? |
It looks good to me. |
This code should fix the issue reported at
#1785
I tested my code with an older file of my own campaign wich contained an invalid UTF-8 byte
https://github.com/AncientLich/wmlxgettext-unoff/blob/master/wmlxgettext/test/reports/2017.06.15/05_The_Underground_Path.cfg
The invalid byte is located on the comment at line 272
That is an italian comment (the meaning is not important). What it is import is that the 'è' character wasn't UTF-8 compatible, becouse I wrote that file with a wrong text encoding while I was a windows user (now I use Ubuntu).
Now the code manage the UTF-8 decode error, and will show a proper error message, wich is, in this case:
You can notice 2 commits. On the first one I forgot to delete some debugging code (create a temporary file wich I used to understand how to display all the informations I need from the exception class). This file should never be created on an actual usage of wmlxgettext.