Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wmlxgettext: fix UTF-8 issue: #1785 #1793

Merged
merged 6 commits into from
Aug 1, 2017
Merged

wmlxgettext: fix UTF-8 issue: #1785 #1793

merged 6 commits into from
Aug 1, 2017

Conversation

AncientLich
Copy link
Contributor

@AncientLich AncientLich commented Jun 15, 2017

This code should fix the issue reported at
#1785

I tested my code with an older file of my own campaign wich contained an invalid UTF-8 byte

https://github.com/AncientLich/wmlxgettext-unoff/blob/master/wmlxgettext/test/reports/2017.06.15/05_The_Underground_Path.cfg

The invalid byte is located on the comment at line 272

# da modificare - devo aggiungerci la condizione "se il turno 23 non è stato raggiunto"

That is an italian comment (the meaning is not important). What it is import is that the 'è' character wasn't UTF-8 compatible, becouse I wrote that file with a wrong text encoding while I was a windows user (now I use Ubuntu).

Now the code manage the UTF-8 decode error, and will show a proper error message, wich is, in this case:

error: DrakeStory/scenarios/05_The_Underground_Path.cfg:272: UTF-8 Format error.
Can't decode byte 0xe8 (invalid continuation byte).
You must edit the file, replacing that byte with a valid UTF-8 character.

Text preceding the invalid byte (line 272):
# da modificare - devo aggiungerci la condizione "se il turno 23 non

You can notice 2 commits. On the first one I forgot to delete some debugging code (create a temporary file wich I used to understand how to display all the informations I need from the exception class). This file should never be created on an actual usage of wmlxgettext.

@CelticMinstrel CelticMinstrel changed the title wmlxgettext: fix UTF-8 issue: https://github.com/wesnoth/wesnoth/issu… wmlxgettext: fix UTF-8 issue: #1793 Jun 15, 2017
@CelticMinstrel CelticMinstrel changed the title wmlxgettext: fix UTF-8 issue: #1793 wmlxgettext: fix UTF-8 issue: #1783 Jun 15, 2017
@CelticMinstrel CelticMinstrel changed the title wmlxgettext: fix UTF-8 issue: #1783 wmlxgettext: fix UTF-8 issue: #1785 Jun 15, 2017
except UnicodeDecodeError as e:
errpos = int(e.start) # error position on file object with UTF-8 error
errbval = hex(e.object[errpos]) # value of byte wich causes UTF-8 error
# well... when exception occurred, the _current_lineno valie
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "value"

@CelticMinstrel
Copy link
Member

I feel like it should be possible to get the line number without reprocessing the whole file... why exactly is _current_lineno set when the exception occurs?

@Elvish-Hunter
Copy link
Contributor

Elvish-Hunter commented Jun 15, 2017

I feel like it should be possible to get the line number without reprocessing the whole file...

Actually, IMHO Nobun/AncientLich's solutions seems the correct one. Once the UnicodeDecodeError exception is raised, you don't get a file object, but an exception object instead, which contains the decoded text inside its .object member.

why exactly is _current_lineno set when the exception occurs?

Because it's a global variable, which is used by several functions (if I remember correctly).
@AncientLich: I think that the new error message should be changed to something simpler. I wouldn't be surprised if, after reading it, some user attempted to use a hex editor just to fix the file. Maybe you should suggest to open the file with a text editor and save it again with UTF-8 encoding without BOM.

@AncientLich
Copy link
Contributor Author

AncientLich commented Jun 15, 2017

@Elvish hunter: I confirm: _current_lineno is global variable of state machine pywmlx/state/machine.py.
@CelticMinstrel: on the test I did, without the workaround, _current_lineno is 0, even if the error is located on line 272 of text file.

Even if _current_lineno should be updated line by line in the try block, when the exception occurs, the exception is managed before all other errors. Infact, if you add a WML error on purpose BEFORE line 272 (example: line 3) you could expect that the first error reported should the wml error and not the UTF-8 error. BUT NOT. The exception is considered the first error, and the try block is considered simply failed.

I suspect that, when happening the exception, the failed try block is somehow managed as "not executed code", so why the _current_lineno is not changed as you would normally expect.

This is why I did the workaround, exactly as described by Elvish Hunter. Thank for correcting the typo, CelticMinstrel


@Elvish-Hunter: We can add something wich says 'use your text editor, not Hex editor... I will think about it. This change could be nice?

error: DrakeStory/scenarios/05_The_Underground_Path.cfg:272: UTF-8 Format error.
Can't decode byte 0xe8 (invalid continuation byte).
You must edit the file (WITH A TEXT EDITOR), replacing that byte with a valid UTF-8 character.
Before editing the file, check your text editor settings: 'text encoding' must be setted to 'utf-8'.

Text preceding the invalid byte (source file, line 272):
# da modificare - devo aggiungerci la condizione "se il turno 23 non

Any suggestion is well accepted: I will wait a bit before making another commit with error message modifications

@CelticMinstrel
Copy link
Member

I guess it's something like the entire file is read when the line iterator is invoked, and the unicode decode error happens there, before the loop body even has a chance to execute. Oh well.

@AncientLich
Copy link
Contributor Author

AncientLich commented Jun 17, 2017

Error message reviewed as follows:

error: DrakeStory/scenarios/05_The_Underground_Path.cfg:272: UTF-8 Format error.
Can't decode byte 0xe8 (invalid continuation byte).
Probably your file is not encoded with UTF-8 encoding: you should open the file with an advanced text editor, and re-save it with UTF-8 encoding.
To avoid this problem in the future, you might want to set the default encoding of your editor to UTF-8.

Text preceding the invalid byte (source file, line 272):
# da modificare - devo aggiungerci la condizione "se il turno 23 non

I noticed only now I didn't mentioned the thing of "not use BOM when re-save file" marked by Elvish Hunter. Should it be added?

@AncientLich
Copy link
Contributor Author

AncientLich commented Jun 25, 2017

Now the -o parameter is mandatory. On a basic usage, infact, it is better to avoid output redirection, wich may lead to text encoding problems. This way UMC developers would learn to use write pot files using the -o parameter instead of using output redirection like they used in past with the perl wmlxgettext. However, writing output to stdout is still possible if a person really want to do it on purpose, setting the -o parameter to "-" like the example showed here:

./wmlxgettext -o - -domain=... --directory=. --recursive | application_receiving_stdout_from_wmlxgettext

Note: I forgot in the previous commits, but I updated the wmlxgettext version number to 2017.06.25.py3

help= ('Destination file. In some special situation you could want to '
'write the output to STDOUT write an actual file. In that case '
'you can use "-o -" to write the pot file to STDOUT on purpose '
'(wich it is something you normally would avoid, becouse it can '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "which". Also, I assume you mean "to STDOUT instead of an actual file" rather than "write" there.

The text is also slightly awkward-sounding to a native speaker. Some possible fixes:

  • Change "you could want" to "you might want" or "you may want".
  • Add a comma after "In that case".
  • "which is something you would normally avoid" - basically swapping two words and removing the "it"

@AncientLich
Copy link
Contributor Author

AncientLich commented Jun 27, 2017

Thank for corrections:

help= ('Destination file. In some special situation you you might want ' 'to write the output to STDOUT instead of writing ' 'an actual file. In that case, you can use "-o -" to write ' 'the pot file to STDOUT on purpose ' '(wich is something you would normally avoid, becouse it can ' 'lead to encoding problems). [**REQUIRED ARGUMENT**]')

Note: I Don't know how to display part of the code published, so I used the code quotation

@Elvish-Hunter
Copy link
Contributor

Some corrections:

  1. "In that case" -> "In this case"
  2. "wich" -> "which"
  3. "becouse" -> "because"

'to write the output to STDOUT instead of writing '
'an actual file. In that case, you can use "-o -" to write '
'the pot file to STDOUT on purpose '
'(wich is something you would normally avoid, becouse it can '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a double "you" on the first line and misspelled "which" in the second-to-last line. I'd also add "to" on the end of the second line, but that might be more debatable.

@AncientLich
Copy link
Contributor Author

AncientLich commented Jul 1, 2017

Sorry if I replied late. I changed the -o help text a bit. Please, tell me if the new text is better and if there are other language errors to fix.

@CelticMinstrel
Copy link
Member

Does this look okay to merge, @Elvish-Hunter ?

@CelticMinstrel CelticMinstrel added this to the 1.13.9 milestone Jul 26, 2017
@Elvish-Hunter
Copy link
Contributor

It looks good to me.

@CelticMinstrel CelticMinstrel merged commit 3f287b9 into wesnoth:master Aug 1, 2017
@AncientLich AncientLich deleted the wmlxgettext branch August 16, 2017 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants