Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PO file with utf-8 BOM signature at beginning of file will not be recognized as translatable #1640

Closed
transl8bzimport opened this issue Aug 21, 2010 · 3 comments

Comments

Projects
None yet
4 participants
@transl8bzimport
Copy link

commented Aug 21, 2010

Version: trunk

Originally posted by langtechie:

(Tested on Firefox 3.6.8)

Uploading a PO file with utf-8 BOM signature at the beginning of the file, and Pootle will say there are no words to translate, even though the file has actual strings to translate.

Remove the utf-8 BOM and re-upload, and Pootle will recognized words to translate.

Expected: utf-8 files with BOM signature should be supported in Pootle as it is a common practice.

@transl8bzimport

This comment has been minimized.

Copy link
Author

commented Aug 21, 2010

Originally posted by langtechie:

Created attachment 684

This is the sample PO file with utf-8 BOM signature

@friedelwolff

This comment has been minimized.

Copy link
Member

commented Aug 21, 2010

Thank you for mentioning this idea. To my best knowledge, none of the official gettext tools support BOMs, and no conforming PO editor should create a file with a BOM. Therefore I don't really consider this common practice. So this seems to be about handling broken files.

If we do handle the BOM, how do we handle a mismatch between the BOM and the encoding specified in the file header? Also, if the BOM indicates UTF-16 (which is not a valid PO encoding as far as I know), what should we do?

We plan to rely entirely on the parser in the gettext package in future, so it might become even harder to do anything different from the gettext package at that stage. So I'm not sure if this is necessarily a good idea. We'll have to think about it a bit more.

@dwaynebailey

This comment has been minimized.

Copy link
Member

commented Aug 21, 2010

(In reply to BZ-IMPORT::comment #0)

Expected: utf-8 files with BOM signature should be supported in Pootle as it is
a common practice.

[dwayne@db storage]$ msgcat thirdwheel_django_sample.po
thirdwheel_django_sample.po:1:2: syntax error
msgcat: found 1 fatal error

The BOM causes Gettext tools to fail. I think it is a common Windows practise but quoting Wikipedia "While Unicode standard allows BOM in UTF-8 [2], it does not require or recommend it[3]. Byte order has no meaning in UTF-8"

(In reply to BZ-IMPORT::comment #2)

Thank you for mentioning this idea. To my best knowledge, none of the official
gettext tools support BOMs, and no conforming PO editor should create a file
with a BOM. Therefore I don't really consider this common practice. So this
seems to be about handling broken files.

Yes its a broken file. I tested with poedit. It opens without complaining but on saving removes the BOM.

If we do handle the BOM, how do we handle a mismatch between the BOM and the
encoding specified in the file header? Also, if the BOM indicates UTF-16
(which is not a valid PO encoding as far as I know), what should we do?

Yes, UTF-16 is invalid (checked with msgconv). I would say we simply remove a character sequence of 0xEF,0xBB,0xBF if it appears in a PO file would be a good workaround for broken PO files.

We plan to rely entirely on the parser in the gettext package in future, so it
might become even harder to do anything different from the gettext package at
that stage. So I'm not sure if this is necessarily a good idea. We'll have to
think about it a bit more.

I think we should approach it like poedit. Although its broken there is a very real possibility that someone unwittingly edits a PO file on a platform that adds BOMs.

No amount of hand waving will detract from a bad user experience.

@nijel nijel self-assigned this Jan 27, 2019

nijel added a commit to nijel/translate that referenced this issue Jan 27, 2019

Gettext: String UTF-8 BOM from input
Such file is not conforming with how Gettext tools behave, but
apparently some editors create such files.

Fixes translate#1640

nijel added a commit to nijel/translate that referenced this issue Jan 27, 2019

Gettext: String UTF-8 BOM from input
Such file is not conforming with how Gettext tools behave, but
apparently some editors create such files.

Fixes translate#1640

nijel added a commit to nijel/translate that referenced this issue Jan 27, 2019

Gettext: Strip UTF-8 BOM from input
Such file is not conforming with how Gettext tools behave, but
apparently some editors create such files.

Fixes translate#1640

@nijel nijel closed this in #3868 Jan 27, 2019

nijel added a commit that referenced this issue Jan 27, 2019

Gettext: Strip UTF-8 BOM from input
Such file is not conforming with how Gettext tools behave, but
apparently some editors create such files.

Fixes #1640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.