Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

po2sub gets encoding wrong and fails #3827

Open
afranke opened this issue Sep 11, 2018 · 4 comments
Open

po2sub gets encoding wrong and fails #3827

afranke opened this issue Sep 11, 2018 · 4 comments

Comments

@afranke
Copy link

afranke commented Sep 11, 2018

I installed translate-toolkit 2.3.0 via pip but the same problem with 2.0.0b5, and with the Fedora 28 package as well.

po2sub -t gnome330.srt po/fr.po fr.srt fails with

po2sub: WARNING: Error processing: input po/fr.po, output fr.srt, template gnome330.srt: 'latin-1' codec can't encode character '\u2019' in position 636: ordinal not in range(256)

For some reason it thinks it’s latin-1 when it should be utf-8.

I’m attaching the files so you can try it yourself.

@Toub
Copy link

Toub commented Sep 25, 2018

Edit: my bad, this is an old bug #3601

I also have the same problem with v3.2.0:

 xliff2po: WARNING: Error processing: input src/assets/i18n/messages.en.xlf, output None, template None: 'ascii' codec can't encode character '\xe9' in position 70: ordinal not in range(128)

My file is UTF-8 encoded, not ascii.

Everything is fine with v2.2.5, so this is a regression.

@Toub Toub mentioned this issue Sep 25, 2018
@queengooborg
Copy link
Contributor

queengooborg commented Dec 18, 2018

Looking into this more, it seems that @afranke's and @Toub's bugs are separate.

@afranke, this seems to be an issue with aeidon.encodings.detect, rather its dependency chardet, assuming the subtitle file is in "ISO-8859-1" (aka latin-1), rather than translate-toolkit itself. It looks like they have an issue since 2017, which describes the issues in detection with only one non-ASCII character. Adding more non-ASCII characters in the subtitle file seemed to fix the issue.

For @Toub's problem, it's an old bug that's affecting more converters than just xliff2po. web2py2po had the same issue.

@rffontenelle
Copy link
Contributor

rffontenelle commented Sep 18, 2020

I've being debugging (not for 2 years, btw) and I have 2 notes on this issue:

1- While chardet fails to detect as UTF-8 single characters like and , the same doesn't happen with a whole sentence. Don't know why at the moment. Here is an example of a string containing from the OP's fr.po file:

>>> import chardet
>>> chardet.detect('Ce cycle, GNOME Shell a reçu une attention particulière sur l’optimisation des performances.'.encode('utf-8'))
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

2- An ugly workaround for this issue is to force the desired encoding instead of relying on translate-toolkit to detect it. For example, take the following patch:

--- translate/storage/subtitles.py-orig	2020-09-18 09:54:47.875337951 -0300
+++ translate/storage/subtitles.py	2020-09-18 09:55:55.799517988 -0300
@@ -105,7 +105,7 @@
 
     def _parse(self):
         try:
-            self.encoding = detect(self.filename)
+            self.encoding = "utf-8"
             self._format = determine(self.filename, self.encoding)
             self._subtitlefile = new(self._format, self.filename, self.encoding)
             for subtitle in self._subtitlefile.read():

and run it with a command like:

$ patch -p3 venv/lib/python3.8/site-packages/translate/storage/subtitles.py < force-utf-8.patch

po2sub's conversion works, but one might want to check if the po file is UTF-8 before it e.g. using file fr.po.

@nijel
Copy link
Member

nijel commented Dec 22, 2020

Yes chardet is not always reliable in detecting utf-8, for example there is chardet/chardet#148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants