po2sub gets encoding wrong and fails #3827

afranke · 2018-09-11T18:00:35Z

I installed translate-toolkit 2.3.0 via pip but the same problem with 2.0.0b5, and with the Fedora 28 package as well.

po2sub -t gnome330.srt po/fr.po fr.srt fails with

po2sub: WARNING: Error processing: input po/fr.po, output fr.srt, template gnome330.srt: 'latin-1' codec can't encode character '\u2019' in position 636: ordinal not in range(256)

For some reason it thinks it’s latin-1 when it should be utf-8.

I’m attaching the files so you can try it yourself.

The text was updated successfully, but these errors were encountered:

Toub · 2018-09-25T06:29:02Z

Edit: my bad, this is an old bug #3601

I also have the same problem with v3.2.0:

 xliff2po: WARNING: Error processing: input src/assets/i18n/messages.en.xlf, output None, template None: 'ascii' codec can't encode character '\xe9' in position 70: ordinal not in range(128)

My file is UTF-8 encoded, not ascii.

Everything is fine with v2.2.5, so this is a regression.

queengooborg · 2018-12-18T08:29:19Z

Looking into this more, it seems that @afranke's and @Toub's bugs are separate.

@afranke, this seems to be an issue with aeidon.encodings.detect, rather its dependency chardet, assuming the subtitle file is in "ISO-8859-1" (aka latin-1), rather than translate-toolkit itself. It looks like they have an issue since 2017, which describes the issues in detection with only one non-ASCII character. Adding more non-ASCII characters in the subtitle file seemed to fix the issue.

For @Toub's problem, it's an old bug that's affecting more converters than just xliff2po. web2py2po had the same issue.

rffontenelle · 2020-09-18T20:19:18Z

I've being debugging (not for 2 years, btw) and I have 2 notes on this issue:

1- While chardet fails to detect as UTF-8 single characters like … and ’, the same doesn't happen with a whole sentence. Don't know why at the moment. Here is an example of a string containing ’ from the OP's fr.po file:

>>> import chardet
>>> chardet.detect('Ce cycle, GNOME Shell a reçu une attention particulière sur l’optimisation des performances.'.encode('utf-8'))
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

2- An ugly workaround for this issue is to force the desired encoding instead of relying on translate-toolkit to detect it. For example, take the following patch:

--- translate/storage/subtitles.py-orig	2020-09-18 09:54:47.875337951 -0300
+++ translate/storage/subtitles.py	2020-09-18 09:55:55.799517988 -0300
@@ -105,7 +105,7 @@
 
     def _parse(self):
         try:
-            self.encoding = detect(self.filename)
+            self.encoding = "utf-8"
             self._format = determine(self.filename, self.encoding)
             self._subtitlefile = new(self._format, self.filename, self.encoding)
             for subtitle in self._subtitlefile.read():

and run it with a command like:

$ patch -p3 venv/lib/python3.8/site-packages/translate/storage/subtitles.py < force-utf-8.patch

po2sub's conversion works, but one might want to check if the po file is UTF-8 before it e.g. using file fr.po.

nijel · 2020-12-22T14:34:39Z

Yes chardet is not always reliable in detecting utf-8, for example there is chardet/chardet#148

Toub mentioned this issue Sep 25, 2018

2.3.1 release ? #3829

Closed

queengooborg mentioned this issue Dec 18, 2018

Wrong UTF-8 detection chardet/chardet#134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

po2sub gets encoding wrong and fails #3827

po2sub gets encoding wrong and fails #3827

afranke commented Sep 11, 2018

Toub commented Sep 25, 2018 •

edited

queengooborg commented Dec 18, 2018 •

edited

rffontenelle commented Sep 18, 2020 •

edited

nijel commented Dec 22, 2020

po2sub gets encoding wrong and fails #3827

po2sub gets encoding wrong and fails #3827

Comments

afranke commented Sep 11, 2018

Toub commented Sep 25, 2018 • edited

queengooborg commented Dec 18, 2018 • edited

rffontenelle commented Sep 18, 2020 • edited

nijel commented Dec 22, 2020

Toub commented Sep 25, 2018 •

edited

queengooborg commented Dec 18, 2018 •

edited

rffontenelle commented Sep 18, 2020 •

edited