New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charset confusion in prerelease #264

Closed
mgedmin opened this Issue Feb 21, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@mgedmin
Copy link
Contributor

mgedmin commented Feb 21, 2018

I was cutting the 2.17.0 release of irclog2html, and saw this:

$ prerelease
...
Changelog entries for version 2.17.0:

2.17.0 (unreleased)
-------------------

- Support `ii <https://tools.suckless.org/ii/>`_ logs which use Unix
  timestamps (https://github.com/mgedmin/irclog2html/pull/21).
  Pull request by Cédric Krier.
Enter version [2.17.0]: 

Note how the changelog text was misidentified as Latin-1 and displayed wrong. The file itself is UTF-8:

$ head -n 10 CHANGES.rst 
Changelog
=========

2.17.0 (2018-02-21)
-------------------

- Support `ii <https://tools.suckless.org/ii/>`_ logs which use Unix
  timestamps (https://github.com/mgedmin/irclog2html/pull/21).
  Pull request by Cédric Krier.

Luckily the text is only displayed wrong, not actually written wrong into the file.

My locale settings are correct (locale charmap prints UTF-8). I'm using Python 2. prerelease --version prints prerelease: error: unrecognized arguments: --version, but anyway I just did a pip install -U zest.releaser so it's the latest.

@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 10, 2018

To detect the encoding, on Python 2 we use chardet and on Python 3 we use tokenize. See https://github.com/zestsoftware/zest.releaser/blob/6.13.5/zest/releaser/utils.py#L93-L108

In debug mode, I see this with Python 2:

$ prerelease -v
...
DEBUG: Checking CHANGES.rst
DEBUG: EUC-TW Taiwan prober hit error at byte 213
DEBUG: utf-8  confidence = 0.505
DEBUG: SHIFT_JIS Japanese confidence = 0.01
DEBUG: EUC-JP Japanese confidence = 0.01
DEBUG: GB2312 Chinese confidence = 0.01
DEBUG: EUC-KR Korean confidence = 0.01
DEBUG: CP949 Korean confidence = 0.01
DEBUG: Big5 Chinese confidence = 0.01
DEBUG: EUC-TW not active
DEBUG: windows-1251 Russian confidence = 0.01
DEBUG: KOI8-R Russian confidence = 0.01
DEBUG: ISO-8859-5 Russian confidence = 0.01
DEBUG: MacCyrillic Russian confidence = 0.01
DEBUG: IBM866 Russian confidence = 0.01
DEBUG: IBM855 Russian confidence = 0.01
DEBUG: ISO-8859-7 Greek confidence = 0.01
DEBUG: windows-1253 Greek confidence = 0.01
DEBUG: ISO-8859-5 Bulgairan confidence = 0.01
DEBUG: windows-1251 Bulgarian confidence = 0.01
DEBUG: TIS-620 Thai confidence = 0.0
DEBUG: ISO-8859-9 Turkish confidence = 0.512703226996
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: windows-1255 Hebrew confidence = 0.0
DEBUG: Detected encoding of CHANGES.rst with chardet: ISO-8859-1

Okay, that looks a bit weird. I see utf-8 at 0.505 and ISO-8859-9 Turkish a tiny bit higher at 0.512703226996. But I don't see the reported ISO-8859-1. I don't know where that one comes from. But at least utf-8 is not the best scoring encoding here, so I can imagine that this fails.

chardet also comes with a command line utility. Let's see:

$ chardetect CHANGES.rst 
CHANGES.rst: ISO-8859-1 with confidence 0.73

So chardet really wrongly detects it. Sounds like this bug: chardet/chardet#138

When I try prerelease -v with Python 3, I get this:

DEBUG: Detected encoding of CHANGES.rst with tokenize: utf-8
@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 10, 2018

I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet or tokenize.

mauritsvanrees added a commit that referenced this issue Mar 23, 2018

When determining encoding, first look for coding hints in the file it…
…self.

Only when that fails, we try `tokenize` or `chardet`.
Fixes #264
@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 23, 2018

I wonder if it makes sense to reorder our detection code and first try looking for encoding hints before using chardet or tokenize.

I made PR #272 that does this. If you add a marker for the encoding at the top of CHANGES.rst it will work:

# -*- coding:utf-8

Wait, that gives a ReStructuredText error: 'Inline emphasis start-string without end-string.' And the line would appear literally in the rendered html if it would work... Try this:

.. # coding=utf-8

I think that is the best we can do, given that chardet gives us the wrong encoding.
Does this seem workable?

mauritsvanrees added a commit that referenced this issue Mar 24, 2018

When determining encoding, first look for coding hints in the file it…
…self.

Only when that fails, we try `tokenize` or `chardet`.
Fixes #264
@mgedmin

This comment has been minimized.

Copy link
Contributor Author

mgedmin commented Mar 24, 2018

Personally I'd prefer a [zest.releaser] config option in setup.cfg for overriding the magic charset detection bits. changelog_encoding = UTF-8 or something like that.

@reinout

This comment has been minimized.

Copy link
Collaborator

reinout commented Mar 25, 2018

That would be a useful safety valve!

@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 26, 2018

That could help.
It would not be only for the changelog though, but also for setup.py or whatever other file we read or write.
So in our read_text_file function, the best order would probably be:

  1. Look for encoding hints in the file itself (# -*- coding:utf-8).
  2. Look for an encoding setting in setup.cfg/.pypirc.
  3. Try tokenize/chardet.

I'll see what I can do.

@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 26, 2018

Done in PR #275. When I put this in setup.cfg (or ~/.pypirc) it works:

[zest.releaser]
encoding = utf-8
@mauritsvanrees

This comment has been minimized.

Copy link
Member

mauritsvanrees commented Mar 26, 2018

Released with several other fixes in 6.14.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment