Strange regression with some unicode characters (e.g. with the Russian Н) #65

chubin · 2017-02-25T13:19:57Z

pyte 0.6 has a strange regression with some Unicode characters, particularly with the Russian "Н" character:

That works:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Русский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py
Русский текст

That does not work:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Нерусский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py

$

As you can see, the output is empty in the second example (where the printed text contains "Н").

Everything works find with the 0.5.x version of the module.

Another problematic character: greek letter Ν

Some other broken characters:

ț \u021b
ȝ \u021d
ɛ \u025b
ɝ \u025d
ʛ \u029b
ʝ \u029d
̛ \u031b
̝ \u031d
͛ \u035b
͝ \u035d
Λ \u039b
Ν \u039d
Л \u041b
Н \u041d
ћ \u045b
ѝ \u045d
қ \u049b
ҝ \u049d
ԛ \u051b
ԝ \u051d
՛ \u055b
՝ \u055d
֛ \u059b
֝ \u059d

1b, 1d, 5b, 5d, 9b, 9d seem to be the root of the problem

The text was updated successfully, but these errors were encountered:

superbobry · 2017-02-25T22:12:04Z

Thanks for reporting! This could be related to #62. Will investigate further.

chubin · 2017-02-25T22:40:53Z

I have found a new group of the evil characters.
Unfortunately, this group seems to have nothing common with the former group:

҃ \u0483
҄ \u0484
҅ \u0485
҆ \u0486
҇ \u0487

superbobry · 2017-02-26T19:35:37Z

The issue indeed has the same cause as #62. All of the characters you've listed contain some control bytes when UTF-8 encoded, e.g.

>>> "Н".encode("utf-8")
b'\xd0\x9d'  # \x9d is OSC
>>> "қ".encode("utf-8")
b'\xd2\x9b'  # \x9b is CSI

chubin · 2017-02-26T19:43:22Z

Of course they have, I listed some of them with their codes and they indeed contain 9d and 9b as you can see. On the other hand, in the last block I listed another group of characters, those do not contain neither 9d nor 9b. That seem to be another problem

superbobry · 2017-02-26T19:47:34Z

The new "unprintable" group seems to be related to the way we do Unicode normalization as all of them (I think) are combining characters.

chubin · 2017-02-26T19:55:16Z

How do you think, are there any chances that the bug will be fixed in the next weeks? Or should I better downgrade pyte and use 0.5.2? Can I help somehow probably?

superbobry · 2017-02-26T20:02:17Z

The bug is a consequence of delegating input decoding to Screen (see febdad7). I am currently thinking about how to best approach this, can't guarantee the fix would arrive shortly.

If you have any ideas, feel free to share them here.

chubin · 2017-02-26T20:05:15Z

I can try to find some other broken characters if it can help

superbobry · 2017-02-26T20:54:00Z

Don't worry, the ones you already came up with are already enough.

chubin · 2017-03-05T19:07:36Z

Any news about the issue may be? The problem is that many Japanese/Chinese are also corrupted. There are some simple workaround for Cyrllic/Greek, but things are getting worse with the oriental languages. So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

superbobry · 2017-03-05T23:07:49Z

I am still thinking on how to implement this without making the code too much of a nightmare. I have a prototype in a local branch but it is not finished yet. Most likely I won't have much time to work on this further until the next weekend, so if you have any ideas feel free to post them here or submit a PR.

So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

Yes, I understand it is critical, but 0.6.0 has not been released, so I'd suggest to use the latest stable version if you're after correctness.

Delegating decoding to ``Screen`` was a bad idea, because control sequences can naturally occur in pure-text data, e.g. the UTF-8 encoded "Н" letter contains an OSC byte. Closes #62 and #65

chubin · 2017-03-25T16:10:54Z

I confirm the problem is fixed now! @superbobry you are genius! Thank you very much!

superbobry · 2017-03-25T16:41:10Z

Haha, thanks! Glad it works for you :)

superbobry added a commit that referenced this issue Feb 26, 2017

Added a failing test for issue #65

d374226

superbobry mentioned this issue Mar 11, 2017

Moved decoding back to pyte.streams.Stream #68

Merged

superbobry closed this as completed Mar 11, 2017

superbobry mentioned this issue Mar 11, 2017

Zero-width characters #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange regression with some unicode characters (e.g. with the Russian Н) #65

Strange regression with some unicode characters (e.g. with the Russian Н) #65

chubin commented Feb 25, 2017 •

edited

Loading

superbobry commented Feb 25, 2017 •

edited

Loading

chubin commented Feb 25, 2017 •

edited

Loading

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017 •

edited

Loading

superbobry commented Feb 26, 2017

chubin commented Mar 5, 2017

superbobry commented Mar 5, 2017

chubin commented Mar 25, 2017

superbobry commented Mar 25, 2017

Strange regression with some unicode characters (e.g. with the Russian Н) #65

Strange regression with some unicode characters (e.g. with the Russian Н) #65

Comments

chubin commented Feb 25, 2017 • edited Loading

superbobry commented Feb 25, 2017 • edited Loading

chubin commented Feb 25, 2017 • edited Loading

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017

superbobry commented Feb 26, 2017

chubin commented Feb 26, 2017 • edited Loading

superbobry commented Feb 26, 2017

chubin commented Mar 5, 2017

superbobry commented Mar 5, 2017

chubin commented Mar 25, 2017

superbobry commented Mar 25, 2017

chubin commented Feb 25, 2017 •

edited

Loading

superbobry commented Feb 25, 2017 •

edited

Loading

chubin commented Feb 25, 2017 •

edited

Loading

chubin commented Feb 26, 2017 •

edited

Loading