Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange regression with some unicode characters (e.g. with the Russian Н) #65

Closed
chubin opened this issue Feb 25, 2017 · 13 comments
Closed

Comments

@chubin
Copy link

chubin commented Feb 25, 2017

pyte 0.6 has a strange regression with some Unicode characters, particularly with the Russian "Н" character:

That works:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Русский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py
Русский текст

That does not work:

$ cat  regression.py
# vim: encoding=utf-8
import sys
import pyte

text = "Нерусский текст"
screen = pyte.screens.Screen(20, 1)
stream = pyte.streams.ByteStream()
stream.attach(screen)
stream.feed(text)

for line in screen.buffer:
    for x in line:
        sys.stdout.write(x.data)
    sys.stdout.write("\n")

$ python regression.py

$

As you can see, the output is empty in the second example (where the printed text contains "Н").

Everything works find with the 0.5.x version of the module.

Another problematic character: greek letter Ν

Some other broken characters:

ț \u021b
ȝ \u021d
ɛ \u025b
ɝ \u025d
ʛ \u029b
ʝ \u029d
̛ \u031b
̝ \u031d
͛ \u035b
͝ \u035d
Λ \u039b
Ν \u039d
Л \u041b
Н \u041d
ћ \u045b
ѝ \u045d
қ \u049b
ҝ \u049d
ԛ \u051b
ԝ \u051d
՛ \u055b
՝ \u055d
֛ \u059b
֝ \u059d

1b, 1d, 5b, 5d, 9b, 9d seem to be the root of the problem

@superbobry
Copy link
Collaborator

superbobry commented Feb 25, 2017

Thanks for reporting! This could be related to #62. Will investigate further.

@chubin
Copy link
Author

chubin commented Feb 25, 2017

I have found a new group of the evil characters.
Unfortunately, this group seems to have nothing common with the former group:

҃ \u0483
҄ \u0484
҅ \u0485
҆ \u0486
҇ \u0487

@superbobry
Copy link
Collaborator

The issue indeed has the same cause as #62. All of the characters you've listed contain some control bytes when UTF-8 encoded, e.g.

>>> "Н".encode("utf-8")
b'\xd0\x9d'  # \x9d is OSC
>>> "қ".encode("utf-8")
b'\xd2\x9b'  # \x9b is CSI

@chubin
Copy link
Author

chubin commented Feb 26, 2017

Of course they have, I listed some of them with their codes and they indeed contain 9d and 9b as you can see. On the other hand, in the last block I listed another group of characters, those do not contain neither 9d nor 9b. That seem to be another problem

@superbobry
Copy link
Collaborator

The new "unprintable" group seems to be related to the way we do Unicode normalization as all of them (I think) are combining characters.

@chubin
Copy link
Author

chubin commented Feb 26, 2017

How do you think, are there any chances that the bug will be fixed in the next weeks? Or should I better downgrade pyte and use 0.5.2? Can I help somehow probably?

@superbobry
Copy link
Collaborator

The bug is a consequence of delegating input decoding to Screen (see febdad7). I am currently thinking about how to best approach this, can't guarantee the fix would arrive shortly.

If you have any ideas, feel free to share them here.

@chubin
Copy link
Author

chubin commented Feb 26, 2017

I can try to find some other broken characters if it can help

superbobry added a commit that referenced this issue Feb 26, 2017
@superbobry
Copy link
Collaborator

Don't worry, the ones you already came up with are already enough.

@chubin
Copy link
Author

chubin commented Mar 5, 2017

Any news about the issue may be? The problem is that many Japanese/Chinese are also corrupted. There are some simple workaround for Cyrllic/Greek, but things are getting worse with the oriental languages. So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

@superbobry
Copy link
Collaborator

I am still thinking on how to implement this without making the code too much of a nightmare. I have a prototype in a local branch but it is not finished yet. Most likely I won't have much time to work on this further until the next weekend, so if you have any ideas feel free to post them here or submit a PR.

So the issue is a real blocker for pyte 0.6 usage in a multilingual environment

Yes, I understand it is critical, but 0.6.0 has not been released, so I'd suggest to use the latest stable version if you're after correctness.

superbobry added a commit that referenced this issue Mar 9, 2017
Delegating decoding to ``Screen`` was a bad idea, because control
sequences can naturally occur in pure-text data, e.g. the UTF-8 encoded
"Н" letter contains an OSC byte.

Closes #62 and #65
superbobry added a commit that referenced this issue Mar 10, 2017
Delegating decoding to ``Screen`` was a bad idea, because control
sequences can naturally occur in pure-text data, e.g. the UTF-8 encoded
"Н" letter contains an OSC byte.

Closes #62 and #65
superbobry added a commit that referenced this issue Mar 11, 2017
Delegating decoding to ``Screen`` was a bad idea, because control
sequences can naturally occur in pure-text data, e.g. the UTF-8 encoded
"Н" letter contains an OSC byte.

Closes #62 and #65
superbobry added a commit that referenced this issue Mar 11, 2017
Delegating decoding to ``Screen`` was a bad idea, because control
sequences can naturally occur in pure-text data, e.g. the UTF-8 encoded
"Н" letter contains an OSC byte.

Closes #62 and #65
@chubin
Copy link
Author

chubin commented Mar 25, 2017

I confirm the problem is fixed now! @superbobry you are genius! Thank you very much!

@superbobry
Copy link
Collaborator

Haha, thanks! Glad it works for you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants