Towards supporting unicode #242

x0ret · 2019-05-24T19:42:29Z

Support unicode docstring
Support unicode strings

There are some issue i think, using python2.7 env to decompile python3.7 pyc results in \n\n docstring, which i think this is xdis related issue.

This is WIP.
fixes #241.

rocky · 2019-05-24T20:41:44Z

Looks good so far. I note that .decode() (and probably unicode) go start at around Python 2.4 or so.

x0ret · 2019-05-25T01:14:07Z

the only case we don't support yet is for e.g. uncompyle6 python version is 3.7.3 and you are trying to decompile 2.7. In this case we don't know whether string is unicode or not. For this case we can use a module like chardet for recognizing encoding:

>>> import chardet
>>> chardet.detect("تست".encode())
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}
>>> chardet.detect("Test".encode())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Do you prefer adding this kind of dependency to uncompyle6?

EDIT:

For xdis this patch fixes the case for using python 2.7 for decompiling python 3.7 for e.g.:

--- ~/.pyenv/versions/2.7.16/lib/python2.7/site-packages/xdis/unmarshal.py Sat May 25 05:08:37 2019
+++ ~/.pyenv/versions/2.7.16/lib/python2.7/site-packages/xdis/unmarshal.py Sat May 25 05:37:41 2019
@@ -57,7 +57,7 @@
         # found it and this code via
         # https://www.peterbe.com/plog/unicode-to-ascii where it is a
         # dead link. That can potentially do better job in converting accents.
-        return unicodedata.normalize('NFKD', u).encode('ascii', 'ignore')
+        return unicodedata.normalize('NFKD', u)
     else:
         return str(u)

ignore option in encode results in :

('unicodestring', u'\n    \u062a\u0633\u062a\n    ')
('unicodestring', '\n    \n    ')

rocky · 2019-05-25T01:56:31Z

For xdis this patch fixes the case for using python 2.7 for decompiling python 3.7 for e.g.:

Yeah, this was expedient and flaky. Please put in PR for fixing xdis. You should also have an invite to that project.

Do you prefer adding this kind of dependency to uncompyle6?

Sure! (I assume you do to since you suggested it?) chardet goes back to 2.1 and looks pretty cool and well written. However there might also be an option to indicate an encoding so the program doesn't have to guess. And while on the topic of options processing, the 2.7 option-processing branch would be greatly cleaned up if we started using click. For decompile3 there's no excuse not to use (other than not having gotten around to it).

x0ret · 2019-05-25T07:57:59Z

However there might also be an option to indicate an encoding so the program doesn't have to guess.

Nice, this options is a better alternative. I'll commit for the option.

Added click to my TODOs.

x0ret · 2019-05-25T21:13:42Z

@rocky, i did changes for encoding option but there's an case which i should ask you for comment. Suppose you specify encoding in the uncompyle6 option to utf-8. Your Python version is 3.7.3 and you are trying to decompile a Python 2.7 module. As we decided, for this case, we treat all strings including ascii to unicode and prepend u to them. I'm thinking about side effect of these cases in Python 2, What do you think?

rocky · 2019-05-25T22:29:49Z

@x0ret You have my admiration for noticing such things and thinking about them. I don't know if I can be of help other than to be a sounding board for ideas. I have thought about this for the last 10 minutes or so.

When you say:

I'm thinking about side effect of these cases in Python 2,

do you mean that normally uncompyle6 would try to turn this into ASCII, but here it might change behavior and turn it into unicode instead?

The TODO suggests we should consider that the string was in unicode to start with so in that case it probably shouldn't be turned into ASCII. Right?

If I have this right, you had suggested using chardet which can detect if there are non-ascii characters in there, and thus may need to be in unicode?) Right? Would chardet work or help?

When I feel like I know something, I'll say so. But when I don't think I know or understand, I won't hesitate to admit it. So I leave to you how you want to ultimately proceed.

The way I see this is, and the way I have been working, has been to make a stab in the right direction even if it is flaky or incomplete or I don't understand the problem fully. Almost always that is better than doing nothing. And if there is a problem, unless this is a massive and difficult change to revert (which you are in a better position to know than me), having moved hopefully forward (or at least in a particular direction) we are in a better position to assess what the right or better thing to do is.

As you may have seen, I am not even afraid to make the wrong decision and own up to it. Hence I'll leave those "FIXME" or "TODO" comments.

Sorry I can't be of more help.

x0ret · 2019-05-27T13:05:58Z

@rocky, thanks for your comment.

do you mean that normally uncompyle6 would try to turn this into ASCII, but here it might change behavior and turn it into unicode instead?

Yes, I was worried about breaking generated source in Python2, however after another shot, i am convinced that when someone uses explicit unicode chars in source, using coding: utf-8 is a forced and no matter if u is used or not. So your suggestion works perfectly.

In this case we do not need chardet.

There was only an issue in code which i used try-catch and with xdis fix i can say it is ready.

Besides based on your suggestion i added another option encoding to give the user the option to explicitly specify source encoding. (due to using getopt i couldn't implement having val optional and force to utf-8).

Please review the commits and let me know if you prefer changes.

Also No changes required for Unicode strings like print('تست') since current implementation with this PR and xdis works perfectly.

As you may have seen, I am not even afraid to make the wrong decision and own up to it. Hence I'll leave those "FIXME" or "TODO" comments.

This is so valuable, I learned alot since working on this project. Thanks.

Sorry I can't be of more help.

Anyway your words encouraged me to recheck again.

rocky

Looks good to me - thanks!

(I will be committing the change to xdis soon since I gather there are no objections with that.

uncompyle6/semantics/pysource.py

x0ret requested a review from rocky May 24, 2019 19:44

x0ret force-pushed the master branch from e64728e to 3c24919 Compare May 25, 2019 00:59

x0ret added 2 commits May 27, 2019 17:00

towards supporting unicode: docstring

a5cdb50

add support for generated source encoding

9db59f1

x0ret force-pushed the master branch from 3c24919 to 9db59f1 Compare May 27, 2019 12:49

rocky approved these changes May 27, 2019

View reviewed changes

uncompyle6/semantics/pysource.py Outdated Show resolved Hide resolved

rocky merged commit e364499 into rocky:master May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Towards supporting unicode #242

Towards supporting unicode #242

Uh oh!

x0ret commented May 24, 2019 •

edited

Loading

Uh oh!

rocky commented May 24, 2019

Uh oh!

x0ret commented May 25, 2019

Uh oh!

rocky commented May 25, 2019 •

edited

Loading

Uh oh!

x0ret commented May 25, 2019

Uh oh!

x0ret commented May 25, 2019

Uh oh!

rocky commented May 25, 2019

Uh oh!

x0ret commented May 27, 2019

Uh oh!

rocky left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Towards supporting unicode #242

Towards supporting unicode #242

Uh oh!

Conversation

x0ret commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocky commented May 24, 2019

Uh oh!

x0ret commented May 25, 2019

Uh oh!

rocky commented May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x0ret commented May 25, 2019

Uh oh!

x0ret commented May 25, 2019

Uh oh!

rocky commented May 25, 2019

Uh oh!

x0ret commented May 27, 2019

Uh oh!

rocky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x0ret commented May 24, 2019 •

edited

Loading

rocky commented May 25, 2019 •

edited

Loading