Skip to content

Conversation

@x0ret
Copy link
Collaborator

@x0ret x0ret commented May 24, 2019

  • Support unicode docstring

  • Support unicode strings

There are some issue i think, using python2.7 env to decompile python3.7 pyc results in \n\n docstring, which i think this is xdis related issue.

This is WIP.
fixes #241.

@x0ret x0ret requested a review from rocky May 24, 2019 19:44
@rocky
Copy link
Owner

rocky commented May 24, 2019

Looks good so far. I note that .decode() (and probably unicode) go start at around Python 2.4 or so.

@x0ret
Copy link
Collaborator Author

x0ret commented May 25, 2019

the only case we don't support yet is for e.g. uncompyle6 python version is 3.7.3 and you are trying to decompile 2.7. In this case we don't know whether string is unicode or not. For this case we can use a module like chardet for recognizing encoding:

>>> import chardet
>>> chardet.detect("تست".encode())
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}
>>> chardet.detect("Test".encode())
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Do you prefer adding this kind of dependency to uncompyle6?

EDIT:

For xdis this patch fixes the case for using python 2.7 for decompiling python 3.7 for e.g.:

--- ~/.pyenv/versions/2.7.16/lib/python2.7/site-packages/xdis/unmarshal.py Sat May 25 05:08:37 2019
+++ ~/.pyenv/versions/2.7.16/lib/python2.7/site-packages/xdis/unmarshal.py Sat May 25 05:37:41 2019
@@ -57,7 +57,7 @@
         # found it and this code via
         # https://www.peterbe.com/plog/unicode-to-ascii where it is a
         # dead link. That can potentially do better job in converting accents.
-        return unicodedata.normalize('NFKD', u).encode('ascii', 'ignore')
+        return unicodedata.normalize('NFKD', u)
     else:
         return str(u)

ignore option in encode results in :

('unicodestring', u'\n    \u062a\u0633\u062a\n    ')
('unicodestring', '\n    \n    ')

@rocky
Copy link
Owner

rocky commented May 25, 2019

For xdis this patch fixes the case for using python 2.7 for decompiling python 3.7 for e.g.:

Yeah, this was expedient and flaky. Please put in PR for fixing xdis. You should also have an invite to that project.

Do you prefer adding this kind of dependency to uncompyle6?

Sure! (I assume you do to since you suggested it?) chardet goes back to 2.1 and looks pretty cool and well written. However there might also be an option to indicate an encoding so the program doesn't have to guess. And while on the topic of options processing, the 2.7 option-processing branch would be greatly cleaned up if we started using click. For decompile3 there's no excuse not to use (other than not having gotten around to it).

@x0ret
Copy link
Collaborator Author

x0ret commented May 25, 2019

However there might also be an option to indicate an encoding so the program doesn't have to guess.

Nice, this options is a better alternative. I'll commit for the option.

Added click to my TODOs.

@x0ret
Copy link
Collaborator Author

x0ret commented May 25, 2019

@rocky, i did changes for encoding option but there's an case which i should ask you for comment. Suppose you specify encoding in the uncompyle6 option to utf-8. Your Python version is 3.7.3 and you are trying to decompile a Python 2.7 module. As we decided, for this case, we treat all strings including ascii to unicode and prepend u to them. I'm thinking about side effect of these cases in Python 2, What do you think?

@rocky
Copy link
Owner

rocky commented May 25, 2019

@x0ret You have my admiration for noticing such things and thinking about them. I don't know if I can be of help other than to be a sounding board for ideas. I have thought about this for the last 10 minutes or so.

When you say:

I'm thinking about side effect of these cases in Python 2,

do you mean that normally uncompyle6 would try to turn this into ASCII, but here it might change behavior and turn it into unicode instead?

The TODO suggests we should consider that the string was in unicode to start with so in that case it probably shouldn't be turned into ASCII. Right?

If I have this right, you had suggested using chardet which can detect if there are non-ascii characters in there, and thus may need to be in unicode?) Right? Would chardet work or help?

When I feel like I know something, I'll say so. But when I don't think I know or understand, I won't hesitate to admit it. So I leave to you how you want to ultimately proceed.

The way I see this is, and the way I have been working, has been to make a stab in the right direction even if it is flaky or incomplete or I don't understand the problem fully. Almost always that is better than doing nothing. And if there is a problem, unless this is a massive and difficult change to revert (which you are in a better position to know than me), having moved hopefully forward (or at least in a particular direction) we are in a better position to assess what the right or better thing to do is.

As you may have seen, I am not even afraid to make the wrong decision and own up to it. Hence I'll leave those "FIXME" or "TODO" comments.

Sorry I can't be of more help.

@x0ret
Copy link
Collaborator Author

x0ret commented May 27, 2019

@rocky, thanks for your comment.

do you mean that normally uncompyle6 would try to turn this into ASCII, but here it might change behavior and turn it into unicode instead?

Yes, I was worried about breaking generated source in Python2, however after another shot, i am convinced that when someone uses explicit unicode chars in source, using coding: utf-8 is a forced and no matter if u is used or not. So your suggestion works perfectly.

In this case we do not need chardet.

There was only an issue in code which i used try-catch and with xdis fix i can say it is ready.

Besides based on your suggestion i added another option encoding to give the user the option to explicitly specify source encoding. (due to using getopt i couldn't implement having val optional and force to utf-8).

Please review the commits and let me know if you prefer changes.

Also No changes required for Unicode strings like print('تست') since current implementation with this PR and xdis works perfectly.

As you may have seen, I am not even afraid to make the wrong decision and own up to it. Hence I'll leave those "FIXME" or "TODO" comments.

This is so valuable, I learned alot since working on this project. Thanks.

Sorry I can't be of more help.

Anyway your words encouraged me to recheck again.

Copy link
Owner

@rocky rocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - thanks!

(I will be committing the change to xdis soon since I gather there are no objections with that.

@rocky rocky merged commit e364499 into rocky:master May 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

About unicode docstring

2 participants