Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tkinter clipboard_get() decodes characters incorrectly #58982

Closed
takluyver mannequin opened this issue May 10, 2012 · 36 comments
Closed

Tkinter clipboard_get() decodes characters incorrectly #58982

takluyver mannequin opened this issue May 10, 2012 · 36 comments
Labels
topic-tkinter type-bug An unexpected behavior, bug, or error

Comments

@takluyver
Copy link
Mannequin

takluyver mannequin commented May 10, 2012

BPO 14777
Nosy @loewis, @terryjreedy, @ned-deily, @asvetlov, @takluyver, @serhiy-storchaka
Files
  • x11-clipboard-utf8.patch: clipboard_get and selection_get default to UTF8_STRING on X11
  • x11-clipboard-try-utf8.patch: 2nd revision of patch
  • x11-clipboard-try-utf8-3.patch: 3rd revision of patch
  • x11-clipboard-try-utf8-4.patch
  • x11-clipboard-try-utf8-4_27.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-05-16.01:17:32.991>
    created_at = <Date 2012-05-10.22:47:56.907>
    labels = ['type-bug', 'expert-tkinter']
    title = 'Tkinter clipboard_get() decodes characters incorrectly'
    updated_at = <Date 2012-05-16.01:17:32.989>
    user = 'https://github.com/takluyver'

    bugs.python.org fields:

    activity = <Date 2012-05-16.01:17:32.989>
    actor = 'ned.deily'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-05-16.01:17:32.991>
    closer = 'ned.deily'
    components = ['Tkinter']
    creation = <Date 2012-05-10.22:47:56.907>
    creator = 'takluyver'
    dependencies = []
    files = ['25552', '25566', '25567', '25571', '25572']
    hgrepos = []
    issue_num = 14777
    keywords = ['patch']
    message_count = 36.0
    messages = ['160378', '160379', '160419', '160438', '160440', '160441', '160444', '160450', '160451', '160452', '160456', '160486', '160545', '160548', '160551', '160552', '160555', '160556', '160557', '160559', '160560', '160561', '160562', '160563', '160569', '160571', '160573', '160576', '160580', '160588', '160714', '160716', '160718', '160722', '160789', '160790']
    nosy_count = 7.0
    nosy_names = ['loewis', 'terry.reedy', 'ned.deily', 'asvetlov', 'python-dev', 'takluyver', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14777'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 10, 2012

    With the text 'abc€' copied to the clipboard, on Linux, where UTF-8 is the default encoding:

    Python 3.2.3 (default, Apr 12 2012, 21:55:50) 
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tkinter
    >>> root = tkinter.Tk()
    >>> root.clipboard_get()
    'abcâ\x82¬'
    >>> 'abc€'.encode('utf-8').decode('latin-1')
    'abcâ\x82¬'

    I see the same behaviour in 2.7.3 as well (it returns a unicode string u'abc\xe2\x82\xac').

    If the clipboard is only accessible at a bytes level, I think clipboard_get should return a bytes object. But I can reliably copy and paste non-ascii characters between programs, so it looks like it's possible to return unicode.

    @takluyver takluyver mannequin added topic-tkinter type-bug An unexpected behavior, bug, or error labels May 10, 2012
    @serhiy-storchaka
    Copy link
    Member

    Still worse. I get 'abc?'. Linux, Python 3.1, 3.2, and 3.3, UTF-8 locale.

    @terryjreedy
    Copy link
    Member

    3.3, Win 7, Idle
    >>> root.clipboard_get()
    'abc€'
    after cut from here

    @serhiy-storchaka
    Copy link
    Member

    This issue can be reproduced by pure Tcl/Tk:

    $ wish
    % clipboard get
    abc?
    % clipboard get -type STRING
    abc?
    % clipboard get -type UTF8_STRING
    abc€

    Use root.clipboard_get(type='UTF8_STRING') in Python.

    I don't know whether it should just be documented (UTF8_STRING is not even mentioned in the clipboard_get docstring), or do we need to change the default behavior.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 11, 2012

    On this computer, I see this from Tcl:

    $ wish
    % clipboard get
    abc\u20ac

    But here Python's following suit:

    >>> root.clipboard_get()
    'abc\\u20ac'

    Which is odd, because as far as I know, my two computers run the same OS (Ubuntu 12.04) in the same configuration. I briefly thought the presence of xsel might be affecting it, but uninstalling it doesn't seem to make any difference.

    @ned-deily
    Copy link
    Member

    As is often the case with Tcl/Tk issues, there are platform differences. On OS X, with the two native Tcl/Tk implementations (Aqua Cocoa and Aqua Carbon), the examples work appear to work as is *and* type "UTF8_STRING" does not exist. The less commonly used X11 Tcl/Tk on OS X does support and require "UTF8_STRING" for the example given. So any doc change needs to be carefully worded.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 11, 2012

    OK, after a quick bit of reading, I see why I'm confused: the clipboard actually works by requesting the text from the source program, so where you copy it from makes a difference. In my case, copying from firefox gives 'abc\\u20ac', and copying from Geany gives u'abc\xe2\x82\xac'.

    However, I still think there's something that can be improved in Python. As Serhiy points out, specifying type='UTF8_STRING' makes it work properly from both programs. The Tcl documentation recommends this as the best option for "modern X11 systems"[1].

    From what Ned says, we can't make UTF8_STRING the default everywhere, but is there a way to detect if we're inside X11, and use UTF8_STRING by default there?

    [1] http://www.tcl.tk/man/tcl/TkCmd/clipboard.htm

    @terryjreedy
    Copy link
    Member

    There are definitely platform differences. As I noted, the original example works fine on Windows. However

    >>> root.clipboard_get(type='STRING')
    'abc€'
    >>> root.clipboard_get(type='UTF8_STRING')
    Traceback (most recent call last):
      File "<pyshell#21>", line 1, in <module>
        root.clipboard_get(type='UTF8_STRING')
      File "C:\Programs\Python33\lib\tkinter\__init__.py", line 549, in clipboard_get
        return self.tk.call(('clipboard', 'get') + self._options(kw))
    _tkinter.TclError: CLIPBOARD selection doesn't exist or form "UTF8_STRING" not defined

    Of course, on Windows I suspect that the unicode string is not copied to clipboard as utf8 bytes, so if clipboard contents are tagged, there would not be such a thing. Perhaps clipboards work differently on diffferent OSes.

    >>> help(root.clipboard_get)
    ...
        The type keyword specifies the form in which the data is
        to be returned and should be an atom name such as STRING
        or FILE_NAME.  Type defaults to STRING.

    (Actually, FILE_NAME give the same exception as UTF8_STRING.)

    @ned-deily
    Copy link
    Member

    Most likely the best way to determine the windowing system is to use the "tk windowingsystem" command (http://www.tcl.tk/man/tcl8.5/TkCmd/tk.htm#M10), so something like this:

        root = tkinter.Tk()
        root.call(('tk', 'windowingsystem'))

    As documented, the call returns 'x11' for X11-based systems, 'win32' for Windows, and 'aqua' for the native OS X implementations.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 11, 2012

    Thanks, Ned.

    Does it seem like a good idea to test the windowing system like that, and default to UTF8_STRING if it's x11? So far, I've not found any case on X where STRING works but UTF8_STRING doesn't. If it seems reasonable, I'm happy to have a go at making a patch.

    @ned-deily
    Copy link
    Member

    A patch would be great. I don't have a strong opinion about the issue one way or another. I suppose it would simplify things for Python 3 users if the clipboard results were returned properly in the default case when no 'type' argument is passed to clipboard_get(). For Python 2, changing things seems a little more questionable but, as long as it was already returning a unicode object in that case, it sounds like a bug fix rather than a feature. Martin, Andrew: any opinions on this?

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 12, 2012

    Here's a patch that makes UTF8_STRING the default type for clipboard_get and selection_get when running in X11.

    @asvetlov
    Copy link
    Contributor

    Patch looks good for me, works fine.
    I think it can be applied to 2.7 as well.
    There are only problem: I don't know how to make test for it without using external tools like xclip or ctypes bindings for X so library.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    Indeed, and there don't seem to be any other tests for the clipboard functionality.

    @asvetlov
    Copy link
    Contributor

    You are right: there are no tests as well as for the most part of tkinter.
    Why don't make it if possible?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 13, 2012

    I'm skeptical about the patch. In both 2.7 and 3.x, clipboard_get returns a Unicode string, yet it fails to decode it properly. So I think this is the bug that ought to be fixed (using the proper encoding).

    Defaulting to UTF8_STRING is a new feature, IMO, and shouldn't be done for 2.7 (or 3.2).

    @ned-deily
    Copy link
    Member

    Martin, is there a way for _tkinter to know whether the result returned from Tcl/Tk is an encoded string or not in this case?

    With regard to the patch, it would be better to cache the results of the first-time call to get the windowingsystem value so that we don't have to make two calls down into Tcl for each clipboard_get.

    @serhiy-storchaka
    Copy link
    Member

    У пт, 2012-05-11 у 21:25 +0000, Thomas Kluyver пише:

    So far, I've not found any case on X where STRING works but UTF8_STRING doesn't.

    Perhaps there will be problems with the old (very old) closed source
    software.

    A few years ago (in Debian Sarge) even xsel did not work with the
    non-ascii strings.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    But the encoding used seemingly depends on the source application - Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas Firefox escapes the unicode character. So I don't think we can correctly decode the STRING value in all cases.

    The Tk documentation describes UTF8_STRING as being the "most useful" type on modern X11.

    @serhiy-storchaka
    Copy link
    Member

    But the encoding used seemingly depends on the source application - Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas Firefox escapes the unicode character. So I don't think we can correctly decode the STRING value in all cases.

    Agree. Opera sends 'abc?' literally.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 13, 2012

    Martin, is that a way for _tkinter to know whether the result
    returned from Tcl/Tk is an encoded string or not in this case?

    Off-hand, I don't know. I suppose there is a way to do this correctly,
    but one might have to dig through many layers of software to find out
    what that way is.

    With regard to the patch, it would be better to cache the results of
    the first-time call to get the windowingsystem value so that we don't
    have to make two calls down into Tcl for each clipboard_get.

    That also.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 13, 2012

    But the encoding used seemingly depends on the source application -
    Geany (GTK 2, I think) seemingly sends UTF8 text anyway, whereas
    Firefox escapes the unicode character. So I don't think we can
    correctly decode the STRING value in all cases.

    Ah, ok. IIUC, support for UTF8_STRING would also be in the realm of
    the source application, right? If so, I think we should use something
    more involved where we try UTF8_STRING first, and fall back to STRING
    if the application doesn't support that.

    This I could also accept for 2.7, since it "shouldn't" have a potential
    for breakage.

    @ned-deily
    Copy link
    Member

    +1 to Martin's proposal

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    OK, I'll produce an updated patch.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    As requested, the second version of the patch (x11-clipboard-try-utf8):

    • Caches the windowing system per object. The tk call to find the windowing system is made the first time clipboard_get or selection_get are called without specifying type=.
    • If using UTF8_STRING throws an error, it falls back to the default call with no type specified (i.e. STRING).

    @ned-deily
    Copy link
    Member

    Not to bikeshed here but I think it would be better to cache the windowingsystem value at the module level since I assume an application could be calling clipboard_get on different tkinter objects and I don't there is any possibility that the windowingsystem value could vary within one interpreter invocation.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    I'm happy to put the cache at the module level, but I'll give other people a chance to express their views before I dive into the code again.

    I imagine most applications would only call clipboard_get() on one item, so it wouldn't matter. However, my own interest in this is from IPython, where we create a Tk object just to call clipboard_get() once, so a module level cache would be quicker, albeit only a tiny bit.

    @serhiy-storchaka
    Copy link
    Member

    Not to bikeshed here but I think it would be better to cache the windowingsystem value at the module level since I assume an application could be calling clipboard_get on different tkinter objects and I don't there is any possibility that the windowingsystem value could vary within one interpreter invocation.

    Why Misc.tk is not a module level variable?

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 13, 2012

    The 3rd revision of the patch has the cache at the module level. It's a bit awkward, because there's no module level function to call to retrieve it (as far as I know), so it's exposed by objects which can call Tk.

    Also, serhiy pointed out a mistake in the 2nd revision, which is fixed ('selection' instead of 'clipboard').

    @ned-deily
    Copy link
    Member

    Serhiy, I don't know why Misc.Tk is not module level but it isn't so caching global attributes there isn't effective. However, upon further consideration, I take back my original suggestion of caching at the module level primarily because I can think of future scenarios where it might be possible that there are different windowing systems supported in the same Python instance. I now think the best solution is to cache at the Tk root object level; that appears to be a simple change to Thomas's 2nd revision. Sorry about that! Here is a fourth version (one for 3.x and one for 2.7) based on the second which includes the fix from the 3rd.

    I started to write a simple test for the clipboard functions but then realized that there doesn't seem to be a practical way to effectively test in a machine-independent way without destroying the contents of the Tk clipboard and hence the user's desktop clipboard, not a friendly thing to do. For example, the clipboard might contain a data type not supported by the platform's Tk, like pict data on OS X. So I'm not including the test here but it did verify that the attribute was being properly cached across multiple tkinter objects.

    Thanks to Thomas for the patch and to Serhiy for reviewing. By the way, Thomas, for your patch to be included, you should submit a PSF contributor agreement as described here: http://www.python.org/psf/contrib/. Once that is in place and if the patch looks good to everyone, I'll apply it.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 15, 2012

    I've submitted the contributor agreement, though I've not yet heard anything back about it.

    @takluyver
    Copy link
    Mannequin Author

    takluyver mannequin commented May 15, 2012

    ...And mere minutes after I said I hadn't heard anything, I've got the confirmation email. :-)

    @serhiy-storchaka
    Copy link
    Member

    ...And mere minutes after I said I hadn't heard anything, I've got the confirmation email. :-)

    Congratulations!

    @asvetlov
    Copy link
    Contributor

    I'm ok with last patch version.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 16, 2012

    New changeset f70fa654f70e by Ned Deily in branch '2.7':
    Issue bpo-14777: In an X11 windowing environment, tkinter may return
    http://hg.python.org/cpython/rev/f70fa654f70e

    New changeset 41382250e5e1 by Ned Deily in branch '3.2':
    Issue bpo-14777: In an X11 windowing environment, tkinter may return
    http://hg.python.org/cpython/rev/41382250e5e1

    New changeset 97601cbf169f by Ned Deily in branch 'default':
    Issue bpo-14777: merge
    http://hg.python.org/cpython/rev/97601cbf169f

    @ned-deily
    Copy link
    Member

    Applied for release in 2.7.4, 3.2.4 and 3.3.0. Thanks all!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-tkinter type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants