Skip to content

How does weechat want zero width spaces to work? #1669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gnachman opened this issue Jul 6, 2021 · 11 comments
Closed

How does weechat want zero width spaces to work? #1669

gnachman opened this issue Jul 6, 2021 · 11 comments
Assignees
Labels
bug Unexpected problem or unintended behavior
Milestone

Comments

@gnachman
Copy link

gnachman commented Jul 6, 2021

Question

Weechat seems to use zero-width spaces differently than other apps.

I modified iTerm2 to advance the cursor on a zero-width space in response to https://gitlab.com/gnachman/iterm2/-/issues/5397. But now in https://gitlab.com/gnachman/iterm2/-/issues/9786 we see that the opposite is expected.

How does Weechat actually want zero-width space to work? Should the cursor move? TBH I found it surprising that you'd want the cursor to move for this non-spacing character, so if it's up in the air I'd prefer to change it to not expect movement.

Is there more to it than this?

@gnachman gnachman added the question General question label Jul 6, 2021
@trygveaa
Copy link
Member

trygveaa commented Jul 6, 2021

WeeChat uses glibc to determine how wide characters are, and thus how much the cursor should be moved. On my system (Arch Linux with glibc version 2.33), glibc returns a width of 0 for ZWS, so then WeeChat doesn't expect the cursor to be moved. You can check this in WeeChat by running: /eval -n ${lengthscr:${\u200b}}.

I don't know if glibc has reported a different width for ZWS in earlier versions, but it would surprise me.

In the first issue you link to you said that Konsole and xterm treat ZWS as a regular space. I don't know if this was different four years ago, but at least on my machine now it's not. In Konsole it's invisible, and in xterm it's rendered as a dotted border around the preceding character. The cursor is not moved in either.

@gnachman
Copy link
Author

gnachman commented Aug 4, 2022

I see that lengthscr boils down to wcswidth.

On macOS, wcswidth returns -1 for ZWS. That is certainly a strange choice, but macOS is infamous for its terrible wcwidth support.

It looks like weechat interprets as width of -1 as 1:

        length = wcswidth (ptr_wstring, num_char);                                                                                                                                       
        /*                                                                                                                                                                               
         * if the char is non-printable, wcswidth returns -1                                                                                                                             
         * (for example the length of the snowman without snow (U+26C4) == -1)                                                                                                           
         * => in this case, consider the length is 1, to prevent any display bug                                                                                                         
         */                                                                                                                                                                              
        if (length < 0)                                                                                                                                                                  
            length = 1;                                                                                                                                                                  

I don't know if this is the right choice. The example it gives of a snowman in particular is wrong because since Unicode 9 all emoji are width 2.

FWIW, in iTerm2 I have to keep a list of characters with the DI (Default Ignorable Code Point) property to avoid moving the cursor for them.

@trygveaa
Copy link
Member

trygveaa commented Sep 6, 2022

On macOS, wcswidth returns -1 for ZWS. That is certainly a strange choice, but macOS is infamous for its terrible wcwidth support.

Hm, yes, that's strange. On my machine with glibc 2.36 it returns 0, tested with this code:

#define _XOPEN_SOURCE
#include <locale.h>
#include <stddef.h>
#include <stdio.h>
#include <wchar.h>

int main(int argc, char **argv) {
  wchar_t wc = L'\u200b';
  wchar_t *ws = L"\u200b";

  setlocale(LC_ALL, "en_US.UTF-8");

  printf("wcwidth: %d\n", wcwidth(wc));
  printf("wcswidth: %d\n", wcswidth(ws, 1));
}

I do see that the man page says this though:

If a nonprintable wide character occurs among these characters, -1 is returned.

Which seems to contradict what I get. Given this line, the behavior on macOS and that the comment in the code you pasted says that U+26C4 returns -1 (while I get 2 for this character now), can there have been a change of behavior in wcswidth at some point, which was not updated in the man page?

I see the commit message for the code you pasted links to https://savannah.nongnu.org/bugs/?40115, but it doesn't contain that much info. It seems strange to me that it sets the length to 1 instead of 0 if wcswidth returns -1. Do you remember why @flashcode?

@trygveaa
Copy link
Member

trygveaa commented Sep 6, 2022

By the way, in https://gitlab.com/gnachman/iterm2/-/commit/04036736f13742668037fb89fc269c9aad88f252 you say that xterm and Konsole advance the cursor on ZWS, but if I run echo -e 'a\u200bb' it prints ab, so has this also changed? Which makes sense if xterm and Konsole use wcswidth from glibc and that has changed. This is with xterm 372 and Konsole 22.08.0.

@gnachman
Copy link
Author

gnachman commented Sep 9, 2022

Which seems to contradict what I get. Given this line, the behavior on macOS and that the comment in the code you pasted says that U+26C4 returns -1 (while I get 2 for this character now), can there have been a change of behavior in wcswidth at some point, which was not updated in the man page?

U+26C4 will give -1 on macOS because macOS's wcwidth is horrible and probably hasn't been updated since Emoji was invented. A modern OS will give you 2, of course. So it's not that wc(s)width changed on macOS, it's that the rest of the world changed and macOS didn't :)

I suspect that xterm and Konsole's behavior depends on wcwidth or something similar and will be platform-specific.

tmux encountered similar difficulties and switched to using utf8proc on macOS. I think that's the right call—wc(s)width should be considered harmful on that platform, unfortunately. See here for their wcwidth wrapper: https://github.com/tmux/tmux/blob/master/compat/utf8proc.c#L24

@flashcode
Copy link
Member

flashcode commented Nov 26, 2022

I see the commit message for the code you pasted links to https://savannah.nongnu.org/bugs/?40115, but it doesn't contain that much info. It seems strange to me that it sets the length to 1 instead of 0 if wcswidth returns -1. Do you remember why @flashcode?

@trygveaa: yes, some comments about this code:

  • when wcswidth returns -1, the char must not be displayed instead of a space (this is a bug)
  • the comment saying wcswidth of U+26C4 is -1 is wrong, for me it's 2
  • the fix of length to 1 if < 0 is wrong because if any char of the string is nonprintable, -1 is returned, that means utf8_strlen_screen("ab\x01") returns 1 instead of 3

I found many other problems with unicode chars, as well as other issues opened (for example for soft-hyphens), so I'm currently writing a specification to rework this and propose a new behavior, I'll post the link here once it's ready to be shared.

@flashcode
Copy link
Member

@gnachman, @trygveaa: I wrote the specification: https://specs.weechat.org/specs/2022-003-fix-unicode-display.html

Please tell me what you think about the proposed changes before I implement them.

I can make them available on a testing branch before merging into master.

@flashcode flashcode added the waiting info Waiting for info from author of issue label Dec 3, 2022
@flashcode flashcode self-assigned this Dec 3, 2022
@flashcode flashcode added this to the 3.8 milestone Dec 3, 2022
@flashcode
Copy link
Member

I pushed the branch unicode-fixes for tests: https://github.com/weechat/weechat/tree/unicode-fixes

Please ping me if you find differences with the specification or display bugs (chat and bars).

@trygveaa
Copy link
Member

  • when wcswidth returns -1, the char must not be displayed instead of a space (this is a bug)
  • the comment saying wcswidth of U+26C4 is -1 is wrong, for me it's 2
  • the fix of length to 1 if < 0 is wrong because if any char of the string is nonprintable, -1 is returned, that means utf8_strlen_screen("ab\x01") returns 1 instead of 3

@gnachman said wcwidth on macOS returns -1 for U+26C4. So does this mean that this character will be stripped away on macOS now? If so, that's not good. If wcwidth on macOS works so poorly, I think you should consider using something else (this also ties into what I wrote on IRC with wcwidth being wrong for some emojis, which might cause issues depending on the terminal emulator).

@flashcode
Copy link
Member

Yes if -1 is returned by wcwidth for U+26C4, it will not be displayed.
This is not really a regression but rather a bug in wcwidth on macOS then, in this case WeeChat can not know how to display this char.

@trygveaa
Copy link
Member

Yes if -1 is returned by wcwidth for U+26C4, it will not be displayed.
This is not really a regression but rather a bug in wcwidth on macOS then, in this case WeeChat can not know how to display this char.

So it seems the behavior is changed from displaying it with the incorrect width (leading to render issues in some cases), to not displaying it at all. This could be considered a regression, even though the bug lies in wcwidth on macOS.

Either way, it seems the fix would be to not use wcwidth/wcswidth from glibc.

@flashcode flashcode added bug Unexpected problem or unintended behavior and removed question General question waiting info Waiting for info from author of issue labels Dec 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants