Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer: unicode-width and emojis #75

Open
markus-bauer opened this issue Feb 21, 2023 · 25 comments
Open

Buffer: unicode-width and emojis #75

markus-bauer opened this issue Feb 21, 2023 · 25 comments
Labels
bug Something isn't working

Comments

@markus-bauer
Copy link

I'm not a user of helix or tui. But I found this helix issue and thought I'd mention it here:
helix-editor/helix#6012

Helix uses a fork of a subset of tui. Notably the core of Buffer is basically the same (AFAIK).

I'm pretty sure that this bug is shared by everything using the tui-style Buffer.
AFAIK, the reason is that unicode-width doesn't (or can't?) report the column width for emojis. See readme and issues:
https://github.com/unicode-rs/unicode-width

I won't have anything to contribute here, or any personal interest, really.
Perhaps you should get into contact with the helix team for this, as this effects both.

@markus-bauer markus-bauer added the bug Something isn't working label Feb 21, 2023
@parasyte
Copy link

parasyte commented Mar 1, 2023

unicode-width is not the right tool for computing column widths, since it depends on the font. tui won't have access to information about the font, since that is the terminal emulator's job.

To complicate matters further, not every terminal emulator handles emojis the same way. For instance, Windows Terminal is decent for the most part, but renders the "woman scientist" emoji from the unicode-width readme using 5 columns. This is of course a bug in Windows Terminal. It should only be 2 columns.

unicode-segmentation does the right thing when breaking strings into grapheme clusters. You just need a way to determine whether your terminal emulator will draw the grapheme cluster as 1 column or 2 (or 5, lol).

@mindoodoo
Copy link
Member

Has this resulted in a bug or particular issue during your use ?

@LeoRiether
Copy link

LeoRiether commented Apr 25, 2023

Has this resulted in a bug or particular issue during your use ?

It has for me. Because unicode-width returns 1, but kitty (the terminal) displays some emojis with width 2, the layout can be broken in some places. For example, if we render a block with a heart emoji on the title, the top-right corner is rendered at the wrong spot.

heart-titled block

Edit: Wezterm seems to render my example correctly...

@pascalkuthe
Copy link

pascalkuthe commented Aug 29, 2023

As the author of that issue I dug I ti that a bunch. I didn't have the bandwidth to finish my fix in helix. There are a bunch of issues here.

Firstly unicode width does not respect text representation/always uses a fixed unicode version. The problem is that while unicode 9 support (where character width changed last Tims) can be assumed these days, unicode 15 support (supourt for text representation) is different between terminals. Even many terminals that support unicode 14 purposfully don't support text representation(alacritty for example). I have written a custom crate to efficiently compute unicode width that optionally respects text representation. The problem is that there is no way to know what the current emulator expects. If I ever upstream this into helix I would have to detect the emulator by name and offer a setting for the user to overwrite said detection (not as good solution).

The second (much bigger) issue in tui-rs is (and helix by extension) is the use of unicode segmentation. No terminal emulator except wezterm (and wezterm is backwards compatible) supports unicode segmentation (altough almost all perform unicode diagraph normalization... With the exception of alacritty... Sigh).

The right way to segment text into terminal cells is to associate a single cell with a single unicode-width unit (really what Unicode-width reports is the number of cells). A single cell always contains at most a single non-zero width character (which is currently not the case in tui-rs since it contains a grapheme which can contain multiple non-zero width characters).

I do want to get around fixing that in our own fork of tui-rs eventually but unicode-segmentstion js sadly used much more heavily there so its gonna be harder to rip out.

@joshka
Copy link
Member

joshka commented Aug 29, 2023

Thanks for this info - can you link to your fork?

I noticed a while back that Wez commented about their approach to this at unicode-rs/unicode-width#4 (comment):

I don't really know much about this space, but here's my attempt at dealing with this in a terminal emulator.

/// Returns the number of cells visually occupied by a sequence
/// of graphemes
pub fn unicode_column_width(s: &str) -> usize {
    use unicode_segmentation::UnicodeSegmentation;
    s.graphemes(true).map(grapheme_column_width).sum()
}

/// Returns the number of cells visually occupied by a grapheme.
/// The input string must be a single grapheme.
pub fn grapheme_column_width(s: &str) -> usize {
    // Due to this issue:
    // https://github.com/unicode-rs/unicode-width/issues/4
    // we cannot simply use the unicode-width crate to compute
    // the desired value.
    // Let's check for emoji-ness for ourselves first
    use xi_unicode::EmojiExt;
    for c in s.chars() {
        if c.is_emoji_modifier_base() || c.is_emoji_modifier() {
            // treat modifier sequences as double wide
            return 2;
        }
    }
    UnicodeWidthStr::width(s)
}

@parasyte
Copy link

parasyte commented Aug 29, 2023

Yes, you actually need segmentation to present emoji widths properly. Mentioned by both the quoted comment and my previous comment in this thread.

I do not believe that "ripping out unicode-segmentation" is a good idea. In fact, it is a step backward. Terminal cells are represented by a grapheme cluster, not whatever unicode-width is doing.

@pascalkuthe
Copy link

pascalkuthe commented Aug 29, 2023

I actually implemented more or less the thing you proposed above into helix and it completely breaks every single terminal emulator but wezterm. If you pack multiple non-zero characters in a single cell that would take up a single cell in the emulator you end up with a situation where the internal cell coordinates don't line up with those used in the emulator and the diffing mechanism breaks down (like ghost characters and highlights everywhere once such a character is on screen).

If you only care about wezterm and consider it a "bug" in the emulator that it doesn't support Unicode segmentation then you can certainly go that route. Most emulators don't actually consider this a bug (try opening an issue with alacrity about this). I also kind of agree with this stance since languages that operate on multicharacter graphemes don't really make sense in a monospaced terminal anyway (unicode-rs/unicode-width#27 (comment)). This is not my project so I can not say what the goals here are but for helix only supporting wezterm would be absolutely unacceptable.

You could probably find some workaround where you do unicode segmentation and then discard the second non-zero width character but that would incur a bunch of overhead and I don't really see the point. Emojis being displayed as individual codepoints is how every tui not written in rust works so at least for helix I am perfectly happy if we match that behavior (for example try looking at some multi codepoint emojis in vim/nvim).

I stopped digging here at some point but what I observed was that grapheme width emoji representation like ✔️ (in emulators that support it) causes visual artifacts in helix. This would be mostly fixed by using my crate for unicode width. You can find my prototype in https://github.com/pascalkuthe/grapheme-width-rs but its not really tested in real world usage yet and I haven't published it yet (it's a full rewrite, not a fork, the Unicode table works slightly differently to save some extra space/get better performance).

However other tui applications (like vim) don't have these issues. I am pretty sure my investigation back then showed me that grapheme segmentation was really the issue that caused vim to at least avoid significant visual artifacts. Even if it's not the fact that emojis show up as partially black boxes is definitely caused by that. Considering that grapheme segmentation has quite a bit of additional overhead (while widths need to be computed either way) my conclusion was to rip that out of the helix rendering backend when I get around to that.

@parasyte
Copy link

parasyte commented Aug 29, 2023

As a Windows Terminal user, I can tell you that the people working on that emulator concur that it is a "bug" that it doesn't support Unicode segmentation (or at least ZWJ emojis -- I can't speak for all uses of Unicode segmentation, which includes a lot of languages that require it).

For instance, microsoft/terminal#8000 includes a test case for a character (single codepoint) which is intended to be rendered with 12 cells. UAX#11 (and by extension unicode-width) only gives you answers for "narrow and wide", so it is in fact not applicable. This is the reason that the term "M:N" appears in the linked issues; M characters need to be mapped to N cells. And it implies not only Unicode segmentation, but also an extra layer that answers the question of how many cells this grapheme cluster occupies.

There are other gems in their issue tracker:

Supporting broken emulators because that's what they offer is a poor choice. You might instead opt to intentionally not support grapheme clusters that are not already widely supported in emulators, for some heuristic of support. Just return an error or panic when you encounter grapheme clusters that are not known to be compatible with today's state of terminal emulators. This would be the conservative approach.

But if you want to do the right thing, you need to skate to where the puck is going. At least Windows Terminal is keen to fix this bug, as they have signaled their intent as far as I can tell. I'm surprised to hear that others are less interested.

@pascalkuthe
Copy link

pascalkuthe commented Aug 29, 2023

I don't follow windows terminal but on the unix side of things no terminal really supports this (wezterm somewhat but it's character width is basically implementation defined if you do.segment). There is not even an agreed upon width for characters and in gernal very little agrement about text width https://gitlab.freedesktop.org/terminal-wg/specifications/-/merge_requests/8. Really the only agreement seems to ge: Once you start taking unicode seriously the width of a grapheme depends on the font. A tui application has no idea about the font so it's hard to define a common notion of width.

So every terminal that does support that is doing it's own thing right now with absolutely no standards. So you are going to aim at a fully moving target (this is btw also the reason why unicode-width doesn't support emoji presentation selectors, their stance is similar: beyond the single codepoint width definition in unicode there is no notion of cannonical unicode width) . What I consider important is that there are no visual artifacts in the rest of the UI. If some emoji doesnt look as nice that is secondary compared to having visual aetifacts all over the screen.

Since every emulator is reasonably backwards compatible with applications that don't unicode segment (and likely will stay that way for a looong time) using the tradition approach is much more stable. You could put segmentation behind a feature flag or something. This is not my crate so I don't care either way but for helix I plan to remove unicode segmentation from the lower level rendering code

@parasyte
Copy link

It seems to me that rendering the TUI when it contains "unknown" grapheme clusters by discarding non-zero-width characters (or even just using the Unicode fallback character) is the least disruptive, even if you disagree with the overhead and don't see the point. Just because some editors render ZWJ emojis with multiple codepoints doesn't mean it's a pleasant UX. See https://lord.io/text-editing-hates-you-too/#emoji-modifiers for some particular egregious examples of this problem in action!

@joshka
Copy link
Member

joshka commented Aug 29, 2023

@pascalkuthe wrote:

This would be mostly fixed by using my crate for unicode width. You can find my prototype in https://github.com/pascalkuthe/grapheme-width-rs but its not really tested in real world usage yet and I haven't published it yet (it's a full rewrite, not a fork, the Unicode table works slightly differently to save some extra space/get better performance).

If we were to include switch from unicode-width to this in ratatui, it sounds like there would be a need to select between Unicode9 or Unicode14 mode in end users apps in order to render correctly for kitty and windows terminals. Have you got any ideas on config / application patterns that would to make that work well?

Edit: looking at the following suggests that this might still not be quite what we need to render correctly. Why is 5/6 reasonable here as opposed to 2?

https://github.com/pascalkuthe/grapheme-width-rs/blob/aef038e78e41f3ec33da04d985aeac1476e2fe02/src/test.rs#L51-L58

#[test]
fn emoji_representation() {
    // its annoying but we don't grapheme segment so each emoji must be calcultade indivudlaly
    assert_eq!(str_width("👩‍❤️‍👨", Unicode9), 5);
    assert_eq!(str_width("👩‍❤️‍👨", Unicode14), 6);
    assert_eq!(str_width("✔️", Unicode9), 1);
    assert_eq!(str_width("✔️", Unicode14), 2);
}

@pascalkuthe
Copy link

pascalkuthe commented Aug 29, 2023

@pascalkuthe wrote:

This would be mostly fixed by using my crate for unicode width. You can find my prototype in pascalkuthe/grapheme-width-rs but its not really tested in real world usage yet and I haven't published it yet (it's a full rewrite, not a fork, the Unicode table works slightly differently to save some extra space/get better performance).

If we were to include switch from unicode-width to this in ratatui, it sounds like there would be a need to select between Unicode9 or Unicode14 mode in end users apps in order to render correctly for kitty and windows terminals. Have you got any ideas on config / application patterns that would to make that work well?

Edit: looking at the following suggests that this might still not be quite what we need to render correctly. Why is 5/6 reasonable here as opposed to 2?

pascalkuthe/grapheme-width-rs@aef038e/src/test.rs#L51-L58

#[test]
fn emoji_representation() {
    // its annoying but we don't grapheme segment so each emoji must be calcultade indivudlaly
    assert_eq!(str_width("👩‍❤️‍👨", Unicode9), 5);
    assert_eq!(str_width("👩‍❤️‍👨", Unicode14), 6);
    assert_eq!(str_width("✔️", Unicode9), 1);
    assert_eq!(str_width("✔️", Unicode14), 2);
}

e As I said in the discussion above, unicode grapheme width is not really standardized (and for example whether this compared emoji has width 1, 2, or 3 is not really defined by any standard, really the width depends on the font). Width of 5/6 really is the correct thing from the perspective of every terminal emulator (for backward compatibility reasons). The reason emojis show up as partially black boxes right now is because you actually overlap them with empty cells (created with Cell::reset). This needs to be fixed in the rendering code but that is not at all easy. The rendering/diffing code actually (correctly) works just like most emulators would by using terminal cells so just putting the entire symbol into the first cell doesn't really work. If you use the segmentation I described above (one non-zero width char per cell) you get correct results.

A short example, this code https://github.com/helix-editor/helix/blob/40d7e6c9c85d4f1ce2345f6e9d59fc091243124d/helix-tui/src/buffer.rs#L412 (you have almost the same code in this repo) needs to be turned into the following:

        let mut chars = string.chars().peekable();

        while let Some(c) = chars.next() {
            let width = c.width().unwrap_or(0);
            if width == 0 {
                continue;
            }
            // `x_offset + width > max_offset` could be integer overflow on 32-bit machines if we
            // change dimensions to usize or u32 and someone resizes the terminal to 1x2^32.
            if width > max_x_offset.saturating_sub(x_offset) {
                break;
            }
            let mut s = c.to_string();
            while let Some(&c) = chars.peek() {
                if c.width().unwrap_or(0) != 0 {
                    break;
                }
                s.push(c);
                chars.next();
            }

            self.content[index].set_symbol(&s);
            self.content[index].set_style(style);
            // Reset following cells if multi-width (they would be hidden by the grapheme),
            for i in index + 1..index + width {
                self.content[i].reset();
            }
            index += width;
            x_offset += width;
        }

helix before that change (with the char above):

image

and after that change:

image

This really relates to what I mentioned above. Unicode segmentation in a tui doesn't really make sense when most emulators don't support it (and for example kitty and wezterm don't agree on the width of some emoji. If you really do want to support that usecase then you would need to cutoff every grapheme at the second non-zero width char (but in that case my crate would also always yield a width <= 2). Like I said unicode-width does the correct thing from the perspective of the unicode standard and almost all terminal emulators. My crate just (optionally) supports respecting the grapheme representation character (which is also part of the unicode standard just some terminals support it and some don't).

To specify a bit further, if you look at kitty with that patch you actually get a correct emoji that is just too wide:

image

You can't actually do better here. If you decrease the width to two this code will be buggy and lead to visual artifacts even on kitty despite the fact that it supports displaying these emojis. The segmentation is simply handled at the font level (but has no effect on cell width).

Just to demonstrate this further if you actually paste that emoji into your shell and print it with echo it will also still have a width of 6 on kitty:

image

wezterm sadly actually changes cell width. I think that likely needs to be considered a bug tough (only emulator in existence that behaves that way and will break a ton of existing programs, for example kakoune is broken by this)

Detection is a bit of a pain, you can do it using TERM_PROGRAM (hardcoding a couple emulators). There is also an escape sequence to configure the unicode version iirc but many emulators don't support it. ultimately the application will need to expose some kind of setting to the user to deal with edgecases

@joshka
Copy link
Member

joshka commented Aug 29, 2023

Iterm seems to get this right:
image

Wezterm seems to also:
image

@pascalkuthe
Copy link

pascalkuthe commented Aug 29, 2023

I guess I don't conaider this "right". I would consider this a bug in those emulator since it breaks existing applicationa (like kakune). Maybe its possible to work around this in the rendering code somehow but I don't think so. Seems like a huge pain.

I actually searched trough the wezterm repo but there doesn't seem to be any past discussion about this. I might look into filing one to move to the kitty behavior since it still looks the better than alacritty I guess while not actually breaking anything.

There is no standardised notion of grapheme width. The unicode standard says "you need to look at the font to determine graphe width, we only define width for codepoints". So if I implement and emulator that uses width 3 here it would be just as correct. You could agree that emojis always have width 2 but what about other scripts? Some graphemes can get really wide or even takeup multiple lines. What size do those have? Once there is an agrees upon standard what width a grapheme actually has in a terminal that makes sense to support that can be used but until then it's essentially implementation defined.

Like I said an alternative could be to discard the rest of a grapheme starting at the second non-zero-width char but that's kind if hacky (different emojis would show up as the same thing) but that is probably the most portable approach.

You can experiment with what you think is the right approach for your crate regarding segmentation. I would advise against breaking all the traditional emulators (alacrity, zellij, termux. vim builtin terminal, GNOME terminal,whatever the kde terminal is called, xterm, ...) and instead would hope to see wezterm and iterm become more compatible (considering wez past stance om backwards comparability he may be open to that, wezterm used to default to unicode14 now it default to unicode9 for backwards compatability). Either way the crate K linked provides an optionally unicode14 compatible chard width calculation that is exactly identical to what wezterm/yermwiz does if you unicode segment first, calculate the width of he grapheme and then cap the result to 2.

@joshka
Copy link
Member

joshka commented Aug 30, 2023

I guess I don't conaider this "right". I would consider this a bug in those emulator since it breaks existing applicationa (like kakune). Maybe its possible to work around this in the rendering code somehow but I don't think so. Seems like a huge pain.

I don't think understand. I think you're saying almost everything is wrong, so it's better to be consistently wrong than to do the right thing?

@parasyte
Copy link

parasyte commented Aug 30, 2023

That's my takeaway, unfortunately. I went through some of the kakoune issues, and found e.g. mawww/kakoune#1447 where the maintainers indicate they aspire to add segmentation support.

If the goal is to "make it work right now at all costs" then claiming 👩‍❤️‍👨 occupies 5 or 6 cells is not only logically incorrect, but also doesn't align with the intentions of developers of the applications cited as motivation for doing this in the first place.

Like I said an alternative could be to discard the rest of a grapheme starting at the second non-zero-width char but that's kind if hacky (different emojis would show up as the same thing) but that is probably the most portable approach.

IMHO, this is your best option and fits neatly with the "at all costs" predicate. It is also a compromise that I conceded in a comment above. Your rendering will not be perfect, but it will be both backward and forward compatible. The latter is something I believe you are dismissing unfairly.

@pascalkuthe
Copy link

pascalkuthe commented Aug 30, 2023

I guess I don't conaider this "right". I would consider this a bug in those emulator since it breaks existing applicationa (like kakune). Maybe its possible to work around this in the rendering code somehow but I don't think so. Seems like a huge pain.

I don't think understand. I think you're saying almost everything is wrong, so it's better to be consistently wrong than to do the right thing?

I am saying the approach currently used by most tui apps and emulators js based on the unicode standard. (Which is what unicore--wifth and my crate implement) Therefore everyone can agree on it (mosly except for mismatches in unicode version).

Kitty does the right thing by adding support for improved rendering while keeping cell width unchanged and hence not breaking this standardized notion of width.

The approach used by wezterm and Iterm is not based on any standardized notion of width. They simply chose to limit the width of a unicoee grapheme (calculated the same as other emulators do) to two. That's limit is arbitrarly chosen by the authors of those terminals (I guess because it looks good for emojis but for other graphemes this width is not correct at all). Nowhere in the unicode standard will you find any definition that would suggest thjs width.

This change to the standard definition of width in a terminal breaks existing applications (since tui and emulators must agree on it) and so I consider that behavior wrong.

I consider it more importent that terminal and emulator agree on a single notion of width than what that notion of width actually is. If they don't agree every single character in that line will be prone to rendering articats (wrong char in the wrong place, ghost colors,...) which is a much bigger problem then a bit of empty space behind an emoji

@pascalkuthe
Copy link

pascalkuthe commented Aug 30, 2023

That's my takeaway, unfortunately. I went through some of the kakoune issues, and found e.g. mawww/kakoune#1447 where the maintainers indicate they aspire to add segmentation support.

If the goal is to "make it work right now at all costs" then claiming 👩‍❤️‍👨 occupies 5 or 6 cells is not only logically incorrect, but also doesn't align with the intentions of developers of the applications cited as motivation for doing this in the first place.

Like I said an alternative could be to discard the rest of a grapheme starting at the second non-zero-width char but that's kind if hacky (different emojis would show up as the same thing) but that is probably the most portable approach.

IMHO, this is your best option and fits neatly with the "at all costs" predicate. It is also a compromise that I conceded in a comment above. Your rendering will not be perfect, but it will be both backward and forward compatible. The latter is something I believe you are dismissing unfairly.

As an maintainer of a text editor (which is quite similar to kakune) you miss understand that issue. They are talking about keeping the gelaheme as the smallest unit of editing. This is absolutely the right thing to do and what helix does too but has nothing to do with the way the rendering works.

I pinged wez about this and he was interested about discussing it on GH so perhaohs there will be some movement on the emulator aide.

@parasyte
Copy link

parasyte commented Aug 30, 2023

Let me be clear, I don't misunderstand the issue. I have been saying since the beginning of the thread that there is a missing piece of the puzzle, and you seem to agree with as much:

There is no standardised notion of grapheme width. The unicode standard says "you need to look at the font to determine graphe width, we only define width for codepoints".

Grapheme clusters are only half of the solution, but they are a necessary half. Grapheme clusters are not only the smallest unit for editing but also the smallest unit which can populate a terminal cell. This is the "M" in the "M:N" terminology. The missing half is, for any given grapheme cluster, declaring how many cells it occupies. This is the "N". While this piece is not currently standardized (and I don't know what the current state of affairs looks like) it isn't always going to be the case.

In the meantime, handling the least common denominator is, as described multiple times already, either dropping additional non-zero-width characters from the grapheme cluster or replacing the entire cluster with the fallback character. Regardless, ripping out segmentation is absolutely the wrong thing to do.

I consider it more importent that terminal and emulator agree on a single notion of width than what that notion of width actually is.

The fact that none of them currently agree means that the only way to get portable output across all disagreeing implementations is to just not output the grapheme clusters in the first place. Which is what "removing additional non-zero-width characters" is all about.

@joshka
Copy link
Member

joshka commented Sep 16, 2023

It does sound like @parasyte and @pascalkuthe both have some expertise beyond mine in this area. Theres' a lot of depth in this issue about what we should do to fix these issues, but I'm not sure that I can really discern an obvious next step to changing unicode rendering in a way that satisfies:

  • Application users who want things to look pretty
  • Application writers who want their users to not be surprised when things don't render "right"
  • Terminal emulator writers who want a consistent target to aim at
  • Adherents to strict uncompromising standards
    (Noting that some people fall into more than one of these categories)

Do you have some concrete suggestions that you can agree on what would be the way to fix this?
Perhaps we could simplify some of the narrative with some simple visual examples in the form of a table: e.g.:

String Character Display as Justification
`"u{1234}\u{5678}" 👩‍❤️‍👨 2 characters wide Rule 1. It's an emoji

Rule 1.... If a ....

@kdheepak
Copy link
Collaborator

For more context for future readers, from this article https://hsivonen.fi/string-length/:

This is because the base emoji is wide (2), the combining skin tone modifier is also wide (2), the male sign is counted as narrow (1), and the zero-width joiner and the variation selector are treated as control characters that don’t count towards width. Obviously, this is not the answer that we want. The answer we want is 2. Ideas that come to mind immediately, such as only counting the width of the first character in an extended grapheme cluster or taking the width of the widest character in an extended grapheme cluster, don’t work, because flag emoji consist of two regional indicator symbol letter characters both of which have East Asian Width of Neutral (i.e. they are counted as narrow but are not marked as narrow, because they are considered to exist outside the domain of East Asian typography). I’m not aware of any official Unicode definition that would reliably return 2 as the width of every kind of emoji. 😭

If you really must estimate display size without running text layout with a font, whether the extended grapheme cluster count or the East Asian Width of the string works better depends on context.

It seems like @pascalkuthe is suggesting keeping the smallest unit of editing (e.g. what happens when you press the right arrow key in the terminal) a grapheme, i.e. using unicode-segmentation but using unicode-width for display? I have to think about that more.

I think the only thing that is clear to me is that we have to use this something like this for emojis:

    // Due to this issue:
    // https://github.com/unicode-rs/unicode-width/issues/4
    // we cannot simply use the unicode-width crate to compute
    // the desired value.
    // Let's check for emoji-ness for ourselves first
    use xi_unicode::EmojiExt;
    for c in s.chars() {
        if c.is_emoji_modifier_base() || c.is_emoji_modifier() {
            // treat modifier sequences as double wide
            return 2;
        }
    }

Do we know for a fact that all emojis will be displayed as 2 wide in a terminal? Will that work in all terminals?

@joshka joshka changed the title unicode-width and emojis Buffer: unicode-width and emojis Sep 28, 2023
@joshka
Copy link
Member

joshka commented Oct 17, 2023

@joshka
Copy link
Member

joshka commented Oct 29, 2023

I just noticed a new crate (released 5 days ago): https://crates.io/crates/unicode-display-width by @jameslanska which might be a good "someone else's problem" type solution to the issue at hand.

@kdheepak
Copy link
Collaborator

I thought this was relevant and interesting:

https://www.jeffquast.com/post/ucs-detect-test-results/

@Benjamin-L
Copy link

I just noticed a new crate (released 5 days ago): https://crates.io/crates/unicode-display-width by @jameslanska which might be a good "someone else's problem" type solution to the issue at hand.

From the README:

Legacy text rendering engines do not support all modern Unicode features, so the rendered width of some text may bear little resemblance to the notional result returned by Unicode Display Width. This includes vim, emacs, most terminal emulators, and most shells.

From https://github.com/jameslanska/unicode-display-width/blob/0300977b7fe58a8b35644a2a5b117f75d41015b6/docs/editor_choice.md

Most shells including bash and zsh and most terminal emulators do not support the zero width joiner.

If your application needs the width as rendered by any of these systems, DO NOT use this crate. This project is more suited towards editors that wrap a web engine such as Chromium with Electron (e.g. VS Code).


As far as I know, the only approach to this problem that's portable across a wide range of terminal emulators is to use the ESC[6n escape to query the cursor position after printing. This is super cursed, a pain to implement, and has performance impact. A variant of this trick that might be simpler is using the report-position escape at application startup to probe the behavior of the current terminal emulator for various strings and then use that information to do formatting later without having to get feedback from the terminal each time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants