Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

kevinstadler · 2021-03-15T17:12:03Z

Describe the bug

Emoji marker display works out of the box, and normally emojis are displayed correctly, with most taking up the width of 2 ASCII characters each (I'm on GNU bash, version 5.0.16(1)-release (x86_64-apple-darwin17.7.0)).

However, when the marker column of a report contains emoji markers but no ASCII-markers (or column header label) wider than 1 character, then the required column width is incorrectly calculated to be 1 character wide, which does not leave enough space for the emoji markers to be displayed.

(I've only tested this on a small number of emojis but presumably the bug also applies to all other non-width-1 Unicode characters/symbols.)

To Reproduce

Open any report using the following marker config

[marker]
enabled = True
header_label=
require_color = False
# just override all (multi-character) defaults with a single emoji to be able to reproduce the bug on most reports
active.label =💡
blocked.label =💡
blocking.label =💡
completed.label =💡
deleted.label =💡
due.label =💡
#due.today.label =💡
keyword.label =💡
overdue.label =💡
project.label =💡
project.none.label =💡
recurring.label =💡
scheduled.label =💡
tag.label =💡
tag.none.label =💡

Expected behavior

Emoji markers should be displayed by allocating a wide enough marker column based on a correct calculation of the marker width (which might have to take into account properties of the console/font set).

Suggested fix

The mis-calculation happens in line 38 here, which assumes that every String character is exactly one display character wide:

vit/vit/formatter/markers.py

Lines 34 to 41 in 8fdebc3

    
           def add_label(self, color, label, width, text_markup): 
        
               if self.color_required(color) or not label: 
        
                   return width, text_markup 
        
               width += len(label) 
        
               text_markup += [(color, label)] 
        
               return width, text_markup

Python-Curses doesn't seem to provide a way to perform a correct width calculation of Unicode symbols, so solving this might require recourse to an external library dependency (in particular https://github.com/jquast/wcwidth)

The text was updated successfully, but these errors were encountered:

thehunmonkgroup · 2021-03-15T18:06:03Z

Well this would have never occurred to me :)

Can you try replacing

vit/vit/formatter/markers.py

Line 38 in 8fdebc3

width += len(label)

with this:

width += len(label.encode("utf-8"))

My quick tests indicate that might over-report the length of emojis for display purposes, but it does seem to reliably report a length of 1 for the common ASCII characters, and I'm guessing it would at least fix the problem of the column being too narrow.

If that works, we can look at a little more logic to triangulate the best display width for emojis -- I really don't want to add a dependency to VIT for something that's probably a minor and rarely used feature.

kevinstadler · 2021-03-16T07:28:37Z

Well this would have never occurred to me :)

I'm not surprised, it's great that emojis are even supported without any extra hassle! I'm only just getting started with vit/task but I already feel like it's gonna change my life hehehe.

So len(c.encode("utf-8")) seems to just give the byte-length of the characters which often results in overestimates:

>>> cs = ['a', '(', 'ä', 'ā', '中', '💡']
>>> [len(c) for c in cs]
[1, 1, 1, 1, 1, 1]
>>> [len(c.encode("utf-8")) for c in cs]
[1, 1, 2, 2, 3, 4]

It looks like there is also a built-in east_asian_width(char) function in unicodedata which, despite its name, correctly reports emojis as wide as well:

>>> from unicodedata import east_asian_width

>>> [east_asian_width(c) for c in cs]
['Na', 'Na', 'N', 'A', 'W', 'W']

>>> [east_asian_width(c) == 'W' for c in cs]
[False, False, False, False, True, True]

So based on this information one could just add one extra width for every wide character in the label like so:

width += len(label) + [east_asian_width(c) for c in label].count('W')

If you think the unicodedata import is not worth the overhead (I agree this is a rare use case), one could just add a hack that assumes that any character that comes after the lowest extra-wide unicode character value (0x1100) is wide. This would still falsely trigger on many zero-width diacritic markers and other characters but results in less of an overestimate than using the utf-8 encoding length:

>>> [ord(c) >= 0x1100 for c in cs]
[False, False, False, False, True, True]

So a minimal, import-free hack solution could be:

width += len(label) + len([c for c in label if ord(c) >= 0x1100])

thehunmonkgroup · 2021-03-16T15:34:33Z

I wasn't so worried about importing unicodedata, it was more the overhead of all those lookups to the unicode database -- couldn't really find anything in a brief search about how performant those lookups are.

I decided to create a simple test, where I emulated the all report with only an ID column, and that generated plenty of markers. Using the approach with unicodedata didn't seem to add any appreciable overhead to that rather expensive build, so I went ahead and committed the most robust solution to this issue.

If you can find any info on the performance of those lookups, or want to construct a more robust test, I'd love to hear what you find!

kevinstadler · 2021-03-17T06:01:41Z

I doubt there's a huge overhead for character type lookup, wcwidth for example has an ordered list of ~1000 unicode block number ranges (each given by a pair of 4 byte numbers) and does a binary search over them, so that's essentially 9 int32 comparisons to determine whether a character is wide or not. I can imagine that the Python built-in does nothing more than a single direct array index lookup from the character table that's stored on disk (which would presumably happen anyway during the actual rendering process).

Already found a minor issue with some symbols (such as 🅱️) whose 'East Asian width' is considered to be 'ambiguous' (i.e. they are single or double width depending on the typeface). The built-in method returns 'A' instead of 'W' for them, which messes up the column alignment, but to be honest they seem to be pretty rare so unless somebody else runs up against this problem I think the current solution is great as it works for almost all cases 👍

thehunmonkgroup · 2021-03-17T15:29:27Z

@kevinstadler thanks again for all your work on this! BTW, if you're loving TaskWarrior/VIT, you might want to check out https://github.com/thehunmonkgroup/onenote as well -- I wrote that out of necessity, and can't live w/o it now!

tbabej · 2021-03-18T00:55:06Z

Related: In TW itself, we recently had to update the unicode width character determination to support newest emojis :) GothenburgBitFactory/taskwarrior#2333

kevinstadler · 2021-03-18T05:23:09Z

@thehunmonkgroup cool, I'll have a look! Is there a user group / chat for vit somewhere? I'm having trouble getting vit to use the taskwarrior report coloring, and the parameters in the [color] section of the config don't seem to be doing anything (pretty sure it's just me though and not a bug)

thehunmonkgroup closed this as completed in 1d59db8 Mar 16, 2021

thehunmonkgroup mentioned this issue Jul 14, 2023

When using emojis as markers the report becomes misaligned #342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

kevinstadler commented Mar 15, 2021

thehunmonkgroup commented Mar 15, 2021

kevinstadler commented Mar 16, 2021 •

edited

Loading

thehunmonkgroup commented Mar 16, 2021

kevinstadler commented Mar 17, 2021

thehunmonkgroup commented Mar 17, 2021

tbabej commented Mar 18, 2021

kevinstadler commented Mar 18, 2021

Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

Comments

kevinstadler commented Mar 15, 2021

thehunmonkgroup commented Mar 15, 2021

kevinstadler commented Mar 16, 2021 • edited Loading

thehunmonkgroup commented Mar 16, 2021

kevinstadler commented Mar 17, 2021

thehunmonkgroup commented Mar 17, 2021

tbabej commented Mar 18, 2021

kevinstadler commented Mar 18, 2021

kevinstadler commented Mar 16, 2021 •

edited

Loading