Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect marker width calculation of Unicode symbols can cause markers to not be displayed #287

Closed
kevinstadler opened this issue Mar 15, 2021 · 7 comments

Comments

@kevinstadler
Copy link

Describe the bug

Emoji marker display works out of the box, and normally emojis are displayed correctly, with most taking up the width of 2 ASCII characters each (I'm on GNU bash, version 5.0.16(1)-release (x86_64-apple-darwin17.7.0)).

However, when the marker column of a report contains emoji markers but no ASCII-markers (or column header label) wider than 1 character, then the required column width is incorrectly calculated to be 1 character wide, which does not leave enough space for the emoji markers to be displayed.

(I've only tested this on a small number of emojis but presumably the bug also applies to all other non-width-1 Unicode characters/symbols.)

To Reproduce

Open any report using the following marker config

[marker]
enabled = True
header_label=
require_color = False
# just override all (multi-character) defaults with a single emoji to be able to reproduce the bug on most reports
active.label =💡
blocked.label =💡
blocking.label =💡
completed.label =💡
deleted.label =💡
due.label =💡
#due.today.label =💡
keyword.label =💡
overdue.label =💡
project.label =💡
project.none.label =💡
recurring.label =💡
scheduled.label =💡
tag.label =💡
tag.none.label =💡

Expected behavior

Emoji markers should be displayed by allocating a wide enough marker column based on a correct calculation of the marker width (which might have to take into account properties of the console/font set).

Suggested fix

The mis-calculation happens in line 38 here, which assumes that every String character is exactly one display character wide:

def add_label(self, color, label, width, text_markup):
if self.color_required(color) or not label:
return width, text_markup
width += len(label)
text_markup += [(color, label)]
return width, text_markup

Python-Curses doesn't seem to provide a way to perform a correct width calculation of Unicode symbols, so solving this might require recourse to an external library dependency (in particular https://github.com/jquast/wcwidth)

@thehunmonkgroup
Copy link
Member

Well this would have never occurred to me :)

Can you try replacing

width += len(label)
with this:

width += len(label.encode("utf-8"))

My quick tests indicate that might over-report the length of emojis for display purposes, but it does seem to reliably report a length of 1 for the common ASCII characters, and I'm guessing it would at least fix the problem of the column being too narrow.

If that works, we can look at a little more logic to triangulate the best display width for emojis -- I really don't want to add a dependency to VIT for something that's probably a minor and rarely used feature.

@kevinstadler
Copy link
Author

kevinstadler commented Mar 16, 2021

Well this would have never occurred to me :)

I'm not surprised, it's great that emojis are even supported without any extra hassle! I'm only just getting started with vit/task but I already feel like it's gonna change my life hehehe.

So len(c.encode("utf-8")) seems to just give the byte-length of the characters which often results in overestimates:

>>> cs = ['a', '(', 'ä', 'ā', '中', '💡']
>>> [len(c) for c in cs]
[1, 1, 1, 1, 1, 1]
>>> [len(c.encode("utf-8")) for c in cs]
[1, 1, 2, 2, 3, 4]

It looks like there is also a built-in east_asian_width(char) function in unicodedata which, despite its name, correctly reports emojis as wide as well:

>>> from unicodedata import east_asian_width

>>> [east_asian_width(c) for c in cs]
['Na', 'Na', 'N', 'A', 'W', 'W']

>>> [east_asian_width(c) == 'W' for c in cs]
[False, False, False, False, True, True]

So based on this information one could just add one extra width for every wide character in the label like so:

width += len(label) + [east_asian_width(c) for c in label].count('W')

If you think the unicodedata import is not worth the overhead (I agree this is a rare use case), one could just add a hack that assumes that any character that comes after the lowest extra-wide unicode character value (0x1100) is wide. This would still falsely trigger on many zero-width diacritic markers and other characters but results in less of an overestimate than using the utf-8 encoding length:

>>> [ord(c) >= 0x1100 for c in cs]
[False, False, False, False, True, True]

So a minimal, import-free hack solution could be:

width += len(label) + len([c for c in label if ord(c) >= 0x1100])

@thehunmonkgroup
Copy link
Member

I wasn't so worried about importing unicodedata, it was more the overhead of all those lookups to the unicode database -- couldn't really find anything in a brief search about how performant those lookups are.

I decided to create a simple test, where I emulated the all report with only an ID column, and that generated plenty of markers. Using the approach with unicodedata didn't seem to add any appreciable overhead to that rather expensive build, so I went ahead and committed the most robust solution to this issue.

If you can find any info on the performance of those lookups, or want to construct a more robust test, I'd love to hear what you find!

@kevinstadler
Copy link
Author

I doubt there's a huge overhead for character type lookup, wcwidth for example has an ordered list of ~1000 unicode block number ranges (each given by a pair of 4 byte numbers) and does a binary search over them, so that's essentially 9 int32 comparisons to determine whether a character is wide or not. I can imagine that the Python built-in does nothing more than a single direct array index lookup from the character table that's stored on disk (which would presumably happen anyway during the actual rendering process).

Already found a minor issue with some symbols (such as 🅱️) whose 'East Asian width' is considered to be 'ambiguous' (i.e. they are single or double width depending on the typeface). The built-in method returns 'A' instead of 'W' for them, which messes up the column alignment, but to be honest they seem to be pretty rare so unless somebody else runs up against this problem I think the current solution is great as it works for almost all cases 👍

@thehunmonkgroup
Copy link
Member

@kevinstadler thanks again for all your work on this! BTW, if you're loving TaskWarrior/VIT, you might want to check out https://github.com/thehunmonkgroup/onenote as well -- I wrote that out of necessity, and can't live w/o it now!

@tbabej
Copy link
Collaborator

tbabej commented Mar 18, 2021

Related: In TW itself, we recently had to update the unicode width character determination to support newest emojis :) GothenburgBitFactory/taskwarrior#2333

@kevinstadler
Copy link
Author

@thehunmonkgroup cool, I'll have a look! Is there a user group / chat for vit somewhere? I'm having trouble getting vit to use the taskwarrior report coloring, and the parameters in the [color] section of the config don't seem to be doing anything (pretty sure it's just me though and not a bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants