New unicode-graphemes addon. #4519

PerBothner · 2023-05-17T01:50:20Z

Handle grapheme cluster lookup as well as wcwidth.

This is based on Unicode 15, and could replace the unicode11 addon.

See the discussion in issue #4513 - but not that this pull request does not change the BufferLine implementation. (Except an extra width parameter for addCodepointToCell.

Note this combines the lookup for wcwidth and grapheme-break-property in a single trie lookup, so I'm hoping there won't be a noticable performace cost. Please let me know what you think.

Here is a screenshot of DomTerm with the xterm.js emulator. This is in a Tauri/Wry window (WebKit-based). Firefox also works. Chrome-based browsers (including Electron) mostly work - except they don't seem to handle Korean Hangul properly. (This might be fixable by changing fonts.)

Handle grapheme cluster lookup as well as wcwidth. This is based on Unicode 15, and could replace the unicode11 addon.

PerBothner · 2023-05-17T04:48:02Z

There are some yarn test failues. I will look at them tomorrow.

PerBothner · 2023-05-18T01:47:04Z

I fixed a bunch of tests.

There are 6 failures like the following. In spite of Googling, I haven't yet figued out to fix these.

/home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/src/UnicodeGraphemesAddon.ts
  0:0  error  Parsing error: "parserOptions.project" has been set for @typescript-eslint/parser.
The file does not match your project config: addons/xterm-addon-unicode-graphemes/src/UnicodeGraphemesAddon.ts.
The file must be included in at least one of the projects provided

Except for these, yarn tests is working.

Still 2 errors in yarn test-api - plus I plan to add some tests to addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts once I have that working.

Tyriar · 2023-05-18T13:22:23Z

0:0 error Parsing error: "parserOptions.project" has been set for @typescript-eslint/parser.

You need to set the tsconfigs in .eslintrc.json:

xterm.js/.eslintrc.json

Lines 8 to 33 in b7d73f6

    
           "parserOptions": { 
        
             "project": [ 
        
               "demo/tsconfig.json", 
        
               "src/browser/tsconfig.json", 
        
               "src/common/tsconfig.json", 
        
               "src/headless/tsconfig.json", 
        
               "test/api/tsconfig.json", 
        
               "test/benchmark/tsconfig.json", 
        
               "addons/xterm-addon-attach/src/tsconfig.json", 
        
               "addons/xterm-addon-attach/test/tsconfig.json", 
        
               "addons/xterm-addon-canvas/src/tsconfig.json", 
        
               "addons/xterm-addon-fit/src/tsconfig.json", 
        
               "addons/xterm-addon-fit/test/tsconfig.json", 
        
               "addons/xterm-addon-ligatures/src/tsconfig.json", 
        
               "addons/xterm-addon-search/src/tsconfig.json", 
        
               "addons/xterm-addon-search/test/tsconfig.json", 
        
               "addons/xterm-addon-serialize/src/tsconfig.json", 
        
               "addons/xterm-addon-serialize/test/tsconfig.json", 
        
               "addons/xterm-addon-serialize/benchmark/tsconfig.json", 
        
               "addons/xterm-addon-unicode11/src/tsconfig.json", 
        
               "addons/xterm-addon-unicode11/test/tsconfig.json", 
        
               "addons/xterm-addon-web-links/src/tsconfig.json", 
        
               "addons/xterm-addon-web-links/test/tsconfig.json", 
        
               "addons/xterm-addon-webgl/src/tsconfig.json", 
        
               "addons/xterm-addon-webgl/test/tsconfig.json" 
        
             ],

PerBothner · 2023-05-18T15:57:47Z

You need to set the tsconfigs in .eslintrc.json

Thanks. yarn test now works.

I added the following to "ignorePatterns", because they come from a separate unicode-properties project, which is in turn based on external projects:

    "addons/xterm-addon-unicode-graphemes/src/tiny-inflate.ts",
    "addons/xterm-addon-unicode-graphemes/src/unicode-trie.ts",
    "addons/xterm-addon-unicode-graphemes/src/UnicodeProperties.ts",

PerBothner · 2023-05-18T16:08:55Z

Next problem: yarn test-api emits this:

1) UnicodeGraphemesAddon
       wcwidth V15 emoji test:
     page.evaluate: ReferenceError: UnicodeGraphemesAddon is not defined
    at eval (eval at evaluate (:178:30), <anonymous>:1:22)
    at eval (<anonymous>)
    at UtilityScript.evaluate (<anonymous>:178:30)
    at UtilityScript.<anonymous> (<anonymous>:1:44)
      at /home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts:34:16
      at /home/bothner/Software/xterm.js-clusters/addons/xterm-addon-unicode-graphemes/out-test/UnicodeGraphemesAddon.api.js:8:71
      at __awaiter (addons/xterm-addon-unicode-graphemes/out-test/UnicodeGraphemesAddon.api.js:4:12)
      at Context.<anonymous> (addons/xterm-addon-unicode-graphemes/test/UnicodeGraphemesAddon.api.ts:33:43)

I might have to figure this out by myself - but if someone has an insight it would be appreciated.

(I'm also seeing some registerLinkProvider failures, but they also fail under an unmodified master clone.)

Tyriar · 2023-05-18T16:16:40Z

page.evaluate: ReferenceError: UnicodeGraphemesAddon is not defined

The playwright tests use the demo, this is about exposing the addon on client.ts:

xterm.js/demo/client.ts

Lines 204 to 214 in 913cb25

    
           if (document.location.pathname === '/test') { 
        
             window.Terminal = Terminal; 
        
             window.AttachAddon = AttachAddon; 
        
             window.FitAddon = FitAddon; 
        
             window.SearchAddon = SearchAddon; 
        
             window.SerializeAddon = SerializeAddon; 
        
             window.Unicode11Addon = Unicode11Addon; 
        
             window.LigaturesAddon = LigaturesAddon; 
        
             window.WebLinksAddon = WebLinksAddon; 
        
             window.WebglAddon = WebglAddon; 
        
           } else {

PerBothner · 2023-05-18T18:23:49Z

At this point I think this is ready for merge - or at least feedback.
It works with the demo and yarn test-api, and I implemented some tests in UnicodeGraphemesAddon.api.ts.

I'd be curious to run a performance comparison. What is the recommended way to do that?

Tyriar · 2023-05-18T19:52:10Z

Good to see it passing 👍. It's a pretty beefy PR so it might take me a couple of weeks to get to.

@jerch would know the best way to benchmark

PerBothner · 2023-05-18T20:26:26Z

No hurry. It does mean I may want to put the BufferLine changes on hold since there is some overlap, mainly in the InputHandler logic. Though maybe not - the overlap is likely to be limited to a small area.

jerch · 2023-05-19T08:19:04Z

@PerBothner For perf test you can use/alter https://github.com/xtermjs/xterm.js/blob/master/test/benchmark/Terminal.benchmark.ts, and then run it with

$> yarn benchmark out-test/benchmark/Terminal.benchmark.js

You can also alter the command to run above in before --> spawn (used that to test different stress pattern in the past).

jerch · 2023-05-19T08:34:22Z

@PerBothner
This a quite impressive PR, thanks for looking into that nasty unicode stuff. It def. will need some time to go over it.

Imho the idea to have this first as an addon is the right choice, which gives us time to level out raw edges (if there are any, no clue yet). I was really skeptical if this could be put into an addon with a reasonable afford, so thanks for doing the hard work. ❤️
Still in the long run I'd love to see this as a core addition, once we are done polishing things.

jerch · 2023-05-19T10:00:31Z

@PerBothner The perf test wont run by just altering the script, it is in fact quite involved to get it working (this is mainly because of the repo structure, where addons are not seen by the base path lookups...). Still got it running, if you want I commit the needed changes into your branch.

Overall the grapheme segementation performs really well 🚀 :

current master

 Context "out-test/benchmark/Terminal.benchmark.js"
    Context "Terminal: ls -lR /usr/lib"
       Context "write/string/async"
          Case "#1" : 5 runs - average throughput: 26.67 MB/s
       Context "write/Utf8/async"
          Case "#1" : 5 runs - average throughput: 29.92 MB/s

with graphemes enabled

 Context "out-benchmark/benchmark/SerializeAddon.benchmark.js"
    Context "Terminal: ls -lR /usr/lib"
       Context "write/string/async"
          Case "#1" : 5 runs - average throughput: 16.46 MB/s
       Context "write/Utf8/async"
          Case "#1" : 5 runs - average throughput: 21.77 MB/s

so overall perf degrades by ~30% for UTF8 and ~40% for JS string input. Thats def. not as bad as I feared, so I wont block this just for perf reasons (yes I would have done that, if the perf penalty was >2 times). There are prolly still some optimizations possible, and I hope we can get these numbers below 30%.

So lets not get sidetracked too much by perf ideas here, it is imho already good enough to follow your impl route and polish interfaces and test cases first.

On a sidenote (@Tyriar) - Seems we already lost throughput over time for some reason. When I did the initial buffer rework the perf test was at ~38MB/s for UTF8 and ~32MB/s for JS strings, as far as I remember.

PerBothner · 2023-05-19T12:25:11Z

It seems plausible that the table lookup in unicode-trie could be sped up using WASM. I don't know enough about WASM to know if the JavaScript/WASM transition adds significant overheads; if so, one could perhaps batch the property lookups with a single call WASM call for each print call.

Another possibly-worthwhile simple optimization might be to bypass the trie-lookup for ASCII (0020-007E) or below 0300 (with a few exceptions, to complicate things).

jerch · 2023-05-19T13:39:48Z

@PerBothner Yeah wasm speedup will most likely be limited by the granularity we can call it with, and what additional data moving is needed per step.
From my experience with wasm in the image addon - I gain there most of the speedup by calling rarely into wasm with bigger batches, and by avoiding data copies where possible (almost all my wasm impls there use borrow mechs at the output stage). With this I typically get a 2 to 4 times speedup compared to a highly optimized JS implementation - sixel decoder is ~3.5 times faster, base64 decoder ~2.5 times (all non SIMD code). With data copies on both wasm IO ends you can roughly subtract 1 from these numbers, thus wasm looks no so good anymore, if alot data juggling is involved.

So while this def. worth a shot, I would not put a priority on that yet, before we settled the inner works and have a reliable test base (wasm is quite cumbersome, when it comes to testing).

PerBothner · 2023-05-19T14:11:03Z

The data for width+category fits in 8 bits. I also note that UnicodeV6 and UnicodeV11 use a 65536-element Uint8Array for the BMP. There is no reason UnicodeGraphemeProvider can't do the same, making a charProperties basically a simple lookup O(1) in the Uint8Array.

Beyond the BMP we could use unicode-trie or perhaps a binary search in a Uint32Array (23 bits character range start + 8 bits value).

But I agree: Lets nail down the API before worrying about further optimization.

jerch · 2023-05-20T10:34:23Z

To move things forward I suggest that we should check, if and how we can activate the addon and grapheme handling during browser API tests. This would give us at least a basic regression testing against top level API. This should be possible, if we rework the API tests to boostrap its terminal instance from a common module, where we already injected the grapheme addon.

It will still miss several core tests on the unit testing level done on nodeJS side, but I am unsure, if we can get it hooked in there with a reasonable effort. This is a bit of a downside of the addon abstraction - they get mixed in late, so cannot get injected easily to existing tests on unit level (still might be possible for the Terminal.test.ts module).

Why bothering with already existing tests? For mainly 3 reasons:

spot regressions - if we want this to be the normal operation mode somewhat in the future, we basically have to make sure, that it does not break things at other ends
avoid writing tests all over again - we have a quite sophisticated test battery, but due to the addon nature, they currently run w'o grapheme handling being active (they currently just run as before)
get faster to working tests - I hope that by injecting existing tests with grapheme handling we get much faster a reliable test battery, thus much faster back to API shaping and even perf considerations (e.g. wasm testing gets much easier once we have regressions covered by tests)

jerch · 2023-05-20T12:04:35Z

About the API tests - for tests using openTerminal grapheme handling can easily be injected:

diff --git a/test/api/TestUtils.ts b/test/api/TestUtils.ts
index 3fd0648e..904c0d12 100644
--- a/test/api/TestUtils.ts
+++ b/test/api/TestUtils.ts
@@ -46,6 +46,11 @@ export async function timeout(ms: number): Promise<void> {
 export async function openTerminal(page: playwright.Page, options: ITerminalOptions & ITerminalInitOnlyOptions = {}): Promise<void> {
   await page.evaluate(`window.term = new Terminal(${JSON.stringify({ allowProposedApi: true, ...options })})`);
   await page.evaluate(`window.term.open(document.querySelector('#terminal-container'))`);
+  await page.evaluate(`
+      window.unicode = new UnicodeGraphemesAddon();
+      window.term.loadAddon(window.unicode);
+      window.term.unicode.activeVersion = '15-graphemes';
+    `);
   await page.waitForSelector('.xterm-rows');
 }

This is just quickly hacked in to see, how it could be done. For better integration testing it prolly will need a conditional here, whether to run API tests with or w'o graphemes (could be done in the CI file as env var).

Also have not checked yet, if all API tests call TestUtils.openTerminal before running the tests, still only one test fails here:

  1) Unicode11Addon
       wcwidth V11 emoji test:

      AssertionError: expected [ '6', '11', '15-graphemes' ] to deeply equal [ '6', '11' ]
      + expected - actual

       [
         "6"
         "11"
      -  "15-graphemes"
       ]

(It tests unicode provider registration, but does not know about the new provider yet, thus it should have failed, hurray 😸)

jerch · 2023-05-20T12:15:40Z

@PerBothner Should I do a PR against your branch, so you can add the perf test and the API test changes?

PerBothner · 2023-05-20T12:24:12Z

No opinion - whatever you think is reasonable.

jerch · 2023-05-20T13:30:40Z

Created PR: PerBothner#2

PerBothner · 2023-09-07T19:21:36Z

How much longer do we wait for @DHowett? (I also sent him an email, which he hasn't responded to. Perhaps @Tyriar could make another attempt to contact him?

I updated the branch from upstream. GitHub reports some weird testsuite-failures, including lint running out of memory. (yarn lint works fine locally.)

Tyriar · 2023-09-11T12:45:27Z

You're right, I can have that discussion internally and unblock merging this if we just mark it as experimental in the readme.

Tyriar · 2023-09-11T12:45:45Z

I'll have a look at the CI failures, OOM is quite surprising.

Tyriar · 2023-09-11T13:07:31Z

@jerch I'll do a once over of the core files, any concerns on your end to get this merged given we have a warning in the readme now?

jerch · 2023-09-11T13:10:43Z

@Tyriar If there are no surprising changes in the newer commits (did only a quick peek) I think it is good to go as it is.

Edit: It would be really nice to get some feedback from @DHowett, esp. if he is in contact with the consortium or working closer with unicode authorities.

Tyriar · 2023-09-11T13:22:25Z

@jerch I'll bring it up in our next monthly sync

Tyriar · 2023-09-11T13:24:06Z

@PerBothner fixed the CI issues:

I guess eslint just takes a lot of memory and we hit the threshold, that was fixed by increasing node's memory limit a003034
The API/playwright test failures were just because we weren't uploading the build artifacts of the new addon 1b5756c

Sorry about all the conflicts, we have a pretty sweet CI setup now though 😅

See xtermjs#4795

@jerch

Has approval to merge from @jerch

Tyriar · 2023-09-12T20:56:22Z

Merging! We'll set up publishing #4797

PerBothner · 2023-09-12T21:06:44Z

Thanks! Next I will "rebase" the BufferLine re-write to use the grapheme-cluster API.

Tyriar · 2023-09-12T21:15:56Z

@PerBothner 👌, I think I mentioned it before but we'll want to chat about the high level ideas in the proposed buffer line rewrite before investing too much time to make sure we're all on the same page.

PerBothner · 2023-09-12T22:32:47Z

@Tyriar "we'll want to chat about the high level ideas in the proposed buffer line rewrite"

I agree. There was some discussion in issue #4513, but mixed in with the clustering issue,
I suggest closing #4513, and either creating a new Issue or a "work-in-progress" Pull Request. The former probably makes more sense, as the current state of my branch needs quite a bit of work to be usable.

This issue is pretty important to me. I am seriously considering switching DomTerm engine to xterms.js for various reasons (mainly performance and community). However, DomTerm has a very ambitious vision and already has a lot of features that xterm.js doesn't have. (Hence a lot of DomTerm features would have to be ported or re-implemented.) The existing BufferLine representation is simple but quite limited, and not a good match for some of the things I want to do, including things that DomTerm currently supports.

Tyriar · 2023-09-12T22:38:57Z

@PerBothner closed, yes please create the issue and we can chat there. It's a bit tough to filter through all this info quickly so it would be good to separate the discussions.

The main goals on my end of the buffer system is fast access and tight packing of the data. The major things I'd like that come to mind for the buffer are deferred clean up/reflow of the arrays (#4232) and leveraging less, larger array buffers to store the data (I think this might be something you were after too). We also have the decoration system which adds a lot of extensibility capabilities which are slower but very flexible.

PerBothner · 2023-09-12T22:55:24Z

FWIW DomTerm already does deferred re-flow. (Partly because re-flow is fairly expensive in the general case, which may include non-trivial pretty-printing and variable-width fonts.) I will cover re-flow in my proposal.

I'll write up the design, goals, and features in a new Issue, hopefully later today or possibly tomorrow.

PerBothner · 2023-09-13T21:40:16Z

See new issue #4800 about a BufferLine re-design.

New unicode-graphemes addon.

2f7fe11

Handle grapheme cluster lookup as well as wcwidth. This is based on Unicode 15, and could replace the unicode11 addon.

Cleanup - fix various tests.

67e9689

xterm-addon-unicode-graphemes - fix 'yarn test'

41760df

PerBothner added 2 commits May 18, 2023 11:11

xterm-addon-unicode-graphemes: add api tests and get demo working

8e730cd

Merge branch 'master' into clusters

9c34bb4

jerch added 5 commits May 20, 2023 15:05

patch for working perf test and API test injection

466501c

fix benchmark

284cde8

remove spurious .only

68abf10

revert wrong added file

ebfa201

revert wrongly editied files

3003185

Merge branch 'master' into clusters

427c472

jerch mentioned this pull request Sep 8, 2023

xterm.js lies about Unicode 11, really uses Unicode 12 #4753

Open

Merge branch 'master' into clusters

bddff60

Tyriar added 3 commits September 11, 2023 06:02

Increase node memory for lint in CI

a003034

Add unicode-grapheme addon files to build artifacts

1b5756c

Add experimental note in readme

17c4d09

Tyriar added this to the 5.4.0 milestone Sep 11, 2023

Tyriar added 5 commits September 11, 2023 06:28

Fix warnings in client.ts

9e536ca

Update to es2021

24b932d

Set repository to sub-directory

9ab0cd6

See xtermjs#4795

Merge branch 'master' into clusters

8183d1d

Final polish

32c6165

Tyriar merged commit bdb25c2 into xtermjs:master Sep 12, 2023
18 checks passed

Tyriar mentioned this pull request Sep 12, 2023

character joiner api unsuitable for graphemes #4513

Closed

PerBothner mentioned this pull request Nov 5, 2023

Support emojis with skin tone #3397

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New unicode-graphemes addon. #4519

New unicode-graphemes addon. #4519

PerBothner commented May 17, 2023

PerBothner commented May 17, 2023

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023 •

edited

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023

jerch commented May 19, 2023 •

edited

jerch commented May 19, 2023

jerch commented May 19, 2023 •

edited

PerBothner commented May 19, 2023

jerch commented May 19, 2023

PerBothner commented May 19, 2023

jerch commented May 20, 2023 •

edited

jerch commented May 20, 2023 •

edited

jerch commented May 20, 2023

PerBothner commented May 20, 2023

jerch commented May 20, 2023

PerBothner commented Sep 7, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

jerch commented Sep 11, 2023 •

edited

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

PerBothner commented Sep 13, 2023

New unicode-graphemes addon. #4519

New unicode-graphemes addon. #4519

Conversation

PerBothner commented May 17, 2023

PerBothner commented May 17, 2023

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023 • edited

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023

Tyriar commented May 18, 2023

PerBothner commented May 18, 2023

jerch commented May 19, 2023 • edited

jerch commented May 19, 2023

jerch commented May 19, 2023 • edited

PerBothner commented May 19, 2023

jerch commented May 19, 2023

PerBothner commented May 19, 2023

jerch commented May 20, 2023 • edited

jerch commented May 20, 2023 • edited

jerch commented May 20, 2023

PerBothner commented May 20, 2023

jerch commented May 20, 2023

PerBothner commented Sep 7, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

jerch commented Sep 11, 2023 • edited

Tyriar commented Sep 11, 2023

Tyriar commented Sep 11, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

Tyriar commented Sep 12, 2023

PerBothner commented Sep 12, 2023

PerBothner commented Sep 13, 2023

PerBothner commented May 18, 2023 •

edited

jerch commented May 19, 2023 •

edited

jerch commented May 19, 2023 •

edited

jerch commented May 20, 2023 •

edited

jerch commented May 20, 2023 •

edited

jerch commented Sep 11, 2023 •

edited