Standardising run segmentation #37

tiroj · 2020-12-30T20:33:41Z

Script itemisation and run segmentation is the first step of OpenType Layout text processing, and like much else in OTL it lacks an implementation specification. Over the years, I have noted inconsistency in outcomes in different environments regarding handling of script=common characters such as punctuation at run boundaries, indicating that different algorithms are used by different implementers or that some algorithms may be broken (broken in terms of their developer’s intentions, since they can’t be said to be broken in terms of a non-existent specification). Lack of consistency in run segmentation sets up some OTL GSUB and GPOS for failure, since the font maker is unable to predict input to lookups, which are not applied across run boundaries.

It seems to me that standardising run segmentation may—as well as being practically useful—provide a useful test case for this group. It is a bite-size chunk of OTL processing, necessary but not overwhelming in scope. It would require applying the kinds of decisions that we have discussed in organisational meetings, re. determining the appropriate venue for the work, determining how it should be published, determining what input from current implementers is needed and available, negotiating possible breaking changes for some implementers, and development of both a written specification and a test suite.

NorbertLindenberg · 2021-01-12T20:39:44Z

Some prior work:

Martin Hosken’s 2018 review of existing algorithms and issues:
https://github.com/OpenType/opentype-layout/blob/master/docs/script_segmentation.md

Specific issue with Common script bases:
OpenType/opentype-layout#13

PeterConstable · 2021-01-12T21:05:30Z

I definitely think this would be worthwhile. My inclination would be for this to happen in a Unicode context, since the algorithm would need to be driven by Unicode character properties---existing properties or perhaps new properties if needed.

NeilSureshPatel · 2021-01-12T21:06:17Z

Useful overview:
https://raphlinus.github.io/text/2020/10/26/text-layout.html

NorbertLindenberg · 2021-02-03T05:15:09Z

@PeterConstable and others have added more thoughts on this topic in a separate issue #44.

NorbertLindenberg · 2021-02-03T06:01:10Z

I think we need to include other segmentation that happens between a text rendering API and cluster- or run-level shaping in this discussion, along with script and run segmentation. As Raph’s article (thank you @NeilSureshPatel for the reference!) points out, “a text layout engine breaks the input into finer and finer grains”, from paragraph to cluster. Any incorrect breaks introduced along the path can adversely affect shaping or even the text the user gets to see.

One goal for this project therefore should be to identify the clusters and runs that must be kept intact for shaping to work correctly, and to ensure that these clusters and runs are indeed kept intact in all segmentation algorithms involved in text rendering.

One example for the kind of errors we’re currently seeing is this (abbreviated) HTML document:

<style type="text/css">
@font-face {
	font-family: "NotoJavanese";
	src: url("NotoSansJavanese-Regular.ttf");
}
body {
	font-family: "Helvetica";
}
:lang(jv) {
	font-family: "NotoJavanese", serif;
}
</style>
<p>Javanese vowel o: <span lang="jv">◌ꦺꦴ</span>. Javanese conjunct ha: <span lang="jv">◌꧀ꦲ</span>.</p>
<p>Javanese vowel o: ◌ꦺꦴ. Javanese conjunct ha: ◌꧀ꦲ.</p>

Browsers render this in different ways.

Safari:

Firefox:

Chrome:

Legacy Edge:

The rendering of the first paragraph in Safari, Firefox, and Legacy Edge is correct. The rendering of that paragraph in Chrome is broken, as are all renderings of the second paragraph (with Safari getting it half right).

The broken renderings often add dotted circles, which indicates that somewhere on the way to the shaping engine the original dotted circle in the text got separated from the marks that are attached to it, so that the shaping engine adds another one, or possibly two in the case of the two-part vowel. Some of the additional dotted circles clearly come from a font other than Noto Sans Javanese, indicating that the separation likely occurred during font fallback handling. In cases where the additional dotted circle comes from the Javanese font, or where there’s no additional dotted circle, other segmentation algorithms are more likely at fault.

tiroj · 2021-02-03T18:24:07Z

Another unusual segmentation case to be investigated: biscript orthographies, e.g. the use of Greek characters within Latin orthographies for First Nations languages in British Columbia.

lianghai mentioned this issue Jan 12, 2021

Meeting 7 schedule and agenda #39

Closed

lianghai added documentation Improvements or additions to documentation layout shaping labels Jan 14, 2021

NorbertLindenberg mentioned this issue Feb 1, 2021

Itemization #44

Open

NorbertLindenberg mentioned this issue Feb 6, 2021

Meeting 7 minutes #46

Merged

brawer mentioned this issue May 20, 2023

Index by grapheme maplibre/maplibre-gl-js#2458

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardising run segmentation #37

Standardising run segmentation #37

tiroj commented Dec 30, 2020

NorbertLindenberg commented Jan 12, 2021

PeterConstable commented Jan 12, 2021

NeilSureshPatel commented Jan 12, 2021 •

edited

NorbertLindenberg commented Feb 3, 2021

NorbertLindenberg commented Feb 3, 2021

tiroj commented Feb 3, 2021

Standardising run segmentation #37

Standardising run segmentation #37

Comments

tiroj commented Dec 30, 2020

NorbertLindenberg commented Jan 12, 2021

PeterConstable commented Jan 12, 2021

NeilSureshPatel commented Jan 12, 2021 • edited

NorbertLindenberg commented Feb 3, 2021

NorbertLindenberg commented Feb 3, 2021

tiroj commented Feb 3, 2021

NeilSureshPatel commented Jan 12, 2021 •

edited