Skip to content
This repository has been archived by the owner on Mar 7, 2023. It is now read-only.

Standardising run segmentation #37

Open
tiroj opened this issue Dec 30, 2020 · 6 comments
Open

Standardising run segmentation #37

tiroj opened this issue Dec 30, 2020 · 6 comments
Labels
documentation Improvements or additions to documentation layout shaping

Comments

@tiroj
Copy link

tiroj commented Dec 30, 2020

Script itemisation and run segmentation is the first step of OpenType Layout text processing, and like much else in OTL it lacks an implementation specification. Over the years, I have noted inconsistency in outcomes in different environments regarding handling of script=common characters such as punctuation at run boundaries, indicating that different algorithms are used by different implementers or that some algorithms may be broken (broken in terms of their developer’s intentions, since they can’t be said to be broken in terms of a non-existent specification). Lack of consistency in run segmentation sets up some OTL GSUB and GPOS for failure, since the font maker is unable to predict input to lookups, which are not applied across run boundaries.

It seems to me that standardising run segmentation may—as well as being practically useful—provide a useful test case for this group. It is a bite-size chunk of OTL processing, necessary but not overwhelming in scope. It would require applying the kinds of decisions that we have discussed in organisational meetings, re. determining the appropriate venue for the work, determining how it should be published, determining what input from current implementers is needed and available, negotiating possible breaking changes for some implementers, and development of both a written specification and a test suite.

@NorbertLindenberg
Copy link

Some prior work:

Martin Hosken’s 2018 review of existing algorithms and issues:
https://github.com/OpenType/opentype-layout/blob/master/docs/script_segmentation.md

Specific issue with Common script bases:
OpenType/opentype-layout#13

@PeterConstable
Copy link

I definitely think this would be worthwhile. My inclination would be for this to happen in a Unicode context, since the algorithm would need to be driven by Unicode character properties---existing properties or perhaps new properties if needed.

@NeilSureshPatel
Copy link

NeilSureshPatel commented Jan 12, 2021

@lianghai lianghai added documentation Improvements or additions to documentation layout shaping labels Jan 14, 2021
@NorbertLindenberg
Copy link

@PeterConstable and others have added more thoughts on this topic in a separate issue #44.

@NorbertLindenberg
Copy link

I think we need to include other segmentation that happens between a text rendering API and cluster- or run-level shaping in this discussion, along with script and run segmentation. As Raph’s article (thank you @NeilSureshPatel for the reference!) points out, “a text layout engine breaks the input into finer and finer grains”, from paragraph to cluster. Any incorrect breaks introduced along the path can adversely affect shaping or even the text the user gets to see.

One goal for this project therefore should be to identify the clusters and runs that must be kept intact for shaping to work correctly, and to ensure that these clusters and runs are indeed kept intact in all segmentation algorithms involved in text rendering.

One example for the kind of errors we’re currently seeing is this (abbreviated) HTML document:

<style type="text/css">
@font-face {
	font-family: "NotoJavanese";
	src: url("NotoSansJavanese-Regular.ttf");
}
body {
	font-family: "Helvetica";
}
:lang(jv) {
	font-family: "NotoJavanese", serif;
}
</style>
<p>Javanese vowel o: <span lang="jv">◌ꦺꦴ</span>. Javanese conjunct ha: <span lang="jv">◌꧀ꦲ</span>.</p>
<p>Javanese vowel o: ◌ꦺꦴ. Javanese conjunct ha: ◌꧀ꦲ.</p>

Browsers render this in different ways.

Safari:
Safari
Firefox:
Firefox
Chrome:
Chrome
Legacy Edge:
Legacy Edge

The rendering of the first paragraph in Safari, Firefox, and Legacy Edge is correct. The rendering of that paragraph in Chrome is broken, as are all renderings of the second paragraph (with Safari getting it half right).

The broken renderings often add dotted circles, which indicates that somewhere on the way to the shaping engine the original dotted circle in the text got separated from the marks that are attached to it, so that the shaping engine adds another one, or possibly two in the case of the two-part vowel. Some of the additional dotted circles clearly come from a font other than Noto Sans Javanese, indicating that the separation likely occurred during font fallback handling. In cases where the additional dotted circle comes from the Javanese font, or where there’s no additional dotted circle, other segmentation algorithms are more likely at fault.

@tiroj
Copy link
Author

tiroj commented Feb 3, 2021

Another unusual segmentation case to be investigated: biscript orthographies, e.g. the use of Greek characters within Latin orthographies for First Nations languages in British Columbia.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation layout shaping
Projects
None yet
Development

No branches or pull requests

5 participants