Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,14 @@ In this repo 3 options are currently provided:
```sh
bazel run -c opt util:closure_glyph_keyed_segmenter_util -- \
--input_font=$(pwd)/myfont.ttf \
--number_of_segments=20 \
--codepoints_file=$(pwd)/all_cps.txt \
--config=path/to/config.textpb
--include_initial_codepoints_in_config \
--output_segmentation_plan > glyph_keyed.txtpb
```

The closure glyph segmenter is configured via an input configuration file using the
[segmenter_config.proto](util/segmenter_config.proto) schema, see the comments there for more details.

Note: this utility is under active development and still very experimental. See
[the status section](docs/experimental/closure_glyph_segmentation.md#status) for more details.

Expand All @@ -111,6 +114,10 @@ cat glyph_keyed.txtpb table_keyed.txtpb > segmentation_plan.txtpb

Additional tools for generating encoder configs are planned to be added in the future.

For concrete examples of how to generate IFT fonts, see the [IFT Demo](https://github.com/garretrieger/ift-demo).
In particular the [Makefile](https://github.com/garretrieger/ift-demo/blob/main/Makefile) and the
[segmenter configs](https://github.com/garretrieger/ift-demo/tree/main/config) may be helpful.

### Generating an IFT Encoding

Once an segmentation plan has been created it can be combined with the target font to produce and incremental font and collection
Expand Down
162 changes: 50 additions & 112 deletions docs/experimental/closure_glyph_segmentation.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Converting Unicode Code Point to Glyph Keyed Segmentations during IFT Encoding using Subsetter Glyph Closure

Author: Garret Rieger
Date: Jan 27, 2025
Date: Jan 27, 2025
Updated: Oct 27, 2025

## Introduction

Expand Down Expand Up @@ -46,46 +47,51 @@ procedures work in practice.

## Status

Development of a robust glyph segmentation process that produces performant, low over head
segmentations is an area of active development in the ift encoder. The current prototype
implementation in [closure_glyph_segmenter.cc](../ift/encoder/closure_glyph_segmenter.cc) can produce
segmentations that satisfy the closure requirement, but does not yet necessarily produce ones that
are performant.

The approach laid out in this document is just one possible approach to solving the problem. This
document aims primarily to describe how the prototype implementation in
[closure_glyph_segmenter.cc](../ift/encoder/closure_glyph_segmenter.cc) functions, and is not intended to
present the final (or only) solution to the problem. There are several unsolved problems and
remaining areas for development in this particular approach:

* Input segmentation generation: selecting good quality input code point segmentations is critically
important to achieving high quality glyph segmentations. A high quality code point segmentation
will need to balance keeping interacting code points together with also keeping code points that
are commonly used together. This document and the implementation make no attempt to solve this
problem yet.

* Patch merging: is the process of combining patches from a found segmentation together in order to
reduce overall overhead (eg. if there was a patch containing only one glyph the overhead cost of
the patch format and network transfer would dominate, therefore it may make sense to merge into
another similar patch). A very basic patch merging process has been included in the implementation
but there is lots of room for improvement. Notably, it does not yet handle conditional patch
merging. Additionally, a more advanced heuristic is needed for selecting which patches to merge.
The current prototype implementation in
[closure_glyph_segmenter.cc](../ift/encoder/closure_glyph_segmenter.cc) can produce segmentations
that satisfy the closure requirement and are performant (via merging). The approach laid
out in this document is just one possible approach to solving the problem. This document aims
primarily to describe how the prototype implementation in
[closure_glyph_segmenter.cc](../ift/encoder/closure_glyph_segmenter.cc) functions, and is not
intended to present the final (or only) solution to the problem. There are several unsolved problems
and remaining areas for development in this particular approach:

* Much of the ongoing work is on the "merger" which is a sub-problem of producing segmentations.
That's discussed in the separate
[closure_glyph_segmentation_merging.md](closure_glyph_segmentation_merging.md) document.
See the implementation status and areas for further development sections for more specifics.

* Running the segmenter currently requires manual configuration to get good results. Configuration
is needed to select appropriate frequency data and settings for parameters controlling merger
behaviour. The goal is to get to the point where good results can be produced with zero
configuration.

* Support for merging segmentations involving multiple overlapping scripts is not yet implemented
(for example creating a segmentation that supports Chinese and Japanese simultaneously).

* [Multi segment analysis](#multi-segment-dependencies): the current implementation only does single
segment analysis which in some cases leaves sizable fallback glyph sets. How to implement multi
segment analysis is an open question and more development is needed.

* Input segmentation generation: the glyph segmentation process starts with an existing
codepoint/feature based segmentation. Good results can be achieved by starting with one input
segment per codepoint/feature and letting merging join segments as needed. However, there is still
value in starting with a good quality input segmentation that places commonly used codepoints
together. This can significantly reduce the amount of work the merger needs to do. Therefore it
may be useful to develop functionality that creates a first pass input segmentation based on
codepoint frequency data.

* Incorporating dependency information: whatever produces the input code point segments will likely
have discovered dependency information related to those code points. That information can be
reused in this process to narrow selections during patch merging and multi segment
analysis. Future work will look at adding dependency information as an optional input to this
procedure.

One of the main down sides to this approach is it's reliance on a subsetting closure function which
are computationally costly. Complex fonts which can require hundreds of closure operation which as a
result can be slow to process. So another area of open research is if a non closure based approach
could be developed that is computationally cheaper (for example by producing a segmentation by
working directly with the substitution and dependencies encoded in a font).
* One of the main down sides to this approach is it's reliance on a subsetting closure function which
are computationally costly. Complex fonts which can require hundreds of closure operation which as a
result can be slow to process. So another area of open research is if a non closure based approach
could be developed that is computationally cheaper (for example by producing a segmentation by
working directly with the substitution and dependencies encoded in a font).

## Goals

Expand All @@ -100,12 +106,15 @@ The segmentation procedure described in this document aims to achieve the follow
values. The input unicode code point segmentations are used to form the conditions.

* Optimize for minimal data transfer by avoiding duplicating glyphs across patches where possible.

* Support optimization of a generated segmentation via merging to reduce network overhead.

* The chosen glyph segmentation and activation conditions must satisfy the closure requirement:

The set of glyphs contained in patches loaded for a font subset definition (a set of Unicode
code points and a set of layout feature tags) through the patch map tables must be a superset of
those in the glyph closure of the font subset definition.


## Subsetter Glyph Closures

Expand Down Expand Up @@ -227,90 +236,19 @@ data in the initial font. In these cases leaving them in the fallback patch may

## Merging

In some cases the patch set produced above may result in some patches that contain a small number of
glyphs. Because there is a per patch overhead cost (from network and patch format overhead) it may
be desirable to merge some patches together in order to meet some minimum size target. Patches can
be merged so long as it's done in a way that preserves the glyph closure requirement.

There are two types of patches that can be merged: exclusive and conditional. The procedure for
merging is dependent on the type. The next two sections provide some guidelines for merging the two
types together.

### Merging Input Segments

This section outlines a procedure to find and merge input code point segments in order to increase
the sizes of one or more exclusive patches. It searches for other input segments that interact with
the one that needs to be enlarged. Keeping interacting code points together in a single segment since
it reduces the number of conditional patches needed, thereby reducing overall overhead.
When starting with an input segmentation that is fine grained (for example using one segment per
code point) the resulting glyph segmentation may involve a large number of patches. This results in
excessive network overhead when loading the patches. Performance can be increased by selectively merging
patches together to reduce overhead. This is a complex problem as it needs to be done in a way that
avoids excessive transfers of glyph data that isn't needed.

Starting with:

* A set of patches and activation conditions produced by the "Segmenting Glyphs Based on Closure
Analysis" procedure.
* A desired minimum and maximum patch size in bytes.

Then, for each segment $s_i$ in $s_1$ through $s_n$ ordered by expected frequency of use (high to low):

1. If the associated exclusive patch does not exist or it is larger than the minimum size skip this segment.

2. Locate one or more segments to merge:

a. Sort all glyph patches by their conditions. First on the number of segments in the condition ascending and
then by the segment frequency (high to low).

b. Return the set of segments in the condition of the first patch from that list where $s_i$
appears somewhere in the patch’s condition.

c. If no such patch was found then instead select another exclusive patch which has the closest
frequency to $s_i$ and return the associated segment.
The segmenter currently implements a cost based merging algorithm which selects merges that minimize
an overall cost function. This process is documented in detail in
[closure_glyph_segmentation_merging.md](closure_glyph_segmentation_merging.md).

3. Generate a new $s'_i$ which is the union of $s_i$ and all returned segments from step 2. Add it
to the input segment list.

4. Remove the $s_i$ and all segments from step 2 from all per glyph conditions and the input segment list.

5. Re-run the "Segment Closure Analysis" closure test on $s'_i$ and update per glyph conditions as needed.

6. Based on the updated per glyph conditions, recompute the overall patch and condition sets
following "Segmenting Glyphs Based on Closure Analysis". If the new patch for $s'_i$ is larger
than the maximum allowed size, undo the changes from steps 3-6 and go back to step 2 to continue
searching for more candidate segments. Ignore the previously selected group.

7. Re-run this process to find and fix the next patch which is too small. If none remain then the
process is finished.


### Merging Conditional Patches

At a high level two conditional patches can be merged together by creating a new patch containing
the union of the glyphs in the two and assigning a new condition with is a super set of the two
original conditions. Merging patches in this way avoids duplicating glyphs, but results in more
relaxed overall activation conditions meaning some of the glyphs will be loaded when not strictly
needed.

Merged conditions for conditional patches can be created by adding a disjunction between the two
overall conditions. For example if the two conditions were ($s_1 \wedge s_2$), ($s_2 \wedge s_3$)
then a new condition $(s_1 \wedge s_2) \vee (s_2 \wedge s_3)$ could be used for a combined
patch. This condition will activate the combined patch when either of the original conditions would
have matched. When selecting patches to merge, patches that have a smaller symmetric difference
between the segments in their conditions should be prioritized as that will minimize the widening of
the activation condition.

There are also two other options for merging conditional patches, though these are generally less
preferable than the merging procedure described above:

1. Move the patch's glyphs into the initial font or merge with the fallback patch. This removes the
patch at the cost of always loading the glyph's data. This may be useful when there is a very
small patch with a wide activation condition. In this case it may not be desirable to merge with
other larger conditional patches due to excessive widening of their activation conditions.

2. Duplicate the patch's glyphs into the segments that make up the patch's condition. This
eliminates the patch at the cost of duplicating glyph data. It may be useful in cases of small
patches with narrow activation conditions.

More research is needed in this area, and ultimately its likely that a selection heuristic which
takes into account segment frequency to assess the impact of condition widening will need to be
developed.
The best segmentation results so far have been obtained by starting with one input segment per
code point in the font and then letting the merger figure out how to best place them together in a way
that minimizes overall cost.

## Multi Segment Dependencies

Expand Down
Loading
Loading