Skip to content

Clean up the text handling pipeline #179

@simoncozens

Description

@simoncozens

OK, this is going to be a bit of a stream-of-consciousness issue. I am trying to understand a major structural issue with SILE and understand how to fix it, and I'm struggling. Any help would be appreciated.

@mhosken recently pointed me to a situation where the shaping engine needs to see the whole paragraph, not just individual tokens. This is something that has come up before (#138, #173), so it's something we definitely need to look into. I have a proof of concept package which does this (and it makes me happy that this can be implemented as a package) but it's a bit of a mess. But the current situation is also a mess, and we need to decide which mess is less messy.

Definitions

First, a few definitions:

  • Shaping, we all know, involves passing text to Harfbuzz and getting back position/width/cluster information.
  • Bidi reordering, hyphenation, line breaking again we all know.
  • Node creation involves bundling up style/font/size/etc. information along with the text.
  • Tokenizing involves splitting the text into smaller units.
  • What I'll call language handling involves language-specific processing of the input stream. Example: Japanese kinsokushori rules and inter-character-class spacing rules.

Also the types of horizontal node that SILE deals with, since this isn't documented anywhere and it's confusing:

  • Penalty and glue nodes represent line break opportunities and white space respectively. This is the same as TeX.
  • hbox nodes can define their own output routines and hence can actually be whatever you like, but their default state is to contain font information and a series of glyph IDs.
  • nnodes were initially based on XeTeX's native nodes, but have diverged quite a bit from that. They represent collections of hboxes that we want to keep together for logical processing. Critically, an nnode also carries around the text which produced it. For instance, hyphenation is implemented by splitting an nnode into its hyphenation parts (N<after> becomes N<af>D<->N<ter>) but these split nodes know their parent. This means that if the word ends up not being hyphenated by the Knuth-Plass algorithm, we can ask the node to output its parent instead. This allows us to output N<after> in the unhyphenated case, instead of having to output N<af>N<ter>; the idea is to improve the PDF output.
  • unshaped nodes are just text plus font options.

Current situation

The current system takes some text, and then performs tokenizing and language handling. Each language support routine can implement their own tokenizer - a tokenizer is an iterator which emits either bits of text, or ready-formed nodes (again, think of Japanese kinsokushori, which is implemented as splitting the text into individual characters and then interweaving penalties and glue nodes with the text output).

Any text tokens are wrapped into unshaped nodes. Now we have a stream of nodes. For English, they might look like U<after>G<2.6pt plus 0.73pt>U<several>U<2.6pt plus 0.73pt>days. For Japanese, they would look like U<私>G<0pt plus 2.5pt>U<は>G<0pt plus 2.5pt minus 5pt>U<「>U<SILE>U<」>G<0pt>U<で>G<0pt plus 2.5pt>U<す>.

They are then broken into lines - to do this, unshaped nodes are shaped into nnodes so we know their widths. After line breaking we do bidi reordering. (Quoting UAX#14: "In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm." - see discussion in #137 and #173, where @behdad cryptically said "Bidi is not a one-off operation. It happens in stages." I don't know what the stages are.)

Bidi reordering currently works like this: for each line of the paragraph, the nodelist is transformed into a new nodelist, where individual characters (UTF8 codepoints) from the text of each nnode are split into their own unshaped nodes. So in the English case we now go down to U<a>U<f>U<t>U<e>U<r>G<2.6pt plus 0.73pt>... These nodes are fed to @khaledhosny's Lua bidi implementation, and reordered, and then we try to reconstitute them into nnodes again afterwards: U<after>G<2.6pt plus 0.73pt> or U<retfa>G<2.6pt plus 0.73pt> as the case may be. The reconstituting process is suspect. And then we shape the unshaped nodes again, and now everything is ready to be sent to the page builder.

So the pattern goes: tokenising / language handling, forming nodes, shaping nodes, line breaking, bidi, shaping again.

Proof of concept situation

The harfbuzz-only-shaper works by each incoming bit of text into one big unshaped node. So now you get U<after several days> and U<私は「SILE」です>. No tokenizing or language handling is done, but we do bidi processing on the unshaped nodes at this stage, in the same way as described above.

Then the unshaped nodes are shaped. We get back a stream of glyphs and their measurements from Harfbuzz, which we assemble into nnodes. If at any point in the glyph stream we see glyph named space, (how brittle is that?!) we output the nnode we've been assembling so far, and then turn the space glyph into a glue node.

Nnodes need to know their text content (for hyphenation, etc.) so we use the cluster index (from Harfbuzz) of the first glyph and last glyphs in the current nnode to substring the paragraph content. This feels like it isn't quite right, because I'm sure some languages produce shaped glyphs in a different order to Unicode stream. (Thai?)

Then line breaking, and straight to the page builder.

Problems with the proof of content

  • No tokenizing or language handling. So Japanese fails, because it doesn't have spaces, and so no glue nodes are generated, which means there's no potential break points.
  • Each "incoming bit of text" is sent into an unshaped node. This is not the same as shaping the whole paragraph. In fact we are shaping runs. Hello \em{there} world would create three unshaped nodes U<Hello >U<there>U< world>, which would each get sent to Harfbuzz independently. I think that is fine - because U<there> has a different font, and so needs to be shaped differently - but then bidi reordering happens on each of them independently, not as a whole, which is clearly wrong.
  • Bidi handling happens before line breaking, which means we have a "my is full of eels hovercraft" issue (see second example of Bidirectionality misfeature? #137).
  • X offsets. When we output the glyphs, we adjust the output cursor by the x_offset value that Harfbuzz reported for each glyph. This works nicely when we shape a token at a time. However, for some reason we end up with huge x_offsets for glyphs at the end of the line. (I assumed that x_offsets were relative to the current advance position, but I am seeing x_offsets like -120pt, which makes no sense to me.)

Where we want to be

This is what I think we want to be doing. But I am not sure.

  • Turn each run of text into an unshaped node. No bidi here.
  • When it's time to set the paragraph, shape each run. (We still haven't done bidi yet so we are sending U<abc>U<[aleph][beth][gimel]> to the shaper in logical, Unicode storage order. Is that right? Is mixed English/Arabic going to shape correctly? Does each run need to know its text direction?)
  • Shaping will return a bunch of glyphs and positioning information.
  • Before we break into lines, we need to turn the glyphs/positions into nnodes. But we let language-specific handlers interact with that process. The default handler assembles characters into nnodes and spaces into glue, just like the harfbuzz-only-shaper. The Japanese handler will insert glues and penalties between glyphs. South Asian handlers will do clever algorithmic stuff on the text of the nnodes to determine line breaking possibilities.
  • Now we break into lines.
  • Now it's bidi time. The bidi process will have to be quite different from the current practice. We have a bunch of nnodes containing text, and we have glyphs within the nnodes which know their offset into the text. Then we do what @khaledhosny suggested in Use ICU for bidi #173:
  • Convert node list to a string, using 0xFFFC for anything that is not a character.
  • Use ICU to get the embedding levels of the string and assign them back to each node.
  • Use the embedding levels to reorder the nodes, which I think can be done after line breaking.

screen shot 2015-09-29 at 12 13 34

  • Do we now need to reshape the reordered nodes at this stage? We've reordered the text, so I would think so, but maybe if we've just reordered runs that won't affect the shaping? I don't know.

And then we're done. I still don't know how to handle the x_offsets issue, but if this is correct, it seems like it should work.

Worries

  • Is hyphenation still going to work?
  • Will I get the bidi reordering wrong?
  • Will we correctly extract text from cluster information returned by Harfbuzz?
  • Will language-specific handlers be able to do all the things they need given that they are now receiving a string of glyphs instead of text?
  • How do I correctly compute the x_offset for marks from the previous character, not from the start of the line?

I will start mucking with all this in the new-pipeline branch. Thanks for reading to the end and if anyone cc'd could make any comments ("Yes, that'll work", "No, you've misunderstood") or pick up things I've missed that would help me a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions