Font subsetting for embedded truetype fonts #17

gunnsth · 2017-09-07T20:17:47Z

Current state

Currently, font files are embedded their entirety. This can be somewhat wasteful, as often only a small portion of glyphs are used, and font files can be large especially for unicode fonts with large numbers of glyphs.

There are two use cases:

Creating reports containing large fonts (typically CJK fonts can be very big).
(2. Optimizing already created PDF files that contain embedded fonts.)

Those two cases may require slightly different approaches to be done efficiently. So it is probably best to keep them separate. Here we will focus on the first use case (for creating PDFs).

Proposed changes

This requires.

Identifying fonts for subsetting. Probably best if user marks the font for subsetting since subsetting may not be desired in all cases.
Identifying which glyphs to keep. Perhaps the encoder could track all glyphs/runes that are referenced (Encode func).
Creating subsetted fonts and labelling them as such (postscript naming convention). We will
use https://github.com/unidoc/unitype to do the subsetting.
This would probably be best to do at serialization time, if the font had been marked for subsetting along with which glyphs to keep. Example use case:

fnt, _ := NewCompositePdfFontFromTTFFile("largefnt.ttf")
fnt.Subset(true) // Marks font for subsetting on write
// then use fnt as normally.
// Each call to the font's encoder Encode will record use of glyph to be used.

Expected results

Significantly smaller generated PDF files using TTF fonts.

References

Section 9.6.4 Font Subsets (PDF32000_2008):

PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor
that describe a font subset are slightly different from those of ordinary fonts. These differences 
allow a conforming reader to recognize font subsets and to merge documents containing different
subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)

For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the 
font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall
consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the
same PDF file shall have different tags.

EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font

And in section 9.9 (Embedded Font Programs) it states:

A TrueType font program may be used as part of either a font or a CIDFont. Although the basic
font file format is the same in both cases, there are different requirements for what information
shall be present in the font program. These TrueType tables shall always be present if present in 
the original TrueType font program:
    “head”, “hhea”, “loca”, “maxp”, “cvt”, “prep”, “glyf”, “hmtx”, and “fpgm”. 
If used with a simple font dictionary, the
font program shall additionally contain a cmap table defining one or more encodings, 
as discussed in 9.6.6.4, "Encodings for TrueType Fonts". If used with a CIDFont dictionary,
the cmap table is not needed and shall not be present, since the mapping from character codes 
to glyph descriptions is provided separately.

Section 9.6.6.4 (Encodings for TrueType fonts) additionally describes how TrueType cmaps and font dictionary's Encoding are used to map between character codes and glyph descriptions.

The text was updated successfully, but these errors were encountered:

dennwc · 2018-12-31T11:17:07Z

Related: https://github.com/unidoc/unidoc/issues/268

gunnsth · 2020-04-25T12:09:24Z

@adrg Can you review this proposal?

adrg · 2020-04-27T16:44:47Z

I think the proposal looks good. Doing the subsetting at serialization time is a good solution in my opinion. Does the font encoder always record used glyphs and ignores the subset at serialization time if subsetting is not enabled, or does it start recording once fnt.Subset(true) is called?
For composite fonts, the CMap should also be reduced to the mappings which are actually used, as they tend to get quite big, especially for CJK encodings.

Will there also be a global writer/creator option to enable subsetting for all fonts? Could be useful for optimizing PDFs for disk space.

gunnsth assigned dennwc Dec 31, 2018

gunnsth transferred this issue from unidoc/unidoc May 23, 2019

gunnsth mentioned this issue Apr 18, 2020

Improve table rendering speed #318

Closed

gunnsth assigned adrg and unassigned dennwc Apr 25, 2020

This was referenced May 2, 2020

Support for subsetting fonts #335

Merged

[BUG] Large file size with created report with Japanese text #321

Closed

gunnsth closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Font subsetting for embedded truetype fonts #17

Font subsetting for embedded truetype fonts #17

gunnsth commented Sep 7, 2017 •

edited

Loading

dennwc commented Dec 31, 2018

gunnsth commented Apr 25, 2020

adrg commented Apr 27, 2020

Font subsetting for embedded truetype fonts #17

Font subsetting for embedded truetype fonts #17

Comments

gunnsth commented Sep 7, 2017 • edited Loading

Current state

Proposed changes

Expected results

References

dennwc commented Dec 31, 2018

gunnsth commented Apr 25, 2020

adrg commented Apr 27, 2020

gunnsth commented Sep 7, 2017 •

edited

Loading