Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Font subsetting for embedded truetype fonts #17

Closed
gunnsth opened this issue Sep 7, 2017 · 3 comments
Closed

Font subsetting for embedded truetype fonts #17

gunnsth opened this issue Sep 7, 2017 · 3 comments
Assignees

Comments

@gunnsth
Copy link
Contributor

gunnsth commented Sep 7, 2017

Current state

Currently, font files are embedded their entirety. This can be somewhat wasteful, as often only a small portion of glyphs are used, and font files can be large especially for unicode fonts with large numbers of glyphs.

There are two use cases:

  1. Creating reports containing large fonts (typically CJK fonts can be very big).
    (2. Optimizing already created PDF files that contain embedded fonts.)

Those two cases may require slightly different approaches to be done efficiently. So it is probably best to keep them separate. Here we will focus on the first use case (for creating PDFs).

Proposed changes

This requires.

  1. Identifying fonts for subsetting. Probably best if user marks the font for subsetting since subsetting may not be desired in all cases.
  2. Identifying which glyphs to keep. Perhaps the encoder could track all glyphs/runes that are referenced (Encode func).
  3. Creating subsetted fonts and labelling them as such (postscript naming convention). We will
    use https://github.com/unidoc/unitype to do the subsetting.
    This would probably be best to do at serialization time, if the font had been marked for subsetting along with which glyphs to keep. Example use case:
fnt, _ := NewCompositePdfFontFromTTFFile("largefnt.ttf")
fnt.Subset(true) // Marks font for subsetting on write
// then use fnt as normally.
// Each call to the font's encoder Encode will record use of glyph to be used.

Expected results

Significantly smaller generated PDF files using TTF fonts.

References

Section 9.6.4 Font Subsets (PDF32000_2008):

PDF documents may include subsets of Type 1 and TrueType fonts. The font and font descriptor
that describe a font subset are slightly different from those of ordinary fonts. These differences 
allow a conforming reader to recognize font subsets and to merge documents containing different
subsets of the same font. (For more information on font descriptors, see 9.8, "Font Descriptors".)

For a font subset, the PostScript name of the font—the value of the font’s BaseFont entry and the 
font descriptor’s FontName entry— shall begin with a tag followed by a plus sign (+). The tag shall
consist of exactly six uppercase letters; the choice of letters is arbitrary, but different subsets in the
same PDF file shall have different tags.

EXAMPLE EOODIA+Poetica is the name of a subset of Poetica®, a Type 1 font

And in section 9.9 (Embedded Font Programs) it states:

A TrueType font program may be used as part of either a font or a CIDFont. Although the basic
font file format is the same in both cases, there are different requirements for what information
shall be present in the font program. These TrueType tables shall always be present if present in 
the original TrueType font program:
    “head”, “hhea”, “loca”, “maxp”, “cvt”, “prep”, “glyf”, “hmtx”, and “fpgm”. 
If used with a simple font dictionary, the
font program shall additionally contain a cmap table defining one or more encodings, 
as discussed in 9.6.6.4, "Encodings for TrueType Fonts". If used with a CIDFont dictionary,
the cmap table is not needed and shall not be present, since the mapping from character codes 
to glyph descriptions is provided separately.

Section 9.6.6.4 (Encodings for TrueType fonts) additionally describes how TrueType cmaps and font dictionary's Encoding are used to map between character codes and glyph descriptions.

@dennwc
Copy link
Contributor

dennwc commented Dec 31, 2018

@gunnsth gunnsth transferred this issue from unidoc/unidoc May 23, 2019
@gunnsth gunnsth assigned adrg and unassigned dennwc Apr 25, 2020
@gunnsth
Copy link
Contributor Author

gunnsth commented Apr 25, 2020

@adrg Can you review this proposal?

@adrg
Copy link
Collaborator

adrg commented Apr 27, 2020

I think the proposal looks good. Doing the subsetting at serialization time is a good solution in my opinion. Does the font encoder always record used glyphs and ignores the subset at serialization time if subsetting is not enabled, or does it start recording once fnt.Subset(true) is called?
For composite fonts, the CMap should also be reduced to the mappings which are actually used, as they tend to get quite big, especially for CJK encodings.

Will there also be a global writer/creator option to enable subsetting for all fonts? Could be useful for optimizing PDFs for disk space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants