Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce standalone fonts when subsetting #27

Open
6 tasks
wezm opened this issue Jun 3, 2020 · 12 comments
Open
6 tasks

Produce standalone fonts when subsetting #27

wezm opened this issue Jun 3, 2020 · 12 comments

Comments

@wezm
Copy link
Contributor

wezm commented Jun 3, 2020

2022 Update:

cmap generation when subsetting, which the original text of this issue focussed around landed in Allsorts 0.9. However, this does not get us all the way to generating standalone fonts. This issue will act as a tracking issue for things that are required to achieve that.


Original text:

The subsetting feature is currently tailored for the needs of subsetting fonts for embedding in PDFs, since that was our primary use case when developing Allsorts. The issue is that we don't include a cmap table in the subset font, which makes it invalid for use outside PDF. When a subset font is embedded in a PDF the cmap info is contained in the PDF directly, so we don't need to include it in the font.

In order to support more general subsetting it would be convenient to have an entry point that takes a list of chars and produces a font with glyphs for just those chars. This would be an incremental improvement on what we have so far and would still have some limitations: with chars as input there wouldn't be a way to include ligature glyphs. Doing so would require subsetting gpos, and gsub tables as well, which is a problem for another day.

The subsetting code lives in subset.rs. The new function signature could be along these lines:

/// Subset this font so that it only contains the glyphs for the supplied `chars`.
pub fn subset_chars(
    provider: &impl FontTableProvider,
    chars: &[char],
) -> Result<Vec<u8>, ReadWriteError>

The implementation would need to map chars to glyph ids using a technique similar to this. The subset font would need to include a new cmap table (probably using the Unicode platformID). There's a bunch of formats to chose from to encode the data. An initial implementation might just choose one of the simpler ones at the cost of size of the resulting font. A more sophisticated implementation could examine the data to determine the best option.

@wezm wezm added the subsetting label Jun 3, 2020
@ebraminio
Copy link

The issue is that we don't include a cmap table in the subset font, which makes it invalid for use outside PDF.

Some PDF readers also won't work without a valid cmap, https://crbug.com/1071958 guess is needed for their text selection to work properly.

@yisibl
Copy link

yisibl commented Aug 21, 2020

Looking forward to this feature.

@wezm
Copy link
Contributor Author

wezm commented Mar 29, 2022

I've just released 0.9, which implements building of a proper cmap table for subset fonts.

@wezm wezm closed this as completed Mar 29, 2022
@yisibl
Copy link

yisibl commented Mar 29, 2022

@wezm Can you upgrade the dependency version in allsorts-tools?

Looks like it can be solved: yeslogic/allsorts-tools#16

@wezm
Copy link
Contributor Author

wezm commented Mar 29, 2022

Yes I'm working on that next. I have a draft PR open for it yeslogic/allsorts-tools#18

@wezm
Copy link
Contributor Author

wezm commented Mar 30, 2022

Reopening as we strip the OS/2 table which is required in OpenType fonts.

@wezm wezm reopened this Mar 30, 2022
@yisibl
Copy link

yisibl commented Mar 30, 2022

@wezm I tried to submit a PR to fix it, PTAL. #58

@yisibl
Copy link

yisibl commented Jan 3, 2023

Happy New Year! Any progress here?

@wezm
Copy link
Contributor Author

wezm commented Jan 3, 2023

No, sorry it's a pretty big piece of work that has not been scheduled yet.

@dnlmlr
Copy link

dnlmlr commented Mar 2, 2023

Hey! I am also trying to use subsetting for embedded fonts in PDF documents. Since I want to avoid getting too deep into the low level PDF structure, I am just using the genpdf -> printpdf -> lopdf stack. The plan was to embed the full subsetted font into the PDF files without touching the PDF internal mappings /Differences.

I got it to work on all tested PDF readers and printers with the current implementation of subset even though the OS/2 table is missing, but only if Unicode Encoding Records are used (mappings with CharExistence::BasicMultilingualPlane, CharExistence::AstralPlane). If CharExistence::MacRoman or CharExistence::DivinePlane is used, it doesn't work.

Would it be a sensible thing to allow forcing the default mode to be Unicode or are there any problems with this?

One workaround that I think I'll be using for now is to manually add a '€' character to the glyph_ids subset so that it can't be encoded with MacRoman, but this is not the nicest solution and will be a problem if a font doesn't actually have '€'

@wezm
Copy link
Contributor Author

wezm commented Mar 6, 2023

Would it be a sensible thing to allow forcing the default mode to be Unicode or are there any problems with this?

I don't think that would make sense as a default as it would unnecessarily inflate the font. There is already an internal CmapStrategy enum used to drive some of the cmap generation behaviour. A new variant could be added to that and then some way to select that strategy could be added.

@dnlmlr
Copy link

dnlmlr commented Mar 6, 2023

Yeah I agree that it shouldn't be default, since this is kind of an edge case. What I meant was a way to externally change the encoding mode, for example as a parameter to the subset function. Basically any mechanism that would allow to optionally prevent encoding with MacRoman.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants