Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Longer term idea: reversible glyph file naming scheme #164

Open
justvanrossum opened this issue Jan 27, 2021 · 24 comments
Open

Longer term idea: reversible glyph file naming scheme #164

justvanrossum opened this issue Jan 27, 2021 · 24 comments

Comments

@justvanrossum
Copy link
Contributor

justvanrossum commented Jan 27, 2021

The main problems a glyph file naming scheme needs to solve:

  1. has to work for case-sensitive glyph names on a non-case-sensitive file system
  2. has to allow characters that are not allowed on (some) file systems

UFO does not specify a maximum glyph name length, but in practice we're tied to .fea, which does.

If we were to set a maximum length for glyph names, then it is possible to create a completely reversible glyph file naming scheme that does not need a contents.plist-like mechanism at all. If we keep insisting on not imposing a maximum length, a hybrid solution may be possible (but I'm not necessarily in favor of that).


Proposed solution to 1:

Append a disambiguation code to the glyph name that encodes a bitfield, using one bit per character in the glyph name, corresponding to the case of the character: 0 = lowercase, 1 = uppercase.

If we allow trailing zeros to be omitted, this scheme is very efficient for mostly lowercase glyph names (esp. if any uppercase characters appear early in the glyph name).

If this code is encoded in a base32-like encoding, we need one ascii character per 5 bits of data. We can chose the encoding to use [a-z0-5]. Perhaps use # as a separator character.

If the glyph name is entirely lowercase, the disambiguation code can be omitted.

Example: the file name for Aring would become Aring#b.glif (b would encode the bits 10000). The file name for aring would be simply aring.glif.

This scheme guarantees that if two glyph names only differ in case, their corresponding file names will be unique, even on a non-case sensitive file system, while still containing the full glyph name.

Proposed solution to 2:

Use url-style %XX escaping.


With an assumed maximum file name length of 255 (which is what ufoLib currently assumes), we can still use fairly long glyph names with this scheme (longer than .fea's 64-character maximum).

To get a glyph name from a file name:

  • chop off the .glif file extenstion
  • chop off the part starting with #, if there is one
  • unescape any %XX sequences

Relates to #122

@alerque
Copy link

alerque commented Jan 28, 2021

Perhaps use # as a separator character.

Solution 1 sounds better to me, although I would suggest a different separator. Among a handful of possible characters that is second only to pipes, quotes, spaces, and slashes (okay sixth-ish) of problematic characters for file systems, build systems, URL encoding, etc. Oh yah and anything that's a glob operator should probably be avoided. Why not just . or _? Given the rest of the scheme those wouldn't conflict with anything would they?

@justvanrossum
Copy link
Contributor Author

justvanrossum commented Jan 28, 2021

Whichever separator we'd choose: it will either have to be escaped if it occurs in the glyph name, or the separator will have to be a non-optional part of the file name, so we can say "take the last one". I prefer it to be optional, though, so all-lowercase glyph names won't need a disambiguation code at all.

Since . and _ are common and useful in glyph names, it would be a shame to have to escape either of those. Same for -.

I didn't realize # is a common unixy meta character. Is there anything else "free" in the ASCII range?

Solution 1 sounds better to me

There are two problems that need to be solved, and I proposed one solution for each :)

@typemytype
Copy link
Contributor

typemytype commented Jan 29, 2021

The separator and the case sensitivity is not the biggest problem, those are relatively easy to fix (and already fixed in the current implementation by adding an _ after every capitalised letter, this is a human readable solution)

The issue is all those illegalCharacters and reservedFileNames and .notdef. From the moment you start escaping those its not possible to have a reversible glyph naming scheme...

see https://unifiedfontobject.org/versions/ufo3/conventions/#example-implementation

Adding those illegalCharacters and reservedFileNames and .notdef should not end up in the glyph.name spec, this is way to much restrictions. I like it how the spec defines glyph.name now.

A possible solution is to have marker that the glyph.name spec does not allow to be used in a glyph name to fence those illegalCharacters and reservedFileNames and .notdef. This fencing marker can must only be used in filename to escape those special cases.

for example: `

  • .notdef --> `.notdef`
  • con --> `con`
  • con.alt -->`con`.alt

bonus: this only adds max 2 extra characters to the filename compared to the glyph name.

@justvanrossum
Copy link
Contributor Author

justvanrossum commented Jan 30, 2021

(and already fixed in the current implementation by adding an _ after every capitalised letter, this is a human readable solution)

Except that is not reversible!

The issue is all those illegalCharacters and reservedFileNames and .notdef. From the moment you start escaping those its not possible to have a reversible glyph naming scheme...

I didn't think the reserved file names issue through yet, but escaping the illegal characters with URL-style %XX escapes as I proposed does make it reversible. https://en.wikipedia.org/wiki/Percent-encoding

Leading periods should also be percent-escaped:

  • .notdef becomes %2Enotdef

Reserved filenames:

  • adding a disambiguation code makes it a non-reserved filename
  • if the reserved file name is all lowercase, add a dummy code anyway: con could become something like con#0.glif

A potential alternate separator character is ~. Is ~ problematic for unixy or Windowsy reasons? It's not a reserved character in URLs.

@alerque
Copy link

alerque commented Jan 30, 2021

This fencing marker can must only be used in filename to escape those special cases.
for example: ``

A backtick would make a nightmare for a fencing marker. Don't go there.

In fact I don't think the fencing scheme works at all, and as Just noted following capitals with an underscore isn't reversible unless you also have some way to escape underscores where they naturally occur.

A potential alternate separator character is ~. Is ~ problematic for unixy or Windowsy reasons?

Yes, ~ in paths on Unix is a shortcut for $HOME. Hash marks are less troublesome on *nix than tilde, but more troublesome for URLs. In fact given the fact that you want to use percent for something else, hash is looking better and better. It's troublesome but in ways that are easier to work around than most of the alternatives.

What about =, *, or ^? The former has meaning in URLs, but only in some positions. The latter are both glob/regex modifiers, but those are easier to deal with than other characters with magic meanings.

@justvanrossum
Copy link
Contributor Author

I'd like to avoid * for glob-reasons. It seems the separator candidates are:

  1. #
  2. =
  3. ^

Visually, I like # best, = is fine, and ^ I could live with but don't find pretty.

@alerque
Copy link

alerque commented Jan 30, 2021

Fair enough on scratching out *. Visually and semantically I like ^ but, it has similar (if not quite as ubiquitous) issues with globbing.

Semantically, # has a meaning that is less appropriate here than =: con#2.glif suggests iteration two of con which isn't what we're after, con=2.glif at least suggests the 2 is somehow being used to interpret con.

Also while = will sometimes be percent-encoded in URLs, it only has meaning in query strings, not paths segments. Meanwhile # breaks parsing. Exhibit:

Okay so GitHub Markdown parsing isn't going to make the exhibits easy. Here are the URLS:

https://github.com/alerque/temp/blob/main/foo=2.glif
https://github.com/alerque/temp/blob/main/foo%3D2.glif

https://github.com/alerque/temp/blob/main/foo#2.glif
https://github.com/alerque/temp/blob/main/foo%232.glif

@justvanrossum
Copy link
Contributor Author

justvanrossum commented Jan 30, 2021

Here's a quick test implementation: https://gist.github.com/justvanrossum/c1055da1041f8976a31a93ea838cc05e

(Setting it up with ^ for now, I like ^ better than I thought I would.)

These are my test cases so far:

_testCases = [
    ("Aring",       "Aring^1.glif"),
    ("aring",       "aring.glif"),
    ("ABCDEGF",     "ABCDEGF^V3.glif"),
    ("f_i",         "f_i.glif"),
    ("F_I",         "F_I^5.glif"),
    (".notdef",     "%2Enotdef.glif"),
    (".null",       "%2Enull.glif"),
    ("CON",         "CON^7.glif"),
    ("con",         "con^0.glif"),
    ("aux",         "aux^0.glif"),
    ("con.alt",     "con.alt.glif"),
    ("A:",          "A%3A^1.glif"),
    ("A^321",       "A%5E321^1.glif"),
    ("a\\",         "a%5C.glif"),
    ("a\t",         "a%09.glif"),
  # ("a ",          "a%20.glif"),  # escape space?
    ("a ",          "a .glif"),    # or not?
    ("a\"",         "a%22.glif"),
    ("aaaaaaaaaA",  "aaaaaaaaaA^0G.glif"),
    ("AAAAAAAAAA",  "AAAAAAAAAA^VV.glif"),
]

@justvanrossum
Copy link
Contributor Author

I wrote:

Except that is not reversible!

Except that I was wrong! The "add underscore after uppercase letter" part of the current scheme is reversible, as the underscore itself is doubled. Eg. F_F becomes F___F_. (The current scheme is not reversible if the glyph name contains reserved characters.)

So, while my base32 scheme would allow longer glyph names than the current one, and keeps the original glyph name in tact as part of the file name (as long as no reserved chars are used), it is otherwise debatable whether my proposal is even an improvement.

It's possible we keep the _ aspect of the current scheme, and only work on making the rest of the scheme reversible.

(To repeat: it can't be fully reversible if we don't specify a length limitation on the glyph name.)

@verdy-p
Copy link

verdy-p commented Nov 29, 2021

And why not using !

(no globbing issue for filenames used in shell scripts like with *?, no restriction in common filesystems or shells like with :/\.()[]{}, no problem in URLs like with ?=#%+_), and easier to type on most keyboards (whereas international layouts don't all have a ^ or treat it as a dead key, requiring to type an additional space to get it).

Also the Base32 bitmap for uppercase mapping is not very friendly.

I tend to think that we should better just tag invidual characters (and avoid UTF-8 hexadecimal escapes as well, causing more problems for embedding in URLs), for example:

  • a to z (only for lowercase ASCII letters), 0 to 9, -, ,, ; and $ are left as is (see remark below).
  • A to Z just become a! to z! (or A! to Z!), with a trailing (rather than leading) ! (only for uppercase ASCII letters)
  • !21 to !7e (or !7E), for escaping other ASCII characters if needed (possibly all other punctuations, including those below?)
    Which ASCII punctuations below that would need to be escaped may be discussed. We have only 26 basic Latin letters (all taken below), the case of $ may be discussed, but it can still be represented as !24. Same remark for , and ;.
  • !00 to !1F or !7F, for escaping ASCII C0 controls (if needed)
  • becomes !- (for SPACE), to avoid trimming and compression
  • ! becomes !!, for simply escaping it
  • !u0080 to !ud7ff and !ue000 to !effff (or !U0080 to !UD7FF and !UE000 to !UFFFF), for escaping non-ASCII Unicode characters in the BMP (rather than using sequences of '%nn' based on UTF-8 bytes, we directly map the Unicode codepoint value in lowercase or uppercase hexadecimal). Encoding Unicode surrogates are excluded as they are non-characters (valid pairs or surrogates used in UTF-16 to encode non-BMP characters have to be encoded as a single code point, see below).
    Leading zeroes after !u or !U may be dropped only if there's no unescaped ASCII digit or letter a to f following the escaped character.
  • !x010000 to !x10ffff (or !X010000 to !X10FFFF), for escaping Unicode characters outside of the BMP (rather than using sequences of '%nn' based on UTF-8 bytes, or encoding valid pairs of UTF-16 surrogates separately, we directly map the Unicode codepoint value in lowercase or uppercase hexadecimal).
    A leading zero after !x or !X may be dropped only if there's no unescaped ASCII digit or letter a to f following the escaped character.
  • _ becomes !n or !N (for "uNderscore", may be needed in some URL schemes)
  • . becomes !d or !D (for "Dot", needed for some filesystems at some positions)
  • / becomes !f or !F (for "Forward slash")
  • \ becomes !b or !B (for "Backslash')
  • : becomes !c or !C (for "Colon")
  • # becomes !h or !H (for "Hash sign")
  • + becomes !p or !P (for "Plus sign")
  • % becomes !r or !R (for "peRcent" or "Ratio")
  • * becomes !a or !A (for "Asterisk")
  • ? becomes !q or !Q (for "Question mark")
  • < becomes !l or !L (for "Lower than")
  • = becomes !e or !E (for "Equal sign")
  • > becomes !g or !G (for "Greater than")
  • ~ becomes !t or !T (for "Tilde")
  • | becomes !v or !V (for "Vertical bar")
  • ' becomes !i or !I (for "sIngle quotation mark")
  • " becomes !k or !K (for "double quotation marK")
  • ( becomes !o or !O (for "Open parenthese")
  • ) becomes !s or !S (for "cloSe parenthese")
  • ^ becomes !j or !J
  • [ becomes !m or !M
  • ] becomes !w or !W
  • { becomes !y or !Y
  • } becomes !z or !Z

I would call such escaping mechanism a "filename-safe" encoding scheme. It could be generic and not limited to mapping glyph names to filenames, and designed to be safe for filesystems with non significant lettercases.

Other schemes are still possible, including the Punycode transform (as used in IDNA for domain names, but without the IDN restrictions for authorized characters and without its lettercase unification, but it is a complex scheme in its trailing part after the -- separator even if the leading part contains only ASCII letters or digits and only single hyphens between them).

@madig
Copy link
Contributor

madig commented Jan 24, 2022

Another way of avoiding reserved names is to, uhm, err, prepend e.g. _ to every file name unconditionally and then just drop reserved name handling (and also drop the leading character on decoding). This also drops the need to replace leading periods with underscores. Reserved names are, from my testing on Win 10 and XP, a problem only if name.split(".", 1)[0] in reserved_names, so "con" and "con.txt" are forbidden, but "acon.glif", "_con" and "hello.con.txt" are fine.

@verdy-p
Copy link

verdy-p commented Feb 4, 2022

A leading _ for "reserved names" (like con, aux) will still be ambiguous: it only allows using _ elsewhere after any letter just to distinguish lettercases. But then how do you encode other restricted characters? This would require using another non-letter character (from ASCII only?) before the _, and this leaves very few options: you can't use the 52 ASCII letters, can't use the 34 controls and space, so your initial character before the _ can only be one of the remaining 42 (=128-52-34) characters: ten digits, the rest being punctuations with many of them restricted as well in filenames (at least ., /, \); as well quotation marks or common characters used in shells (notably, *, #, ?) would cause problems for encapsulation of these filenames (e.g. in URIs); if you remove these 16 characters from having special interpretation with the following _, it remains 24 possible combinations (including __ for escaping the _ character itself, which is very common in usernames for example).

Other possible solution would be to use the "trigrams" as documented for use in C/C++ preprocessors: in C/C++, except that you would like to use something else than the ?? prefix because ? is also reserved in many filesystems (e.g. in URLs) or shells.

@madig
Copy link
Contributor

madig commented Feb 4, 2022

A leading "" for "reserved names" (like "con", "aux") will still be ambiguous: it only allows using "" elsewhere after any letter just to distinguish lettercases.

I don't follow. If you prepend every file name with an underscore (reserved or not) and remove it on reading the file, you can use it anywhere else in the name like before.

@verdy-p
Copy link

verdy-p commented Feb 5, 2022

The unconditional "_" prefix would then apply to every filename, jsut to solve the problem of possible reserved filenames (which is variable depending on OS/environments/filesystem/versions), but it still does not solve the problem of reserved characters and case insensitivity (e.g. for naming files for glyphs, whose case is significant and that may include restricted characters like "*", "#", "/" or "?").

As well if there are filename lengths restrictions (on FAT without LFN support), we need something else: an extra metadata file containing the mappings between filenames and glyphnames (even if we try to circumvert some restriction using an archive format like ZIP which holds filenames with less restrictions, there would be problems for extracting and archiving the archive in a restricted filesystem). This is in fact general for any development project for naming their source files: usually, programming environments enforces some restrictions for safe source filenames: restricting "/", "", "*", "?", "#", compressing whitespaces, and ".", reserved for filename extensions, not in their basename.

I don't see any definitive solution, except by using a mapping metadata file (just like what fonts already do internally in their tables): this mapping could be optional as long as its values can be infered from a limited set of production rules, however these rules could be overriden at anytime in an explicit mapping file, allowing more flexibility for the naming scheme).

Some default productions rules are those already defined in Postscript for naming glyphs; but if there are none to reliably name a glyph, using hexadecimal codepoints then a custom extension for contextual or linguistic variants (separated by "_" maybe? fixed number of digits or without leading zeroes).

Each font project can then choose how to manage their own namespace using the mapping metadatafile in their project. But if we are building an OTF/TTF font, there should be a Postscript glyph names table which can be part of the generated OTF/TTF font. As well it is possible to automatically generate a suitable mapping file using the basic Postscript glyph names rules (those recognized as well in PDF readers), then a list of hex code codepoints and an extension for contextual glyph variants (default extensions can be just created by numeral increments). If this initial mapping is not the best for font designers, they can rename as they want by updating the name mapping file (which should be a one-to-one bijection, if we want for example PDF readers to be able to infer the Unicode encoding from a list of glyph ids, without needing to use OCR technics: this reverse conversion from list of glyphs to encoded text is something not very easy, given the existence of Unicode Bidi reordering, OpenType reorderings for example with prepended vowels, and other GSUB/GPOS rules that could be used for complex ligatures).

@madig
Copy link
Contributor

madig commented Feb 7, 2022

The unconditional "_" prefix would then apply to every filename, jsut to solve the problem of possible reserved filenames (which is variable depending on OS/environments/filesystem/versions), but it still does not solve the problem of reserved characters and case insensitivity (e.g. for naming files for glyphs, whose case is significant and that may include restricted characters like "*", "#", "/" or "?").

Oh yes. I mean, prepend an underscore unconditionally and then the reserved name concern goes away (I'm only aware of Windows imposing something here) and you can focus on how to handle special characters and case-insensitivity.

Maybe relying on a flaky string storage for fully reversible names is a fool's errand and we do need some kind of mapping file, I don't know.

But if we are building an OTF/TTF font, there should be a Postscript glyph names table

Already exists in the form of https://unifiedfontobject.org/versions/ufo3/lib.plist/#publicpostscriptnames.

@anthrotype
Copy link
Member

anthrotype commented Nov 30, 2022

I like the scheme that Just proposed in #164 (comment)

Unfortunately the caret ^ is an escape character in Windows command prompt (similar to a posix backslash), see https://ss64.com/nt/syntax-esc.html -- so we should probably use something else. Maybe just a period . could do the trick. We could say the penultimate . optionally separates the disambiguating part of the path name (the last required . separates the .glif extension suffix of course).

E.g. "Aring.1.glif" or "CON.7.glif" (instead of "Aring^1.glif" and "CON^7.glif")

I think imposing a maximum glyph name length of 255 is reasonable if it allows us to devise a reversible filenaming scheme.

I doubt one would ever fit a whole Lorem Ipsum paragraph inside a glyph name, e.g. these are 255 characters, and I'd argue that they ought to be enough:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras venenatis quam vel turpis fermentum iaculis. Donec posuere dictum nulla et euismod. Sed elit nulla, sodales id ipsum ac, pretium tempor orci. Nunc accumsan libero et mattis varius. Vivamus nec.

Also, I think that space character should also be escaped, to avoid having to quote glif path names if used in command lines.

@justvanrossum
Copy link
Contributor Author

I would like the suffix to be optional, and using . makes that ambiguous I think. See:

("aring",       "aring.glif"),

@anthrotype
Copy link
Member

well, sorry, forget period ., it can't be optionally used to separate the case-disambiguation part because it may well be part of glyph names themselves

@anthrotype
Copy link
Member

whatever the separator, it has to be one that can't occur in glyph names themselves, or it must always be there at the end of the filename

@justvanrossum
Copy link
Contributor Author

whatever the separator, it has to be one that can't occur in glyph names themselves, or it must always be there at the end of the filename

Whatever the separator, if it occurs in the glyph name itself, it will have to be %-escaped.

@anthrotype
Copy link
Member

you're right. By the way the percent symbol % proposed for URL-style encoding is also a special character in windows command line, but similarly to the caret one can escape it by doubling it (%% or ^^) or wrapping in quotes, so I don't think we should block on this. It's gonna be hard or even impossible to make posix shells, windows command prompt, regular expression syntax, and what have you -- all happy.
So what Just proposed is ok: caret as separator for case-disambiguating suffix and percent-encoding for reserved characters.

@anthrotype
Copy link
Member

ok so to recap. To make it a reversible naming scheme that doesn't require contents.plist we have two options, both of which can use url-style %-encoding to escape illegal characters:

  1. uses a disambiguation suffix separated by ^ containing a base32 encoded mask where each bit corresponds to a capital letter in the filaname preceding the ^, optionally omitted for all-lowercase filenames: e.g. "Aring" => "Aring^1.glif", or "aring" => "aring.glif".
  2. appends an "_" to all capital letters as well as the underscore itself (which gets dobuled up): "F_F" => "F___F.glif"

Option 2) is basically the same as we currently use, with the difference that illegal characters are %-encoded, instead of replaced with "_", so they can be reversed. The underscore notation keeps the readability, and is already familiar to font devs, at the cost perhaps of a longer filename.

If we were to set a maximum length for glyph names, then it is possible to create a completely reversible glyph file naming scheme that does not need a contents.plist-like mechanism at all. If we keep insisting on not imposing a maximum length, a hybrid solution may be possible (but I'm not necessarily in favor of that).

this is the bit I don't quite undestand yet. Why is the maximum length on glyph names a requirement for fully reversible, contents.plist-less naming mechanism?
Is it because filenames in practice do impose maximum length limitations, thus glyph names length cannot be unbounded?
And this max length limits inevitably lead to truncating data making the scheme no longer reversible?

I think that imposing a max glyph name length could be reasonable and won't ever be hit in practice, so I'm ok with that.

@justvanrossum
Copy link
Contributor Author

Is it because filenames in practice do impose maximum length limitations, thus glyph names length cannot be unbounded?
And this max length limits inevitably lead to truncating data making the scheme no longer reversible?

Yes and yes.

I think that imposing a max glyph name length could be reasonable and won't ever be hit in practice, so I'm ok with that.

I agree. It would just be nice to figure out and document what the actual limit will be. With either of the two schemes it will depend on the number of capital letters in the glyph name...

@verdy-p
Copy link

verdy-p commented Dec 9, 2022

If my last scheme was too complex, it can also be reduced (but not above that the lowercase and uppercase letters were considered equivalent, due to case-insensitive filesystems).

Just retain !## for hex escape of 8-bit codepoints (limited to ISO8859-1), !u#### for hex escape 16-bit codepoints (limited to the basic multingual plane), and !x###### for 24-bit codepoints (for other planes), with a fixed number of hex digits. All special punctuation can be handled. The character ! has no specific use in filesystems or most common shells or in URLs, so it is a safe escaping character, just like _ which can be used as an unconditional prefix for filenames (solving the problem with special "con", "aux", though in practice I doubt we'll ever have glyphs named this ways) and can still be used for prefixing uppercase letters A-Z, leaving other lowercase letters a-z and all basic digits 0-9 unescaped. (I'm not fan for differentiating lettercase by a binary-coded suffix).


Similar (may be even simpler) escaping could as well reduce to !##! for any hex escape with variable number (up to 6) of hex digits in the codepoint, the first one being non-0 (note that the second ! may be dropped if it is not followed by an hex digits 0-9 or a-f (or A-F: remember that case is not sensitive) or if this is a codepoint in the 17th plane (i.e. the 2nd private-use plane, needing 6 hex digits for U+100000 to U+10FFFD). This gives (here also the target filenames are given in lowercase only):
Note that there should be no glyph using U+0000 (but eventually it could map to !0! or just !!). The character ! itself needs to be escaped as !21!, eventually reduced to !21 if not followed by another hex digit or escape):

_testCases = [
    ("a",           "_a.glif"),
    ("aring",       "_aring.glif"),
    ("ae",          "_ae.glif"),
    ("a-e",         "_a-e.glif"),
    ("A",           "__a.glif"),
    ("Aring",       "__aring.glif"),
    ("AE",          "__a_e.glif"),
    ("A-E",         "__a-_e.glif"),
    ("_",           "__.glif"),
    ("__",          "____.glif"),
    ("f_",          "_f_.glif"),
    ("f_i",         "_f__i.glif"),
    ("f_I",         "_f___i.glif"),
    ("F_i",         "__f__i.glif"),
    ("F_I",         "__f___i.glif"),
    ("aaaaaaaaaa",  "_aaaaaaaaaa.glif"),
    ("aaaaaaaaaA",  "_aaaaaaaaa_a.glif"),
    ("ABCDEGF",     "__a_b_c_d_e_g_f.glif"),
    ("AAAAAAAAAA",  "__a_a_a_a_a_a_a_a_a_a.glif"),
    ("AaaaaAaaaa",  "__aaaaa_aaaaa.glif"),
    (".notdef",     "_.notdef.glif"),
    (".null",       "_.null.glif"),
    ("CON",         "__c_o_n.glif"),
    ("con",         "_con.glif"),
    ("aux",         "_aux.glif"),
    ("con.alt",     "_con.alt.glif"),
    ("*",           "_!2a.glif"),
    ("?",           "_!3f.glif"),
    ("?!",          "_!3f!!21.glif"),
    ("???",         "_!3f!!3f!!3f.glif"),
    ("who?",        "_who!3f.glif"),
    ("oh!",         "_oh!21.glif"),
    ("oh!!",        "_oh!21!!21.glif"),
    ("A:",          "__a!3a.glif"),
    ("A:a",         "__a!3a!a.glif"),
    ("A^321",       "__a!5e!321.glif"),
    ("a\\",         "_a!5c.glif"),
    ("a\\-",        "_a!5c-.glif"),
    ("a\\a",        "_a!5c!a.glif"),
    ("a\\A",        "_a!5c_a.glif"),
    ("a\t",         "_a!9.glif"),
    ("a \t"         "_a!20!!9.glif"),
    ("a ",          "_a!20.glif"),
    ("a  ",         "_a!20!!20.glif"),
    ("a a",         "_a!20!a.glif"),
    ("a\"",         "_a!22.glif"),
]

Literal underscores for escaping uppercase letters must be doubled if they occur before a literal lowercase letter, or before an literal uppercase letter or underscore needing their own underscope escape.

Note that the behavior of underscores for handling lettercase only applies to ASCII letters; filesystems may or may not treat case differences for other letters (depending on versions of the UCD they are using for case mappings) so uppercase non-ASCII letters should be escaped in hex. But in fact non-ASCII characters should probably all be hex-escaped (for having too fuzzy support in filesystems, possibly also changing and enforcing a Unicode normalization form): glyph names themselves shouldbe preferably limited to ASCII, but if not, these extra characters have to be hex-escaped.

Finally all this thread is only about finding a solution to restrictions/limitation of filesystems (forbidding or not distinguishing some filenames or creating some additional aliases). We should not care about limitations/restrictions added by shells. So what is relevant is just what is found in common filesystems, the most restrictive being those used by Windows (case insensitive names, the handling of leading/trailing whitespaces or dots, the behavior of wildcards, a few legacy reserved names, and path separators, plus the Windows-specific bahavior of tilde "~" related to the generation of "short filenames" for compatibility with legacy programs inherited from DOS for FAT filesystems when they still did not have LFN support, this behavior being still used on Wnidows filesystems having LFN support; on Linux/Unix, we are just concerned by wildcards, a single path separator "/" and special names "." and "..", which are also restricted as well on Windows filesystems).

Additional characters that are restricted on Windows are "|", "<", ">", as well as double quotation marks (they are not restricted on Linux/Unix, just used specically by its common command-line shells, providing escaping mechanisms when needed, including for wildcards "?" and "*", for character classes with "[...]", and for sets of alternate names with "{..., ...}").

Other things like the "=", "%", "$", "&", "{...}", "[...]" and "(...)" characters, specifically used by the syntaxic parsers of command line shells should not concern us: there's a wide set of shells, each one having their tricks and their own escaping mechanism when needed, but not adding restrictions on filesystems on which they are used.

The most problematic case is if wel want to use legacy filesystems that don't have LFN support (basically old FAT filesystems without the extension supported since Windows 95): these are extremely unlikely to be ever used here for developing/supporting "unified font objects". The only way to support these would have to use a "mapping file" containing the list of short filenames mapped for what was intended to be long filenames (it is possible to maintain here such mapping file, but in reality there are alternatives, such as storing these files in a ZIP container: extracting files from the ZIP could also create and maintain such mapping file automatically: the short filenames would be generated on the file using a scheme similar to what is used on Windows for "8.3" names, using a few letters in prefix, plus a basic numeric counter, before a shortened extension; those files with short names could remain just in a temporary working folder, along with the temporary mapping file and discarded once we are done and they are rearchived

This does not mean that the project development here on GitHub must generate ZIP archives for its collection of glyph files: it's up to the client to manage their local archives when talking with GitHub; reading files form ZIP archives can be extremely fast, as they don't even need to use extreme compression level (they are just there as an easy workaround possible for any client that could not store individual files directly "as is" on its local filesystems). This is exactly like what already happens everyday within all web browsers for managing their cache: browsers maintain their own mapping file to index the long names referenced on external sites in their URLs or web APIs. Web sites do not ever have to know or manage these client-side index themselves. And today, clients can avoid that local "cost" for managing ZIP archives and index, by just using a better-capable filesytem for the storage (today NTFS, or modern FAT32 with LFN, or exFAT, ReFS... Let's forget old ISO-FS on CDROMs/DVDs without support of the "Joliet" extension, old FAT on MSDOS and Windows before Windows 95, or antique filesytems like CP/M that did not even have the concept of distinct directories in their root).

However, for deployment of apps, using ZIP archives or libraries can still be interesting if these files are to be mostly used as read-only resources: they take less space and (un)install themselves much faster with less overhead on local filesystems and frequently improve the overall performance of the app using them: that what most modern apps are doing today for their packages (including for their "theme packs", "resource bundles"...), with the additional possiblity of embedding other metadata along with their embedded mapping file (e.g. digital signatures, versioning info, security descriptors, permissions, intended usage, etc.) But our archiving format in fine should be the complete font file (in some OpenType/XML/SVG/PostScript/webfont container format) that this project intends to produce so that they become instantly installable or referenceable in applications needing fonts. Our individual ".glif" files are intermediate development files only to be used by very few users/developers/designers, and will almost never be used "as is" by final applications. We just want individual ".glif" files to manage the development, design, interchange in a more granualr way than just hosting plain ".ttf" files on GitHub or other source repositories (because they offer no facility such as diffs, history of changes, development comments, patches, conditional testing of changes, reusability, restructuration of font contents...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants