What things are called in Who's On First
Who's On First uses a variation of RFC 5646 for identifying names. The W3C's Language tags in HTML and XML page describes RFC 5646 like this:
RFC 5646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in a number of ways – for one thing, there is only one place you need to look now for valid subtags.
Although it provides some additional options for identifying common language variations, RFC 5646 includes all of the tags that were previously valid. If you have been using RFC 1766, RFC 3066, or RFC 4646 you do not need to make any changes to your tags.
The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.
language-extlang-script-region-variant-extension-privateuse
Sometimes RFC 5646 is referred to as BCP (Best Current Practice) 47.
-
We follow the same structure outlined in RFC 5646 but use
_
(underbar) characters instead of-
(dash) characters for delimiting individual properties of a language identifier. -
We use three-letter language codes (e.g.:
eng
rather thanen
two-letter codes) to identify the primary language. -
The use of either the "script" (e.g.:
kor_latn
) or "region" (e.g.:eng_ca
) subtags is allowed, although neither is required. -
While not explicitly forbidden neither the "extlang" or the "variant" subtags are commonly used, and Who's On First tools for parsing name labels may not support them.
-
We use private extensions, specifically a
-x-[NAME_TYPE]
label. We also use private extensions that are longer than the maximum 8-characters per RFC 5646. A mapping (between WOF names and shortened RFC 5646 compliant extensions) does not exist, as of this writing, but will be provided.
- Replace the "_" separators with "-"
- You probably want to replace 3-letter language codes with 2-letter language codes
eng_x_preferred → en-x-preferred
fre_ca_x_variant → fr-ca-x-variant
The mapzen.whosonfirst.names Python library provides libraries and functions for converting between Who's on First
, Geoplanet
and RFC 5646 subtags
.
Note: When converting to subtags the library does convert three-letter language codes to two-letter language codes.
For example:
import mapzen.whosonfirst.names
lbl = mapzen.whosonfirst.names.labels()
names = ("fin_p", "eng_s", "unk_v")
for n in names:
print n
n2 = lbl.convert(n, 'geoplanet', 'wof')
print n2
n3 = lbl.convert(n2, 'wof', 'subtags')
print n3
n4 = lbl.convert(n3, 'subtags', 'wof')
print n4
n5 = lbl.convert(n4, 'wof', 'geoplanet')
print n5
Would yield:
fin_p
fin_x_preferred
fin-x-preferred
fin_x_preferred
fin_p
eng_s
eng_x_colloquial
eng-x-colloquial
eng_x_colloquial
eng_s
unk_v
und_x_variant
und-x-variant
und_x_variant
unk_v
In the beginning:
-
We had names according to QuattroShapes
-
We re-indexed all the names, aliases and translations from WOE (7.10) and concordances between WOE and the Gazetteer for many of them. Those we don't have concordances for will simply be imported in to the Gazetteer as new records complete with their names and aliases.
-
We had concordances for many places in Geonames which also has many of its own aliases and translations, sometimes exceeding those of WOE
WOE defines two properties for a name:
- a ISO 639-3 language code
- a name "type", which is a canned list as defined by the WOE folks:
The Name_Type field is a single letter code that describes the alias
as follows:
* P is a preferred English name
* Q is a preferred name (in other languages)
* V is a well-known (but unofficial) variant for the place
(e.g. "New York City" for New York)
* S is either a synonym or a colloquial name for the place
(e.g. "Big Apple" for New York), or a version of the name which
is stripped of accent characters.
* A is an abbreviation or code for the place (e.g. "NYC" for New
York)
WOE also distinguishes between a name
and an alias
so in their world you end
up with something like:
Name: Montréal
Language: FRE
Alias (ENG_P): Montreal
Alias (KOR_Q): 몬트리올
WOE does not however account for the fact that some countries have multiple languages.
With all that in mind, decided that:
- We should support multiple languages for a place and label placement
- We should just use
p
for a preferred name, regardless of language - We should use a
name
namespace for names because it is explicit (likewise forfullname
) wof:name
is English by default and is a string (rather than a list of strings) that can be used for labels
For example:
{
"wof:name": "Montreal",
"wof:lang": [ "eng", "fre" ],
"name:eng_p": "Montreal",
"name:eng_a": "YMQ",
"name:fre_p": "Montréal",
"name:kor_p": "몬트리올",
}
But wait, there's more!
One day we met @nyampire who told us that he had a gazetteer of places published by the Japanese government that contained place names in Kanji, Kana and English. Since Kanji is a script the solution described above doesn't work. So now we're using RFC 5646 and subtags.