What things are called in Who's On First
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
inventory
.gitignore
LABELS.md
README.md

README.md

whosonfirst-names

What things are called in Who's On First

Names and aliases (and languages) – aka RFC 5646

Who's On First uses a variation of RFC 5646 for identifying names. The W3C's Language tags in HTML and XML page describes RFC 5646 like this:

RFC 5646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in a number of ways – for one thing, there is only one place you need to look now for valid subtags.

Although it provides some additional options for identifying common language variations, RFC 5646 includes all of the tags that were previously valid. If you have been using RFC 1766, RFC 3066, or RFC 4646 you do not need to make any changes to your tags.

The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.

language-extlang-script-region-variant-extension-privateuse

RFC 5646 (BCP 47) comformance

Sometimes RFC 5646 is referred to as BCP (Best Current Practice) 47.

  • We follow the same structure outlined in RFC 5646 but use _ (underbar) characters instead of - (dash) characters for delimiting individual properties of a language identifier.

  • We use three-letter language codes (rather than two-letter codes) to identify the primary language.

  • The use of either the "script" or "region" subtags is allowed, although neither is required.

  • While not explicitly forbidden neither the "extlang" or the "variant" subtags are commonly used, and Who's On First tools for parsing name labels may not support them.

  • We use private extensions, specifically a -x-[NAME_TYPE] label. We also use private extensions that are longer than the maximum 8-characters per RFC 5646. A mapping (between WOF names and shortened RFC 5646 compliant extensions) does not exist, as of this writing, but will be provided.

RFC 5646 (BCP 47) tag conversion

  1. Replace the "_" separators with "-"
  2. You probably want to replace 3-letter language codes with 2-letter language codes

For example

eng_x_preferred → en-x-preferred
fre_ca_x_variant → fr-ca-x-variant

Code-y bits (and backwards compatibility)

The mapzen.whosonfirst.names Python library provides libraries and functions for converting between Who's on First, Geoplanet and RFC 5646 subtags.

Note: When converting to subtags the library does convert three-letter language codes to two-letter language codes.

For example:

import mapzen.whosonfirst.names

lbl = mapzen.whosonfirst.names.labels()
names = ("fin_p", "eng_s", "unk_v")

for n in names:
	print n

        n2 = lbl.convert(n, 'geoplanet', 'wof')
        print n2

        n3 = lbl.convert(n2, 'wof', 'subtags')
        print n3

        n4 = lbl.convert(n3, 'subtags', 'wof')
        print n4

        n5 = lbl.convert(n4, 'wof', 'geoplanet')
        print n5

Would yield:

fin_p
fin_x_prefered
fin-x-prefered
fin_x_prefered
fin_p
eng_s
eng_x_colloquial
eng-x-colloquial
eng_x_colloquial
eng_s
unk_v
und_x_variant
und-x-variant
und_x_variant
unk_v

A short history

In the beginning:

  • We had names according to QuattroShapes

  • We re-indexed all the names, aliases and translations from WOE (7.10) and concordances between WOE and the Gazetteer for many of them. Those we don't have concordances for will simply be imported in to the Gazetteer as new records complete with their names and aliases.

  • We had concordances for many places in Geonames which also has many of its own aliases and translations, sometimes exceeding those of WOE

WOE defines two properties for a name:

  1. a ISO 639-3 language code
  2. a name "type", which is a canned list as defined by the WOE folks:
The Name_Type field is a single letter code that describes the alias
as follows:

  * P is a preferred English name
  * Q is a preferred name (in other languages)
  * V is a well-known (but unofficial) variant for the place
      (e.g. "New York City" for New York)
  * S is either a synonym or a colloquial name for the place
      (e.g. "Big Apple" for New York), or a version of the name which
      is stripped of accent characters.
  * A is an abbreviation or code for the place (e.g. "NYC" for New
      York)

WOE also distinguishes between a name and an alias so in their world you end up with something like:

Name:  Montréal 
Language: FRE
Alias (ENG_P): Montreal
Alias (KOR_Q): 몬트리올

WOE does not however account for the fact that some countries have multiple languages.

With all that in mind, decided that:

  • We should support multiple languages for a place and label placement
  • We should just use p for a preferred name, regardless of language
  • We should use a name namespace for names because it is explicit (likewise for fullname)
  • wof:name is English by default and is a string (rather than a list of strings) that can be used for labels

For example:

{
	"wof:name": "Montreal",
	"wof:lang": [ "eng", "fre" ],
	"name:eng_p": "Montreal",
	"name:eng_a": "YMQ",
	"name:fre_p": "Montréal",
	"name:kor_p": "몬트리올",
}

But wait, there's more!

One day we met @nyampire who told us that he had a gazetteer of places published by the Japanese government that contained place names in Kanji, Kana and English. Since Kanji is a script the solution described above doesn't work. So now we're using RFC 5646 and subtags.

See also