Skip to content
Jonas Gierer edited this page Jun 22, 2014 · 2 revisions

Files with main data

Files located in the folder data:

  • sections.txt — Unicode table sections
  • sets.txt — symbols sets
  • entities.txt — mnemonics (e.g. ©)
  • types.txt — sections types (alphabet, abugida)
  • languages.txt — section languages
  • countries.txt — section countries
  • specs.txt — control characters (e.g. \n)

These files are only for common data (language independent). All names and descriptions are located in localisation files.

Format of files

For example file sections.txt:

# Sections params

[greek-coptic]
	diap            : 0370:03FF
	type            : alphabet
	languages       : greek, coptic
	countries       : greece

[cyrillic]
	diap            : 0400:04FF
	type            : alphabet
	languages       : russian, ukrainian, bulgarian
	countries       : russia, ukraine, bulgaria, serbia, macedonia, moldova

Lines begining with a # are comments and are ingnored. Empty lines are ignored as well.

For example two objects: greek alphabet (greek-coptic) and cyrillic.

Section descriptions begin with the section key (cyrillic) wich is in square brackets. Then follows a list of characteristics in the form of characteristic : value.

The key of the object has several purposes:

The key should be unique and consist of lowercase latin characters, numerals or hyphens.

The list of arguments depends on the content. Arguments can be mandatory or optional. The value can be a string or a list of comma-separated values ​​(e.g. russian, ukrainian, bulgarian).

Please note that we use keys instead of names of counties that can be different in various languages. The keys are defined in the files languages.txt and countries.txt.

Sections (sections.txt)

Arguments:

  • diap — the diapason (range) of the values (e.g. 0370:03FF). The diapasons of different sections should not intersect.
  • type — type (e.g. alphabet or abugida). Corresponds to the types of types.txt. Not Required.
  • languages — a list of languages that use the symbols in this section. Corresponds to the languages of languages.txt. Not Required.
  • countries — a list of countries that use the symbols in this section. Corresponds to the countries of countries.txt. Not Required.

Sets (sets.txt)

Used for pages (http://unicode-table.com/sets/)

Arguments:

  • set — a list of characters in this set

Example:

[set-abcdef]
    set : a, b, c, d, e, f

Types (types.txt)

At the moment there are no arguments defined, so just specify the list of keys.

[abjad]

[abugida]

[alphabet]

Languages (languages.txt)

Similarly to type, these have no arguments.

Countries (countries.txt)

Arguments:

  • map — the coordinates of this country. Format: x:y (e.g. 110:75)

HTML-entities (entities.txt)

For Example: © — copyright sign.

The file has a simple format:

copy     : 169
ordf     : 170
laquo    : 171
not      : 172

First the sequence name (without & and ;), then the decimal code of the character.

At the moment used in searches: http://unicode-table.com/en/search/?q=%26copy%3B

Control characters (specs.txt)

These are characters like \n, \t etc. The file format is similar to entities.txt:

0: 0
a: 7
b: 8
t: 9
n: 10
v: 11
f: 12
r: 13

First the sequence of characters without the slash, then the decimal code of character. This is also used for searching.

Adding new objects

Please note that you can only refer to existing objects. For example, if you want cyrillic to refer to lang-unknown:

[cyrillic]
	diap            : 0400:04FF
	type            : alphabet
	languages       : russian, ukrainian, bulgarian, lang-unknown

You have to create lang-unknown in languages.txt and translate it to as many languages in the localisation files as possible (at least to English).