# Basics of grapheme tokenization and transliteration in Python

[Steven Moran](http://www.comparativelinguistics.uzh.ch/de/moran.html)

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/Basics](https://github.com/unicode-cookbook/recipes/Basics). 

This use case illustrates how to segment text into graphemes. We also transliterate graphemes using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook for Linguists](https://github.com/unicode-cookbook/cookbook).

This recipe uses Python 3.5. 

GitHub renders Jupyter notebooks nicely, so you can copy and paste code into your interpreter or scripts. If you however `git clone` this [recipes repository](https://github.com/unicode-cookbook/recipes) and have Jupyter installed on your machine, this file is also executable in the browser. Run `jupyter notebook` in this directory.

## Overview

Let's use the Python [segments](https://pypi.python.org/pypi/segments/) package to tokenize graphemes and to transliterate input data with orthography profiles. Installation instructures here: [https://github.com/unicode-cookbook/recipes](https://github.com/unicode-cookbook/recipes). Examples from both the API and from the command line are shown in this recipe.

## API access

Begin by importing the `Tokenizer` from the `segments` library.

In [1]:
from segments.tokenizer import Tokenizer

Next, instantiate a tokenizer object, which takes optional arguments for an orthography profile and an orthography profile rules files.

In [2]:
t = Tokenizer()

The default tokenization strategy is to segment some input text at the [Unicode Extended Grapheme Cluster](http://www.unicode.org/reports/tr18/tr18-19.html#Default_Grapheme_Clusters) boundaries, and to return, by default, a space-delimited string of graphemes. White space between input string sequences is by default separated by a hash symbol <#>, which is a linguistic convention used to denote word boundaries. The default grapheme tokenization is useful when you encounter a text that you want to tokenize to identify potential orthographic or transcription elements.

In [3]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p


In [4]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", segment_separator="-")
print(result)

ĉ-h-á-ɾ-ã̌-c-t-ʼ-ɛ-↗-ʐ-ː-| # k͡-p


In [5]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", separator=" // ")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | // k͡ p


The optional `ipa` parameter forces grapheme segmentation for [IPA strings](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet). Note here that [Unicode Spacing Modifier Letters](https://en.wikipedia.org/wiki/Spacing_Modifier_Letters), such as <ː> and <◌͡◌>, will be segmented together with base characters (although you might need orthography profiles and rules to correct these in your input source).

In [6]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", ipa=True)
print(result)

ĉ h á ɾ ã̌ c tʼ ɛ ↗ ʐː | # k͡p


If you want to tokenize a string at [Unicode code point](https://unicode.org/glossary/#code_point) boundaries, say for example to get the frequency of each and every character in your input, pass the tokenizer `characters=True`.

In [7]:
# Forthcoming
# result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", characters=True)
# Current
result = t.characters("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p


### Orthography profiles

You can also load an orthography profile and tokenize input strings with it. In the `data` directory we've placed an example orthograpy profile. Let's have a look at it using `more` on the command line.

In [8]:
!!more data/orthography-profile.tsv

['Grapheme\tIPA\tXSAMPA\tCOMMENT',
 'a\ta\ta',
 'aa\taː\ta:',
 'b\tb\tb',
 'c\tc\tc',
 'ch\ttʃ\ttS',
 '-\tNULL\tNULL\t"comment with\ttab"',
 'on\tõ\to~',
 'n\tn\tn',
 'ih\tí\ti_H',
 'inh\tĩ́\ti~_H']

For a slightly nicer display of Jupyter Notebooks on GitHub, we pipe our [TSV file](https://en.wikipedia.org/wiki/Tab-separated_values) to `csvlook` in [csvkit](http://csvkit.readthedocs.io/).

In [9]:
!!more data/orthography-profile.tsv | csvlook -t

['| Grapheme | IPA | XSAMPA | COMMENT          |',
 '| -------- | --- | ------ | ---------------- |',
 '| a        | a   | a      |                  |',
 '| aa       | aː  | a:     |                  |',
 '| b        | b   | b      |                  |',
 '| c        | c   | c      |                  |',
 '| ch       | tʃ  | tS     |                  |',
 '| -        |     |        | comment with\ttab |',
 '| on       | õ  | o~     |                  |',
 '| n        | n   | n      |                  |',
 '| ih       | í  | i_H    |                  |',
 '| inh      | ĩ́ | i~_H   |                  |']

An orthograpy profile is a character delimited UTF-8 text file. The first column of the orthography profile must be labeled `Grapheme`. Other columns are optional. The formal specifcation is in defined in [Chapter 8](https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf) of The Unicode Cookbook for Linguists.

Each row in the `Grapheme` column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns `IPA` and `XSAMPA`, which are mappings from our graphemes to their [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) and [XSAMPA](https://en.wikipedia.org/wiki/X-SAMPA) transliterations. The final column `COMMENT` is for comments; if you want to use a tab, then ''quote that&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;string''!

Let's load the orthography profile with our tokenizer.

In [10]:
from segments.tokenizer import Profile

t = Tokenizer('data/orthography-profile.tsv')

  'Unspecified column "{0}" in table {1}'.format(k, self.local_name))
  'Unspecified column "{0}" in table {1}'.format(k, self.local_name))
  'Unspecified column "{0}" in table {1}'.format(k, self.local_name))


Now let's segment the graphemes in some input strings with our orthography profile. The output is segmented given the definition of graphemes in our orthography profile, e.g. we specified the sequence of two &lt;a a&gt; should be a single unit &lt;aa&gt;, and so should the sequences &lt;c h&gt;, &lt;o n&gt; and &lt;i h&gt;.

In [11]:
t('aabchonn-ih')

'aa b ch on n - ih'

This example shows how can we tokenize input text into our orthographic specification. We can also segment graphemes and transliterate them into other formats, which we find useful when we have sources with different orthographies. Of course we also want to sometimes be able to compare different data sources in a single orthographic representation, like IPA or XSAMPA.

In [12]:
t("aabchonn-ih", column="IPA")

'aː b tʃ õ n í'

In [13]:
t("aabchonn-ih", column="XSAMPA")

'a: b tS o~ n i_H'

It is also useful to know which characters in your input data are not in your orthography profile. By default, missing characters are displayed with the [Unicode REPLACEMENT CHARACTER U+FFFD](http://www.fileformat.info/info/unicode/char/fffd/index.htm), which appears below as a white question mark within a black diamond <�>.

In [14]:
t("aa b ch on n - ih x y z")

'aa # b # ch # on # n # - # ih # � # � # �'

You can change the default by specifying a different replacement character when you load the orthography profile with the tokenizer.

In [15]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '?')
t("aa b ch on n - ih x y z")

'aa # b # ch # on # n # - # ih # ? # ? # ?'

In [16]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '<{0}>'.format(c))
t("aa b ch on n - ih x y z")

'aa # b # ch # on # n # - # ih # <x> # <y> # <z>'

Perhaps you want to create an initial orthography profile that also contains those graphemes &lt;x&gt;, &lt;y&gt;, and &lt;z&gt;? (Note that the space character and its frequency are also captured in this initial profile.)

In [17]:
profile = Profile.from_text("aa b ch on n - ih x y z")
print(profile)

Grapheme	mapping	frequency
 	 	9
a	a	2
h	h	2
n	n	2
b	b	1
c	c	1
o	o	1
-	-	1
i	i	1
x	x	1
y	y	1
z	z	1


## Command line access

Make sure to `pip install segments` to install the command line tool.

Get some help with `segments -h`.

In [18]:
!!segments -h

['usage: segments [-h] [--verbosity VERBOSITY] [--encoding ENCODING]',
 '                [--profile PROFILE] [--mapping MAPPING]',
 '                command ...',
 '',
 'Main command line interface of the segments package.',
 '',
 'positional arguments:',
 '  command               tokenize | profile',
 '  args',
 '',
 'optional arguments:',
 '  -h, --help            show this help message and exit',
 '  --verbosity VERBOSITY',
 '                        increase output verbosity',
 '  --encoding ENCODING   input encoding',
 '  --profile PROFILE     path to an orthography profile',
 '  --mapping MAPPING     column name in ortho profile to map graphemes',
 '',
 "Use 'segments help <cmd>' to get help about individual commands."]

We have created some test data in the `source/german.txt` file using the word 'Schächtelchen', which is the diminuitive form of 'Schachtel', meaning 'box, packet, or carton' in English.

In [19]:
!!more sources/german.txt

['Schächtelchen']

We can create an initial orthography profile of the German text by passing it to the `segments profile` command. The initial profile tokenizes the text on Unicode grapheme clusters, lists the frequency of each grapheme, and provides an initial mapping column by default.

In [20]:
!!cat sources/german.txt | segments profile | csvlook -t

['| Grapheme | frequency | mapping |',
 '| -------- | --------- | ------- |',
 '| c        |         3 | c       |',
 '| h        |         3 | h       |',
 '| e        |         2 | e       |',
 '| S        |         1 | S       |',
 '| ä        |         1 | ä       |',
 '| t        |         1 | t       |',
 '| l        |         1 | l       |',
 '| n        |         1 | n       |']

Next, we know a bit about German orthography and which characters combine to form German graphemes. We can use the information from our initial orthography profile to hand-curate a more precise German orthography profile that takes into account capitalization (German orthography obligatorily capitalizes nouns) and grapheme clusters, such as $<$sch$>$ and $<$ch$>$. We can use the initial orthography profile above as a starting point (note the in large text the frequency column may signal errors in the input, such as typos, if they occur with a very low frequency). The initial orthography profile can be edited with a text editor or spreadsheet program. As per the orthography profile specifications (see [Chapter 7](https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf)), we can adjust rows in the `Grapheme` column and then add additional columns for translitation or for comments.

In [21]:
!!more data/german-orthography-profile.tsv | csvlook -t

['| Grapheme | IPA | XSAMPA | COMMENT                      |',
 '| -------- | --- | ------ | ---------------------------- |',
 '| Sch      | ʃ   | S      | German nouns are capitalized |',
 '| ä        | ɛː  | E:     |                              |',
 '| ch       | ç   | C      |                              |',
 '| t        | t   | t      |                              |',
 '| e        | e   | e      |                              |',
 '| l        | l   | l      |                              |',
 '| n        | n   | n      |                              |']

Using the command line `segments` function and passing it our orthography profile, we can now segment our German text example into graphemes.

In [22]:
!!cat sources/german.txt | segments --profile=data/german-orthography-profile.tsv tokenize

 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 'Sch ä ch t e l ch e n']

And by providing `segments` a column for transliteration, we can convert the text into IPA or XSAMPA.

In [23]:
!!cat sources/german.txt | segments --mapping=IPA --profile=data/german-orthography-profile.tsv tokenize

 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 'ʃ ɛː ç t e l ç e n']

In [24]:
!!cat sources/german.txt | segments --mapping=XSAMPA --profile=data/german-orthography-profile.tsv tokenize

 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 'S E: C t e l C e n']

## An additional example

Next we provide another example of working with orthography profiles.

In [25]:
!!more sources/text.txt

['aäaaöaaüaa']

In [26]:
!!cat sources/text.txt | segments profile | csvlook -t

['| Grapheme | mapping | frequency |',
 '| -------- | ------- | --------- |',
 '| a        | a       |         7 |',
 '| ä        | ä       |         1 |',
 '| ö        | ö       |         1 |',
 '| ü        | ü       |         1 |']

In [27]:
!!cat sources/text.txt | segments profile > sandbox/orthography-profile.tsv

[]

In [28]:
!!more sandbox/orthography-profile.tsv

['Grapheme\tmapping\tfrequency', 'a\ta\t7', 'ä\tä\t1', 'ö\tö\t1', 'ü\tü\t1']

In [29]:
!!cat sources/text.txt | segments tokenize

['a ä a a ö a a ü a a']

In [30]:
!!cat sources/text.txt | segments --mapping=mapping --profile=sandbox/orthography-profile.tsv tokenize

 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 '  \'Unspecified column "{0}" in table {1}\'.format(k, self.local_name))',
 'a ä a a ö a a ü a a']

In [31]:
!!more sources/text.txt

['aäaaöaaüaa']