# Basics of grapheme segmentation

Steven Moran &lt;bambooforest@gmail.com&gt;

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/Basics](https://github.com/unicode-cookbook/recipes/Basics). 

This use case illustrates how to segment text into graphemes. We also transliterate graphemes using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook for Linguists](https://github.com/unicode-cookbook/cookbook).

This recipe uses Python 3.5. 

Github renders Jupyter notebooks nicely, so you can copy and paste code into your interpreter or scripts. If you however `git clone` the `recipes` repository and have Jupyter installed on your machine, this file is also executable in the browser. Run `jupyter notebook` in this directory.

## Overview

Let's use the Python [segments](https://pypi.python.org/pypi/segments/) package to tokenize characters, graphemes and IPA. Installation instructures here: [https://github.com/unicode-cookbook/recipes](https://github.com/unicode-cookbook/recipes). We illustrate both API access and the command line program.

## API access

In [1]:
from segments.tokenizer import Tokenizer

The `characters` function will segment a string at Unicode code points.

In [2]:
t = Tokenizer()
result = t.characters("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p


The `grapheme_clusters` function will segment text at the [Unicode Extended Grapheme Cluster](http://www.unicode.org/reports/tr18/tr18-19.html#Default_Grapheme_Clusters) boundaries. 

In [3]:
result = t.grapheme_clusters("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p


The `grapheme_clusters` function is the default segmentation algorithm for the `segments.Tokenizer`. It is useful when you encounter a text that you want to tokenize to identify orthographic or transcription elements.

In [4]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p


The `ipa` parameter forces grapheme segmentation for [IPA strings](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet).

In [5]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", ipa=True)
print(result)

ĉ h á ɾ ã̌ c tʼ ɛ ↗ ʐː | # k͡p


We can also load an orthography profile and tokenize an input string with it. In the `data` directory we've placed an example orthograpy profile. Let's have a look at it using `more` on the command line.

In [6]:
!!more data/orthography-profile.tsv

['Grapheme\tIPA\tXSAMPA\tCOMMENT',
 'a\ta\ta',
 'aa\taː\ta:',
 'b\tb\tb',
 'c\tc\tc',
 'ch\ttʃ\ttS',
 '-\tNULL\tNULL\t"comment with\ttab"',
 'on\tõ\to~',
 'n\tn\tn',
 'ih\tí\ti_H',
 'inh\tĩ́\ti~_H']

Here we pipe this to `csvlook` for nicer display in Jupyter Notebook.

In [7]:
!!more data/orthography-profile.tsv | csvlook -t

['| Grapheme | IPA | XSAMPA | COMMENT          |',
 '| -------- | --- | ------ | ---------------- |',
 '| a        | a   | a      |                  |',
 '| aa       | aː  | a:     |                  |',
 '| b        | b   | b      |                  |',
 '| c        | c   | c      |                  |',
 '| ch       | tʃ  | tS     |                  |',
 '| -        |     |        | comment with\ttab |',
 '| on       | õ  | o~     |                  |',
 '| n        | n   | n      |                  |',
 '| ih       | í  | i_H    |                  |',
 '| inh      | ĩ́ | i~_H   |                  |']

Let's load the orthography profile with our tokenizer.

In [8]:
from segments.tokenizer import Profile

t = Tokenizer('data/orthography-profile.tsv')

Now let's segment the graphemes in some input strings with our orthography profile. The output is segmented given the definition of graphemes in our orthography profile, e.g. we specified the sequence of two &lt;a a&gt; should be a single unit &lt;aa&gt;, and so should the sequences &lt;c h&gt;, &lt;o n&gt; and &lt;i h&gt;.

In [9]:
t('aabchonn-ih')

'aa b ch on n - ih'

This example shows how can we tokenize input text into our orthograpic specification. We can also segment graphemes and transliterate them into other formats, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA. 

In [10]:
t.transform("aabchonn-ih", "IPA")

'aː b tʃ õ n í'

In [11]:
t.transform("aabchonn-ih", "XSAMPA")

'a: b tS o~ n i_H'

It is also useful to know which characters in your input string are not in your orthography profile. Use the function `find_missing_characters`.

In [12]:
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih � � �'

We set the default as the [Unicode replacement character \ufffd](http://www.fileformat.info/info/unicode/char/fffd/index.htm). But you can simply change this by specifying the replacement character when you load the orthography profile with the tokenizer.

In [13]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '?')
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih ? ? ?'

In [14]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '<{0}>'.format(c))
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih <x> <y> <z>'

Perhaps you want to create an initial orthography profile that also contains those graphemes x, y, z? Note that the space character and its frequency are also captured in this initial profile.

In [15]:
profile = Profile.from_text("aa b ch on n - ih x y z")
print(profile)

Grapheme	frequency	mapping
 	9	 
a	2	a
h	2	h
n	2	n
b	1	b
c	1	c
o	1	o
-	1	-
i	1	i
x	1	x
y	1	y
z	1	z


## Command line access

Make sure to `pip install segments` to install the command line tool.

Get some help with `segments -h`.

In [16]:
!!segments -h

['usage: segments [-h] [--verbosity VERBOSITY] [--encoding ENCODING]',
 '                [--profile PROFILE] [--mapping MAPPING]',
 '                command ...',
 '',
 'Main command line interface of the segments package.',
 '',
 'positional arguments:',
 '  command               tokenize | profile',
 '  args',
 '',
 'optional arguments:',
 '  -h, --help            show this help message and exit',
 '  --verbosity VERBOSITY',
 '                        increase output verbosity',
 '  --encoding ENCODING   input encoding',
 '  --profile PROFILE     path to an orthography profile',
 '  --mapping MAPPING     column name in ortho profile to map graphemes',
 '',
 "Use 'segments help <cmd>' to get help about individual commands."]

We have created some test data in the `source/german.txt` file using the word 'Schächtelchen', which is the diminuitive form of 'Schachtel', meaning 'box, packet, or carton' in English.

In [17]:
!!more sources/german.txt

['Schächtelchen']

We can create an initial orthography profile of the German text by passing it to the `segments profile` command. The initial profile tokenizes the text on Unicode grapheme clusters, lists the frequency of each grapheme, and provides an initial mapping column by default.

In [18]:
!!cat sources/german.txt | segments profile | csvlook -t

['| Grapheme | frequency | mapping |',
 '| -------- | --------- | ------- |',
 '| c        |         3 | c       |',
 '| h        |         3 | h       |',
 '| e        |         2 | e       |',
 '| S        |         1 | S       |',
 '| ä       |         1 | ä      |',
 '| t        |         1 | t       |',
 '| l        |         1 | l       |',
 '| n        |         1 | n       |']

Next, we know a bit about German orthography and which characters combine to form German graphemes. We can use the information from our initial orthography profile to hand-curate a more precise German orthography profile that takes into account capitalization (German orthography obligatorily capitalizes nouns) and grapheme clusters, such as $<$sch$>$ and $<$ch$>$. We can use the initial orthography profile above as a starting point (note the in large text the frequency column may signal errors in the input, such as typos, if they occur with a very low frequency). The initial orthography profile can be edited with a text editor or spreadsheet program. As per the orthography profile specifications (see [Chapter 7](https://github.com/unicode-cookbook/cookbook/blob/master/unicode-cookbook.pdf)), we can adjust rows in the `Grapheme` column and then add additional columns for translitation or for comments.

In [19]:
!!more data/german-orthography-profile.tsv | csvlook -t

['| Grapheme | IPA | XSAMPA | COMMENT                      |',
 '| -------- | --- | ------ | ---------------------------- |',
 '| Sch      | ʃ   | S      | German nouns are capitalized |',
 '| ä        | ɛː  | E:     |                              |',
 '| ch       | ç   | C      |                              |',
 '| t        | t   | t      |                              |',
 '| e        | e   | e      |                              |',
 '| l        | l   | l      |                              |',
 '| n        | n   | n      |                              |']

Using the command line `segments` function and passing it our orthography profile, we can now segment our German text example into graphemes.

In [20]:
!!cat sources/german.txt | segments --profile=data/german-orthography-profile.tsv tokenize

['Sch ä ch t e l ch e n']

And by providing `segments` a column for transliteration, we can convert the text into IPA or XSAMPA.

In [21]:
!!cat sources/german.txt | segments --mapping=IPA --profile=data/german-orthography-profile.tsv tokenize

['ʃ ɛː ç t e l ç e n']

In [22]:
!!cat sources/german.txt | segments --mapping=XSAMPA --profile=data/german-orthography-profile.tsv tokenize

['S E: C t e l C e n']

## Additional example

Next we provide another example of working with orthography profiles.

In [23]:
!!more sources/text.txt

['aäaaöaaüaa']

In [24]:
!!cat sources/text.txt | segments profile | csvlook -t

['| Grapheme | frequency | mapping |',
 '| -------- | --------- | ------- |',
 '| a        |         7 | a       |',
 '| ä       |         1 | ä      |',
 '| ö       |         1 | ö      |',
 '| ü       |         1 | ü      |']

In [25]:
!!cat sources/text.txt | segments profile > sandbox/orthography-profile.tsv

[]

In [26]:
!!more sandbox/orthography-profile.tsv

['Grapheme\tfrequency\tmapping',
 'a\t7\ta',
 'ä\t1\tä',
 'ö\t1\tö',
 'ü\t1\tü']

In [27]:
!!cat sources/text.txt | segments tokenize

['a ä a a ö a a ü a a']

In [28]:
!!cat sources/text.txt | segments --mapping=mapping --profile=sandbox/orthography-profile.tsv tokenize

['a ä a a ö a a ü a a']

In [29]:
!!more sources/text.txt

['aäaaöaaüaa']