# Basics of grapheme segmentation

Steven Moran &lt;bambooforest@gmail.com&gt;

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/Basics](https://github.com/unicode-cookbook/recipes/Basics). 

This use case illustrates how to segment text into graphemes. We also transliterate graphemes using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook for Linguists](https://github.com/unicode-cookbook/cookbook).

This recipes uses Python 3.5. 

Github renders Jupyter notebooks nicely, so you can copy and paste code into your interpreter or scripts. If you however `git clone` the `recipes` repository and have Jupyter installed on your machine, this file is also executable in your own browser. Run `jupyter notebook` in this directory.

## Overview

Let's use the Python [segments](https://pypi.python.org/pypi/segments/) package to tokenize characters, graphemes and IPA. Installation instructures here: [https://github.com/unicode-cookbook/recipes](https://github.com/unicode-cookbook/recipes). We illustrate both API access and the command line program.

## API access

In [1]:
from segments.tokenizer import Tokenizer

The `characters` function will segment a string at Unicode code points.

In [12]:
t = Tokenizer()
result = t.characters("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

c ̂ h a ́ ɾ a ̃ ̌ c t ʼ ɛ ↗ ʐ ː | # k ͡ p


The `grapheme_clusters` function will segment text at the [Unicode Extended Grapheme Cluster](http://www.unicode.org/reports/tr18/tr18-19.html#Default_Grapheme_Clusters) boundaries. 

In [10]:
result = t.grapheme_clusters("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p


The `grapheme_clusters` function is also the default segmentation algorithm for the `segments.Tokenizer`. It is useful when you encounter a text that you want to tokenize to identify orthographic or transcription elements.

In [15]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p")
print(result)

ĉ h á ɾ ã̌ c t ʼ ɛ ↗ ʐ ː | # k͡ p


The `ipa` parameter forces grapheme segmentation for [IPA strings](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) (a formal definition is given in [Chapter 5](https://github.com/unicode-cookbook/cookbook)).

In [11]:
result = t("ĉháɾã̌ctʼɛ↗ʐː| k͡p", ipa=True)
print(result)

ĉ h á ɾ ã̌ c tʼ ɛ ↗ ʐː | # k͡p


We can also load an orthography profile and tokenize input string with it. In the `data` directory we've placed an example orthograpy profile. Let's have a look at it using `more` on the command line.

In [17]:
!more data/orthography-profile.tsv

[?1h=Grapheme        IPA     XSAMPA  COMMENT
a       a       a
aa      aː      a:
b       b       b
c       c       c
ch      tʃ      tS
-       NULL    NULL    "comment with   tab"
on      õ       o~
n       n       n
ih      í       i_H
inh     ĩ́       i~_H
[K[?1l>

An orthograpy profile is tab-delimited UTF-8 text file. The first column must be labelled `Grapheme`. Each row in the `Grapheme` column specifies graphemes that may be found in the orthography of the input text. In this example, we provide additional columns `IPA` and `XSAMPA`, which are mappings from our graphemes to their [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) and [XSAMPA](https://en.wikipedia.org/wiki/X-SAMPA) transliterations. The final column `COMMENT` is for comments; if you want to use a tab ''quote that&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;string''!

Let's load the orthography profile with our tokenizer.

In [23]:
from segments.tokenizer import Profile

t = Tokenizer('data/orthography-profile.tsv')

Now let's segment the graphemes in some input strings with our orthography profile. The output is segmented given the definition of graphemes in our orthography profile, e.g. we specified the sequence of two &lt;a a&gt; should be a single unit &lt;aa&gt;, and so should the sequences &lt;c h&gt;, &lt;o n&gt; and &lt;i h&gt;.

In [20]:
t('aabchonn-ih')

'aa b ch on n - ih'

This example shows how can we tokenize input text into our orthograpic specification. We can also segment graphemes and transliterate them into other formats, which is useful when you have sources with different orthographies, but you want to be able to compare them using a single representation like IPA. 

In [24]:
t.transform("aabchonn-ih", "IPA")

'aː b tʃ õ n í'

In [25]:
t.transform("aabchonn-ih", "XSAMPA")

'a: b tS o~ n i_H'

It is also useful to know which characters in your input string are not in your orthography profile. Use the function `find_missing_characters`.

In [26]:
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih � � �'

We set the default as the [Unicode replacement character \ufffd](http://www.fileformat.info/info/unicode/char/fffd/index.htm). But you can simply change this by specifying the replacement character when you load the orthography profile with the tokenizer.

In [30]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '?')
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih ? ? ?'

In [33]:
t = Tokenizer('data/orthography-profile.tsv', errors_replace=lambda c: '<{0}>'.format(c))
t.find_missing_characters("aa b ch on n - ih x y z")

'aa b ch on n - ih <x> <y> <z>'

Perhaps you want to create an initial orthography profile that also contains those graphemes x, y, z?

In [51]:
profile = Profile.from_text("aa b ch on n - ih x y z")
print(profile)

Grapheme	frequency	mapping
 	9	 
a	2	a
h	2	h
n	2	n
b	1	b
c	1	c
o	1	o
-	1	-
i	1	i
x	1	x
y	1	y
z	1	z


## Command line access

Make sure to `pip install segments` to install the command line tool.

Get some help with `segments`.

In [3]:
!segments -h

usage: segments [-h] [--verbosity VERBOSITY] [--encoding ENCODING]
                [--profile PROFILE] [--mapping MAPPING]
                command ...

Main command line interface of the segments package.

positional arguments:
  command               tokenize | profile
  args

optional arguments:
  -h, --help            show this help message and exit
  --verbosity VERBOSITY
                        increase output verbosity
  --encoding ENCODING   input encoding
  --profile PROFILE     path to an orthography profile
  --mapping MAPPING     column name in ortho profile to map graphemes

Use 'segments help <cmd>' to get help about individual commands.


In [5]:
!more sources/text.txt

[?1h=aäaaöaaüaa
[K[?1l>

In [7]:
!cat sources/text.txt | segments profile

Grapheme	frequency	mapping
a	7	a
ä	1	ä
ö	1	ö
ü	1	ü


In [27]:
!cat sources/text.txt | segments profile > sandbox/orthography-profile.tsv

In [28]:
!more sandbox/orthography-profile.tsv

[?1h=Grapheme        frequency       mapping
a       7       a
ä       1       ä
ö       1       ö
ü       1       ü
[K[?1l>

In [29]:
!cat sources/text.txt | segments tokenize

a ä a a ö a a ü a a


In [33]:
!cat sources/text.txt | segments --mapping=mapping --profile=sandbox/orthography-profile.tsv tokenize

a ä a a ö a a ü a a
