# Tokenize Dogon comparative wordlist

Steven Moran &lt;bambooforest@gmail.com&gt;

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/Dogon](https://github.com/unicode-cookbook/recipes/Dogon). 

This use case illustrates how to tokenize a wordlist using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook for Linguists](https://github.com/unicode-cookbook/cookbook).

This recipe uses Python 3.5.

## Overview

The [Dogon and Bangime linguistics](http://dogonlanguages.org/) project collects and disseminates linguistic, cultural and geographic data from fieldwork undertaken on the Dogon languages, and the language isolate Bangime, spoken in central Mali. 

## Data

The data includes an extensive comparative [Dogon lexicon](https://github.com/clld/dogonlanguages-data) organized by the project members in an Excel spreadsheet. The columns include more than 20 lects (languages or dialects depending on the pair) spoken in Dogon country. There are English and French glosses for each row, e.g. 'cow that has calved at least once', 'vache qui a mis bas au moins une fois'. A row may also information in columns generated by the project, e.g. semantic domain (animal), subdomain (camel). Wordlist type, if there is any representative media file (especially flora and fauna), and so on. 

The Dogon comparative lexicon contains over 8000 rows. Sparsity is an issue in several languages.

## Data processing

The comparative wordlist was compiled by fieldworkers and each fieldworker has their own system for transcription. This will be made clear below when we create an initial orthography profile from the wordlist; it highlights idiosyncracies between transcription practives, e.g. there is both &lt;aa&gt; and &lt;aː&gt;.

For this recipe, we will use a smaller curated version of the comparative Dogon wordlist available in the `sources` directory of this recipe. The wordlist is in CSV format with columns for a row ID, concept, doculect (language variety) and counterpart (word in the language).

We will use the Python library [Pandas](http://pandas.pydata.org/) for the CSV reading and manipulation of the word list. The orthography profile for the Dogon lexical data is located in the `data` directory. 

To get started import the [segments](https://pypi.python.org/pypi/segments/) and [Pandas](http://pandas.pydata.org/) modules:

In [1]:
from segments.tokenizer import Tokenizer
import pandas as pd

Load the word list, for Pandas specify a row index column, and have a look at it:

In [2]:
df = pd.read_csv("sources/dogon-wordlist.tsv", index_col="ID")
print(df.head())

                 CONCEPT         DOCULECT COUNTERPART
ID                                                   
1   -teen ('11' to '19')          Ben_Tey        sâ:
2   -teen ('11' to '19')        Dogul_Dom      sìgà
3   -teen ('11' to '19')           Gourou      sáɣà
4   -teen ('11' to '19')  Jamsay_Douentza      sáɣà
5   -teen ('11' to '19')          Najamba      sìgà


The Dogon orthography profile is in the `data` directory. Let's have a look:

In [7]:
!more data/Heath2016-profile.tsv

Grapheme        IPA     notes
àà      àː
áá      áː
àá      ǎː
áà      âː
ààⁿ     ã̀ː
ááⁿ     ã́ː
àáⁿ     ã̌ː
áàⁿ     ã̂ː
ɔ̀ɔ̀      ɔ̀ː
ɔ́ɔ́      ɔ́ː
ɔ̀ɔ́      ɔ̌ː
ɔ́ɔ̀      ɔ̂ː
ɔ̀ɔ̀ⁿ     ɔ̃̀ː
ɔ́ɔ́ⁿ     ɔ̃́ː
ɔ̀ɔ́ⁿ     ɔ̃̌ː
ɔ́ɔ̀ⁿ     ɔ̃̂ː
ɛ̀ɛ̀      ɛ̀ː
ɛ́ɛ́      ɛ́ː
ɛ̀ɛ́      ɛ̌ː
ɛ́ɛ̀      ɛ̂ː
ɛ̀ɛ̀ⁿ     ɛ̃̀ː
ɛ́ɛ́ⁿ     ɛ̃́ː
[K[?1l>eath2016-profile.tsv[m[K

Create a tokenizer object from the orthography profile

In [8]:
t = Tokenizer("data/Heath2016-profile.tsv")

We will add to our word list an additional column called `IPA` that will contain the output from orthography profile segmentation of the `COUNTERPART` column. 

In [9]:
tokenizer = lambda x: t.transform(x, column="IPA")

In [10]:
df['TOKENS'] = pd.Series(df['COUNTERPART'].apply(tokenizer))

In [11]:
print(df.head())

                 CONCEPT         DOCULECT COUNTERPART     TOKENS
ID                                                              
1   -teen ('11' to '19')          Ben_Tey        sâ:      s âː
2   -teen ('11' to '19')        Dogul_Dom      sìgà  s ì ɡ à
3   -teen ('11' to '19')           Gourou      sáɣà  s á ɣ à
4   -teen ('11' to '19')  Jamsay_Douentza      sáɣà  s á ɣ à
5   -teen ('11' to '19')          Najamba      sìgà  s ì ɡ à


Write the new segmented wordlist to the `sandbox` directory

In [12]:
df.to_csv('sandbox/segmented-dogon-wordlist.tsv', sep="\t")

## Create an initial orthography profile from text input

I have already provided you with a well-tested and curated orthography profile for the Dogon data. But if you want to create an initial profile from scratch, this is how you could do it from a CSV file.

In [13]:
from segments.tokenizer import Profile

This next line is a bit cryptic. The `segments` Profile object takes as a parameter string text. Here we get the COUNTERPART column from the Pandas dataframe, which itself is a Pandas Series object. We first convert it to a list of string elements, which we join into one long string and feed it to the Profile function.

In [14]:
profile = Profile.from_text(''.join(df['COUNTERPART'].tolist()))

In [15]:
print(profile)

Grapheme	frequency	mapping
n	2457	n
:	1935	:
à	1816	à
á	1778	á
m	1648	m
g	1608	g
ú	1552	ú
r	1529	r
k	1368	k
y	1345	y
í	1316	í
ɛ́	1272	ɛ́
l	1223	l
d	1199	d
ù	1193	ù
b	1171	b
ì	1080	ì
ⁿ	1033	ⁿ
ó	986	ó
ɛ̀	970	ɛ̀
 	914	 
w	914	w
é	884	é
ɔ́	881	ɔ́
ɔ̀	826	ɔ̀
s	820	s
ŋ	767	ŋ
t	750	t
è	731	è
j	696	j
ò	621	ò
p	500	p
ǎ	285	ǎ
-	255	-
ɲ	237	ɲ
c	162	c
â	155	â
ɛ̌	145	ɛ̌
î	143	î
ɔ̌	104	ɔ̌
=	92	=
ʔ	91	ʔ
ɛ̂	88	ɛ̂
z	87	z
ǐ	84	ǐ
ǹ	81	ǹ
ŋ̀	80	ŋ̀
û	78	û
ǔ	71	ǔ
ɔ̂	65	ɔ̂
ǒ	64	ǒ
ě	59	ě
ê	49	ê
a	47	a
ô	46	ô
m̀	44	m̀
ỳ	41	ỳ
ɣ	40	ɣ
ɡ	40	ɡ
h	38	h
e	36	e
ń	32	ń
ɛ	30	ɛ
v́	25	v́
ý	25	ý
ʒ	24	ʒ
ḿ	21	ḿ
o	20	o
f	20	f
ẁ	18	ẁ
ʷ	18	ʷ
i	16	i
ŋ́	15	ŋ́
ɔ	14	ɔ
v	14	v
≡	14	≡
→	11	→
ɥ	11	ɥ
ð	10	ð
a᷈	10	a᷈
ə̀	10	ə̀
ɔ᷈	9	ɔ᷈
r̃	7	r̃
Y	6	Y
∴	5	∴
v̀	5	v̀
ʃ	5	ʃ
ə́	5	ə́
ʤ	5	ʤ
ɛ᷈	4	ɛ᷈
u	4	u
w̃	4	w̃
š	4	š
ĩ́	4	ĩ́
ɪ́	4	ɪ́
o᷈	4	o᷈
ẃ	3	ẃ
…	3	…
ɲ́	3	ɲ́
ʋ	3	ʋ
u᷈	3	u᷈
ʸ	2	ʸ
ŷ	2	ŷ
/	2	/
ĩ̀	2	ĩ̀
ɕ	2	ɕ
ɪ̀	2	ɪ̀
ɔ̯	2	ɔ̯
ⁿ́	2	ⁿ́
Ǹ	1	Ǹ
ᵇ	1	ᵇ
N	1	N
È	1	È
'	1	'
C	1	C
ĺ	1	ĺ
(	1	(
õ̀	1	õ̀
ɔ̃́	1	ɔ̃́
i᷈	1	i᷈
ɛ̄	1	ɛ̄
;	1	;
ṍ	1	ṍ
ɔ̃̀	1	ɔ̃̀
ɲ̀	1	ɲ̀
ⁿ̀	1	ⁿ̀
	

The initial orthography profile is simply a unigram model with graphemes and their frequencies. You can open the file in a text editor and define tailored grapheme clusters and their mappings for an even more powerful orthography profile.