# Segmenting the Dogon comparative wordlist

Steven Moran &lt;bambooforest@gmail.com&gt;

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook](https://ithub.com/unicode-cookbook). 

This use case illustrates how to segment wordlist data using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook](https://github.com/unicode-cookbook).


## Overview

The [Dogon and Bangime linguistics](http://dogonlanguages.org/) project collects and disseminates linguistic, cultural and geographic data from fieldwork undertaken on the Dogon languages (and Bangime) spoken in Mali. 

The data includes an extensive comparative [Dogon lexicon](https://github.com/clld/dogonlanguages-data) organized in an Excel spreadsheet. Columns include more than 20 linguistic varieties of Dogon, Englisd and French glosses, and semantic domains. The spreadsheet contains over 8000 rows, where each row is a semantic concept.

The comparative wordlist was compiled by fieldworkers and each fieldworker has their own system for transcription. This will be made clear below when we create an initial orthography profile from the wordlist; it highlights idiosyncracies between transcription practives, e.g. there is both &lt;aa&gt; and &lt;aː&gt;.

For this recipe, we will use a smaller curated version of the comparative Dogon wordlist available in the `sources` directory of this recipe. The wordlist is in CSV format with columns for a row ID, concept, doculect (language variety) and counterpart (word in the language).

We will use the Python library [Pandas](http://pandas.pydata.org/) for the CSV reading and manipulation of the word list. The orthography profile for the Dogon lexical data is located in the `data` directory. 

To get started import the [segments](https://pypi.python.org/pypi/segments/) and [Pandas](http://pandas.pydata.org/) modules:

In [1]:
from segments.tokenizer import Tokenizer
import pandas as pd

Load the word list, for Pandas specify a row index column, and have a look at it:

In [2]:
df = pd.read_csv("sources/dogon-wordlist.tsv", index_col="ID")
print(df.head())

                 CONCEPT         DOCULECT COUNTERPART
ID                                                   
1   -teen ('11' to '19')          Ben_Tey        sâ:
2   -teen ('11' to '19')        Dogul_Dom      sìgà
3   -teen ('11' to '19')           Gourou      sáɣà
4   -teen ('11' to '19')  Jamsay_Douentza      sáɣà
5   -teen ('11' to '19')          Najamba      sìgà


The Dogon orthography profile is in the `data` directory. Let's have a look:

In [None]:
! more data/Heath2016-profile.tsv

Create a tokenizer object from the orthography profile

In [11]:
t = Tokenizer("data/Heath2016-profile.tsv")

We will add to our word list an additional column called `IPA` that will contain the output from orthography profile segmentation of the `COUNTERPART` column. 

In [14]:
tokenizer = lambda x: t.transform(x, column="IPA")

In [15]:
df['TOKENS'] = pd.Series(df['COUNTERPART'].apply(tokenizer))())

                 CONCEPT         DOCULECT COUNTERPART     TOKENS
ID                                                              
1   -teen ('11' to '19')          Ben_Tey        sâ:      s âː
2   -teen ('11' to '19')        Dogul_Dom      sìgà  s ì ɡ à
3   -teen ('11' to '19')           Gourou      sáɣà  s á ɣ à
4   -teen ('11' to '19')  Jamsay_Douentza      sáɣà  s á ɣ à
5   -teen ('11' to '19')          Najamba      sìgà  s ì ɡ à


In [16]:
print(df.head())

                 CONCEPT         DOCULECT COUNTERPART     TOKENS
ID                                                              
1   -teen ('11' to '19')          Ben_Tey        sâ:      s âː
2   -teen ('11' to '19')        Dogul_Dom      sìgà  s ì ɡ à
3   -teen ('11' to '19')           Gourou      sáɣà  s á ɣ à
4   -teen ('11' to '19')  Jamsay_Douentza      sáɣà  s á ɣ à
5   -teen ('11' to '19')          Najamba      sìgà  s ì ɡ à


Write the new segmented wordlist to the `sandbox` directory

In [17]:
df.to_csv('sandbox/segmented-dogon-wordlist.tsv', sep="\t")

## Create an initial orthography profile from text input

I have already provided you with a well-tested and curated orthography profile for the Dogon data. But if you want to create an initial profile, this is how you could do it from a CSV file.