# Segment graphemes in IPA text

[Steven Moran](http://www.comparativelinguistics.uzh.ch/de/moran.html)

The latest version of this [Jupyter notebook](http://jupyter.org/) is available at [https://github.com/unicode-cookbook/recipes/JIPA](https://github.com/unicode-cookbook/recipes/JIPA). 

This use case illustrates how to segment wordlist data using an orthography profile. Details about orthography profiles and more is available in the [Unicode Cookbook for Linguists](https://github.com/unicode-cookbook/cookbook).

This recipe uses Python 3.5.

## Overview

This use case illustrates how to segment graphemes in IPA text.

The [https://www.internationalphoneticassociation.org/content/full-ipa-chart](International Phonetic Alphabet) is a standardized system of phonetic notation often used by linguists to transcribe phonological data.

This recipe requires some text in IPA, so we typed up a few transcriptions of the [https://en.wikipedia.org/wiki/The_North_Wind_and_the_Sun](North Wind and the Sun passage) from the excellent series, Illustrations of the IPA in the [https://www.cambridge.org/core/journals/journal-of-the-international-phonetic-association](Journal of the International Phonetic Association). These short articles illustrate the use of IPA transcription. The series is comprised of over 100 hundred richly detailed phonological descriptions from languages from around the world. Kenneth S. Olson provides a [http://www-01.sil.org/~olsonk/ipa.html](useful list of the articles) with additional metadata (ISO 639-3 codes and references).

These fine illustrations present a broad set of orthographic challenges, including tone, stress, diphthongs, and diacritics:

- Brazilian Portuguese (ISO 639-3: por) by Barbosa, Plínio A. & Eleonora C. Albano. 2004. JIPA 34(2). 227–232.
- Kabiye (kbp) by Padayodi, Cécile M. 2008. JIPA 38(2). 215–221.
- Vietnamese (vie) by Kirby, J.P. 2011. JIPA 41(3). 381–392.
- Zurich German (gsw) by Fleischer, Jürg & Stephan Schmid. 2006. JIPA 36(2). 243–253.

These files can be found in the `sources` folder. For each passage, there are two files, one suffixed with `_input` and the other with `_output`. Input files are the raw text where words are separated by white space. The corresponding ouput files have been been grapheme segmented by hand. Each grapheme is separated by white space and each word by the boundary marker `#`.

Let's have a look at the files in the `sources` directory. Use `more` or `less` or `cat` on the command line.

In [1]:
!!more sources/Brazilian_Portuguese_input.txt

['u vẽⁿtʊ nɔɾt͡ʃɪ̥ u sɔʊ̯ dʲiskut͡ʃiɐ̃ʊ̯ kʊ̯aʊ̯ duz doɪ̯z ɛɾɐ u maɪ̯s fɔɾt͡ʃɪ̥ | kʊ̯ɐ̃ⁿdʊ suseˈdeʊ̯ paˈsaɾ ũ viaʒɐ̃ⁿt͡ʃɪ̃voʊ̯tʊ ˈnũma kapɐ ‖ aʊ̯ velʊ põɪ̯sɪ d͡ʒɪ̯akoɾdʊ ɪ̯̃ kõmakelɪ kɪ pɾimeɾʊ kõsɪˈgis̩ obɾiˈgaɾ u vɪaʒɐ̃ⁿt͡ʃɪ̯ɐt͡ʃiˈɾaɾ a kapɐ seɾiɐ kõs̩deˈɾadu maɪ̯s fɔɾt͡ʃɪ̥ ‖ u vẽⁿtʊ nɔɾɪ̥ komeˈsoɐ̯ soˈpɾaɾ kõ ˈmũɪ̯̃ta fuɾɪ̯ɐ | mas kʊ̯ɐ̃ⁿtʊ maɪ̯sopɾavɐ maɪ̯z u vɪaˈʒɐ̃ⁿt sɪ̯akõʃeˈgava suɐ kapɐ̰ | aˈtɛ kɪ̯ʊ vẽⁿtʊ nɔɾt͡ʃɪ dʲizisˈt͡ʃiʊ̯ ‖ u sɔʊ̯ bɾiˈʎoʊ̯ tevɪ̯aˈsɪ̃ dʲi ɣekõɲeˈseɾ a supeɾioɾidadʲɪ du sɔʊ̯']

In [2]:
!! more sources/Brazilian_Portuguese_output.txt

['u # v ẽⁿ t ʊ # n ɔ ɾ t͡ʃ ɪ̥ # u # s ɔ ʊ̯ # dʲ i s k u t͡ʃ i ɐ̃ ʊ̯ # k ʊ̯ a ʊ̯ # d u z # d o ɪ̯ z # ɛ ɾ ɐ # u # m a ɪ̯ s # f ɔ ɾ t͡ʃ ɪ̥ # | # k ʊ̯ ɐ̃ⁿ d ʊ # s u s e ˈd e ʊ̯ # p a ˈs a ɾ # ũ # v i a ʒ ɐ̃ⁿ t͡ʃ ɪ̃ v o ʊ̯ t ʊ # ˈn ũ m a # k a p ɐ # ‖ # a ʊ̯ # v e l ʊ # p õ ɪ̯ s ɪ # d͡ʒ ɪ̯ a k o ɾ d ʊ # ɪ̯̃ # k õ m a k e l ɪ # k ɪ # p ɾ i m e ɾ ʊ # k õ s ɪ ˈg i s̩ # o b ɾ i ˈg a ɾ # u # v ɪ a ʒ ɐ̃ⁿ t͡ʃ ɪ̯ ɐ t͡ʃ i ˈɾ a ɾ # a # k a p ɐ # s e ɾ i ɐ # k õ s̩ d e ˈɾ a d u # m a ɪ̯ s # f ɔ ɾ t͡ʃ ɪ̥ # ‖ # u # v ẽⁿ t ʊ # n ɔ ɾ ɪ̥ # k o m e ˈs o ɐ̯ # s o ˈp ɾ a ɾ # k õ # ˈm ũ ɪ̯̃ t a # f u ɾ ɪ̯ ɐ # | # m a s # k ʊ̯ ɐ̃ⁿ t ʊ # m a ɪ̯ s o p ɾ a v ɐ # m a ɪ̯ z # u # v ɪ a ˈʒ ɐ̃ⁿ t # s ɪ̯ a k õ ʃ e ˈg a v a # s u ɐ # k a p ɐ̰ # | # a ˈt ɛ # k ɪ̯ ʊ # v ẽⁿ t ʊ # n ɔ ɾ t͡ʃ ɪ # dʲ i z i s ˈt͡ʃ i ʊ̯ # ‖ # u # s ɔ ʊ̯ # b ɾ i ˈʎ o ʊ̯ # t e v ɪ̯ a ˈs ɪ̃ # dʲ i # ɣ e k õ ɲ e ˈs e ɾ # a # s u p e ɾ i o ɾ i d a dʲ ɪ # d u # s ɔ ʊ̯']

In [3]:
!!more sources/Kabiye_input.txt

['hà̙jìkíŋ́ hèˠèˠlím̀ nɛ̙̀ wí̙sì̙ hà̙jìkíŋ́ hèˠèˠlím̀ nɛ̙̀ wɪ̙́sɪ̙̀ bà̙à̙ mà̙ˠà̙ˠzɪ̙́nɪ̙́ ḿbʊ̙́ zɪ̙̀ bà̙nà̙ zɪ̙́ á̙kɪ̙́lɪ̙́ ɖóŋ̀ ‖ pɪ̙̀ɡɛ̙̀dá̙à̙ lɛ̙́ | nʊ̙́mɔ̙̀wòɽú nɔ̙́ɔ̙́jʊ̙̀ ɖɛ̙̀wà̙ˠá̙ˠ | ɛ̙̀hɔ̙̀kɪ̙́ ɛ̙̀dɪ̙̀ɪ̙̀ dókò ɡ͡bìŋ̀ɡìzìm ɲɪ̙́ŋ́ɡʊ̙̀ nà̙kʊ̙́jʊ̙̀ dà̙á̙ ‖ pá̙nɪ̙́ɪ̙́ná̙ ɖá̙má̙ zɪ̙̀ wèjí ɛ̙́ɽá̙ bɪ̙́zʊ̙́ʊ̙́ ɛ̙́lá̙ nʊ̙́mɔ̙̀wòɽú ɛ̙́nʊ̙́ ɛ̙́hɔ̙́zɪ̙̀ òdókòò lɛ̙́ | ɛ̙́nʊ̙́ ɡɪ̙́lɪ̙́ná̙ ɖóŋ̀ ‖ kɛ̙́lɛ̙́ | hà̙jìkíŋ́ hèˠèˠlíḿ bá̙zɪ̙́ bɪ̙́má̙ ɖéŋ́ɖéé bɪ̙́bɪ̙́zɪ̙̀ˠɪ̙̀ˠ jɔ̙́ ‖ kɔ̙́jɔ̙́ bɪ̙̀má̙kɪ̙́ ɖóŋ̀ lɛ̙́ | nʊ̙́mɔ̙̀wòɽú ɲí ɡ͡bííkíˠíˠ ɛ̙̀dɪ̙̀ɪ̙̀ òdókò dà̙à̙ ɡíɡ͡bììkú ‖ pɪ̙́nɪ̙́ɪ̙́ hà̙jìkíŋ́ hèˠèˠlím̀ nɛ̙̀ bɪ̙́tɛ̙́↓zɪ̙́ ↓jébù ‖ kɛ̙́lɛ̙́ wɪ̙́↓sɪ̙́ɲá̙ sɪ̙́ŋ́↓ɡɪ̙́ ↓zɪ̙́ɲá̙ˠá̙ˠ ɖéŋ́↓ ɖéé ↓zɪ̙́bɪ̙́zɪ̙̀ˠɪ̙̀ˠjɔ̙́ kà̙ʊ̙̀ʊ̙̀ʊ̙̀ ‖ nʊ̙́mɔ̙̀wòɽú dɪ̙́ ↓hɔ̙́zʊ̙́ ↓édókó ↓ɛ̙́lɔ̙́ ‖ pèéɽèè bɪ̙́ɽɛ̙́ hà̙jìkíŋ́ hèˠèˠlím̀ nɪ̙̀ bídísì zɪ̙̀ bà̙dà̙á̙ lɛ̙́ | wí̙sì̙ ɡɪ̙̀lɪ̙́↓ná̙ ɖóŋ̀ nòòò ‖']

In [4]:
!!more sources/Vietnamese_input.txt

['zɔ˩˨ ɓɤ̆k˦˥ va˧˨ măt˨˩ tɕɤj˧˨ kaj˧˨˦ ɲaw˦ sɛm˦ aj˦ mɛŋ̟˨ hɤn˦ tɕɔŋ͡m˦ luk͡p˦˥ ɗɔ˧˦ mot˨˩ zu˦ xɛk˦˥ măk˨˩ mot˨˩ aw˧˦ xʷak˦˥ ɤm˨˦ ɗi˦ kʷa˦ hɔ˨ zaw˦ kɛw˧˨ vɤj˧˦ ɲaw˦ zăŋ˧˨ ʔaj˦ la˧˨ ŋɯəj˧˨ ɗɤ̆w˧˨ tiən˦ ma˧˨ kɔ˨˧ tʰe˧˩˨ ɓăt˦˥ ŋɯəj˧˨ zu˦ xɛk˦˥ kiə˦ kɤj˧˩˨ ʔaw˧˦ tʰi˧˨ sɛ˧˨˥ ɗɯək˨˩ kɔj˦ la˧˨ mɛŋ̟˨ hɤn˦ săw˦ ɗɔ˨˦ zɔ˨˦ ɓɤ̆k˦˥ ɓăt˦˥ dɤ̆w˧˨ tʰoj˧˩˨ mɛŋ̟˨ het˦˥ sɯk˦˥ kɔ˨˦ tʰe˧˩˨ ɲɯŋ˦ kaŋ˧˨ tʰoj˧˩˨ tʰi˧˨ ŋɯəj˧˨ zu˦ xɛk˦˥ kaŋ˧˨ zɯ˧˨˦ tɕăt˨˩ ʔaw˧˦ xʷak˦˥ va˧˨ kuj˧˦ kuŋ͡m˧˨ zɔ˧˦ ɓɤ̆k˦˥ ɗa˧˨˦ faj˧˩˨ tɯ˧˨ ɓɔ˧˩˨ săw˦ ɗɔ˨˧ măt˨˩ tɕɤj˧˨ sɯəj˧˩˨ ʔɤ̆m˨˧ va˧˨ ŋɯəj˧˨ zu˦ xek˦˥ lien˧˨ kɤj˧˩˨ ʔaw˨˧ xʷak˦˥ ket˦˥ kuk͡p˨˩ la˧˨ zɔ˧˦ ɓɤ̆k˦˥ faj˧˩˨ tʰɯə˧˨ ɲɤ̆n˨ zăŋ˧˨ măt˨˩ tɕɤj˦˧ la˧˨ ŋɯəj˧˨ mɛŋ̟˨ hɤn˦ tɕɔŋ͡m˦ haj˦ ŋɯəj˧˨']

In [5]:
!!more sources/Zurich_German_input.txt

['d̥ə ˈb̥iːz̥ˌʋi̞nd̥ u̞n‿ˈt͡su̞nə ‖ əˈmɒːl hæn‿tə ˈb̥iːz̥ˌʋi̞nd̥ u̞n‿ˈt͡su̞nə ˈkʃtri̞tːə | ʋɛːr v̥o ˈb̥æi̯ʔnə d̥ɒz̥ æχ tə ˈʃtɛrɣ̥ər z̥ei̯ɡ̥ ‖ d̥ɒ ʁ̥u̞nt ə‿ˈmːɒː d̥ətˈhɛːʁ | ʋon ən ˈtːi̞k͡χə ˈmɒntl̩ ˈɒːkhɒ hæt̚ ‖ d̥o z̥i̞nt͡s ˈrœːtːi̞ɡ̥ ˈʋoːrd̥ə | d̥ɒs ˈtɛː d̥ə ˈʃtɛrɣ̥ər z̥ei̯ɡ | ʋo d̥ɛː mɒː d̥əˈt͡su̞ə̯ ˈb̥ri̞ŋi̞ | d̥ɒz̥ ər z̥i̞‿ˈmːɒntl̩ ˈɒpˌt͡si̞ə̯i̞ ‖ d̥ə ˈb̥iːz̥ʋi̞nd̥ | hæt ˈɒːv̥ɛ ˈb̥lɒːz̥ə z̥o v̥eʃ‿tɒz̥ ər hæ‿ˈk͡xønə | ˈɒb̥ər d̥ə ˈmɒː | hæ‿ʔnu̞ d̥ə ˈmɒntl̩ ˈæŋər knɒː ‖ d̥o hæ‿ˈt͡su̞nə | ˈɒːv̥ə ˈʒ̥iːnə | ˈi̞mər ˈʋɛːrmər | b̥i̞s tə mɒː | d̥ə ˈmɒntl̩ ˈɒpˌt͡soɡ̥ə hæt ‖ d̥o̞ hæ‿tːə ˈb̥iːz̥ˌʋi̞m‿ˈʔmøz̥ə ˈt͡su̞ə̯ɡ̥ɛː | d̥ɒs t͡su̞nː | ˈʃtɛrɣ̥ər z̥ei̯ɡ̥ ˈʋed̥ər ɛːʁ̥ ‖']

## Segment graphemes

One thing we can do is to read these files into Python and segment them with the `segments` library.

In [6]:
from segments.tokenizer import Tokenizer

In [7]:
with open('sources/Brazilian_Portuguese_input.txt') as infile:
    text = infile.read()

In [8]:
t = Tokenizer()
tokenized_text = t(text)
print(tokenized_text)

u # v ẽ ⁿ t ʊ # n ɔ ɾ t͡ ʃ ɪ̥ # u # s ɔ ʊ̯ # d ʲ i s k u t͡ ʃ i ɐ̃ ʊ̯ # k ʊ̯ a ʊ̯ # d u z # d o ɪ̯ z # ɛ ɾ ɐ # u # m a ɪ̯ s # f ɔ ɾ t͡ ʃ ɪ̥ # | # k ʊ̯ ɐ̃ ⁿ d ʊ # s u s e ˈ d e ʊ̯ # p a ˈ s a ɾ # ũ # v i a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̃ v o ʊ̯ t ʊ # ˈ n ũ m a # k a p ɐ # ‖ # a ʊ̯ # v e l ʊ # p õ ɪ̯ s ɪ # d͡ ʒ ɪ̯ a k o ɾ d ʊ # ɪ̯̃ # k õ m a k e l ɪ # k ɪ # p ɾ i m e ɾ ʊ # k õ s ɪ ˈ g i s̩ # o b ɾ i ˈ g a ɾ # u # v ɪ a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̯ ɐ t͡ ʃ i ˈ ɾ a ɾ # a # k a p ɐ # s e ɾ i ɐ # k õ s̩ d e ˈ ɾ a d u # m a ɪ̯ s # f ɔ ɾ t͡ ʃ ɪ̥ # ‖ # u # v ẽ ⁿ t ʊ # n ɔ ɾ ɪ̥ # k o m e ˈ s o ɐ̯ # s o ˈ p ɾ a ɾ # k õ # ˈ m ũ ɪ̯̃ t a # f u ɾ ɪ̯ ɐ # | # m a s # k ʊ̯ ɐ̃ ⁿ t ʊ # m a ɪ̯ s o p ɾ a v ɐ # m a ɪ̯ z # u # v ɪ a ˈ ʒ ɐ̃ ⁿ t # s ɪ̯ a k õ ʃ e ˈ g a v a # s u ɐ # k a p ɐ̰ # | # a ˈ t ɛ # k ɪ̯ ʊ # v ẽ ⁿ t ʊ # n ɔ ɾ t͡ ʃ ɪ # d ʲ i z i s ˈ t͡ ʃ i ʊ̯ # ‖ # u # s ɔ ʊ̯ # b ɾ i ˈ ʎ o ʊ̯ # t e v ɪ̯ a ˈ s ɪ̃ # d ʲ i # ɣ e k õ ɲ e ˈ s e ɾ # a # s u p e ɾ i o ɾ i d a d ʲ ɪ # d u # s ɔ ʊ̯


The `tokenized_text` contains a default segmentation of the the text. In the `sources` directory we have some hand-vetted gold standard segmented files. We can compare the `tokenized_text` versus the vetted files.

In [9]:
with open('sources/Brazilian_Portuguese_output.txt') as goldfile:
    gold = goldfile.read()
    print(gold)

u # v ẽⁿ t ʊ # n ɔ ɾ t͡ʃ ɪ̥ # u # s ɔ ʊ̯ # dʲ i s k u t͡ʃ i ɐ̃ ʊ̯ # k ʊ̯ a ʊ̯ # d u z # d o ɪ̯ z # ɛ ɾ ɐ # u # m a ɪ̯ s # f ɔ ɾ t͡ʃ ɪ̥ # | # k ʊ̯ ɐ̃ⁿ d ʊ # s u s e ˈd e ʊ̯ # p a ˈs a ɾ # ũ # v i a ʒ ɐ̃ⁿ t͡ʃ ɪ̃ v o ʊ̯ t ʊ # ˈn ũ m a # k a p ɐ # ‖ # a ʊ̯ # v e l ʊ # p õ ɪ̯ s ɪ # d͡ʒ ɪ̯ a k o ɾ d ʊ # ɪ̯̃ # k õ m a k e l ɪ # k ɪ # p ɾ i m e ɾ ʊ # k õ s ɪ ˈg i s̩ # o b ɾ i ˈg a ɾ # u # v ɪ a ʒ ɐ̃ⁿ t͡ʃ ɪ̯ ɐ t͡ʃ i ˈɾ a ɾ # a # k a p ɐ # s e ɾ i ɐ # k õ s̩ d e ˈɾ a d u # m a ɪ̯ s # f ɔ ɾ t͡ʃ ɪ̥ # ‖ # u # v ẽⁿ t ʊ # n ɔ ɾ ɪ̥ # k o m e ˈs o ɐ̯ # s o ˈp ɾ a ɾ # k õ # ˈm ũ ɪ̯̃ t a # f u ɾ ɪ̯ ɐ # | # m a s # k ʊ̯ ɐ̃ⁿ t ʊ # m a ɪ̯ s o p ɾ a v ɐ # m a ɪ̯ z # u # v ɪ a ˈʒ ɐ̃ⁿ t # s ɪ̯ a k õ ʃ e ˈg a v a # s u ɐ # k a p ɐ̰ # | # a ˈt ɛ # k ɪ̯ ʊ # v ẽⁿ t ʊ # n ɔ ɾ t͡ʃ ɪ # dʲ i z i s ˈt͡ʃ i ʊ̯ # ‖ # u # s ɔ ʊ̯ # b ɾ i ˈʎ o ʊ̯ # t e v ɪ̯ a ˈs ɪ̃ # dʲ i # ɣ e k õ ɲ e ˈs e ɾ # a # s u p e ɾ i o ɾ i d a dʲ ɪ # d u # s ɔ ʊ̯


Hmm, something does not look quite right! Can you spot the differences between strings?

In [10]:
tokenized_words = tokenized_text.split(' # ')
print(tokenized_words)

['u', 'v ẽ ⁿ t ʊ', 'n ɔ ɾ t͡ ʃ ɪ̥', 'u', 's ɔ ʊ̯', 'd ʲ i s k u t͡ ʃ i ɐ̃ ʊ̯', 'k ʊ̯ a ʊ̯', 'd u z', 'd o ɪ̯ z', 'ɛ ɾ ɐ', 'u', 'm a ɪ̯ s', 'f ɔ ɾ t͡ ʃ ɪ̥', '|', 'k ʊ̯ ɐ̃ ⁿ d ʊ', 's u s e ˈ d e ʊ̯', 'p a ˈ s a ɾ', 'ũ', 'v i a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̃ v o ʊ̯ t ʊ', 'ˈ n ũ m a', 'k a p ɐ', '‖', 'a ʊ̯', 'v e l ʊ', 'p õ ɪ̯ s ɪ', 'd͡ ʒ ɪ̯ a k o ɾ d ʊ', 'ɪ̯̃', 'k õ m a k e l ɪ', 'k ɪ', 'p ɾ i m e ɾ ʊ', 'k õ s ɪ ˈ g i s̩', 'o b ɾ i ˈ g a ɾ', 'u', 'v ɪ a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̯ ɐ t͡ ʃ i ˈ ɾ a ɾ', 'a', 'k a p ɐ', 's e ɾ i ɐ', 'k õ s̩ d e ˈ ɾ a d u', 'm a ɪ̯ s', 'f ɔ ɾ t͡ ʃ ɪ̥', '‖', 'u', 'v ẽ ⁿ t ʊ', 'n ɔ ɾ ɪ̥', 'k o m e ˈ s o ɐ̯', 's o ˈ p ɾ a ɾ', 'k õ', 'ˈ m ũ ɪ̯̃ t a', 'f u ɾ ɪ̯ ɐ', '|', 'm a s', 'k ʊ̯ ɐ̃ ⁿ t ʊ', 'm a ɪ̯ s o p ɾ a v ɐ', 'm a ɪ̯ z', 'u', 'v ɪ a ˈ ʒ ɐ̃ ⁿ t', 's ɪ̯ a k õ ʃ e ˈ g a v a', 's u ɐ', 'k a p ɐ̰', '|', 'a ˈ t ɛ', 'k ɪ̯ ʊ', 'v ẽ ⁿ t ʊ', 'n ɔ ɾ t͡ ʃ ɪ', 'd ʲ i z i s ˈ t͡ ʃ i ʊ̯', '‖', 'u', 's ɔ ʊ̯', 'b ɾ i ˈ ʎ o ʊ̯', 't e v ɪ̯ a ˈ s ɪ̃', 'd ʲ i', 'ɣ e k õ ɲ e ˈ s e ɾ', 'a', 's u p e ɾ i o ɾ

In [11]:
gold_words = gold.split(' # ')
print(gold_words)

['u', 'v ẽⁿ t ʊ', 'n ɔ ɾ t͡ʃ ɪ̥', 'u', 's ɔ ʊ̯', 'dʲ i s k u t͡ʃ i ɐ̃ ʊ̯', 'k ʊ̯ a ʊ̯', 'd u z', 'd o ɪ̯ z', 'ɛ ɾ ɐ', 'u', 'm a ɪ̯ s', 'f ɔ ɾ t͡ʃ ɪ̥', '|', 'k ʊ̯ ɐ̃ⁿ d ʊ', 's u s e ˈd e ʊ̯', 'p a ˈs a ɾ', 'ũ', 'v i a ʒ ɐ̃ⁿ t͡ʃ ɪ̃ v o ʊ̯ t ʊ', 'ˈn ũ m a', 'k a p ɐ', '‖', 'a ʊ̯', 'v e l ʊ', 'p õ ɪ̯ s ɪ', 'd͡ʒ ɪ̯ a k o ɾ d ʊ', 'ɪ̯̃', 'k õ m a k e l ɪ', 'k ɪ', 'p ɾ i m e ɾ ʊ', 'k õ s ɪ ˈg i s̩', 'o b ɾ i ˈg a ɾ', 'u', 'v ɪ a ʒ ɐ̃ⁿ t͡ʃ ɪ̯ ɐ t͡ʃ i ˈɾ a ɾ', 'a', 'k a p ɐ', 's e ɾ i ɐ', 'k õ s̩ d e ˈɾ a d u', 'm a ɪ̯ s', 'f ɔ ɾ t͡ʃ ɪ̥', '‖', 'u', 'v ẽⁿ t ʊ', 'n ɔ ɾ ɪ̥', 'k o m e ˈs o ɐ̯', 's o ˈp ɾ a ɾ', 'k õ', 'ˈm ũ ɪ̯̃ t a', 'f u ɾ ɪ̯ ɐ', '|', 'm a s', 'k ʊ̯ ɐ̃ⁿ t ʊ', 'm a ɪ̯ s o p ɾ a v ɐ', 'm a ɪ̯ z', 'u', 'v ɪ a ˈʒ ɐ̃ⁿ t', 's ɪ̯ a k õ ʃ e ˈg a v a', 's u ɐ', 'k a p ɐ̰', '|', 'a ˈt ɛ', 'k ɪ̯ ʊ', 'v ẽⁿ t ʊ', 'n ɔ ɾ t͡ʃ ɪ', 'dʲ i z i s ˈt͡ʃ i ʊ̯', '‖', 'u', 's ɔ ʊ̯', 'b ɾ i ˈʎ o ʊ̯', 't e v ɪ̯ a ˈs ɪ̃', 'dʲ i', 'ɣ e k õ ɲ e ˈs e ɾ', 'a', 's u p e ɾ i o ɾ i d a dʲ ɪ', 'd u', 's ɔ ʊ̯']


 Let's find the delta between the lists of words and identify where the problems are.

In [12]:
from difflib import unified_diff

d = unified_diff(tokenized_words, gold_words)
print('\n'.join(list(d)))

--- 

+++ 

@@ -1,76 +1,76 @@

 u
-v ẽ ⁿ t ʊ
-n ɔ ɾ t͡ ʃ ɪ̥
+v ẽⁿ t ʊ
+n ɔ ɾ t͡ʃ ɪ̥
 u
 s ɔ ʊ̯
-d ʲ i s k u t͡ ʃ i ɐ̃ ʊ̯
+dʲ i s k u t͡ʃ i ɐ̃ ʊ̯
 k ʊ̯ a ʊ̯
 d u z
 d o ɪ̯ z
 ɛ ɾ ɐ
 u
 m a ɪ̯ s
-f ɔ ɾ t͡ ʃ ɪ̥
+f ɔ ɾ t͡ʃ ɪ̥
 |
-k ʊ̯ ɐ̃ ⁿ d ʊ
-s u s e ˈ d e ʊ̯
-p a ˈ s a ɾ
+k ʊ̯ ɐ̃ⁿ d ʊ
+s u s e ˈd e ʊ̯
+p a ˈs a ɾ
 ũ
-v i a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̃ v o ʊ̯ t ʊ
-ˈ n ũ m a
+v i a ʒ ɐ̃ⁿ t͡ʃ ɪ̃ v o ʊ̯ t ʊ
+ˈn ũ m a
 k a p ɐ
 ‖
 a ʊ̯
 v e l ʊ
 p õ ɪ̯ s ɪ
-d͡ ʒ ɪ̯ a k o ɾ d ʊ
+d͡ʒ ɪ̯ a k o ɾ d ʊ
 ɪ̯̃
 k õ m a k e l ɪ
 k ɪ
 p ɾ i m e ɾ ʊ
-k õ s ɪ ˈ g i s̩
-o b ɾ i ˈ g a ɾ
+k õ s ɪ ˈg i s̩
+o b ɾ i ˈg a ɾ
 u
-v ɪ a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̯ ɐ t͡ ʃ i ˈ ɾ a ɾ
+v ɪ a ʒ ɐ̃ⁿ t͡ʃ ɪ̯ ɐ t͡ʃ i ˈɾ a ɾ
 a
 k a p ɐ
 s e ɾ i ɐ
-k õ s̩ d e ˈ ɾ a d u
+k õ s̩ d e ˈɾ a d u
 m a ɪ̯ s
-f ɔ ɾ t͡ ʃ ɪ̥
+f ɔ ɾ t͡ʃ ɪ̥
 ‖
 u
-v ẽ ⁿ t ʊ
+v ẽⁿ t ʊ
 n ɔ ɾ ɪ̥
-k o m e ˈ s o ɐ̯
-s o ˈ p ɾ a ɾ
+k o m e ˈs o ɐ̯
+s o ˈp ɾ a ɾ
 k õ
-ˈ m ũ ɪ̯̃ t a
+ˈm ũ ɪ̯̃ t a
 f u ɾ ɪ̯ ɐ
 |
 m a s
-k ʊ̯ ɐ̃ ⁿ t ʊ
+k ʊ̯ ɐ̃ⁿ t ʊ
 m a ɪ̯ s o p ɾ a v ɐ
 m a ɪ̯ z
 u
-v ɪ 

The output shows us which words do not match and it's pretty clear that our default Unicode segmentation (in posic regex terms the grapheme marker "\X") does not deal with some of the Unicode IPA pitfalls discussed in the cookbook. For example, the tie-bar is not attached to both characters; the nasal release <ⁿ> also floats without a base character. Luckily, the `segments` package has a tokenize IPA function meant to deal with IPA text in particular.

In [13]:
ipa_tokenized_text = t(text, ipa=True)
print(tokenized_text)

u # v ẽ ⁿ t ʊ # n ɔ ɾ t͡ ʃ ɪ̥ # u # s ɔ ʊ̯ # d ʲ i s k u t͡ ʃ i ɐ̃ ʊ̯ # k ʊ̯ a ʊ̯ # d u z # d o ɪ̯ z # ɛ ɾ ɐ # u # m a ɪ̯ s # f ɔ ɾ t͡ ʃ ɪ̥ # | # k ʊ̯ ɐ̃ ⁿ d ʊ # s u s e ˈ d e ʊ̯ # p a ˈ s a ɾ # ũ # v i a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̃ v o ʊ̯ t ʊ # ˈ n ũ m a # k a p ɐ # ‖ # a ʊ̯ # v e l ʊ # p õ ɪ̯ s ɪ # d͡ ʒ ɪ̯ a k o ɾ d ʊ # ɪ̯̃ # k õ m a k e l ɪ # k ɪ # p ɾ i m e ɾ ʊ # k õ s ɪ ˈ g i s̩ # o b ɾ i ˈ g a ɾ # u # v ɪ a ʒ ɐ̃ ⁿ t͡ ʃ ɪ̯ ɐ t͡ ʃ i ˈ ɾ a ɾ # a # k a p ɐ # s e ɾ i ɐ # k õ s̩ d e ˈ ɾ a d u # m a ɪ̯ s # f ɔ ɾ t͡ ʃ ɪ̥ # ‖ # u # v ẽ ⁿ t ʊ # n ɔ ɾ ɪ̥ # k o m e ˈ s o ɐ̯ # s o ˈ p ɾ a ɾ # k õ # ˈ m ũ ɪ̯̃ t a # f u ɾ ɪ̯ ɐ # | # m a s # k ʊ̯ ɐ̃ ⁿ t ʊ # m a ɪ̯ s o p ɾ a v ɐ # m a ɪ̯ z # u # v ɪ a ˈ ʒ ɐ̃ ⁿ t # s ɪ̯ a k õ ʃ e ˈ g a v a # s u ɐ # k a p ɐ̰ # | # a ˈ t ɛ # k ɪ̯ ʊ # v ẽ ⁿ t ʊ # n ɔ ɾ t͡ ʃ ɪ # d ʲ i z i s ˈ t͡ ʃ i ʊ̯ # ‖ # u # s ɔ ʊ̯ # b ɾ i ˈ ʎ o ʊ̯ # t e v ɪ̯ a ˈ s ɪ̃ # d ʲ i # ɣ e k õ ɲ e ˈ s e ɾ # a # s u p e ɾ i o ɾ i d a d ʲ ɪ # d u # s ɔ ʊ̯


In [14]:
ipa_tokenized_text == gold

True

In [15]:
t("Voilà!")

'V o i l à !'