# Generate pseudo-words using a Markov chain 

- We will read in a dataset of English words, with pronunciations, calculate phoneme transition probabilities from the data, and then use those phoneme transition probabilities to generate pseudo-words that follow the phonotactic probabilities of English.
- This is essentially a simple Markov chain.  Every phoneme is a state - specifically, being in the state corresponding to phoneme $p$ means that we have just seen phoneme $p$.  Transition probabilities from state $p$ to some other state $q$ are determined from bigrams in the data.  How?
- A bigram is a pair of consecutive symbolic units - in this case phonemes.  Thus transition probabilities such as $P(p_2|p_1)$ represent the probability of moving from the state that denotes our having just seen phoneme $p_1$ to the state that denotes our having just seen phoneme $p_2$;  this probability is determined by the relative frequency of encountering $p_2$ in the data, given that we have just seen $p_1$.
- This notebook again uses [WikiPron](https://github.com/kylebgorman/wikipron), but this time we consider US pronunciation for English.  Analogous data for many other languages, in the same format, is available [here](https://github.com/kylebgorman/wikipron/tree/master/data), which you can download and use for extensions.

In [1]:
import re
import pandas as pd
import numpy as np
import unicodedata
import random

### Read in data

And format it so that pulling out phoneme transitions will be easy.

In [2]:
# read in english words.  
df = pd.read_csv("eng_us_phonemic.tsv", sep='\t', names=['Word', 'Pron'])
df.Pron = df.Pron.str.split()  # turn space-separated strings into lists of phonemes
df

Unnamed: 0,Word,Pron
0,$deity,"[d, e, ɪ, ɪ, t, i]"
1,$deity,"[d, iː, ɪ, t, i]"
2,'cause,"[k, ʌ, z]"
3,'dswounds,"[d, z, w, u, n, d, z]"
4,'em,[m̩]
...,...,...
53394,œconomy,"[ɪː, k, ɒ, n, ɒ, m, i]"
53395,œneus,"[ɛ, n, j, uː, s]"
53396,ǃkung,"[k, u, ŋ]"
53397,ǃkung,"[k, ʊ, ŋ]"


In [3]:
# sanity check
# find the US pronunciation of the word "aardvark"
df[df.Word=='aardvark']

Unnamed: 0,Word,Pron
67,aardvark,"[ɑ, ɹ, d, v, ɑ, ɹ, k]"


In [4]:
# sanity check - and a chance to introduce the very useful pandas "sample" routine.
# find the US pronunciation of several other words.
df.sample(n=25)

Unnamed: 0,Word,Pron
51772,wff,"[w, ʊ, f]"
6088,beamer,"[b, i, m, ɚ]"
39941,reading,"[ɹ, i, d, i, ŋ]"
14694,deshler,"[d, ɛ, ʃ, l, ə, ɹ]"
35586,parasitism,"[p, æ, ɹ, ə, s, a, ɪ, t, ɪ, z, m̩]"
9885,cesspool,"[s, ɛ, s, p, uː, l]"
14393,deleon,"[d, ə, l, i, ɑ, n]"
15505,distressing,"[d, ɪ, s, t, ɹ, ɛ, s, ɪ, ŋ]"
37237,plate,"[p, l, e, ɪ, t]"
30573,martinet,"[m, æ, ɹ, t, ɪ, n, ɛ, t]"


In [5]:
# now add START and STOP pseudo-phonemes at beginning and end of each word.
# a chance to introduce the very useful pandas "apply" routine.
df.Pron = df.Pron.apply(lambda x: ['START'] + x + ['STOP'])
df

Unnamed: 0,Word,Pron
0,$deity,"[START, d, e, ɪ, ɪ, t, i, STOP]"
1,$deity,"[START, d, iː, ɪ, t, i, STOP]"
2,'cause,"[START, k, ʌ, z, STOP]"
3,'dswounds,"[START, d, z, w, u, n, d, z, STOP]"
4,'em,"[START, m̩, STOP]"
...,...,...
53394,œconomy,"[START, ɪː, k, ɒ, n, ɒ, m, i, STOP]"
53395,œneus,"[START, ɛ, n, j, uː, s, STOP]"
53396,ǃkung,"[START, k, u, ŋ, STOP]"
53397,ǃkung,"[START, k, ʊ, ŋ, STOP]"


### Get bigram conditional probabilities for each pair of phonemes

In [6]:
# start by getting the set of all phonemes in the dataset.
# we will have one state for each phoneme.
phonemes = set([p for l in df.Pron for p in l])
print(len(phonemes))
print(phonemes)

186
{'ʏ', 'ʃː', 'ð', 'ŭ̥', 'ɖ', 'lʲ', 'n̩', 'ɪ̯', 'tʰ', 'ʙ̩ː', 'd', 'u', 'eː', 'ɹ̩', 'ɕː', 'x', 'd̪', 'dʲ', 'ʌ̈', 'ɹ', 'm', 'k̚', 'ə̯', 'r̩', 'mː', 'ɣ', 'aː', 'c', 'ʌː', 'ɪˑ', 'p', 'ɛ', 't͡s', 'ʊ̩', 'l', 'ɡː', 'ɲ', 'bʲ', 's', 'ɔ', 'ɘ', 'l̩', 'l͡s', 'oˀ', 'pʰ', 'ʊ̪', 'j', 'ɜ', '˞', 'w', 'œ', 'æː', 'ɝː', 'ä', 'ɑ', 'ʈ', 'ɭ', 'ɝ', 'v', 'ɔ͡ɪ', 'ʃʰ', 'n', 'kᵊ', 'õ', 'p̬', 't͡ʃ̩̩', 'uː', 'ʉː', 'ú', 'ɵ̞', 'fː', 't͡ɕ', 'əˑ', 'ĭ̥', 'ʌ', 'ɑː', 'ɔ̃', 'o͡ʊ', 'h', 'ɑˑ', 'e', 'ŋ', 'o̞', 'ɛ͡ɹ', 'χ', 'ɡ', 'iː', 'q', 'ɫ̩', '~', 'ṽ', 'a', 'bᵊ', 'ɾ', 'ɒː', 'ɪ̈', 'ɔː', 'ʊ̯', 'ɪ̪', 'm̩ː', 'ɦ', 'ɱ', 'START', 'ɪ', 'ə', 'ɪː', 'ɑ̃', 'kʰ', 'nː', 'ə̆', 'ø', 'ʍ', 't͡ʃ', 'e͡ɪ', 'ɨ', 'əʳ', 'ô', 't', 'ɫ', 'ʀ', 'ɜː', 'oː', 'ɜ̃', 'ɵ', 'tː', 'ʊ̂', 'ɪʲ', 'ɽ', 'ɹʷ', 'ɒ̃', 'ɹː', '↓', 'ɬ', 'ɛ͡ɪ', 'ē', 'ɯ', 'ɛː', 'ĭ', 'ʉ', 'θ', 't̬ᵊ', 'ɚ', 'o', 'ĩ', 'ɛ̃', 'k', 'ʊ̆', 'd͡ʒ', 'ʊ', 'əː', 'z', 'e͜ɚ', 'm̩', 'ɛ̝', 'ɛˑ', 'sː', 'nˡ', 'tᵊ', 'ç', 'ʁ', 'i', 'îː', 'dʰ', 'æ', 't͡ʃʰ', 'STOP', 'äː', 'ʃ', 'ɒ', 'y', 'pː', 'r', 'ɐ', 'ɾ̃', 't̬'

Now get simple phoneme transition **counts** for every pair of phonemes.  Later, we will turn these into probabilities.

We will use [Laplace (add-one) smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), which means that every phoneme pair gets a "freebie" count of 1 to start with, before actually seeing any data.  Ask me why we might want to do such a thing.

In [7]:
# ptc = phoneme transition counts.
# every count is indexed by a tuple of 2 phonemes: (p1, p2).
# here, we start by initializing each count with 1: add-one smoothing.
ptc = {(p1,p2):1 for p1 in phonemes for p2 in phonemes}
ptc

{('ʏ', 'ʏ'): 1,
 ('ʏ', 'ʃː'): 1,
 ('ʏ', 'ð'): 1,
 ('ʏ', 'ŭ̥'): 1,
 ('ʏ', 'ɖ'): 1,
 ('ʏ', 'lʲ'): 1,
 ('ʏ', 'n̩'): 1,
 ('ʏ', 'ɪ̯'): 1,
 ('ʏ', 'tʰ'): 1,
 ('ʏ', 'ʙ̩ː'): 1,
 ('ʏ', 'd'): 1,
 ('ʏ', 'u'): 1,
 ('ʏ', 'eː'): 1,
 ('ʏ', 'ɹ̩'): 1,
 ('ʏ', 'ɕː'): 1,
 ('ʏ', 'x'): 1,
 ('ʏ', 'd̪'): 1,
 ('ʏ', 'dʲ'): 1,
 ('ʏ', 'ʌ̈'): 1,
 ('ʏ', 'ɹ'): 1,
 ('ʏ', 'm'): 1,
 ('ʏ', 'k̚'): 1,
 ('ʏ', 'ə̯'): 1,
 ('ʏ', 'r̩'): 1,
 ('ʏ', 'mː'): 1,
 ('ʏ', 'ɣ'): 1,
 ('ʏ', 'aː'): 1,
 ('ʏ', 'c'): 1,
 ('ʏ', 'ʌː'): 1,
 ('ʏ', 'ɪˑ'): 1,
 ('ʏ', 'p'): 1,
 ('ʏ', 'ɛ'): 1,
 ('ʏ', 't͡s'): 1,
 ('ʏ', 'ʊ̩'): 1,
 ('ʏ', 'l'): 1,
 ('ʏ', 'ɡː'): 1,
 ('ʏ', 'ɲ'): 1,
 ('ʏ', 'bʲ'): 1,
 ('ʏ', 's'): 1,
 ('ʏ', 'ɔ'): 1,
 ('ʏ', 'ɘ'): 1,
 ('ʏ', 'l̩'): 1,
 ('ʏ', 'l͡s'): 1,
 ('ʏ', 'oˀ'): 1,
 ('ʏ', 'pʰ'): 1,
 ('ʏ', 'ʊ̪'): 1,
 ('ʏ', 'j'): 1,
 ('ʏ', 'ɜ'): 1,
 ('ʏ', '˞'): 1,
 ('ʏ', 'w'): 1,
 ('ʏ', 'œ'): 1,
 ('ʏ', 'æː'): 1,
 ('ʏ', 'ɝː'): 1,
 ('ʏ', 'ä'): 1,
 ('ʏ', 'ɑ'): 1,
 ('ʏ', 'ʈ'): 1,
 ('ʏ', 'ɭ'): 1,
 ('ʏ', 'ɝ'): 1,
 ('ʏ', 'v'): 1,
 ('ʏ', 'ɔ͡ɪ'): 1,
 ('ʏ

In [8]:
# now add in the actual bigram counts
prons = df.Pron
for pron in prons:
    for ix in range(1,len(pron)):
        p1 = pron[ix-1]
        p2 = pron[ix]
        ptc[(p1,p2)] += 1

# sanity-check some of the counts.
print("START d", ptc[('START','d')])
print("d START", ptc[('d','START')])
print("f ɹ", ptc[('f','ɹ')])
print("ɹ f", ptc[('ɹ','f')])
print("g ɹ", ptc[('ɡ','ɹ')])
print("ɹ g", ptc[('ɹ','ɡ')])
print("s l", ptc[('s','l')])
print("l s", ptc[('l','s')])

START d 3380
d START 1
f ɹ 500
ɹ f 153
g ɹ 875
ɹ g 125
s l 404
l s 135


Now transform the phoneme transition **counts** into phoneme transition **probabilities**.  

For every pair of phonemes $p_1, p_2$, we already have ptc($p_1$,$p_2$): the phoneme transition count from $p_1$ to $p_2$, i.e. the number of times we encountered the bigram $p_1 p_2$ in the data.  

From this, we want to obtain the transition **probability** $p(p_2|p_1)$, i.e. the probability of encountering $p_2$, given that we have just seen $p_1$.  Note the order of the two phonemes: our bigram is $p_1 p_2$, and we use the count for it (and other bigrams) to obtain $p(p_2|p_1)$.  

How?  This is a conditional probability, and following the Bishop chapter, or Jurafsky and Martin eqn 3.11, we may calculate it as:

\begin{equation}
p(p_2|p_1) = \frac{c(p_1 p_2)}{c(p_1)}
\end{equation}

where $c(\cdot)$ denotes count.  The numerator is simply the count for the bigram $p_1 p_2$.  The denominator is the count for $p_1$ whatever phoneme may follow it.  We have bigram counts, so we can easily obtain the denominator $c(p_1)$ as: 

\begin{equation}
c(p_1) = \sum_{q \in P} c(p_1 q)
\end{equation}

where $P$ is the set of all phonemes.

In [9]:
# data structure for transition probabilities is a dict of dicts.
# the outer dict is keyed by phoneme (state), and has value inner dict.
# the inner dict is also keyed by phoneme, and holds the probability of moving from the outer phoneme
# to the inner phoneme, i.e. p(inner|outer).
# this structure should make it easy to convert counts to probs, normalize etc.

tp = dict()  # transition probs, outer dict.
for p1 in phonemes:
    d = dict()  # inner dict
    for p2 in phonemes:
        d[p2] = ptc[(p1,p2)]
    total = sum(d.values())
    d = {k: v / total for k, v in d.items()}  # normalize
    tp[p1] = d

In [10]:
# sanity check.
# the conditional probabilities p(p2|p1) should add up to 1.0 when summing across p2, for each p1.
print("Sanity check: do the conditional probabilities conditioned on each phoneme sum to 1.0?")
for p1 in phonemes:
    total = sum(tp[p1].values())
    print("Sum of conditional probabilities conditioned on", p1, total)

Sanity check: do the conditional probabilities conditioned on each phoneme sum to 1.0?
Sum of conditional probabilities conditioned on ʏ 1.0000000000000024
Sum of conditional probabilities conditioned on ʃː 1.0000000000000047
Sum of conditional probabilities conditioned on ð 0.999999999999999
Sum of conditional probabilities conditioned on ŭ̥ 1.0000000000000027
Sum of conditional probabilities conditioned on ɖ 1.0000000000000024
Sum of conditional probabilities conditioned on lʲ 1.0000000000000027
Sum of conditional probabilities conditioned on n̩ 0.9999999999999997
Sum of conditional probabilities conditioned on ɪ̯ 0.9999999999999978
Sum of conditional probabilities conditioned on tʰ 1.0000000000000024
Sum of conditional probabilities conditioned on ʙ̩ː 1.0000000000000027
Sum of conditional probabilities conditioned on d 1.000000000000001
Sum of conditional probabilities conditioned on u 0.9999999999999979
Sum of conditional probabilities conditioned on eː 0.9999999999999977
Sum of co

### Congratulations!  You have created a Markov chain.

Now use it to generate pseudo-words.

In [11]:
# specifically, let's generate 20 pseudo-words.
for i in range(20):
    # for each pseudo-word, we start by setting state to 'START'.
    state = 'START'
    print(state, end='')   # print state, but no newline.

    # now keep repeating this until you arrive at the 'STOP' state
    while (state != 'STOP'):
        next_state_options = list(tp[state].keys())
        next_state_probs = list(tp[state].values())
        new_state = np.random.choice(next_state_options, p=next_state_probs)
        print(" ", new_state, end='')
        state = new_state
    print(); print()

START  v  d  u  t  STOP

START  æ  n  STOP

START  v  STOP

START  ɑː  STOP

START  ə  n  STOP

START  ɪ  d  ɪ  l  ə  ɹ  ɪ  ɑ  n  STOP

START  t͡ʃ  STOP

START  m  ɪ  s  STOP

START  d  STOP

START  s  ʃ  ə  s  STOP

START  n  ə  t  l̩  STOP

START  æ  b  ɔ  ɹ  ʌ̈  æː  æ̃  ɛ͡ɹ  d̪  ĭ  ɝː  ɪ  k  ɛ  d  ʒ  a  ɪ  n  t  ɹ  i  e  ɪ  STOP

START  ɹ  ə  n  k  STOP

START  v  e  ɪ  k  æ  k  j  u  l  v  ə  v  ə  ɹ  d  a  ʊ  s  STOP

START  d  STOP

START  p  i  ə  f  ɛ  l  i  ə  n  e  ɪ  k  a  ɪ  n  ə  ɹ  ɛ  n  o  ʊ  t͡ʃ  STOP

START  t  STOP

START  d͡ʒ  uː  k  w  ɔ  ŋ  STOP

START  ə  l  i  ə  p  l  ɪ  n  d  ə  ɹ  o  ʊ  ɹ  o  ʊ  STOP

START  ɹ  iː  d  ʒ  ə  ɹ  ə  STOP



OK, that seemed to more or less work.  We get some reasonable-looking outputs, and also some that look a little strange.

Now, just to highlight how natural some of the stranger-looking outputs we just saw were, we'll try the same thing, but flipping the roles of $p_1$ and $p_2$ in the counts.  That is, when we see the bigram $p_1 p_2$, we increment the count for $p_2 p_1$.  

What does this do?  It rewards or strengthens transitions that tend to show up in the **opposite order** from that actually encountered.  

What sort of output should we predict?

**[ YOUR ANSWER GOES HERE. ]**

In [12]:
# generate transition probabilities from counts again, but this time with p1 and p2 swapped, 
# as described above.  this will be a lightly but strategically edited variant of a code box
# above.
tp_swapped = dict()  # transition probs
for p1 in phonemes:
    d = dict()
    for p2 in phonemes:
        d[p2] = ptc[(p2,p1)]
    total = sum(d.values())
    d = {k: v / total for k, v in d.items()}  # normalize
    tp_swapped[p1] = d

In [13]:
# now generate 20 pseudo-words with these swapped transition probabilities.
for i in range(20):
    # for each pseudo-word, we start by setting state to 'START'.
    state = 'START'
    print(state, end='')   # print state, but no newline.

    # now keep repeating this until you arrive at the 'STOP' state
    while (state != 'STOP'):
        next_state_options = list(tp_swapped[state].keys())
        next_state_probs = list(tp_swapped[state].values())
        new_state = np.random.choice(next_state_options, p=next_state_probs)
        print(" ", new_state, end='')
        state = new_state
    print(); print()

START  c  d  START  ɹ  ə  START  e͡ɪ  n  ɪ  t  ɪ  ɪ  l  k  ɪ  a  l  ɪ  t  ɪ  ʒ  d  ɛ  v  ɛ  ɛ̃  ɹʷ  ɡ  ŋ  ɪ  d  eː  ːs  ɹʷ  ʊ̆  õ  ɪ̯  e  ɹ  ə  z  n  ə̯  eː  bʲ  ɪʲ  tʰ  kʰ  ʌ̈  i  n  uː  f  START  ɑ̃  o͡ʊ  ʌː  ʃʰ  ɒ̃  θ  ɛ  h  ʊ  ə  l  ə  t  n  ʌ  p  START  t͡s  c  b  æ  v  START  nˡ  ɫ  ə  t  ɪ  d  START  ɑ  p  START  ɜ  k  ɪ  s  ə  ɹ  t  ɪ  m  ʌ  s  n  ə  t͡ʃ  ↓  ʉ  bᵊ  ɛː  i  s  ɪ  a  t  k  START  dʲ  r  t  k  START  ɽ  ɣ  əˑ  c  STOP

START  ŋ  ɪ  l  ɑ  f  START  ð  ɪ  a  d  ɝ  w  ɪ  m  START  ↓  sː  s  START  ʊ  o  START  pʰ  e͜ɚ  ːs  b  æ  ɡ  ɽ  ɨ  kᵊ  lʲ  ʉ  ĭ  ɡ  START  u  m  ə  m  START  t͡ʃ  ɪ  ɹ  ɜː  ɬ  œ  uː  n  iː  z  ɪ  ʃ  ɪ  e  START  kʰ  k  ɹ  ɡ  ʌ  START  ɝː  bᵊ  l  æ  l  ɪ  n  ɪ  a  l  ə  s  ɒ  l  ə  v  æ  n  ə  b  START  ð  ɪ  t  ɪ  ɹ  ə  ɹ  ɛ  j  h  ə  d  ɛ  v  ɻ  t̬ᵊ  əˑ  ʊ̆  ʀ  nː  ʒ  d  ɪ  b  START  w  ə  d  n  ə  k  ɑ  ɹ  ɡ  ɔ  f  ɪ  s  k  ɪ  v  START  ʀ  ɔ̃  d͡ʒ  ɔ  m  ə  k  ɪ  l  ɪ  b  ə  k  ʌ  l  ɛ  ɹ  t  ɛ  n  ɪ  d  ɪ  e  l  ɹ  START  ɣ  ŋ  ʌ  ɹ  ɑ  s  ɪ  e

START  ɛ͡ɪ  ĭ̥  ʌ  d  START  ɑ  l  t  uː  j  START  æː  ɹ̩  œ  fː  bᵊ  t̬ᵊ  ɱ  ɔ  t͡ʃʰ  n̩  d  START  ɪː  äː  ɑ  v  ɹ  ɑ  START  r  kᵊ  f  l  k  ɪ  b  START  ĭ̥  ɛ̝  t͡ʃ  h  START  sː  k̚  j  m  ʊ  ə  m  uː  j  k  START  ð  ɪ  e  m  ɪ  ɡ  ɹ  ə  v  i  m  d  n  iː  START  START  ~  ɣ  ɝː  p  s  ɪ  a  l  ɡ  START  ɑ̃  ɾ  əˑ  ɹ  p  s  START  m̩ː  ɑ  START  ʃʰ  ɑ  b  a  d  START  θ  START  j  l  æ  f  START  ɛ͡ɪ  ɑˑ  q  ɝː  ɒ̃  ɪ̯  e  z  ɪ  b  START  a  ɹ  ə  m  ɪ  ʃ  ɪ  n  ə  ɹ  ɡ  ɪ  e  ɹ  ɔ  START  ɜ̃  äː  ɛˑ  ɾ  ʃ  START  l͡s  ɑˑ  ɻ  n  ə  d͡ʒ  ɹʷ  ʙ̩ː  ɑ̃  x  ʊ̂  ʊ̆  ɜ  t  ɪ  e  l  ɒ  v  ə  z  uː  t  s  k  START  ɒ  l  ə  n  START  ɯ  ɹː  ʌ̈  ʁ  ŋ  ɪ  ɹ  p  START  nˡ  ɕː  tᵊ  t͡ɕ  ʊ̪  d͡ʒ  START  ɣ  h  START  ʔ  ːs  ʈ  ä  m̩ː  d̪  ə̯  ɪː  ʌ  k  ə  s  ə  ʃ  p  START  t̬ᵊ  t  ɛ  s  START  b  ɪ  l  k  s  æ  START  ɒ  l  ɛ  t  s  START  ʈ  ɑ̃  c  ɒ  START  m̩ː  ˞  dʲ  ɡ  START  ɛ̃  pː  θ  θ  æ  d  n  ɪ  d  ɚ  v  ɝ  w  ɛ  s  ɪ  START  ʏ  ɪ̯  ɑ  b  START  ʊ̂  e͜ɚ  ɭ  ɑ̃  ɘ  ɝː  k  s  ə  m  ə

START  ɪ̪  ɕː  ɫ  ɽ  tᵊ  ɒː  l͡s  ɜ̃  ɘ  ɛ͡ɹ  iː  r  ˞  d͡ʒ  START  ɡː  s  ə  iː  t͡ʃ  w  START  ɡː  iː  t  k  START  START  d͡ʒ  b  START  ɒ̃  o  n  ə  ʒ  d  ē  ɦ  ɫ̩  START  t͡ʃʰ  ú  ↓  ɾ  ʊː  ú  t͡ʃ̩̩  əː  ʃ  t  s  n  ə  START  y  e͜ɚ  t͡ʃ̩̩  ʊ̆  tᵊ  v  ɪ  t  s  æ  m  iː  w  k  ɪ  l  æ  START  ɔː  ṽ  e  t  ɪ  l  ə  k  START  d  ɹ  ɜː  w  k  ʊ  o  ɹ  START  ʊ̪  ã  θ  ə  ɡ  ŋ  æ  START  m̩  h  START  ɑ̃  z  iː  s  n  ɚ  d  n  ə  z  l  ʌ  ɹ  START  m  æ  START  ɵ  æː  ʔ  îː  ɖ  d  ə  n  ə  v  ɪ  ɛ  START  ɚː  bʲ  ɪ̯  ɹ̩  ɬ  l͡s  ĩ  ɭ  θ  ɚ  s  ə  ɹ  ʊ  o  START  tʰ  p̬  ɑː  v  ɪ  l  b  START  n  ə  b  m  ø  fː  ʉ  j  p  START  z  n  ə  ɡ  ɚː  ĩ  c  ˞  n̩  d  ʌ  m  ɒ  START  ʊ̪  ɘ  ʊ̆  d̪  ɵ  ʊː  ʌː  ɛˑ  ɑ̃  j  t  n  ə  ɪ  h  START  t͡ʃ̩̩  ɛ͡ɹ  ɛ͡ɪ  t͡ʃ̩̩  k  ɪ  d  ɹ  ə  START  ɕː  ˞  ɒ  k  ə  START  ʊ  ə  t  s  ɪ  e  l  n  ɛ  v  ɪ  e  b  ə  m  START  ã  ɱ  ɵ  ʌː  bʲ  a  l  ə  m  ə  v  START  ä  z  ɪ  a  s  i  s  START  ɒː  õ  ɜː  ɑ  k  s  ʊ  b  ə  ɡ  ɑ  l  ə  d͡ʒ  b  ɪ̈  ɚ  t  START  ~

START  æ̃  ɜː  f  START  ð  tʰ  dʰ  p  ɪ  ɹ  k  ɪ  l  ɪ  a  f  START  tː  r̩  ä  p̬  əʳ  ə̯  ɲ  ɦ  ɻ  bʲ  ɕː  ĩ  v  m  ɪ  w  START  ɕː  l̩  b  ʌ  d  n  ə  l  v  START  ʌː  ɑ  ɡ  START  d  ɪ  ɔ  START  ə  k  ɝ  n  ə  v  START  ɬ  ô  kᵊ  ɽ  i  l  ə  s  START  ɪʲ  pː  bʲ  d  ɪ  n  æ  START  uː  ʌː  iː  ɹ  START  ô  o̞  ʌ  START  ɖ  oː  ɚː  ã  ʒ  d  l  æ  f  START  b  p̬  ɻ  j  k  START  ɹː  θ  ʊ  o  ɹ  t  ʃ  ɪ  d  START  ç  ç  χ  ɛˑ  b  ɔ  START  ʊː  t͡ʃ  æ  l  ɪ  t  l  i  m  æ  START  ʊ  a  t  n  ɑː  n  ɪ  p  START  əˑ  ɬ  õ  tʰ  e͡ɪ  ʊ  ɜ  ɹ̩  ʌ̈  ŋ  ɪ  e  m  z  ɔ  s  ɪ  a  ɹ  f  æ̃  ø  ɝ  m  ɹ  START  ø  n̩  b  ɪ  f  START  aː  v  n  ə  n  ə  n  ɪ  s  k  æ  START  ɐː  ɛ̃  l͡s  ɪ  a  START  æː  χ  ʊ̩  œ  pʰ  ɻ  d  ɪ  t  ɪ  s  START  h  b  START  ɑˑ  ɚ  ð  ɪ  n  ʌ  b  ɹ  ɑ  ɹ̩  ʉ  t͡ɕ  ʃʰ  d̪  i  l  ə  z  b  START  n  ə  d  ə  START  ʒ  d  ɑ  n  ə  k  ɑ  ɹ  d  n  uː  ɡ  START  ʃ  START  ʀ  ɫ̩  dʲ  ã  ɭ  ɡː  pː  ʏ  u  t  ɪ  s  START  e͜ɚ  ɾ  u  j  n  ɪ  s  ɪ  t  ɹ  ɑ  s  START  ɹː  pʰ  ø 

START  t͡ʃʰ  ɪ̈  ʁ  ɔ̃  äː  ɻ  ʁ  m  START  õ  ɛː  mː  ɪ̯  a  ɡ  START  aː  əˑ  ð  ɛ  s  ɛ  ɑː  b  START  i  d  l  ə  k  START  ɑˑ  ɛ͡ɹ  y  p  START  ɜ̃  l  ɛ  s  ɪ  ʃ  t  æ  f  ɹ  b  ʊ  ə  ʃ  r̩  ē  ʃʰ  j  b  START  ɛˑ  ɭ  ɹ̩  ʀ  ĩ  ṽ  ʊ̪  sː  ʊ̩  ɵ  k̚  aː  îː  uː  ɹ  ɒ  k  START  ɔ̃  o͡ʊ  d  n  ə  ʃ  k  ʊ  o  i  d̪  ɛ̃  χ  ɪ̯  eː  äː  ˞  q  ɛˑ  æ  p  ɛ  START  a  s  n  n  i  ɡ  START  ē  ɜː  v  ɪ  a  l  ɑ  k  START  ʊ̂  ↓  tʰ  æː  ɾ  ɣ  q  ɵ  t͡ɕ  ɹː  s  ə  n  ʊ  o  n  START  e͡ɪ  ð  ɑː  b  ɪ  e  ɹ  p  ʊ  o  f  n  ɪ  e  START  ʁ  ã  q  əː  ɪˑ  p  æ  ɡ  START  ɛ̃  k  l  æ  ɡ  ø  t̬ᵊ  ɪʲ  ɡː  ɣ  ʍ  ʊ̆  ɜ  k  ə  m  ə  k  ɹ  ə  t  x  ə  START  m̩ː  æ  START  ʏ  ĭ  t̬ᵊ  ɑː  d  n  s  START  ɡ  n  ə  n  ə  j  ɛ̝  ʉ  əː  m  START  ɦ  ô  ɪʲ  ɵ  ɹː  ə̯  ɐ  ɫ  ə  m  æ  ɡ  ɛ  b  æ  m  ʌ  m  START  ə̯  t͡ɕ  θ  ĩ  ɑˑ  ɾ  ɪ  e  h  n  ɛ  ð  ɛ  m  START  b  ʊ  b  START  e  ɹ  ɔ  d  n  ɪ  d͡ʒ  ə  v  uː  j  p  e  ɹ  t  iː  t  ɪ  a  p  START  iː  ɹ  b  z  d  ɪ  l  ə  f  n  ə  f  l  ɪ  p  iː  d  ɪ  e  v

START  ɭ  ʈ  ɑ̃  ʊ  j  n  æ  l  ɡ  ə  n  ə  ʃ  t  p  ɪ  t  ɛ  s  START  θ  START  əː  ʃ  t  START  a  h  ʌ̈  y  ʁ  ʍ  START  tᵊ  ʁ  ä  ɚ  t  ə  l  b  d̪  e͡ɪ  ã  nː  ʔ  ɑː  m  ɚ  ɾ  s  ə  i  l  t  s  n  START  ú  ɚ  ð  oː  ɡ  ŋ  ɪ  z  ɡ  START  ʊ̩  r̩  c  m  æ  START  əʳ  aː  ɹː  ɝ  m  START  iː  b  START  e  n  d  n  i  θ  n  START  ɔː  l  ʙ̩ː  θ  æ  iː  ʒ  d  uː  j  l  ɪ  e  h  b  æ  m  ɹ  START  i  d  b  ɪ  ɹ  ə  v  t̬  ʒ  d  ɪ  a  ɹ  ɔ  z  i  ɹ  ə  ɹ  ɔ  f  ɪ  m  s  START  eː  ē  ɑ  s  ʊ  o  h  START  ɦ  t͡ʃ̩̩  n  ʌ  ɹ  ɑː  ˞  d͡ʒ  n  ɪ  s  START  o̞  əˑ  aː  nː  ʊ̯  ʌ̈  o̞  n  æ  p  START  f  ɪ  t  START  ɒː  dʲ  ʉ  ɜ  f  ɪ  n  ʊ  o  l  t  START  ʊ̆  l  ə  t  ɑ  m  ə  n  ɪ  e  v  u  j  n  ə  l  f  START  t͡ʃʰ  iː  l  s  ɛ  ɹ  ɛ  h  k  ɛ  p  START  dʰ  fː  ɔ  n  ə  d  ɹ  ə  m  ɪ  e  b  æ  START  ʏ  ɡː  ɾ̃  ɘ  sː  ɾ  ɛˑ  ɘ  ɾ  tᵊ  d  ə  d  n  ʌ  k  ɔ  k  START  kʰ  ɪː  ē  ɯ  ɵ̞  r̩  t͡ʃ  ɪ  z  ɚ  θ  n  ʌ  m  æ  START  l  ə  t  ə  l  ə  m  ɑː  s  ə  b  m  START  t͡ɕ  l͡s  ʈ  ʌː  ʈ  ɑ

START  ɪˑ  t͡ʃʰ  ʏ  nː  k̚  ŭ̥  ɣ  ŋ  ɒ  i  ɹ  ɛː  ĭ̥  ɘ  ɘ  l  ə  ʃ  l̩  t  k  ɝ  z  ɪ  w  s  ɹ  ɡ  ɚ  v  START  ɚː  l  ə  d  START  oˀ  ʏ  q  i  n  ɑ  b  æ  l  ə  t  START  ɪ̈  ɡː  ʊ̯  o  m  ə  ɛ  t  ɪ  ʃ  t  ɪ  k  START  l͡s  bᵊ  ɜ̃  ɑˑ  mː  ʌ̈  t͡ɕ  ʁ  ɱ  sː  l͡s  ɭ  ɒ  ɹ  ɔ  s  ə  ʃ  ɪ  v  ɪ  p  ɪ̯  ɭ  mː  n  ə  l  k  æ  n  æ  p  ɒ  n  ɪ  ʒ  d  n  ɪ  e  l  ʌ  j  ↓  æ  m  START  ʃʰ  oː  t  ʌ  d  l  ɪ  e  l  ɛ  t  ɛ  t  n  START  ē  f  START  ˞  ɖ  ɘ  ē  ŭ̥  k  ʊ  o  p  ə  t  ɛ  s  ɛ  əʳ  ʈ  ɘ  ʔ  t̬  fː  ʊ̯  ɒ̃  o  ɹ  k  ɑ  h  START  ã  ʊ̩  îː  ɛː  ɪ̪  l͡s  ú  oˀ  j  ç  ŭ̥  mː  ʉː  kᵊ  mː  ɪˑ  p̬  t͡ʃʰ  ɖ  START  ʌ  b  æ  k  START  p̬  e͡ɪ  ɜː  START  oˀ  ɑ  d  ʊ  f  START  r  ɜ  o͡ʊ  ɒː  k  START  ə̆  ɹ  ɛ  ʒ  d  START  ʊ  o  t͡ʃ  ə  s  uː  f  ʊ  o  d  ɹ  d  START  v  ə  m  u  j  START  ĭ̥  y  ʙ̩ː  ē  ɜ̃  ē  t  START  ô  ɔ͡ɪ  s  ɪ  a  b  æ  START  ɫ̩  ʙ̩ː  sː  tʰ  ɹː  ɑ̃  ʍ  s  ə̯  b  ɜː  l͡s  ʙ̩ː  ɝ  v  l  uː  d  ɛ  s  t  n  æ  s  ɪ  a  s  ɪ  a  l  ɑ  b  ɑ  z  ɡ  uː  j  START  χ 

# Observations

- Write your general observations about the results here.  A few short sentences would be fine.

# Extensions (optional)

- Do the same thing for other languages.  See [here](https://github.com/kylebgorman/wikipron/tree/master/data) for analogous WikiPron data from other languages, in the same format as the data we've used here.
- Do the same thing for words rather than phonemes.  You could download a book in plain text from [Project Gutenberg](https://www.gutenberg.org/ebooks/search/?sort_order=downloads), and proceed from there.