# Generate pseudo-words using a Markov chain 

- We will read in a dataset of English words, with pronunciations, calculate phoneme transition probabilities from the data, and then use those phoneme transition probabilities to generate pseudo-words that follow the phonotactic probabilities of English.
- This is essentially a simple Markov chain.  Every phoneme is a state - specifically, being in the state corresponding to phoneme $p$ means that we have just seen phoneme $p$.  Transition probabilities from state $p$ to some other state $q$ are determined from bigrams in the data.  How?
- A bigram is a pair of consecutive symbolic units - in this case phonemes.  Thus transition probabilities such as $P(p_2|p_1)$ represent the probability of moving from the state that denotes our having just seen phoneme $p_1$ to the state that denotes our having just seen phoneme $p_2$;  this probability is determined by the relative frequency of encountering $p_2$ in the data, given that we have just seen $p_1$.
- This notebook again uses [WikiPron](https://github.com/kylebgorman/wikipron), but this time we consider US pronunciation for English.  Analogous data for many other languages, in the same format, is available [here](https://github.com/kylebgorman/wikipron/tree/master/data), which you can download and use for extensions.

In [1]:
import re  # not necessary here
import pandas as pd
import numpy as np
import unicodedata
import random  # useful for sampling from a distribution

### Read in data

And format it so that pulling out phoneme transitions will be easy.

In [2]:
# read in english words.  
df = pd.read_csv("eng_us_phonemic.tsv", sep='\t', names=['Word', 'Pron'])
df.Pron = df.Pron.str.split()  # turn space-separated strings into lists of phonemes
df

Unnamed: 0,Word,Pron
0,$deity,"[d, e, ɪ, ɪ, t, i]"
1,$deity,"[d, iː, ɪ, t, i]"
2,'cause,"[k, ʌ, z]"
3,'dswounds,"[d, z, w, u, n, d, z]"
4,'em,[m̩]
...,...,...
53394,œconomy,"[ɪː, k, ɒ, n, ɒ, m, i]"
53395,œneus,"[ɛ, n, j, uː, s]"
53396,ǃkung,"[k, u, ŋ]"
53397,ǃkung,"[k, ʊ, ŋ]"


In [3]:
# sanity check
# find the US pronunciation of the word "aardvark"
df[df.Word=='aardvark']

Unnamed: 0,Word,Pron
67,aardvark,"[ɑ, ɹ, d, v, ɑ, ɹ, k]"


In [4]:
# sanity check - and a chance to introduce the very useful pandas "sample" routine.
# find the US pronunciation of several other words.
df.sample(n=25)  # pick 25 random samples

Unnamed: 0,Word,Pron
6842,biota,"[b, a, ɪ, ə, ʊ, t, ə]"
5959,baster,"[b, e, ɪ, s, t, ɚ]"
22706,gular,"[ɡ, j, uː, l, ə, r]"
16148,druggeting,"[d, ɹ, ʌ, ɡ, ɪ, t, ɪ, ŋ]"
29932,madd,"[m, æ, d]"
24349,hooverize,"[h, u, v, ɚ, a, ɪ, z]"
26871,jennifer,"[d, ʒ, ɛ, n, ɪ, f, ɚ]"
44560,snooze,"[s, n, uː, z]"
25102,idolatry,"[a, ɪ, d, ɑ, l, ə, t, ɹ, i]"
49925,uneasy,"[ʌ, n, iː, z, i]"


In [5]:
# now add START and STOP pseudo-phonemes at beginning and end of each word.
# a chance to introduce the very useful pandas "apply" routine.
df.Pron = df.Pron.apply(lambda x: ['START'] + x + ['STOP'])
df

Unnamed: 0,Word,Pron
0,$deity,"[START, d, e, ɪ, ɪ, t, i, STOP]"
1,$deity,"[START, d, iː, ɪ, t, i, STOP]"
2,'cause,"[START, k, ʌ, z, STOP]"
3,'dswounds,"[START, d, z, w, u, n, d, z, STOP]"
4,'em,"[START, m̩, STOP]"
...,...,...
53394,œconomy,"[START, ɪː, k, ɒ, n, ɒ, m, i, STOP]"
53395,œneus,"[START, ɛ, n, j, uː, s, STOP]"
53396,ǃkung,"[START, k, u, ŋ, STOP]"
53397,ǃkung,"[START, k, ʊ, ŋ, STOP]"


### Get bigram conditional probabilities for each pair of phonemes

In [6]:
# start by getting the set of all phonemes in the dataset.
# we will have one state for each phoneme.
phonemes = set([p for l in df.Pron for p in l]) # for every list in df.pron; for every phoneme in that list
print(len(phonemes))
print(phonemes)

186
{'y', 'ɡ', 't͡ʃ̩̩', 'əː', 'z', 'p', 't͡s', 'b', 'ɛː', 'ʊ̂', 'ʊ̯', 'ɛˑ', 'ɜː', 'ɻ', 'ʁ', 'bᵊ', 'n̩', 'ɪ', 'ɹ', 'ĭ', 'd̪', 'nˡ', 't̬', 'ʌː', 'r', 'ɔ͡ɪ', 's', 'nː', 't', 'ɒ', 't͡ɕ', 'e͜ɚ', 'ɫ̩', 'f', 'l͡s', 'ɹː', 'ɪˑ', 'ɪː', 'ʙ̩ː', 'õ', 'ə̯', 'ɝ', 'j', 'uː', 'd͡ʒ', 'ø', 'ɑ̃', 'ɲ', 'ɒ̃', 'ɑː', 'pː', 'æ', 'o', 'ē', 'dʰ', 'ɑ', 'ɛ͡ɪ', 'ɕː', 'START', 'ʊ̪', 'm̩', 'ç', 'ɒː', 'ɪʲ', 'ɖ', 'ʒ', 'l̩', 'îː', 'm', 'ô', 'ĭ̥', 'œ', 'ɐː', 'a', 'ːs', 'ɭ', 'ɛ', 'ú', 'ɛ̝', 'q', 'i', 'aː', 'e͡ɪ', 'æ̃', 'ɘ', 'ɱ', 'e', 'ṽ', 'ɾ', 'ʀ', 't͡ʃ', 'ʈ', 'c', 'ä', 'ɦ', 'ə', 'ɔː', 'sː', 'ɣ', 'ɛ̃', 'n', 'kᵊ', 'p̬', 'k', 'ŭ̥', 'ɪ̪', 'mː', 'æː', 'd', 'ʃː', 'dʲ', 'pʰ', 'ʃ', 'ã', 'o̞', 'ŋ', 'tː', 'v', 'əʳ', 'ɵ̞', 'STOP', 'θ', 'ɽ', 'fː', 'u', '~', 'ɫ', 'lʲ', 'äː', '˞', 'tᵊ', 'bʲ', 'ʉ', 'χ', 'ɐ', 'm̩ː', 'ɜ̃', 'k̚', 'ʌ̈', 'ʔ', 'ʌ', 'oˀ', 'ə̆', 'ʊː', 'ɝː', 't͡ʃʰ', 'ɵ', 'tʰ', 'kʰ', 'ɯ', 'ð', 'ɔ̃', 'ɛ͡ɹ', 'ɹʷ', 'ʊ̩', 'ɨ', 'oː', 'iː', '↓', 'ʏ', 'ɜ', 'ɪ̯', 'ɔ', 'əˑ', 'ʍ', 'h', 'r̩', 'ɪ̈', 'ɹ̩', 'ɑˑ', 'ʊ̆', 'ɡː', 'ʉː', 'ɚ', 'ɬ', '

Now get simple phoneme transition **counts** for every pair of phonemes.  Later, we will turn these into probabilities.

We will use [Laplace (add-one) smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), which means that every phoneme pair gets a "freebie" count of 1 to start with, before actually seeing any data.  Ask me why we might want to do such a thing.

In [7]:
# ptc = phoneme transition counts.
# this structure is a dict.
# its key is a tuple of 2 phonemes: (p1, p2), and the corresponding value is the count for those 2 phonemes.
# here, we start by initializing each count with 1: add-one smoothing.
# MODIFY THIS NEXT PART TO INITIALIZE ALL COUNTS TO 1.
ptc = {}  # YOUR CODE GOES HERE.

for phoneme1 in phonemes:
    for phoneme2 in phonemes:
        ptc[(phoneme1, phoneme2)] = 1

print(len(ptc)) # sanity check
        
ptc

34596


{('y', 'y'): 1,
 ('y', 'ɡ'): 1,
 ('y', 't͡ʃ̩̩'): 1,
 ('y', 'əː'): 1,
 ('y', 'z'): 1,
 ('y', 'p'): 1,
 ('y', 't͡s'): 1,
 ('y', 'b'): 1,
 ('y', 'ɛː'): 1,
 ('y', 'ʊ̂'): 1,
 ('y', 'ʊ̯'): 1,
 ('y', 'ɛˑ'): 1,
 ('y', 'ɜː'): 1,
 ('y', 'ɻ'): 1,
 ('y', 'ʁ'): 1,
 ('y', 'bᵊ'): 1,
 ('y', 'n̩'): 1,
 ('y', 'ɪ'): 1,
 ('y', 'ɹ'): 1,
 ('y', 'ĭ'): 1,
 ('y', 'd̪'): 1,
 ('y', 'nˡ'): 1,
 ('y', 't̬'): 1,
 ('y', 'ʌː'): 1,
 ('y', 'r'): 1,
 ('y', 'ɔ͡ɪ'): 1,
 ('y', 's'): 1,
 ('y', 'nː'): 1,
 ('y', 't'): 1,
 ('y', 'ɒ'): 1,
 ('y', 't͡ɕ'): 1,
 ('y', 'e͜ɚ'): 1,
 ('y', 'ɫ̩'): 1,
 ('y', 'f'): 1,
 ('y', 'l͡s'): 1,
 ('y', 'ɹː'): 1,
 ('y', 'ɪˑ'): 1,
 ('y', 'ɪː'): 1,
 ('y', 'ʙ̩ː'): 1,
 ('y', 'õ'): 1,
 ('y', 'ə̯'): 1,
 ('y', 'ɝ'): 1,
 ('y', 'j'): 1,
 ('y', 'uː'): 1,
 ('y', 'd͡ʒ'): 1,
 ('y', 'ø'): 1,
 ('y', 'ɑ̃'): 1,
 ('y', 'ɲ'): 1,
 ('y', 'ɒ̃'): 1,
 ('y', 'ɑː'): 1,
 ('y', 'pː'): 1,
 ('y', 'æ'): 1,
 ('y', 'o'): 1,
 ('y', 'ē'): 1,
 ('y', 'dʰ'): 1,
 ('y', 'ɑ'): 1,
 ('y', 'ɛ͡ɪ'): 1,
 ('y', 'ɕː'): 1,
 ('y', 'START'): 1,
 ('y', 

In [8]:
# now add in the actual bigram counts
prons = df.Pron
for pron in prons:
    # YOUR CODE GOES HERE.
    # increment the ptc count for (p1,p2) each time you encounter the bigram "p1 p2" in the data.
    for i in range(len(pron) - 1):
        ptc[(pron[i], pron[i + 1])] += 1
    

# sanity-check some of the counts.
print("START d", ptc[('START','d')])
print("d START", ptc[('d','START')])
print("f ɹ", ptc[('f','ɹ')])
print("ɹ f", ptc[('ɹ','f')])
print("g ɹ", ptc[('ɡ','ɹ')])
print("ɹ g", ptc[('ɹ','ɡ')])
print("s l", ptc[('s','l')])
print("l s", ptc[('l','s')])

START d 3380
d START 1
f ɹ 500
ɹ f 153
g ɹ 875
ɹ g 125
s l 404
l s 135


Now transform the phoneme transition **counts** into phoneme transition **probabilities**.  

For every pair of phonemes $p_1, p_2$, we already have ptc($p_1$,$p_2$): the phoneme transition count from $p_1$ to $p_2$, i.e. the number of times we encountered the bigram $p_1 p_2$ in the data.  

From this, we want to obtain the transition **probability** $p(p_2|p_1)$, i.e. the probability of encountering $p_2$, given that we have just seen $p_1$.  Note the order of the two phonemes: our bigram is $p_1 p_2$, and we use the count for it (and other bigrams) to obtain $p(p_2|p_1)$.  

How?  This is a conditional probability, and following the Bishop chapter, or Jurafsky and Martin eqn 3.11, we may calculate it as:

\begin{equation}
p(p_2|p_1) = \frac{c(p_1 p_2)}{c(p_1)}
\end{equation}

where $c(\cdot)$ denotes count.  The numerator is simply the count for the bigram $p_1 p_2$.  The denominator is the count for $p_1$ whatever phoneme may follow it.  We have bigram counts, so we can easily obtain the denominator $c(p_1)$ as: 

\begin{equation}
c(p_1) = \sum_{q \in P} c(p_1 q)
\end{equation}

where $P$ is the set of all phonemes.

In [9]:
# data structure for transition probabilities is a dict of dicts.
# the outer dict is keyed by phoneme (state), and has value inner dict.
# the inner dict is also keyed by phoneme, and holds the probability of moving from the outer phoneme
# to the inner phoneme, i.e. p(inner|outer).
# this structure should make it easy to convert counts to probs, normalize etc.

tp = dict()  # transition probs, outer dict.
for p1 in phonemes:
    d = dict()  # inner dict
    # YOUR CODE GOES HERE.
    p1_total = 0
    for key in ptc.keys():
        p1_total += ptc[key] if key[0] == p1 else 0
    for p2 in phonemes:
        d[p2] = ptc[(p1, p2)] / p1_total
    
    tp[p1] = d

In [10]:
# sanity check.
# the conditional probabilities p(p2|p1) should add up to 1.0 when summing across p2, for each p1.
print("Sanity check: do the conditional probabilities conditioned on each phoneme sum to 1.0?")
for p1 in phonemes:
    total = sum(tp[p1].values())
    print("Sum of conditional probabilities conditioned on", p1, total)

Sanity check: do the conditional probabilities conditioned on each phoneme sum to 1.0?
Sum of conditional probabilities conditioned on y 1.0000000000000022
Sum of conditional probabilities conditioned on ɡ 0.999999999999996
Sum of conditional probabilities conditioned on t͡ʃ̩̩ 1.0000000000000027
Sum of conditional probabilities conditioned on əː 1.000000000000002
Sum of conditional probabilities conditioned on z 1.000000000000002
Sum of conditional probabilities conditioned on p 0.9999999999999982
Sum of conditional probabilities conditioned on t͡s 0.9999999999999959
Sum of conditional probabilities conditioned on b 0.9999999999999969
Sum of conditional probabilities conditioned on ɛː 1.000000000000002
Sum of conditional probabilities conditioned on ʊ̂ 1.0000000000000027
Sum of conditional probabilities conditioned on ʊ̯ 0.9999999999999987
Sum of conditional probabilities conditioned on ɛˑ 1.0000000000000027
Sum of conditional probabilities conditioned on ɜː 0.9999999999999972
Sum of c

### Congratulations!  You have created a Markov chain.

Now use it to generate pseudo-words.

In [11]:
# specifically, let's generate 20 pseudo-words.
for i in range(20):
    # for each pseudo-word, we start by setting state to 'START'.
    state = 'START'
    print(state, end='')   # print state, but no newline.

    # now keep repeating this until you arrive at the 'STOP' state
    while (state != 'STOP'):
        next_state_options = list(tp[state].keys())
        next_state_probs = list(tp[state].values())
        new_state = np.random.choice(next_state_options, p=next_state_probs)
        print(" ", new_state, end='')
        state = new_state
    print(); print()

START  ɪ  l  STOP

START  ə  n  ɪ  ɡ  z  STOP

START  f  ɛ  l  ɪ  d  STOP

START  l  ə  l  STOP

START  t  ɪ  l  ɡ  STOP

START  ə  n  d͡ʒ  ɛ̝  uː  STOP

START  d  i  ə  d  ɛ  n  ɒ  m  ɑː  n  STOP

START  h  a  ɪ  k  ə  s  ɑ  s  STOP

START  f  ɛ  ə  p  ɹ  æ  f  ə  ɹ  ə  k  t  STOP

START  n  ɪ  ə  ɹ  STOP

START  s  ə  m  STOP

START  ɛ  s  l  a  ɪ  n  STOP

START  ɹ  STOP

START  s  ɪ  t  æ  s  w  ɪ  STOP

START  z  STOP

START  ɹ  d  ə  s  t  ɪ  n  ɪ  ɾ  ə  n  s  t  ʃ  uː  m  z  STOP

START  s  t  ə  d  ʒ  STOP

START  f  ɝ  d  j  ɒ  ɡ  ɫ  pʰ  ɹ̩  ɪ̪  ə̆  ä  ɚː  ɛ̝  pʰ  ɜ̃  ã  ɾ  ə̯  d̪  p  ə  n  STOP

START  k  STOP

START  ʌ  p  ɪ  l  i  STOP



OK, that seemed to more or less work.  We get some reasonable-looking outputs, and also some that look a little strange.

Now, just to highlight how natural some of the stranger-looking outputs we just saw were, we'll try the same thing, but flipping the roles of $p_1$ and $p_2$ in the counts.  That is, when we see the bigram $p_1 p_2$, we increment the count for $p_2 p_1$.  

What does this do?  It rewards or strengthens transitions that tend to show up in the **opposite order** from that actually encountered.  

What sort of output should we predict?

**[ YOUR ANSWER GOES HERE. ]**

**1. Each output word tend to be very long, often with multiple encounters with "START" in the middle before getting to "STOP."**

**2. The combination of phonemes in each output word will sound more natural when read backwards then read normally.**

In [12]:
# generate transition probabilities from counts again, but this time with p1 and p2 swapped, 
# as described above.  this will be a lightly but strategically edited variant of a code box
# above.
tp_swapped = dict()  # transition probs
for p1 in phonemes:
    d = dict()
    # YOUR CODE GOES HERE.
    p1_total = 0
    for key in ptc.keys():
        p1_total += ptc[key] if key[1] == p1 else 0
    for p2 in phonemes:
        d[p2] = ptc[(p2, p1)] / p1_total
    
    tp_swapped[p1] = d

In [13]:
# now generate 20 pseudo-words with these swapped transition probabilities.
for i in range(20):
    # for each pseudo-word, we start by setting state to 'START'.
    state = 'START'
    print(state, end='')   # print state, but no newline.

    # now keep repeating this until you arrive at the 'STOP' state
    while (state != 'STOP'):
        next_state_options = list(tp_swapped[state].keys())
        next_state_probs = list(tp_swapped[state].values())
        new_state = np.random.choice(next_state_options, p=next_state_probs)
        print(" ", new_state, end='')
        state = new_state
    print(); print()

START  p̬  k̚  uː  ʍ  dʰ  χ  ô  c  ɔː  ɛ͡ɪ  e͡ɪ  w  START  fː  ɜ̃  d  ʊ  o  b  s  ɪ  a  h  t  ə  n  ə  ɹ  b  ɝ  l  ɪ  v  ɪ  e  ɹ  ə  ʒ  d  START  ʌ  k  ɹ  ʊ  o  m  ɛ  ʃ  k  ɑ  k  ə  START  k̚  ɑ̃  dʲ  ɑ̃  ŋ  ɪ  s  iː  w  k  ɛ  START  ã  ú  tʰ  θ  START  d͡ʒ  ɪ  ɹ  k  START  STOP

START  əː  ə̆  ʌ̈  ʊː  m̩ː  h  START  ĭ̥  ʈ  ʔ  ɛ͡ɪ  l͡s  ɚː  q  ɔ̃  ʌ̈  x  ɫ̩  ô  ŋ  ɜː  ɡː  ɔ͡ɪ  o̞  dʲ  ɭ  y  æː  ʊː  ɹː  ɾ̃  oˀ  ɹ  ɑ  START  oː  æ  u  b  uː  j  n  ə  p  START  t̬  ɪ  ɹ  p  ə  k  ɪ  e  l  ə  n  ʌ  b  ə  ɹ  d  START  tʰ  a  ɹ  p  START  p̬  ɻ  ɑ  l  ə  j  n  ʊ  o  s  ɪ  n  ɪ  s  k  ŋ  ɪ  f  START  t͡ɕ  ɚː  ɔ͡ɪ  æ  h  START  ɹ  t  ɪ  ŋ  æ  n  ʊ  a  ɹ  ɔ  k  ɪ  p  ɛ  t  ɑ  f  ɪ  ɹ  ə  m  ɪ  a  l  ə  k  ɛ  h  START  v  ɪ  s  ɪ  n  START  ɑˑ  ɐ  x  ~  ã  ɛ͡ɹ  ɔː  p  START  p  START  ʌ̈  v  ɪ  d  START  ʏ  ə̯  ɪ  START  t͡ʃ  START  o̞  d͡ʒ  ɚ  ɪ̯  a  ɝ  START  ɝː  l͡s  ɹː  ɵ  e͡ɪ  ʒ  d  START  kᵊ  ɪ̯  a  w  START  ú  ɛː  ɣ  ʃː  ʙ̩ː  n  iː  l  ə  n  ə  t  s  uː  ʒ  d  ɪ  a  l  ɑ  k  iː  ʊ  ə  ST

START  ɔ͡ɪ  y  ɔ̃  ɛˑ  ʃ  ɪ  ɹ  ɡ  START  ɹː  ã  äː  ʊ̪  ʃː  ɒː  o̞  ɭ  ɐː  c  ɒ̃  ʊ̂  o  ɹ  ɛ  ʒ  d  START  ʃʰ  bʲ  t͡s  ə̆  ʃ  ɪ  d  ʊ  ə  ɹ  k  ɪ  b  START  ɐː  ɛˑ  ɹ  t  START  ĭ  pː  i  t  ʊ  ə  m  ʌ  h  d  ɛ  θ  ə  m  æ  k  START  b  START  t͡ʃ  ɑ  l  ʒ  d  ɪ  f  ə  i  θ  s  ɛ  b  ə  n  ɡ  ʊ  o  v  l  ɪ  ɹ  p  ə  ʒ  START  p̬  lʲ  ĭ  ɪː  r  ɔː  v  ɪ  e  ɡ  ŋ  ɑ  w  æ̃  ç  dʰ  l̩  b  ɑ  d  n  START  ʍ  ɡ  ɫ̩  ṽ  ɨ  d  n  ə  s  ɹ  ɔ  θ  ɛ  θ  n  ɚ  f  START  ʍ  START  ʌ̈  ɻ  bʲ  ʀ  əˑ  ʃː  l͡s  o͡ʊ  k̚  START  ɑˑ  ɒ̃  əˑ  ɛː  ṽ  ɹ  ə  ʃ  æ  l  ʊ  o  n  ə  k  ə  f  d  ʊ  a  ð  i  l  k  START  x  ɔ̃  θ  p̬  ~  ɒː  dʰ  ɘ  ɾ̃  ɔ͡ɪ  æ̃  ɑ  m  ɛ  w  k  l  ə  ɡ  ŋ  æ  ɹ  k  ə  t  s  n  ɛ  l  ʊ  o  l  ʃ  ʊ  ə  ɪ  ɹ  iː  ʊ  o  ɡ  START  p̬  ɹʷ  ɻ  uː  l  START  ɪʲ  ɵ  ↓  ɪ̈  ɹ  ə  k  ɪ  ɹ  ɔ  s  æ  ɹ  æ  n  ʌ  ʃ  ɪ  ɹ  ə  p  START  əː  ɚː  ɲ  ä  ʙ̩ː  n̩  ʃ  t  ɛ  n  ɡ  START  e  w  k  START  ʊ̪  θ  START  ɕː  ɔ͡ɪ  fː  lʲ  mː  ɱ  fː  w  START  t̬ᵊ  y  p̬  i  f  m  æ  START  k̚  ə̆  dʲ  õ  ç 

START  aː  ɑ  f  ɪ  ɹ  START  ɹ̩  ʒ  d  ʊ  o  k  ɪ  d  n  æ  ɹ  ɡ  ɪ  l  ʌ  k  ɪ  b  æ  k  ɪ  e  ɹ  ɑː  w  ɡ  START  oː  ɐ  ɭ  d͡ʒ  n  ʌ  k  ɪ  ɔ  h  START  x  ʃʰ  ɹ̩  fː  pː  ɛˑ  ɪˑ  ô  ɲ  ɚː  ɛ  k  START  ɒ̃  ʊ̆  tᵊ  t̬ᵊ  uː  d  n  ə  k  START  ŭ̥  ɪˑ  æ̃  ʊ̩  y  ə  ɹ  p  START  ɵ̞  pʰ  ˞  ə  k  ʌ  m  ə  v  ʊ  w  s  ə  l  k  START  ʊ̆  ɖ  START  e͜ɚ  kʰ  ɽ  ɒː  ɐ  œ  o͡ʊ  t͡ʃ  ə  s  ə  l  ɛ  w  ə  b  START  ːs  ˞  ɻ  ɻ  äː  œ  ē  ɛˑ  o  START  ɪʲ  t̬ᵊ  ɜ  pʰ  θ  n  ə  m  p  ɒ  b  START  l͡s  ɐː  ĭ̥  lʲ  h  START  ʃ  t  ɪ  e  w  i  ʒ  d  ð  ɪ  h  START  ʃː  p  ə  k  ɪ  m  START  ɛː  w  s  ə  d  n  ɑ  s  k  ɪ  f  START  f  ə  n  ɪ  e  p  m  uː  l  ʊ  ə  k  START  n̩  ʃ  t  ɪ  l  ə  START  ɔ̃  îː  r̩  w  s  ɪ  t  ʊ  a  w  ʊ̪  ɑˑ  ɒ  p  START  k̚  ɐː  ɽ  ʊ̪  p  æ  ʃ  t  ə  d  n  ɛ  ʃ  ɪ  d  iː  d  START  ɜː  b  ə  ɹ  s  START  ʉː  əː  uː  t  s  ɑː  p  m  d  n  ɑ  k  START  ã  ʌ  START  ɐ  ɡː  ɒː  ɹː  v  ə  t  START  sː  ɪˑ  q  t͡ʃ̩̩  o  n  ə  ɹ  ə  i  l  ə  k  START  tᵊ  əʳ  ɝ  m̩  ɪː  ↓

START  d  ʊ  o  f  ɪ  e  p  ʊ  o  b  ə  START  ɑ̃  r̩  ŋ  æ  l  ɪ  p  START  START  ɬ  îː  ä  ʊ̩  o  m  ɑ  ɹ  ʊ  ə  ʃ  p  ə  ɪ  e  START  ɪ̈  START  ɨ  l  b  START  ʊ̆  ɭ  ɲ  m̩ː  ɜ̃  ʙ̩ː  ɪ̪  START  ɹ  b  ɪ  v  ə  t  START  əː  ɫ  ɵ̞  kᵊ  ç  ʌ  w  k  ɪ  t  s  ə  iː  ɹ  ɡ  ɒ  k  l  ə  i  s  ə  k  ɑː  d  START  ɛ͡ɹ  ä  ʊ̆  ɪ  START  œ  t͡ʃʰ  p  m  ɹ  d  START  ɽ  ɜː  ɹ  ə  p  START  æ  ʃ  ɪ  e  j  n  ɑː  l  t  ɪ  ɹ  ɔ  k  ə  d  START  ɔ  h  START  l  ɹʷ  ɔː  START  j  ɑ  j  uː  k  ɛ  d  ɝ  ɯ  ɽ  ṽ  oː  ã  ɵ  ɪ̈  oˀ  ɔ  n  ə  t  k  ɛ  v  i  ɹ  ə  m  START  p  m  START  fː  ú  y  k  ɛ  f  ɨ  ʊ̩  k̚  ʈ  n̩  ʃ  t  iː  θ  ɪ  e  ɹ  k  i  n  ɪ  a  k  START  ɑˑ  x  t͡ʃ̩̩  ɣ  STOP

START  ↓  ɑ̃  ʃ  æ  n  ə  m  START  t̬ᵊ  ɪˑ  l͡s  ṽ  ɜ̃  ɵ̞  ã  k  ɪ  e  ɫ̩  ɪ  e  ɹ  n  ɪ  v  ɪ  d  ɑ  START  ɹ  ɔː  ʊː  ʊː  îː  ʊ̩  ɛ̃  n  æ  s  ɪ  t  k  ɪ  s  START  ɫ̩  ɔ  f  START  tʰ  oː  y  ɹ̩  mː  æ  ɹ  ɔ  t  ɪ  l͡s  ~  a  ɹ  ʊ  a  b  START  t̬ᵊ  ʉ  ɭ  nˡ  t͡ɕ  ɽ  ɖ  ɚː  ɒː  ɔ͡ɪ  i  d  b  START  n  ʌ  t  s  n 

START  ã  z  i  l  æ  s  n  ɪ  a  p  ɪ  p  æ  m  START  kʰ  ã  ~  r  ə  v  START  o  l  ɑ  l  ɛ  t  ə  l  ɪ  p  START  aː  ē  ˞  o͡ʊ  ɔ  START  îː  ˞  ɑ̃  kᵊ  m  ə  k  ə  n  ɪ  a  s  START  ɭ  tː  ɪː  t͡ʃ  START  ɝː  ç  k  ɑ  b  ə  i  ʒ  d  ʊ  o  ʃ  t  s  k  ɑ  p  START  æː  ɜ  j  n  START  ɪˑ  oː  tː  ɦ  ɛ͡ɪ  t  ʃ  n  ɪ  n  ə  l  æ  ɹ  t  ɹ  ɪ  a  START  æː  əʳ  pː  m̩  əʳ  ↓  ə  t  ə  d  ɑ  p  s  START  ɪˑ  ä  ʙ̩ː  t̬ᵊ  ɯ  v  iː  l  ɔ  ɡ  ɪ  p  START  m  START  y  ʊ̩  b  START  ɻ  ʃʰ  ʊ̩  l  ʊ  o  d  ə  m  s  ʌ  START  ɛ͡ɹ  lʲ  ʃʰ  n  iː  d͡ʒ  ɑ  i  ɹ  ɪ  d  START  f  r  ə  START  ʌː  ɪː  ~  ɣ  l̩  b  m  ð  ɬ  əʳ  ɾ̃  ʊ̯  ʙ̩ː  nˡ  ɛ̝  bᵊ  v  ɝ  m  ɑ  k  ŋ  ɪ  a  b  START  ːs  ɔ͡ɪ  ɵ̞  ʊ̯  o  k  ɪ  l  k  p  m  START  l̩  b  ɪ  a  f  ɑ  t  n  ɚ  t  ɑ  p  START  dʰ  θ  f  START  ɒ̃  ɜː  v  ɛ  v  ʌ  s  t  ɔ  k  ɪ  f  ʊ  o  d  z  ɪ  w  ɵ̞  t̬ᵊ  ɝː  ʊ  o  ɡː  r  ə  ĩ  χ  ŋ  ɪ  a  t  ɑ  m  ɪ  e  l  ə  ɡ  ɪ  f  θ  ɹ  ɛ  START  əʳ  a  n  ʌ  s  b  START  ɘ  oˀ  ə̆  ɑː  k  ɪ  k  i  t  ɛ  s  ɪ  

# Observations

- Write your general observations about the results here.  A few short sentences would be fine.

It seems that phoneme transition probabilities are generally good at generating English pseudo-words, i.e., words that comply with the "rules" of English pronunciation, except for several single-phoneme ones. However, it is reasonable to speculate that transition probability alone probably does not generate well to words (i.e., generating sentences that comply with rules of English grammar). On the one hand, it may generate sentences that are statistically probable but ungrammatical, e.g., "I like to the cafeteria;" on the other hand, it cannot generate sentences that are grammatically but meaningless, such as the good o old "colorless green ideas sleep furiously." Therefore, the behaviorist view of a "sequential" language has its limitations. Sentential grammar is more than a sequence probability; that is where other views of language, such as the generative grammar of Chomsky, come in.

# Extensions (optional)

- Do the same thing for other languages.  See [here](https://github.com/kylebgorman/wikipron/tree/master/data) for analogous WikiPron data from other languages, in the same format as the data we've used here.
- Do the same thing for words rather than phonemes.  You could download a book in plain text from [Project Gutenberg](https://www.gutenberg.org/ebooks/search/?sort_order=downloads), and proceed from there.