# Decoding Tool

This script will calculate a number of features for specific words related to decoding, many of which come from the CMU pronunciation dictionary.

It will be used to develop a dataframe that contains observations that are English words and variables that are counts related to decoding for each word.

It will calculate the following

1. Syllables per word
2. Number of letters per word
3. Number of phonemes: Differences between the number of characters in a word and the number of phonemes in that word
4. Discrepancy (raw and mean)
5. Average syllable length
6. Blends (consonants, vowels, and both): Average character per phoneme
7. Grapheme‑Phoneme Complexity (Berndt et al. (1987) calculated in two ways

    7a. The mid, max, prior, average of Grapheme‑Phoneme Complexity by word

    7b. The phonological matching of Grapheme‑Phoneme Complexity by word

We will start with calculating variables using the CMU pronunciation dictionary

First, where are we?

In [1]:
import os

print(os.listdir())

['everything_except_vowel_prob.csv', 'calculate_decoding_measures_from_texts_spacy.ipynb', 'CLEAR_corpus_prac.csv', 'problems_w_vowel_list_lengths.csv', 'CLEAR_corpus_final.csv', 'Screenshot 2023-06-02 at 1.10.55 PM.png', 'consonant_cond_prob_berndt_no_except.csv', '.DS_Store', 'vow_cond_prob_berndt_no_except_one_char.csv', 'joon_cond_prob', 'decoding_measure_notes.rtf', 'cons_cond_prob_berndt_no_except_one_char.csv', 'dic_prac.ipynb', 'cons_cond_prob_berndt_no_except_two_char.csv', 'corr_matrix_decoding_variables_clear.csv', 'decoding_1.0_for_Joon.zip', 'vow_cond_prob_berndt_no_except_two_char.csv', 'berndt_data_with_cmu_phones.xlsx', 'vowel_cond_prob_berndt.csv', 'decoder_dataframe_analytics', 'vow_cond_prob_berndt_no_except_three_char.csv', 'cmu_phones_vowels.csv', 'berndt_notes.rtf', 'prac_words_berndt.csv', 'python-regular-expressions-cheat-sheet.pdf', 'decoding_2.ipynb', 'decoding_1_dataframe.csv', 'consonant_cond_prob_berndt.csv', 'All_variables_in_decoder_project.xlsx', 'berndt

In [1]:
# safe divide function to stop zero counts from causing problems
def safe_divide(a, b):
    if b != 0:
        return a/b
    else:
        return 0

## Call in CMU pronunciation dictionary and wrangle data into a dictionary

In [2]:
#call in dictionary

import pandas as pd

#start with small dataframe
#rhyme_df = pd.read_csv('cmu_rhyme_dic_small.csv', na_values='', header=None)

#call in dataframe by number of words
pronounce_df = pd.read_csv('cmu_rhyme_dic.csv', na_values='', header=None)


pronounce_df = pronounce_df.fillna("") #replace nan with nothing
result_df = pronounce_df.head(10) #small dataset to view
print(result_df)
print(type(pronounce_df))
print(pronounce_df.shape)

  pronounce_df = pd.read_csv('cmu_rhyme_dic.csv', na_values='', header=None)


       0    1    2    3  4    5    6    7    8  9   ... 23 24 25 26 27 28 29  \
0       A  AH0                                      ...                        
1    A(1)  EY1                                      ...                        
2     A'S  EY1    Z                                 ...                        
3      A.  EY1                                      ...                        
4    A.'S  EY1    Z                                 ...                        
5     A.S  EY1    Z                                 ...                        
6  A42128  EY1    F  AO1  R    T  UW1    W  AH1  N  ...                        
7      AA  EY2  EY1                                 ...                        
8     AAA    T    R  IH2  P  AH0    L  EY1          ...                        
9  AABERG  AA1    B  ER0  G                         ...                        

  30 31 32  
0           
1           
2           
3           
4           
5           
6           
7           
8 

Need to remove all words/observations that have the following characters

( or ) these are alternative pronunciations.

Clean the words of these

.' this get in the way of character counts

**but you will also need to remove these from the corpus you are analyzing so a word like parent's become parents (or wouldn't because wouldnt)


In [3]:
# Remove rows with .

pronounce_df2 = pronounce_df[pronounce_df[0].str.contains("\(") == False] #remove any observations that contain (

# Remove . and ' from words

pronounce_df2[0] = pronounce_df2[0].str.replace(r'\.', '') #replace . with nothing
pronounce_df2[0] = pronounce_df2[0].str.replace(r'\'', '') #replace ' with nothing

pronounce_df2
print(pronounce_df2.shape)

(125000, 33)


  pronounce_df2[0] = pronounce_df2[0].str.replace(r'\.', '') #replace . with nothing
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pronounce_df2[0] = pronounce_df2[0].str.replace(r'\.', '') #replace . with nothing
  pronounce_df2[0] = pronounce_df2[0].str.replace(r'\'', '') #replace ' with nothing
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pronounce_df2[0] = pronounce_df2[0].str.replace(r'\'', '') #replace ' with nothing


**Remove redundant words

There are a bunch of words that have multiple pronunciations. We will remove all but the primary pronunciation.

Removes around 1,500 words

In [4]:
#remove rows based on first column but keep first instance of duplicate
pronounce_df3 = pronounce_df2.drop_duplicates(subset=[0], keep= 'first')

pronounce_df3
print(pronounce_df3.shape)

(121753, 33)


In [5]:
#Convert to dictionary

pronounce_df_tran = pronounce_df3.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header

#print(pronounce_df_tran)

pronounce_dic_tran =pronounce_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values

#print out key and values


#see what is in there
counter = 0

for key, val in pronounce_dic_tran.items():
    print(key)
    print(val)
    counter += 1
    if counter == 5:
        break

#lots of empty values


A
['AH0', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
AS
['EY1', 'Z', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
A42128
['EY1', 'F', 'AO1', 'R', 'T', 'UW1', 'W', 'AH1', 'N', 'T', 'UW1', 'EY1', 'T', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
AA
['EY2', 'EY1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
AAA
['T', 'R', 'IH2', 'P', 'AH0', 'L', 'EY1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [6]:
#remove all empty items from dictionary
for key, val in pronounce_dic_tran.items():
  while("" in val):
    val.remove("") #remove empty values
  pronounce_dic_tran[key] = val #reassign



In [7]:
#list first 4 items
dict(list(pronounce_dic_tran.items())[0:4]) #does not work sometimes?
#they are cleaned

#counter = 0

#for key, val in pronounce_dic_tran.items():
#    print(key)
#    print(val)
#    counter += 1
#    if counter == 5:
#        break


{'A': ['AH0'],
 'AS': ['EY1', 'Z'],
 'A42128': ['EY1',
  'F',
  'AO1',
  'R',
  'T',
  'UW1',
  'W',
  'AH1',
  'N',
  'T',
  'UW1',
  'EY1',
  'T'],
 'AA': ['EY2', 'EY1']}

## Count number of syllables per word

1. Count the number of digits in each dictionary element

In [8]:

# read in vowels from cmu
vow = pd.read_csv("cmu_phones_vowels.csv")
 
# convert vowel column to list
vowels = vow['vowel'].tolist()

words = [] #keep track of all the words in pronunciation dictionary
pronounciation = [] #keep track of all the pronunications
for i in pronounce_dic_tran:
    words.append(i) #put words in list above
    pronounciation.append(pronounce_dic_tran[i]) #put pronunciations in list above

decoding_df = pd.DataFrame() #create pandas dataframe
decoding_df['words'] = words
decoding_df['pronounciations'] = pronounciation #add words and pronunciations to dataframe

syllables = []

for key, value in pronounce_dic_tran.items(): #read cmu dict
    #print(value)
    
    count = 0 #start a count
    for val in value:  
        #print(v)
        if val in vowels: #count the number of vowels in each word (i.e,. syllables)
            #print(val)
            count += 1
            #print(count)

    syllables.append(count) #append syllable count

decoding_df['num_syllables'] = syllables
    
decoding_df


Unnamed: 0,words,pronounciations,num_syllables
0,A,[AH0],1
1,AS,"[EY1, Z]",1
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6
3,AA,"[EY2, EY1]",2
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3
...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3


## Count number of letters per word

In [9]:
#words_small = words[1055:1065]

#print(words_small)

letter_per_word = []

for word in words:
    #print(len(word))
    letter_per_word.append(len(word)) #count characters per word
    #print(count)
    
#print(letter_per_word[0:20])

decoding_df['num_letters'] = letter_per_word #add to dataframe
decoding_df

Unnamed: 0,words,pronounciations,num_syllables,num_letters
0,A,[AH0],1,1
1,AS,"[EY1, Z]",1,2
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6,6
3,AA,"[EY2, EY1]",2,2
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3,3
...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1,4
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3,9
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3,8
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3,9


## Number of phonemes

In [10]:
#small_list_list = pronounciation[1000:1006]
#print(small_list_list)

phoneme_count = []

for list in pronounciation: #use pronunciation list to count phonemes
    #print(len(list))
    phoneme_count.append(len(list))

#print(phoneme_count[0:5])

decoding_df['num_phonemes'] = phoneme_count

decoding_df


Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes
0,A,[AH0],1,1,1
1,AS,"[EY1, Z]",1,2,2
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6,6,13
3,AA,"[EY2, EY1]",2,2,2
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3,3,7
...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1,4,4
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3,9,9
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3,8,8
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3,9,9


## Compute discrepancy 

basically, differences between the number of characters in a word and the number of phonemes in that word

Number phonemes minus number of letters
Number of letters/number of phonemes

In [11]:

#change counts to floats
decoding_df['num_syllables'] = decoding_df['num_syllables'].astype(float)
decoding_df['num_letters'] = decoding_df['num_letters'].astype(float)
decoding_df['num_phonemes'] = decoding_df['num_phonemes'].astype(float)

decoding_df['discrepancy_raw'] = decoding_df.apply(lambda row: row.num_phonemes - row.num_letters, axis=1)
#number of phonemes minus number of letters

#use safe_divide function in .apply function for potential division by 0 problems
decoding_df['discrepancy_ratio'] = decoding_df.apply(lambda row: safe_divide(row.num_phonemes,row.num_letters), axis=1)
#number of phonemes/number of letters

decoding_df


Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio
0,A,[AH0],1.0,1.0,1.0,0.0,1.000000
1,AS,"[EY1, Z]",1.0,2.0,2.0,0.0,1.000000
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6.0,6.0,13.0,7.0,2.166667
3,AA,"[EY2, EY1]",2.0,2.0,2.0,0.0,1.000000
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3.0,3.0,7.0,4.0,2.333333
...,...,...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1.0,4.0,4.0,0.0,1.000000
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3.0,9.0,9.0,0.0,1.000000
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3.0,8.0,8.0,0.0,1.000000
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3.0,9.0,9.0,0.0,1.000000


## Compute Average Syllable Length

number of letters/Number of syllables

In [12]:
#use safe_divide function in .apply function for potential division by 0 problems
decoding_df['avg_syllable_length'] = decoding_df.apply(lambda row: safe_divide(row.num_letters,row.num_syllables), axis=1)


decoding_df


Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio,avg_syllable_length
0,A,[AH0],1.0,1.0,1.0,0.0,1.000000,1.000000
1,AS,"[EY1, Z]",1.0,2.0,2.0,0.0,1.000000,2.000000
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6.0,6.0,13.0,7.0,2.166667,1.000000
3,AA,"[EY2, EY1]",2.0,2.0,2.0,0.0,1.000000,1.000000
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3.0,3.0,7.0,4.0,2.333333,1.000000
...,...,...,...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1.0,4.0,4.0,0.0,1.000000,4.000000
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3.0,9.0,9.0,0.0,1.000000,3.000000
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3.0,8.0,8.0,0.0,1.000000,2.666667
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3.0,9.0,9.0,0.0,1.000000,3.000000


## Count up blends

combinations of two or three consonants which, when pronounced, blend into sounds which still retain elements of the individual consonants

Need character counts for vowels and consonant (i.e., how many vowels and how many consonants per word)

Then need phoneme counts for vowels and consonants


### Start with character counts

Counts for vowels and phonemes

In [13]:
#prac_words = words[11000:11005]
#print(prac_words)

#list of character vowels
chr_vowels = ['A', 'E', 'I', 'O', 'U', 'Y']
#print(chr_vowels)

#count consonant characters per word

cons_char_count_words = [] #holder for list of lists for consonant counts

for w in words:
    cons_char_words = []
    for char in w:
        #print(char)
        if char not in chr_vowels: #if not a vowel
            #print(char)
            cons_char_words.append(char) #append to intermediate list
    #print(char_words)
    cons_char_count_words.append(len(cons_char_words)) #append length of of list to list of list        

#print(cons_char_count_words)    


#count number of vowel characters per word

vowel_char_count_words = [] #holder for list of lists for vowel counts

for w in words:
    vowel_char_words = []
    for char in w:
        #print(char)
        if char in chr_vowels: #if it is a vowel
            #print(char)
            vowel_char_words.append(char)
    #print(char_words)
    vowel_char_count_words.append(len(vowel_char_words)) #append length of of list to list of list        

#print(vowel_char_count_words)    



decoding_df['num_consonants_characters'] = cons_char_count_words
decoding_df['num_vowel_characters'] = vowel_char_count_words
decoding_df




Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio,avg_syllable_length,num_consonants_characters,num_vowel_characters
0,A,[AH0],1.0,1.0,1.0,0.0,1.000000,1.000000,0,1
1,AS,"[EY1, Z]",1.0,2.0,2.0,0.0,1.000000,2.000000,1,1
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6.0,6.0,13.0,7.0,2.166667,1.000000,5,1
3,AA,"[EY2, EY1]",2.0,2.0,2.0,0.0,1.000000,1.000000,0,2
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3.0,3.0,7.0,4.0,2.333333,1.000000,0,3
...,...,...,...,...,...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1.0,4.0,4.0,0.0,1.000000,4.000000,3,1
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3.0,9.0,9.0,0.0,1.000000,3.000000,6,3
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3.0,8.0,8.0,0.0,1.000000,2.666667,4,4
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3.0,9.0,9.0,0.0,1.000000,3.000000,5,4


### Second, get phoneme counts

for vowels and for consonants

In [15]:
#prac_words = pronounciation[0:11]
#print(prac_words)

#list of phoneme vowels from Pronunciation dictionary
#print(vowels)

#count consonant phonemes per word

cons_phon_count_words = [] #holder for list of lists for consonant counts

for w in pronounciation:
    #print(w)
    cons_phon_words = []
    for phon in w:
        #print(phon)
        if phon not in vowels: #these are vowels from the Pronunciation dictionary
            #print(phon)
            cons_phon_words.append(phon)
    #print(cons_phon_words)
    cons_phon_count_words.append(len(cons_phon_words)) #append length of of list to list of list        

#print(cons_phon_count_words)    

#count vowel phonemes per word


vowel_phon_count_words = [] #holder for list of lists for consonant counts

for w in pronounciation:
    #print(w)
    vowel_phon_words = []
    for phon in w:
        #print(phon)
        if phon in vowels:
            #print(phon)
            vowel_phon_words.append(phon)
    #print(vowel_phon_words)
    #vowel_phon_count_words.append(vowel_phon_words) #append length of of list to list of list        
    vowel_phon_count_words.append(len(vowel_phon_words)) #append length of of list to list of list        

#print(vowel_phon_count_words)    




decoding_df['num_consonants_phonemes'] = cons_phon_count_words
decoding_df['num_vowel_phonemes'] = vowel_phon_count_words

decoding_df



Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio,avg_syllable_length,num_consonants_characters,num_vowel_characters,num_consonants_phonemes,num_vowel_phonemes
0,A,[AH0],1.0,1.0,1.0,0.0,1.000000,1.000000,0,1,0,1
1,AS,"[EY1, Z]",1.0,2.0,2.0,0.0,1.000000,2.000000,1,1,1,1
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6.0,6.0,13.0,7.0,2.166667,1.000000,5,1,7,6
3,AA,"[EY2, EY1]",2.0,2.0,2.0,0.0,1.000000,1.000000,0,2,0,2
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3.0,3.0,7.0,4.0,2.333333,1.000000,0,3,4,3
...,...,...,...,...,...,...,...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1.0,4.0,4.0,0.0,1.000000,4.000000,3,1,3,1
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3.0,9.0,9.0,0.0,1.000000,3.000000,6,3,6,3
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3.0,8.0,8.0,0.0,1.000000,2.666667,4,4,5,3
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3.0,9.0,9.0,0.0,1.000000,3.000000,5,4,6,3


### Lastly, count the following

- Number of characters consonants/number of consonant phonemes
- Number of character vowels/number of vowel phonemes
- Number of letters/number of phonemes

In [16]:
#average phonemes per character (consonants)
#average phonemes per character (vowel)
#average phonemes per character (all)


#use safe_divide function in .apply function for potential division by 0 problems

decoding_df['avg_phonemes_per_character_consonants'] = decoding_df.apply(lambda row: safe_divide(row.num_consonants_characters, row.num_consonants_phonemes), axis=1)
decoding_df['avg_phonemes_per_character_vowels'] = decoding_df.apply(lambda row: safe_divide(row.num_vowel_characters, row.num_vowel_phonemes), axis=1)
#this is the reverse of discrepancy ratio
decoding_df['avg_phonemes_per_character_all'] = decoding_df.apply(lambda row: safe_divide(row.num_letters, row.num_phonemes), axis=1)


decoding_df[50000:50020]


Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio,avg_syllable_length,num_consonants_characters,num_vowel_characters,num_consonants_phonemes,num_vowel_phonemes,avg_phonemes_per_character_consonants,avg_phonemes_per_character_vowels,avg_phonemes_per_character_all
50000,HOBBS,"[HH, AA1, B, Z]",1.0,5.0,4.0,-1.0,0.8,5.0,4,1,3,1,1.333333,1.0,1.25
50001,HOBBY,"[HH, AA1, B, IY0]",2.0,5.0,4.0,-1.0,0.8,2.5,3,2,2,2,1.5,1.0,1.25
50002,HOBBYIST,"[HH, AA1, B, IY0, IH0, S, T]",3.0,8.0,7.0,-1.0,0.875,2.666667,5,3,4,3,1.25,1.0,1.142857
50003,HOBBYISTS,"[HH, AA1, B, IY0, IH0, S, T, S]",3.0,9.0,8.0,-1.0,0.888889,3.0,6,3,5,3,1.2,1.0,1.125
50004,HOBDAY,"[HH, AA1, B, D, EY2]",2.0,6.0,5.0,-1.0,0.833333,3.0,3,3,3,2,1.0,1.5,1.2
50005,HOBDY,"[HH, AA1, B, D, IY0]",2.0,5.0,5.0,0.0,1.0,2.5,3,2,3,2,1.0,1.0,1.0
50006,HOBEN,"[HH, AA1, B, AH0, N]",2.0,5.0,5.0,0.0,1.0,2.5,3,2,3,2,1.0,1.0,1.0
50007,HOBERG,"[HH, OW1, B, ER0, G]",2.0,6.0,5.0,-1.0,0.833333,3.0,4,2,3,2,1.333333,1.0,1.2
50008,HOBERMAN,"[HH, OW1, B, ER0, M, AH0, N]",3.0,8.0,7.0,-1.0,0.875,2.666667,5,3,4,3,1.25,1.0,1.142857
50009,HOBERT,"[HH, AA1, B, ER0, T]",2.0,6.0,5.0,-1.0,0.833333,3.0,4,2,3,2,1.333333,1.0,1.2


In [17]:
decoding_df

Unnamed: 0,words,pronounciations,num_syllables,num_letters,num_phonemes,discrepancy_raw,discrepancy_ratio,avg_syllable_length,num_consonants_characters,num_vowel_characters,num_consonants_phonemes,num_vowel_phonemes,avg_phonemes_per_character_consonants,avg_phonemes_per_character_vowels,avg_phonemes_per_character_all
0,A,[AH0],1.0,1.0,1.0,0.0,1.000000,1.000000,0,1,0,1,0.000000,1.000000,1.000000
1,AS,"[EY1, Z]",1.0,2.0,2.0,0.0,1.000000,2.000000,1,1,1,1,1.000000,1.000000,1.000000
2,A42128,"[EY1, F, AO1, R, T, UW1, W, AH1, N, T, UW1, EY...",6.0,6.0,13.0,7.0,2.166667,1.000000,5,1,7,6,0.714286,0.166667,0.461538
3,AA,"[EY2, EY1]",2.0,2.0,2.0,0.0,1.000000,1.000000,0,2,0,2,0.000000,1.000000,1.000000
4,AAA,"[T, R, IH2, P, AH0, L, EY1]",3.0,3.0,7.0,4.0,2.333333,1.000000,0,3,4,3,0.000000,1.000000,0.428571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121748,ZYSK,"[Z, IH1, S, K]",1.0,4.0,4.0,0.0,1.000000,4.000000,3,1,3,1,1.000000,1.000000,1.000000
121749,ZYSKOWSKI,"[Z, IH0, S, K, AO1, F, S, K, IY0]",3.0,9.0,9.0,0.0,1.000000,3.000000,6,3,6,3,1.000000,1.000000,1.000000
121750,ZYUGANOV,"[Z, Y, UW1, G, AA0, N, AA0, V]",3.0,8.0,8.0,0.0,1.000000,2.666667,4,4,5,3,0.800000,1.333333,1.000000
121751,ZYUGANOVS,"[Z, Y, UW1, G, AA0, N, AA0, V, Z]",3.0,9.0,9.0,0.0,1.000000,3.000000,5,4,6,3,0.833333,1.333333,1.000000


## Grapheme‑Phoneme Complexity using Berndt et al. (1987)




### Average of Grapheme‑Phoneme Complexity by word

Will calculate the following for vowels, consonants, and both vowels and consonants

1. Prior probability
2. Max probability: The max probability strength between grapheme(s) and phoneme
3. Mid probability: The mid-range probability strength between grapheme(s) and phoneme
4. Mean probability: The min probability strength between grapheme(s) and phonem
5. Number of phonemes: The number of phonemes associate with a grapheme

Do this for consonants (while removing vowels with consonants in them) and then for vowels and then combine the results from both of these (but do not do it for both vowels and consonants at the same time)

Need to start with longer character clusters first and then remove them from the word so things are not double counted.

1. Three and four character vowels (just remove them because a word like bought needs to have the ough removed initially and changed to bt)
2. Four character phonemes
3. Three character phonemes
4. Two character phonemes that are exceptions (see regular expressions)
5. Two character phonemes that are not exceptions
6. One character phonemes

Call in dataframes

In the data frames, the columns are

0 - Grapheme

1 - Prior_prob

2 - Max_prob

3 - Min_prob

4 - Mid_prob

5 - Num_phonemes

In [18]:
#call in dictionaries

import pandas as pd


#call in dataframe by number of words
berndt_1_df = pd.read_csv('cons_cond_prob_berndt_no_except_one_char.csv', na_values='', header=None)

berndt_2_df = pd.read_csv('cons_cond_prob_berndt_no_except_two_char.csv', na_values='', header=None)

berndt_3_df = pd.read_csv('cons_cond_prob_berndt_no_except_three_char.csv', na_values='', header=None)
#this has following rules added
#si as sh/zh needs to become sio (for words like passion and mansion) 
#ti as sh to tio (captures everything in nation and similar words)

berndt_4_df = pd.read_csv('cons_cond_prob_berndt_no_except_four_char.csv', na_values='', header=None)
#this has following rule added (ssi as sh/zh needs to become ssio to distinguish harnessing from possession)

berndt_1_df
#berndt_2_df

Unnamed: 0,0,1,2,3,4,5
0,B,0.0206,1.0,1.0,1.0,1
1,C,0.042,0.757,0.008,0.3825,3
2,D,0.0336,0.991,0.008,0.4995,2
3,F,0.0146,0.998,0.001,0.4995,2
4,G,0.0169,0.64,0.008,0.324,3
5,H,0.007,1.0,1.0,1.0,1
6,J,0.002,1.0,1.0,1.0,1
7,K,0.0055,1.0,1.0,1.0,1
8,L,0.0451,1.0,1.0,1.0,1
9,M,0.0313,0.971,0.028,0.4995,2


In [19]:
#Start small

#Convert to dictionary

berndt_4_df_tran = berndt_4_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_4_df_tran =berndt_4_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_3_df_tran = berndt_3_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_3_df_tran =berndt_3_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_2_df_tran = berndt_2_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_2_df_tran =berndt_2_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_1_df_tran = berndt_1_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_1_df_tran =berndt_1_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values

#print out key and values

print(berndt_4_df_tran)

#see what is in there
counter = 0

for key, val in berndt_1_df_tran.items():
    print(key)
    print(val)
    counter += 1
    if counter == 5:
        break



{'NGUE': [1e-05, 1.0, 1.0, 1.0, 1.0], 'SSIO': [0.0004, 1.0, 1.0, 1.0, 1.0]}
B
[0.0206, 1.0, 1.0, 1.0, 1.0]
C
[0.042, 0.757, 0.008, 0.3825, 3.0]
D
[0.0336, 0.991, 0.008, 0.4995, 2.0]
F
[0.0146, 0.998, 0.001, 0.4995, 2.0]
G
[0.0169, 0.64, 0.008, 0.324, 3.0]


**Start here with counts for consonants**

Calculate values for 3 and 4 character consonant phonemes first.

In [20]:

#Full list of words
#words = ['BOUGHT', 'DOUGH', 'HEIGHT', 'STRAIGHT', 'CAUGHT', 'VIEW', 'NIGHT', 'AWE', 'BROWSE', 'GARGOYLE', 'JUICE', 'SCREEN', 'MEN', 'EATEN', 'BOBBY', 'TONGUE', 'ROGUE', 'TALK', 'ALL', 'LATER', 'QUILTED', 'SOCIAL', 'FUSION', 'PASSED', 'SKETCH', 'SPECIAL', 'MISSION', 'CAPTION', 'A', 'MEDAL', 'INCONSEQUENTIAL', 'GOAL', 'WASHED', 'PACED', 'PICKED', 'GLARED', 'EASEL', 'ABLE', 'FEEL', "EEL", "YES", 'BELIEVE', 'KISSES', 'TEST', 'GIRL', 'ENGINE', 'DETAIL', 'PENCIL', 'NIMBLE', 'EVIL', 'GENTILE', 'CLEAN', 'PISTOL', 'DROOL', 'PYLON', 'GOON', 'NATION', 'CHANGES', 'BRAVES', 'TIMES', 'DROOL']

#smaller list of words
#words = ['BOUGHT',  'MEN', 'EATEN', 'TONGUE', 'AWE', 'MISSIONED', 'GOAL', 'WASHED', 'ABLE', 'KISSES', 'TIMES', 'MOM']

#words from the actual CMU dict
#words_2 = ['A', 'AS', 'A42128', 'AA', 'AAA', 'AABERG', 'AACHEN', 'AACHENER', 'AAH', 'AAKER']

#set up list for cleaned words
clean_words = []

#set up list of lists to index results
prior_prob = [[] for x in range(len(words))] #set up a list of lists that is as long as the list of words
max_prob = [[] for x in range(len(words))] 
min_prob = [[] for x in range(len(words))] 
mid_prob = [[] for x in range(len(words))] 
number_phonemes = [[] for x in range(len(words))] 


for i, word in enumerate(words):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    for key, val in berndt_4_df_tran.items(): #start with 4 character phonemes. Call in dictionary
        while key in word: #while key in dictionary is in word (allows for counts of multiple keys as in the M in MOM)
            #print(key) #the key to make sure it is working
            #how the values are numbered
            #print(val[0]) #prior_prob
            #print(val[1]) #max_prob
            #print(val[2]) #min_prob
            #print(val[3]) #mid_prob
            #print(val[4]) #num_phonemes
            prior_prob[i].append(val[0]) #appends value to first list of lists [0] because tongue is [0] element
            max_prob[i].append(val[1]) 
            min_prob[i].append(val[2]) 
            mid_prob[i].append(val[3]) 
            number_phonemes[i].append(val[4]) 
            word = word.replace(key, '', 1) #replace the key with empty characters. Does this one at a time though to count all incidences that may repeat (M in MOM)
    for key, val in berndt_3_df_tran.items(): #next is 3 character phonemes
        while key in word:
            #print(key) #the key
            prior_prob[i].append(val[0])
            max_prob[i].append(val[1]) 
            min_prob[i].append(val[2]) 
            mid_prob[i].append(val[3]) 
            number_phonemes[i].append(val[4]) 
            word = word.replace(key, '', 1)
    clean_words.append(word) #this is a check to make sure the code is removing character phonemes

#print(words_2)
#print(prior_prob)
#print(max_prob)
#print(min_prob)
#print(mid_prob)
#print(number_phonemes)
#print(clean_words)



In [21]:
print(clean_words[:10])
print(len(clean_words))
prior_prob[:10]


['A', 'AS', 'A42128', 'AA', 'AAA', 'AABERG', 'AACHEN', 'AACHENER', 'AAH', 'AAKER']
121753


[[], [], [], [], [], [], [], [], [], []]

**Next, call in exception consonant phonemes that are all two characters long**

These all happen at end of word and should not overlap with 3-4 character phonemes (see exception with special wherein 'cia' is removed leaving spel and el at the end of the word is an exception in a word like 'easel').

Rules are here

- al to ul at end of word as long as word is < 4 characters (medal, commercial, but not veal, goal, pal, gal)

- ed as t in past tense at end preceded by voiceless consonants (picked, paced, passed, washed). This rule is not in the data, so... probably just all of them? Words should be longer than 3 so it picks up 'aced' but not 'red'

- el as in l when at the end of word and not preceded by a vowel (easel but not feel)

- en as un when at the end of a word and preceded by a vowel and the word is over 3 characters (so, raven and eaten, but not screen or men)

- es as z when at the end of word except when sibilant (wishes, whizzes, kisses, races, prizes, watches, changes). Should be longer than 3 letters (aces but not yes)

- gi as dg except at beginning of word (girl versus engine)  

- il as ul at the end of the word except when preceded by a vowel (detail v pencil)

- le as ul when at the end of the word and preceded by consonant (able and nimble but not gentile) 

- ol as ul when at the end of the word and preceded by consonant (pistol and carol but not drool) 

- on as un when at the end of the word and preceded by consonant (pylon and iron but not goon or nation) 


In [None]:
import re

#print(clean_words)

al_dict = {'AL': [.00003, 1.0, 1.0, 1.0, 1.0]}
ed_dict = {'ED': [0.0002, 1, 1, 1, 1]}
el_dict = {'EL': [0.0001, 1, 1, 1, 1]}
en_dict = {'EN': [.00007, 1.0, 1.0, 1.0, 1.0]}
es_dict = {'ES': [0.0004, 1, 1, 1, 1]}
gi_dict = {'GI': [0.0001, 1, 1, 1, 1]}
il_dict = {'IL': [0.00006, 1, 1, 1, 1]}
le_dict = {'LE': [0.0057, 1, 1, 1, 1]}
ol_dict = {'OL': [0.000009, 1, 1, 1, 1]}
on_dict = {'ON': [0.00002, 1, 1, 1, 1]}
qu_dict = {'QU': [0.002, 0.876, 0.123, 0.4995, 2]}
ti_dict = {'TI': [0.0076, 0.983, 0.001, 0.449, 3]}

except_words_2 = [] #list of words that need to be removed from the words_2 before the next stage of cleaning
clean_words_2 = [] #this is a list of partial words that need to be moved back into clean_words

for i, word in enumerate(clean_words):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    except_words = [] #holder list of words to remove
    # for words that end in al and are over 4 letters long
    if len(word) > 4: #if word longer than 4 characters
        if re.match(r'.*AL$', word): #if word ends in AL
            #print(word)
            except_words.append(word) #put the word in the exception list
            for word in except_words:
                #print(word)
                #need to remove word from words_2 now. Later, will need to put clean_words back into words_2
                for key, val in al_dict.items(): #call in dictionary
                    #print(key)
                    while key in word: #if key is in word (i.e., does word include 'al', append information
                        #print(key)
                        prior_prob[i].append(val[0])
                        max_prob[i].append(val[1]) 
                        min_prob[i].append(val[2]) 
                        mid_prob[i].append(val[3]) 
                        number_phonemes[i].append(val[4]) 
                        word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                clean_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
            #print(except_words)
    except_words_2.append(except_words) #move except words into larger list (CHECK THIS. YOU ONLY CALL IT ONCE!!!!)
    # for words that end in ed
    if len(word) > 3: #if word longer than 3
        if re.match(r'.*ED$', word): #if it ends in ED
            #print(word)
            except_words.append(word)
            for word in except_words:
                for key, val in ed_dict.items():
                    while key in word:
                        #print(key)
                        prior_prob[i].append(val[0])
                        max_prob[i].append(val[1]) 
                        min_prob[i].append(val[2]) 
                        mid_prob[i].append(val[3]) 
                        number_phonemes[i].append(val[4]) 
                        word = word.replace(key, '', 1)
                clean_words_2.append(word) 
    if re.match(r'.+[^AEIOU]EL$', word):
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in el_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word) 
    if len(word) > 3: #if word longer than 3
        if re.match(r'.+[^AEIOU]EN$', word): #if it ends in EN
            #print(word)
            except_words.append(word)
            for word in except_words:
                for key, val in en_dict.items():
                    while key in word:
                        #print(key)
                        prior_prob[i].append(val[0])
                        max_prob[i].append(val[1]) 
                        min_prob[i].append(val[2]) 
                        mid_prob[i].append(val[3]) 
                        number_phonemes[i].append(val[4]) 
                        word = word.replace(key, '', 1)
                clean_words_2.append(word) 
    if re.match(r'.+[^(?:SS|Z|CH|G|C)]ES$', word): #if words ends in ES and has sibilant before it
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in es_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word)
    if re.match(r'(?!GI)\w*GI\w*(?<!GI)', word): #if words contains GI, except at the beginning
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in gi_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word)
    if re.match(r'.+[^AEIOU]IL$', word): #if word ends in IL
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in il_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word)
    if re.match(r'.+[^AEIOU]LE$', word): #if word ends in LE
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in le_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word)
    if re.match(r'.+[^AEIOU]OL$', word): #if words ends in OL
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in ol_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word) 
    if re.match(r'.+[^AEIOU]ON$', word): #if word ends in ON
        #print(word)
        except_words.append(word)
        for word in except_words:
            for key, val in on_dict.items():
                while key in word:
                    #print(key)
                    prior_prob[i].append(val[0])
                    max_prob[i].append(val[1]) 
                    min_prob[i].append(val[2]) 
                    mid_prob[i].append(val[3]) 
                    number_phonemes[i].append(val[4]) 
                    word = word.replace(key, '', 1)
            clean_words_2.append(word) 
            
#print(f'these are the words to remove later {except_words_2}')                      
#print(f'these are what remains of the words above to add back to the original list of words after the complete words have been removed {clean_words_2}')
#print(prior_prob)
#print(max_prob)
#print(min_prob)
#print(mid_prob)
#print(number_phonemes)


these are the words to remove later [[], [], [], ['EATEN'], [], [], ['MINED'], [], ['WASHED'], ['ABLE'], [], ['TIMES'], []]
these are what remains of the words above to add back to the original list of words after the complete words have been removed ['EAT', 'MIN', 'WASH', 'AB', 'TIM']
[[], [], [], [7e-05], [1e-05], [], [0.0004, 0.0002], [], [0.0002], [0.0057], [], [0.0004], []]
[[], [], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1], []]
[[], [], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1], []]
[[], [], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1], []]
[[], [], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1], []]


**Replace existing word list with shortened words as a result of above for VOWEL COUNTS later**

This will give us a list of the remaining words in which 3-4 character phonemes have been removed along with 2 character exception phonemes.

This list will be used for vowel probabilities.

We will continue with extracting consonants and then move to vowels. But vowel rules require that the the words are mostly natural except for the removal of exceptions.

So, for instance, the word TONGUE needs to be shortened to TO or the vowel counts will be calculated on the UE at the end of the word (even thought that UE is part of NGUE, which is a consonant).



In [None]:
#print(f'these are the original words {words}\n') #the original list

#print(f'these are the original words after first pass for 3-4 character phonemes {clean_words}\n') #the original list

words_to_remove = []
for sublist in except_words_2:
    for word in sublist:
        words_to_remove.append(word)

#print(f'These are the words with exceptions that were processed and need to be removed {words_to_remove}\n') #note, special is a hard one... cia is removed [sh]
# leaving spel, which ends in an 'el'

#print(f'these are the parts of the exception words left over after processing {clean_words_2}\n')



In [None]:
#now, we need a list to next set of consonants and to extract vowel probabilities in the near future
#1. take the original words
#2. remove the exception words that were processed
#3. replace those words with their remaining parts after processing in the same order


# create a new list that will store the updated list_1
clean_words_vowels = []

# loop through the words in the original list
for word in clean_words:
    #print(word)
    # if the word was processed as an exception, replace it with the corresponding set of characters that remain after processing
    if word in words_to_remove:
        #print(word)
        clean_words_vowels.append(clean_words_2[words_to_remove.index(word)])#index returns the position at the first occurrence of the specified value
    # if the word is not in words_to_remove, add it to the processed_list_for_vowel_extraction list
    else:
        clean_words_vowels.append(word)

# print the final list for vowel probability extraction
#print(f'this was the original list of words {words_2}\n')
#print(f'this is the list for vowel extraction and next steps in consonants {clean_words_vowels}') #Think this is right except for special, which should be spel, but is sp



**Next, call in 1 and 2 character phonemes (consonants)**

***HERE WE CONTINUE WITH THE CONSONANT COUNTS***

Run counts for 1 and 2 consonant character phonemes on the remaining characters

NOTE that the output will be useless for counting vowel probabilities

['E', '', 'OY', 'O', 'O', 'A', 'A', 'AE', 'O', 'U', 'E', 'I', 'AIO', 'A', 'OA', 'EE', 'EE', 'YE', 'EIEE', 'IE', 'E', 'I', 'EAI', 'IE', 'EA', 'OO', 'OO', 'AIO', 'AE', 'OO', 'EA', 'UI', 'A', '', 'E', 'IOEUI', 'A', 'A', 'I', 'A', 'EA', 'A', 'E', '', 'I', 'E', 'I', 'Y', 'A', 'I']

Because we do not know where the vowels are broken up by consonants. Hence, we have to use the words after exception consonants removed (i.e, clean_words_3).

Also, we first remove vowels that contain consonants (e.g., AIGH in STRAIGHT) so they are not counted twice.

In [None]:
#print(f'this was the original list of words {words}\n')
#print(f'this is the list of words for final consonant cleaning with 2 and 1 character consonant phonemes {clean_words_vowels}\n')


vowels_to_remove = ['AIGH', 'AUGH', 'EIGH', 'OUGH', 'IGH', 'IEW'] #these are long vowels that contain consonants. Need to remove them so the consonants are not counted (e.g., STRAIGHT, VIEW)

clean_words_3 = []

for word in clean_words_vowels:
    for vowel in vowels_to_remove:
        if vowel in word:
            word = word.replace(vowel, '_') #replace the key with _ that represents a vowel. If you replace with nothing, for the word BOUGHT you get BT, which is a consonant with two phonemes
    clean_words_3.append(word)

#print(f'this is the list of words with vowels containing consonants removed{clean_words_3}')

            
            

In [None]:

clean_words_4 = []

for i, word in enumerate(clean_words_3):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    for key, val in berndt_2_df_tran.items(): #next is 2 character phonemes
        while key in word:
            #print(key) #the key
            prior_prob[i].append(val[0])
            max_prob[i].append(val[1]) 
            min_prob[i].append(val[2]) 
            mid_prob[i].append(val[3]) 
            number_phonemes[i].append(val[4]) 
            word = word.replace(key, '', 1)
    for key, val in berndt_1_df_tran.items(): #next is 1 character phonemes
        while key in word:
            #print(key) #the key
            prior_prob[i].append(val[0])
            max_prob[i].append(val[1]) 
            min_prob[i].append(val[2]) 
            mid_prob[i].append(val[3]) 
            number_phonemes[i].append(val[4]) 
            word = word.replace(key, '', 1)
    clean_words_4.append(word) #this is a check to make sure the code is removing character phonemes

#print(f'this was the original list of words {words_2}\n')
#print(f'this is the list of words for final consonant cleaning with 2 and 1 character consonant phonemes {clean_words_3}\n')
#print(prior_prob)
#print(max_prob)
#print(min_prob)
#print(mid_prob)
#print(number_phonemes)
#print(f'this is what is left of the words with all the consonants counted {clean_words_4}')

this was the original list of words ['LAUGH', 'BOUGHT', 'MEN', 'EATEN', 'TONGUE', 'AWE', 'MISSIONED', 'GOAL', 'WASHED', 'ABLE', 'KISSES', 'TIMES', 'MOM']

this is the list of words for final consonant cleaning with 2 and 1 character consonant phonemes ['L_', 'B_T', 'MEN', 'EAT', 'TO', 'AWE', 'MIN', 'GOAL', 'WASH', 'AB', 'KISSES', 'TIM', 'MOM']

[[0.0451, 0.0451], [0.0206, 0.0713, 0.0206, 0.0713], [0.0313, 0.071, 0.0313, 0.071], [7e-05, 0.0713, 0.0713], [1e-05, 0.0713, 0.0713], [0.0053, 0.0053], [0.0004, 0.0002, 0.0313, 0.071, 0.0313, 0.071], [0.0169, 0.0451, 0.0169, 0.0451], [0.0002, 0.0036, 0.0053, 0.0036, 0.0053], [0.0057, 0.0206, 0.0206], [0.0042, 0.0055, 0.0488, 0.0042, 0.0055, 0.0488], [0.0004, 0.0313, 0.0713, 0.0313, 0.0713], [0.0313, 0.0313, 0.0313, 0.0313]]
[[1.0, 1.0], [1.0, 0.973, 1.0, 0.973], [0.971, 0.967, 0.971, 0.967], [1.0, 0.973, 0.973], [1.0, 0.973, 0.973], [1.0, 1.0], [1.0, 1, 0.971, 0.967, 0.971, 0.967], [0.64, 1.0, 0.64, 1.0], [1, 1.0, 1.0, 1.0, 1.0], [1, 1.0, 1.0],

In [None]:
print(words[:10])
print(prior_prob[:10])

This is what we had after the exceptions and 3 and 4 character consonant phonemes

[[], [], [7e-05], [1e-05], [], [0.0004, 0.0002], [], [0.0002], [0.0057], [], [0.0004]]
[[], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1]]
[[], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1]]
[[], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1]]
[[], [], [1.0], [1.0], [], [1.0, 1], [], [1], [1], [], [1]]

So, it is looking good

Need to average by list. 
If list is empty, need to assign value of 0 to list
Then, need to move list to pandas df

In [None]:

for item in prior_prob:
     if len(item)==0:
        #print(item)
        item.append(0)

for item in max_prob:
     if len(item)==0:
        item.append(0)
        
for item in min_prob:
     if len(item)==0:
        item.append(0)

        
for item in mid_prob:
     if len(item)==0:
        item.append(0)

        
for item in number_phonemes:
     if len(item)==0:
        item.append(0)

#print(prior_prob)
#print(max_prob)
#print(min_prob)
#print(mid_prob)
#print(number_phonemes)

prior_prob_avg = [sum(sub_list) / len(sub_list) for sub_list in prior_prob]
max_prob_avg = [sum(sub_list) / len(sub_list) for sub_list in max_prob]
min_prob_avg = [sum(sub_list) / len(sub_list) for sub_list in min_prob]
mid_prob_avg = [sum(sub_list) / len(sub_list) for sub_list in mid_prob]
number_phonemes_avg = [sum(sub_list) / len(sub_list) for sub_list in number_phonemes]

#print(prior_prob_avg)
#print(max_prob_avg)
#print(min_prob_avg)
#print(mid_prob_avg)
#print(number_phonemes_avg)



In [None]:
print(prior_prob_avg[:10])
print(max_prob_avg[:10])
print(min_prob_avg[:10])
print(mid_prob_avg[:10])
print(number_phonemes_avg[:10])

In [None]:

decoding_df['prior_prob_cons'] = prior_prob_avg #add to dataframe
decoding_df['max_prob_cons'] = max_prob_avg #add to dataframe
decoding_df['min_prob_cons'] = min_prob_avg #add to dataframe
decoding_df['mid_prob_cons'] = mid_prob_avg #add to dataframe
decoding_df['number_phonemes_cons'] = number_phonemes_avg #add to dataframe



In [None]:
decoding_df[25000:25010]

In [None]:
#save to a .csv for later use if needed

decoding_df.to_csv("everything_except_vowel_prob.csv")


**Start here with counts for vowels**

Calculate values for 3 and 4 character vowel phonemes first.

Start all over with the word list after consonant exceptions. This will have the meaningful vowels left in it


First, call in dataframes with conditional probabilities for vowels

In [1]:
import pandas as pd

#call in dataframe by number of words
berndt_1_vow_df = pd.read_csv('vow_cond_prob_berndt_no_except_one_char.csv', na_values='', header=None)

berndt_2_vow_df = pd.read_csv('vow_cond_prob_berndt_no_except_two_char.csv', na_values='', header=None)

berndt_3_vow_df = pd.read_csv('vow_cond_prob_berndt_no_except_three_char.csv', na_values='', header=None)

berndt_4_vow_df = pd.read_csv('vow_cond_prob_berndt_no_except_four_char.csv', na_values='', header=None)

berndt_3_vow_df
#berndt_2_df

Unnamed: 0,0,1,2,3,4,5
0,EAU,0.0001,0.545,0.454,0.4995,2
1,EOU,7e-05,1.0,1.0,1.0,1
2,IGH,0.0008,1.0,1.0,1.0,1
3,lEU,3e-05,1.0,1.0,1.0,1
4,lEW,3e-05,1.0,1.0,1.0,1


Turn these dataframes into useable dictionaries

In [2]:
#Convert to dictionary

berndt_4_vow_df_tran = berndt_4_vow_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_4_vow_df_tran =berndt_4_vow_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_3_vow_df_tran = berndt_3_vow_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_3_vow_df_tran =berndt_3_vow_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_2_vow_df_tran = berndt_2_vow_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_2_vow_df_tran =berndt_2_vow_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values


berndt_1_vow_df_tran = berndt_1_vow_df.set_index(0).transpose() #this transposes, but also makes column 0 (the words) the header
berndt_1_vow_df_tran =berndt_1_vow_df_tran.to_dict('list') #makes it into a single dictionary and uses header as key and list as values

#print out key and values

#print(berndt_4_vow_df_tran)
#print(berndt_3_vow_df_tran)

#see what is in there
counter = 0

#for key, val in berndt_1_vow_df_tran.items():
#    print(key)
#    print(val)
#    counter += 1
#    if counter == 5:
#        break




{'AIGH': [3e-05, 1.0, 1.0, 1.0, 1.0], 'AUGH': [0.0001, 1.0, 1.0, 1.0, 1.0], 'EIGH': [0.0001, 0.857, 0.142, 0.4995, 2.0], 'OUGH': [0.0002, 0.517, 0.068, 0.2925, 4.0]}
{'EAU': [0.0001, 0.545, 0.454, 0.4995, 2.0], 'EOU': [7e-05, 1.0, 1.0, 1.0, 1.0], 'IGH': [0.0008, 1.0, 1.0, 1.0, 1.0], 'lEU': [3e-05, 1.0, 1.0, 1.0, 1.0], 'lEW': [3e-05, 1.0, 1.0, 1.0, 1.0]}


**Next set up new probability counts for vowels**

These will be added to the consontant counts later to compute the following

1. Probability counts all
2. Probability counts vowels
3. Probability counts consonants

In [None]:
#print(f'this is the list for vowel extraction {clean_words_vowels}') #Think this is right except for special, which should be spel, but is sp. 
#Gargoyle also went to gargoy (because le is meaningful), but this seems fine.


**Start with 3-4 character exception vowels**

In [None]:
len(clean_words_vowels)
print(clean_words_vowels[:10])

In [None]:
#THIS IS WHAT WE SHOULD USE (the list below clean_words_vowels)
#print(f'this is the list for vowel extraction {clean_words_vowels}') #Think this is right except for special, which should be spel, but is sp. 

#BUT, SET UP CODE WITH THIS FOLLOWING LIST to test outcomes on more vowels

#clean_words_vowels_prac = ['DEVALUATE', 'CASTLE', 'SHOPPE', 'SHOVE', 'TO', 'OYE', 'AWE', 'HATE', 'HEARSE', 'PROTRUDE', 'PROTOTYPE', 'COUNTERWEIGHT', 'COYOTE', 'EQUATE']
#print(clean_words_vowels_prac)

#set up count lists
clean_words_vow = []

#set up list of lists to index vowel results
prior_prob_vow = [[] for x in range(len(clean_words_vowels))] #set up a list of lists that is as long as the list of words
max_prob_vow = [[] for x in range(len(clean_words_vowels))] 
min_prob_vow = [[] for x in range(len(clean_words_vowels))] 
mid_prob_vow = [[] for x in range(len(clean_words_vowels))] 
number_phonemes_vow = [[] for x in range(len(clean_words_vowels))] 


for i, word in enumerate(clean_words_vowels):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    for key, val in berndt_4_vow_df_tran.items(): #start with 4 character phonemes. Call in dictionary
        while key in word: #while key in dictionary is in word. This allows for multiple keys (i.e., the EA in MEATHEAD will be counted twice)
            #print(key) #the key to make sure it is working
            #how the values are numbered
            #print(val[0]) #prior_prob
            #print(val[1]) #max_prob
            #print(val[2]) #min_prob
            #print(val[3]) #mid_prob
            #print(val[4]) #num_phonemes
            prior_prob_vow[i].append(val[0])
            max_prob_vow[i].append(val[1]) 
            min_prob_vow[i].append(val[2]) 
            mid_prob_vow[i].append(val[3]) 
            number_phonemes_vow[i].append(val[4]) 
            word = word.replace(key, '', 1) #replace the key with empty characters, but replace only one occurrence at a time. Will ensure each item counted and put into respective lists.
    for key, val in berndt_3_vow_df_tran.items(): #next is 3 character phonemes
        while key in word:
            #print(key) #the key
            prior_prob_vow[i].append(val[0])
            max_prob_vow[i].append(val[1]) 
            min_prob_vow[i].append(val[2]) 
            mid_prob_vow[i].append(val[3]) 
            number_phonemes_vow[i].append(val[4]) 
            word = word.replace(key, '', 1)

    clean_words_vow.append(word) #this is a check to make sure the code is removing character phonemes

#print(prior_prob_vow)
#print(max_prob_vow)
#print(min_prob_vow)
#print(mid_prob_vow)
#print(number_phonemes_vow)
#print(clean_words_vow)

#seems to work fine

[[0.0001], [], [], [], [], [], [], [], [], [], [], [0.0001], [], [0.0001]]
[[1.0], [], [], [], [], [], [], [], [], [], [], [0.857], [], [1.0]]
[[1.0], [], [], [], [], [], [], [], [], [], [], [0.142], [], [1.0]]
[[1.0], [], [], [], [], [], [], [], [], [], [], [0.4995], [], [1.0]]
[[1.0], [], [], [], [], [], [], [], [], [], [], [2.0], [], [1.0]]
['L', 'CASTLE', 'SHOPPE', 'SHOVE', 'TO', 'OYE', 'AWE', 'HATE', 'HEARSE', 'PROTRUDE', 'PROTOTYPE', 'COUNTERWT', 'COYOTE', 'GT']


**Then move onto 2 and 1 character exception vowels**

Okay, now need to move to regex to remove vowels that are

AI-E
AU-E
AW-E

and

A-E
E-E

In [None]:
import re

ai_dict = {'AI': [0.0002, 0.818, 0.045, 0.4315, 3]}
au_dict = {'AU': [0.0001, 0.75, 0.083, 0.4165, 3]}
aw_dict = {'AW': [0.00001, 1, 1, 1, 1]}
ay_dict = {'AY': [0.000009, 1, 1, 1, 1]}
ea_dict = {'EA': [.00003, 1.0, 1.0, 1.0, 1.0]} #for HEARSE
ee_dict = {'EE': [0.00008, 1, 1, 1, 1]}
ei_dict = {'EI': [.00007, 0.75, 0.25, 0.5, 2]}
eu_dict = {'EU': [0.000009, 1, 1, 1, 1]}
ew_dict = {'EW': [0.000009, 1, 1, 1, 1]}
ey_dict = {'EY': [0.00006, 1, 1, 1, 1]}
ia_dict = {'IA': [0.00002, 1, 1, 1, 1]}
ie_dict = {'IE': [0.0002, 0.838, 0.032, 0.435, 3]}
oa_dict = {'OA': [0.00002, 1, 1, 1, 1]}
oi_dict = {'OI': [0.00009, 0.8, 0.2, 0.5, 2]}
oo_dict = {'OO': [0.0001, 1, 1, 1, 1]}
ou_dict = {'OU': [0.0006, 0.794, 0.014, 0.404, 4]}
ow_dict = {'OW': [0.00002, 0.666, 0.333, 0.4995, 2]}
oy_dict = {'OY': [0.000009, 1, 1, 1, 1]}
ui_dict = {'UI': [0.00003, 1, 1, 1, 1]}


a_dict = {'A': [0.0111, 0.651, 0.002, 0.3265, 7]} #for AWE
e_dict = {'E': [0.0032, 0.321, 0.002, 0.1615, 6]}
i_dict = {'I': [0.0086, 0.589, 0.001, 0.295, 5]}
o_dict = {'O': [0.0043, 0.785, 0.002, 0.3935, 7]}
u_dict = {'U': [0.0033, 0.703, 0.008, 0.3555, 7]} #for RUDE
y_dict = {'Y': [0.0002, 0.958, 0.041, 0.4995, 2]} #for TYPE


#words_prac = ['BT', 'MEN', 'EAT', 'TO', 'AWE', 'MIN', 'GOAL', 'WASH', 'AB', 'BTY', 'HEARSE', 'PROTRUDE', 'TYPE', 'EASE', 'TOOTSIE', 'GOOSE', 'GOOD', 'AIDE', 'VOICE']
except_vowel_words_2 = [] #holder list of words to remove. 
clean_vow_words_2 = []

#first, let's call in all the words and find those that have xx_e and x_e patterns and get counts for those words and clean the words of these vowel phonemes


for i, word in enumerate(clean_words_vow):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    except_vowel_words = []
#START HERE for TWO letter exceptions (i.e., AI_E as in AIDE and AISLE)
    if re.search(r'AI[^AEIOU]E$', word):# basically will find anything with AI followed by a consonant followed by an E
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ai_dict.items(): #call in dictionary
                #print(key)
                while key in word: #while key is in word (to allow for multiple keys in a word (e.g., MEATHEAD)
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1] #remove the final e
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'AU[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in au_dict.items(): #call in dictionary
                #print(key)
                while key in word: #while key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'AW[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in aw_dict.items(): #call in dictionary
                #print(key)
                while key in word: #while key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)   
    if re.search(r'AY[^AEIOU]E$', word):# 
        #print(word) #this is GOOSE but not GOOD and TOOTSIE (works well)
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ay_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)   
    if re.search(r'EA[^AEIOU]E$', word):# EA, .* means zero or more of any character, [^AEIOU] means non-vowels, E$ mean ends in E
        #print(word) #this is just HEARSE and EASE
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ea_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word (i.e., does word EA, like EASE, but not EAT
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'EE[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ee_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words) 
    if re.search(r'EI[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ei_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'EU[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in eu_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'EW[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ew_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'EY[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ey_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'IA[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ia_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'IE[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ie_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'OA[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in oa_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'OI[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in oi_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'OO[^AEIOU]E$', word):# 
        #print(word) #this is GOOSE but not GOOD and TOOTSIE (works well)
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in oo_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)
    if re.search(r'OU[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ou_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'OW[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ow_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'OY[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in oy_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'UI[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in ui_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)   
        
        
#NEXT, let's work on ONE letter exceptions (i.e., A_E as in ATE and SENATE)
    if re.search(r'A[^AEIOU]E$', word):#AWE but not CASTLE. Screws up with equate for now...
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(f' the word {word} matches A-E') #these are the exception words for a-e. 
            for key, val in a_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word. This will count a word like DEVALUATE twice.... once as DEVALUATE, which will be changed to DEVALUTE, which will be counted a second time. Minor glitch in coding. Very rare.
                    #print(f'the word {word} includes this key {key}')
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'E[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in e_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'I[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in i_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'O[^AEIOU]E$', word):# the NORE in IGNORE but not the OTOTYPE in PROTOTYPE
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in o_dict.items(): #call in dictionary
                #print(key)
                while key in word: #while key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'U[^AEIOU]E$', word):# 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in u_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    if re.search(r'Y[^AEIOU]E$', word):#The TYPE in PROTOTYPE 
        #print(word) #
        except_vowel_words.append(word)#put it into exception list to act on in a bit
        for word in except_vowel_words:
            #print(word) #these are the exception words for ea-e
            for key, val in y_dict.items(): #call in dictionary
                #print(key)
                while key in word: #if key is in word
                    #print(key)
                    prior_prob_vow[i].append(val[0])
                    max_prob_vow[i].append(val[1]) 
                    min_prob_vow[i].append(val[2]) 
                    mid_prob_vow[i].append(val[3]) 
                    number_phonemes_vow[i].append(val[4]) 
                    word = word.replace(key, '', 1) #replace the key in the word with nothing (i.e., 'medal' becomes 'med')
                    word = word[:-1]
            clean_vow_words_2.append(word) #this is to keep list of remaining characters in words to return to larger list of words later
        except_vowel_words_2.append(except_vowel_words)     
    
#clean_vow_words_3 = [] #these are the words stripped of final E
        
#for word in clean_vow_words_2:
#    word = word[:-1] #this will replace all E's
    #print(word)
#    clean_vow_words_3.append(word)
        
 
#print(f'this is a list of words that had vowel exceptions {except_vowel_words_2}\n')
#print(f'this is a list of words after removing vowel exceptions including removing the final E {clean_vow_words_2}\n')
#print(f'this is a list of words cleaned of all exception vowel phonemes {clean_vow_words_3}\n')
#print(prior_prob_vow)
#print(max_prob_vow)
#print(min_prob_vow)
#print(mid_prob_vow)
#print(number_phonemes_vow)


These were the counts after 3 and 4 character vowels

['BOUGHT', 'MEN', 'EAT', 'TO', 'AWE', 'MIN', 'GOAL', 'WASH', 'AB', 'BEAUTY', 'HEARSE', 'RUDE', 'TYPE']
[[0.0002], [], [], [], [], [], [], [], [], [0.0001], [], [], []]
[[0.517], [], [], [], [], [], [], [], [], [0.545], [], [], []]
[[0.068], [], [], [], [], [], [], [], [], [0.454], [], [], []]
[[0.2925], [], [], [], [], [], [], [], [], [0.4995], [], [], []]
[[4.0], [], [], [], [], [], [], [], [], [2.0], [], [], []]
['BT', 'MEN', 'EAT', 'TO', 'AWE', 'MIN', 'GOAL', 'WASH', 'AB', 'BTY', 'HEARSE', 'RUDE', 'TYPE']

In [None]:
print(clean_words_vow[:15]) #original words/character strings after 3-4 consonant vowels removed
print(except_vowel_words_2[:15]) #words to remove from list above
print(clean_vow_words_2[:15]) #character strings to replace them with

In [None]:
print(len(clean_words_vow))
print(len(clean_vow_words_2))

In [None]:
#now, we need a list for the next set of vowels
#1. Make the words to remove a list
#2. take the original words
#3. remove the exception words that were processed
#4. replace those words with their remaining parts after processing in the same order

#change list of lists into a list
vow_words_to_remove = []
for sublist in except_vowel_words_2:
    for word in sublist:
        vow_words_to_remove.append(word)

print(len(vow_words_to_remove))
print(f'these are the words to remove and replace {vow_words_to_remove[:20]}')

print(f'these are the remaining characters to replace the words above with {clean_vow_words_2[:20]}')

print(len(clean_words_vow))
print(f'these are the words cleaned of 3-4 character vowel phonemes {clean_words_vow[:20]}')


# create a new list that will store the updated list of words where we replace the exception words with the leftovers
clean_words_vow_4 = []

# loop through the words in the original list
for word in clean_words_vow:
    #print(word)
    # if the word was processed as an exception, replace it with the corresponding set of characters that remain after processing
    if word in vow_words_to_remove:
        #print(word)
        #clean_words_vow_4.append(word) #there are 7668 words in this list, but there are 7917 words in vow_words_to_remove
        clean_words_vow_4.append(clean_vow_words_2[vow_words_to_remove.index(word)])#index returns the position at the first occurrence of the specified value
    # if the word is not in words_to_remove, add it to the processed_list_for_vowel_extraction list
    else:
        clean_words_vow_4.append(word)

print(len(clean_words_vow_4))
print(f'these are the words for the next section of counting (i.e., words removed of exception vowels {clean_words_vow_4[:20]}')




# print the final list for vowel probability extraction
#print(f'this was the original list of words after 3-4 consonant vowels removed {clean_words_vow}\n')
#print(f'this is the list for next steps in vowel cleaning after exception vowels removed {clean_words_vow_4}') #Think this is right except for special, which should be spel, but is sp


**The next steps are the following**

1. Count and remove two character vowels
2. Count and remove single character vowels

This will give us vowel counts

In [None]:
#print(clean_words_vow_4)
clean_words_vow_5 = []

for i, word in enumerate(clean_words_vow_4):  # iterate over the list of words, starting with word 0. For i starts count at 0 and enumerates through list
    for key, val in berndt_2_vow_df_tran.items(): #next is 2 character vowel phonemes
        while key in word: 
            #print(key) #the key
            prior_prob_vow[i].append(val[0])
            max_prob_vow[i].append(val[1]) 
            min_prob_vow[i].append(val[2]) 
            mid_prob_vow[i].append(val[3]) 
            number_phonemes_vow[i].append(val[4]) 
            word = word.replace(key, '', 1)
    for key, val in berndt_1_vow_df_tran.items(): #next is 1 character vowel phonemes
        while key in word:
            #print(key) #the key
            prior_prob_vow[i].append(val[0])
            max_prob_vow[i].append(val[1]) 
            min_prob_vow[i].append(val[2]) 
            mid_prob_vow[i].append(val[3]) 
            number_phonemes_vow[i].append(val[4]) 
            word = word.replace(key, '', 1)
    clean_words_vow_5.append(word) #this is a check to make sure the code is removing character phonemes

#print(f'this was the original list of words {clean_words_vowels}\n')
#print(f'this is the list of words for final vowel cleaning with 2 and 1 character phonemes {clean_words_vow_4}\n')
#print(prior_prob_vow)
#print(max_prob_vow)
#print(min_prob_vow)
#print(mid_prob_vow)
#print(number_phonemes_vow)
#print(f'this is what is left of the words with all the consonants counted {clean_words_vow_5}')

In [None]:
print(f'this is what is left of the words with all the consonants counted {clean_words_vow_5[:20]}')

Now, need to get rid of any empty lists (unlikely with vowels, but I am sure there are exceptions.

And, also average across lists. 

In [None]:
for item in prior_prob_vow:
     if len(item)==0:
        #print(item)
        item.append(0)

for item in max_prob_vow:
     if len(item)==0:
        item.append(0)
        
for item in min_prob_vow:
     if len(item)==0:
        item.append(0)

        
for item in mid_prob_vow:
     if len(item)==0:
        item.append(0)

        
for item in number_phonemes_vow:
     if len(item)==0:
        item.append(0)


prior_prob_vow_avg = [sum(sub_list) / len(sub_list) for sub_list in prior_prob_vow]
max_prob_vow_avg = [sum(sub_list) / len(sub_list) for sub_list in max_prob_vow]
min_prob_vow_avg = [sum(sub_list) / len(sub_list) for sub_list in min_prob_vow]
mid_prob_vow_avg = [sum(sub_list) / len(sub_list) for sub_list in mid_prob_vow]
number_phonemes_vow_avg = [sum(sub_list) / len(sub_list) for sub_list in number_phonemes_vow]



In [None]:
print(prior_prob_vow_avg[:20])
print(max_prob_vow_avg[:20])
print(min_prob_vow_avg[:20])
print(mid_prob_vow_avg[:20])
print(number_phonemes_vow_avg[:20])

In [None]:
#add to dataframe

decoding_df['prior_prob_vowel'] = prior_prob_vow_avg #add to dataframe
decoding_df['max_prob_vowel'] = max_prob_vow_avg #add to dataframe
decoding_df['min_prob_vowel'] = min_prob_vow_avg #add to dataframe
decoding_df['mid_prob_vowel'] = mid_prob_vow_avg #add to dataframe
decoding_df['number_phonemes_vowel'] = number_phonemes_vow_avg #add to dataframe

In [None]:
decoding_df[34000:34100]

In [None]:
#decoding_df.to_csv("everything_including_vowel_prob.csv")

Now, probably need something that averages all the probability values across consonants and vowels.

Can do the following

1. Simple average (average of prior_prob_cons and prior_prob_vowel)


In [None]:
decoding_df['prior_prob_all'] = decoding_df[['prior_prob_cons', 'prior_prob_vowel']].mean(axis=1)
decoding_df['max_prob_all'] = decoding_df[['max_prob_cons', 'max_prob_vowel']].mean(axis=1)
decoding_df['mid_prob_all'] = decoding_df[['mid_prob_cons', 'mid_prob_vowel']].mean(axis=1)
decoding_df['min_prob_all'] = decoding_df[['min_prob_cons', 'min_prob_vowel']].mean(axis=1)
decoding_df['number_phonemes_all'] = decoding_df[['number_phonemes_cons', 'number_phonemes_vowel']].mean(axis=1)

In [None]:
decoding_df

In [None]:
decoding_df.to_csv("everything_including_vowel_prob.csv")