 Notebook to generate 10,20, 50 Finnish phoneme representation for each English words in the game data.
Created from the base: english_word_to_finnish_phoneme.ipynb.

Author: Sujith Padaru
- *Inputs*: 
    - Game words
    - English Dict, word, phoneme representation form. *[hard coded for en_uk].
    - Finnish dict,
    - Eng to Global map. Pickle of a dictionary with {phoneme -> phoneme rep}
    - Finnish to Global Map.
    - Global phone distances.
    

- *Outputs*: 
    - n finnish phonetical representation for each word.
- *HyperParameters*:
    - n number of samples to generate

In [1]:
import os
import numpy as np
import pandas as pd
from pprint import pprint
from itertools import product
from pathlib import Path


 Code to Run the below file as a script to generate english words to generate based on different distance 
 metrics and english and finnihs dictionaries. 

In [52]:
def map_eng_sentence_2_en_uk_phone_rep(sentence):
    '''
    inputs:
    sentence: (str) Sentence which needs to be represented in phones.
    outputs:
    transcripts (str)['ph ph ph'] Phoneme representation of sentence.
    '''
    
    # Save transcript for each word in the sentence in a list transcripts.
    global en_uk_dict
    words = sentence.split(' ')    
    print(words)
    word_transcripts = []
    for word in words:
        if word not in en_uk_dict.keys():
            print('Some words not found')
            # return ''
        else:
            word_transcripts.append(en_uk_dict[word])
    
    #Take the transcripts in the transcript list and make a phoneme representation.
    # Taking care if it's a single word or a sentence.
    print(word_transcripts)
    sentence_transcript = ''
    if len(word_transcripts) == 1:
        sentence_transcript += word_transcripts[0]
    else:
        for i in word_transcripts:
            sentence_transcript += i 
            sentence_transcript += " sil "

    return sentence_transcript.rstrip()

In [3]:
def eng_ph_2_global_ph(english_ph_transcript):
    '''
    input: 
    english_ph_transcript - English phone transcripts from mapping file. 'ph ph ph ...'
    output: 
    global_ph_transcript: 'ph ph ph ...'
    Hyperparameter: eng_to_global_map.
    '''
    global_ph_transcript = ''
    for phone in english_ph_transcript.split(" "):
        global_ph_transcript += eng_to_global_map[phone]
        global_ph_transcript += ' '
    
    return global_ph_transcript.rstrip()

In [4]:
def glob_transcript_2_fin_nearest(transcript):
    '''Returns the nearest finnish phoneme to each of the global transcript
       Input: Global phoneme representation seperated by a space
       Output: Finnish phoneme representation seperated by space.
    '''
    fin_transcript = ''
    for phone in transcript.split(" "):
         
        if type(global_2_fin_map[phone]) is list:
            fin_transcript += global_2_fin_map[phone][0]
        else:
            fin_transcript += global_2_fin_map[phone]       
        
        fin_transcript += ' '
    return fin_transcript.rstrip()

In [5]:
def glob_transcript_2_N_fin(transcript):
    '''Returns the nearest finnish phoneme to each of the global transcript
       Input: Global phoneme representation seperated by a space
       Output: Finnish phoneme representation seperated by space.
    '''
    
    phone_maps = []
    for phone in transcript.split(" "):
        if phone == 'sil':
            phone_maps.append(['sil'])
        else:
            phone_maps.append(global_2_fin_map[phone])
    
    list_of_trans = list(product(*phone_maps))    
    list_of_trans = [" ".join(list(trans)) for trans in list_of_trans]
    return list_of_trans

In [61]:
mapping.fin_transcript[0]

'f y sil sil d y sil f i s t sil tʰ æe m sil e n sil m i m y rː iː sil p rː y d j uː s y sil sil y n d sil k y n s j uː m y sil sil p rː æe s sil rː e p o t s sil h e t sil b æ k sil tʰ y sil b æ k sil e n sil d y sil s e m sil m iː k sil'

In [65]:
def phone_sentences(sentence):
    sentence = sentence.replace("  ","sil")
    sentence = sentence.replace(" ","")
    sentence = sentence.replace("sil"," ")
    return sentence

def mappings_2_text(index='all',speakers=[600], name='eng_game_words'):
    '''
    Input : 
    Index - Index of the the mapping file to be written into a evaluation file.
                Default : all the words
    speakers - List of speakers the be generated by tacotron.
    name - Name of the output text file
    
    Output:
    None, Saves the file in the eval_text/folder. 
    '''
    if index == 'all':
        transcript_2_text = mapping[['sentence','fin_transcript']]
    else:
        transcript_2_text = mapping.iloc[index][['sentence','fin_transcript']]

    #transcript_2_text.set_index('sentence',inplace=True)
    transcript_2_text = transcript_2_text.assign(
                                fin_transcript=transcript_2_text.fin_transcript.apply(phone_sentences))
    for speaker in speakers:
        transcript_2_text = transcript_2_text.assign(speaker_id = len(transcript_2_text)*[speaker])
        
        my_file = Path('eval_text/{}_{}.txt'.format(name,speaker))
        
        if my_file.is_file():
            print('eval_text/{}_{}.txt already exists; not writing the file'.format(name,speaker))
        else:    
            transcript_2_text.to_csv('eval_text/{}_{}.txt'.format(name,speaker), sep='|', header=False)
    
    return 0

In [7]:

def mappings_2_text_N(N=20, index='all', speakers=[600], name='eng_game_words'):
    '''
    Input : 
    N : Number of alternate Finnish phonemic representation for each game word.
    Index - Index of the the mapping file to be written into a evaluation file.
                Default : all the words
    speakers - List of speakers the be generated by tacotron.
    name - Name of the output text file
    
    Output:
    None, Saves the file in the eval_text/folder. 
    '''
    if index == 'all':
        transcript_2_text = mapping[['sentence','all_fin_transcripts']]
    else:
        transcript_2_text = mapping.iloc[index][['sentence','all_fin_transcripts']]

    transcript_2_text.set_index('sentence',inplace=True)
    
    dfs = []
    for sentence, all_fin_transcripts in transcript_2_text.iterrows():
        list_of_trans = all_fin_transcripts.values[0]
        number_of_trans = len(list_of_trans)

        if number_of_trans > 20:
            np.random.seed(0)
            transcripts = np.random.choice(list_of_trans, size=20, replace=False)
            df = pd.DataFrame(transcripts)
            df = df.assign(sentence = 20*[sentence])
        else:
            df = pd.DataFrame(list_of_trans)
            df = df.assign(sentence = number_of_trans*[sentence])
        dfs.append(df)
    
    transcript_2_text = pd.concat(dfs)
    transcript_2_text.set_index('sentence',inplace=True)
    
    for speaker in speakers:
        transcript_2_text = transcript_2_text.assign(speaker_id = len(transcript_2_text)*[speaker])
        
        my_file = Path('eval_text/{}_{}.txt'.format(name,speaker))
        
        if my_file.is_file():
            print('eval_text/{}_{}.txt already exists; not writing the file'.format(name,speaker))
        else:    
            transcript_2_text.to_csv('eval_text/{}_spk_{}.txt'.format(name,speaker), sep='|', header=False)
    
    return 0


## __main__():
Read the Game Words

In [8]:
'''
import argparse

parser = argparse.ArgumentParser(description='Outputs the Finnish Phoneme representation of the english words')

parser.add_argument("word_list",help="Path to the text file to be converted to Finnish phonetic symbols",type=str)
parser.add_argument("-ed","--english_dict",default="dict/en_uk_dict.txt")
parser.add_argument('--eng_to_global_map', default='../mappings/en_uk_ph_dist_phones_map.pkl',
                    help="Path to the english to global phoneme dictionary mapping", type=str)
parser.add_argument('--fin_to_global_map', default='../mappings/fin_2_global_phones_map.pkl',
                    help="Path to the finnish to global phoneme dictionary mapping", type=str)
parser.add_argument('--global_phone_distances', default='../mappings/global_phone_distances.pkl',
                    help="Path to the english to global phoneme dictionary mapping", type=str)
args = parser.parse_args()

'''

'\nimport argparse\n\nparser = argparse.ArgumentParser(description=\'Outputs the Finnish Phoneme representation of the english words\')\n\nparser.add_argument("word_list",help="Path to the text file to be converted to Finnish phonetic symbols",type=str)\nparser.add_argument("-ed","--english_dict",default="dict/en_uk_dict.txt")\nparser.add_argument(\'--eng_to_global_map\', default=\'../mappings/en_uk_ph_dist_phones_map.pkl\',\n                    help="Path to the english to global phoneme dictionary mapping", type=str)\nparser.add_argument(\'--fin_to_global_map\', default=\'../mappings/fin_2_global_phones_map.pkl\',\n                    help="Path to the finnish to global phoneme dictionary mapping", type=str)\nparser.add_argument(\'--global_phone_distances\', default=\'../mappings/global_phone_distances.pkl\',\n                    help="Path to the english to global phoneme dictionary mapping", type=str)\nargs = parser.parse_args()\n\n'

In [67]:
### Adding the words/sentences to be translated to a **mapping** file, where the further mappings can be representated.
mapping = pd.read_csv('eval_text/w1_s1newart_clean_no_unk.trn',header=None,names=['sentence'])

In [68]:
mapping.head(5)

Unnamed: 0,sentence
0,for the first time in memory producer and cons...
1,unlisted share prices slipped in muted trading...
2,a stronger dollar helped clobber tokyo stocks ...
3,tomorrow the house will vote for only the seco...
4,brother pierre is extolling the virtues of his...


- Read the English Dictionary 
- preprocess it
- Convert english word rep to eng phoneme rep

In [69]:
# Reading and preprocessing the english dictionary. 
# English Dictionary should be a text file with each line representeda s follows
# language_dialect_word [\t tab] phone[space]phone[space]...

en_uk_dict = pd.read_csv('dict/en_uk_dict.txt',header=None,names=['word','en_rep'],sep='\t')#,index_col=['word'])

#Removing 'language_dialect_' part
word_after_remov_en_uk = en_uk_dict.word.apply(lambda en_uk_word: en_uk_word.split('_')[-1])
en_uk_dict.set_index(word_after_remov_en_uk,inplace=True)
en_uk_dict.drop('word',axis=1,inplace=True)
#pprint('English Phoneme Representation samples:')
#pprint(en_uk_dict.head())

en_uk_dict = en_uk_dict.to_dict()['en_rep']
#Add the English phoneme representation to the mapping dataframe            
mapping =mapping.assign(eng_transcript= mapping.sentence.apply(map_eng_sentence_2_en_uk_phone_rep))
mapping = mapping.assign(no_transcript_flag=(mapping.eng_transcript==''))
pprint('The Dataframe after mapping english words to phonemes:')
pprint(mapping.head())

['for', 'the', 'first', 'time', 'in', 'memory', 'producer', 'and', 'consumer', 'price', 'reports', 'hit', 'back', 'to', 'back', 'in', 'the', 'same', 'week', '']
Some words not found
['f ə ', 'ð ə', 'f ɜː s t', 'tʰ aɪ m', 'ɪ n', 'm ɛ m ə ɹ iː', 'p ɹ ə d j uː s ə ', 'ə n d', 'kʰ ə n s j uː m ə ', 'p ɹ aɪ s', 'ɹ ɪ pʰ ɔː t s', 'h ɪ t', 'b a k', 'tʰ ə', 'b a k', 'ɪ n', 'ð ə', 's eɪ m', 'w iː k']
['unlisted', 'share', 'prices', 'slipped', 'in', 'muted', 'trading', 'yesterday', 'depressed', 'by', 'weaker', 'equity', 'markets', 'overseas', '']
Some words not found
['ʌ n l ɪ s t ɪ d', 'ʃ ɛə ', 'p ɹ aɪ s ɪ z', 's l ɪ p t', 'ɪ n', 'm j uː t ɪ d', 't ɹ eɪ d ɪ ŋ', 'j ɛ s t ə d iː', 'd ɪ p ɹ ɛ s t', 'b aɪ', 'w iː k ə ', 'ɛ k w ə t iː', 'm ɑː k ɪ t s', 'əʊ v ə s iː z']
['a', 'stronger', 'dollar', 'helped', 'clobber', 'tokyo', 'stocks', 'which', 'in', 'turn', 'pulled', 'u.', 's.', 'stock', 'prices', 'modestly', 'lower', '']
Some words not found
Some words not found
Some words not found
['ə', 's t ɹ ɒ 

- Read English phoneme to Global Phoneme Map
- Eng phoneme rep to Global Phoneme rep

In [70]:
# Read the English to Global rep and make the mapping in the file.


'''Reading the pprint english to global map dictionary
{'': '',
 'ɒ': 'ɒ',
 .
 .
 .
 }
'''

eng_to_global_map = pd.read_pickle('mappings/en_uk_ph_dist_phones_map.pkl')
#eng_to_global_map = pd.read_pickle(args.eng_to_global_map)


mapping = mapping.assign(global_transcript = mapping.eng_transcript.apply(eng_ph_2_global_ph))

- Read Finnish phoneme map to Global Map
- Read Global Phone Distance
- Infer Global phone to Finnish Phone Map
- Find Finnish Phoneme Rep from Global Rep

In [74]:
# Read the Fin to Global Map and compute Global to Finnish Map.

fin_to_global_map = pd.read_pickle('mappings/fin_2_global_phones_map.pkl')
global_2_fin_map = dict([[value,key] for key,value in fin_to_global_map.items()])

global_phone_dist = pd.read_pickle('mappings/phone_distances.pickle')

global_ph = global_phone_dist['phones']
global_ph_dist = global_phone_dist['phone_distances']

global_ph_dist = pd.DataFrame(data=global_ph_dist,index=global_ph.values(),columns=global_ph.values())

distance_2_fin = global_ph_dist.loc[global_2_fin_map.keys()]

for ph in global_ph.values():
    three_nearest_phones = list(distance_2_fin[ph].sort_values()[:3].index)
    global_2_fin_map[ph] = three_nearest_phones


global_2_fin_map['w'] = np.concatenate([global_2_fin_map['w'][:2],['v']])
global_2_fin_map['w'] = list(global_2_fin_map['w'])
global_2_fin_map['z'] = np.concatenate([global_2_fin_map['z'][:2],['s']])
global_2_fin_map['z'] = list(global_2_fin_map['z'])

mapping = mapping.assign(fin_transcript=mapping.global_transcript.apply(glob_transcript_2_fin_nearest))
pprint(mapping.head(5))


#mapping = mapping.assign(all_fin_transcripts =
#                         mapping.global_transcript.apply(glob_transcript_2_N_fin))

                                            sentence  \
0  for the first time in memory producer and cons...   
1  unlisted share prices slipped in muted trading...   
2  a stronger dollar helped clobber tokyo stocks ...   
3  tomorrow the house will vote for only the seco...   
4  brother pierre is extolling the virtues of his...   

                                      eng_transcript  no_transcript_flag  \
0  f ə  sil ð ə sil f ɜː s t sil tʰ aɪ m sil ɪ n ...               False   
1  ʌ n l ɪ s t ɪ d sil ʃ ɛə  sil p ɹ aɪ s ɪ z sil...               False   
2  ə sil s t ɹ ɒ ŋ g ə  sil d ɒ l ə  sil h ɛ ɫ p ...               False   
3  tʰ ə m ɒ ɹ əʊ sil ð ə sil h aʊ z sil w ɪ ɫ sil...               False   
4  b ɹ ʌ ð ə  sil pʰ ɪə  sil ɪ z sil ɪ k s t əʊ l...               False   

                                   global_transcript  \
0  f ə sil sil ð ə sil f ɜ s t sil tʰ aɪ m sil ɪ ...   
1  ʌ n l ɪ s t ɪ d sil ʃ ɛə sil sil p ɹ aɪ s ɪ z ...   
2  ə sil s t ɹ ɒ ŋ g ə sil sil d ɒ l ə

- Create the Dataframe to save the evaluation file.
- Save it

In [66]:
# Generating 20 fin reps for 5 random words with speaker, 600, 555, 567 and 57, 258 

rand_spks = [600, 555, 567, 57, 258]
mappings_2_text(speakers=rand_spks, name='wsj_test')

0

In [132]:
mapping.head()

Unnamed: 0,sentence,eng_transcript,no_transcript_flag,global_transcript,fin_transcript,all_fin_transcripts
0,girl,g ɜː ɫ,False,g ɜ ɫ,g i l,"[g i l, g i lː, g i rː, g æ l, g æ lː, g æ rː,..."
1,hello,h ɛ l əʊ,False,h ɛ l əʊ,h i l ou,"[h i l ou, h i l u, h i l y, h i lː ou, h i lː..."
2,book,b ʊ k,False,b ʊ k,b u k,"[b u k, b u kː, b u p, b y k, b y kː, b y p, b..."
3,learn,l ɜː n,False,l ɜ n,l i n,"[l i n, l i nː, l i rː, l æ n, l æ nː, l æ rː,..."
4,bye bye,b aɪ sil b aɪ sil,False,b aɪ sil b aɪ sil,b æe sil b æe sil,"[b æe sil b æe sil, b æe sil b ɑe sil, b æe si..."
