In [1]:
__author__="Emily Hua"

Now you have to create some text files that will allow Kaldi to communicate with your audio data. Consider these files as 'must be done'. Each file that you will create in this section (and in Language data section as well) can be considered as a text file with some number of strings (each string in a new line). These strings need to be sorted. If you will encounter any sorting issues you can use Kaldi scripts for checking (utils/validate_data_dir.sh) and fixing (utils/fix_data_dir.sh) data order. And for you information - utils directory will be attached to your project in Tools attachment section.

In [2]:
import os
parent_path = os.path.split(os.getcwd())[0]
print (parent_path)

/Volumes/STARTRACK/deep-learning/code-switch


In kaldi/egs/code-switch directory, create a folder **data**. Then create **test** and **train** subfolders inside. Create in each subfolder following files (so you have files named in the same way in test and train subfolders but they relate to two different data sets that you created before):

In [3]:
directory = parent_path + "/data/train"
if not os.path.exists(directory):
    os.makedirs(directory)

In [4]:
directory = parent_path + "/data/test"
if not os.path.exists(directory):
    os.makedirs(directory)

In [7]:
ls -l ../data

total 0
drwxr-xr-x  3 yehua  staff  102 Mar 24 15:11 [1m[36mtest[m[m/
drwxr-xr-x  3 yehua  staff  102 Mar 24 15:03 [1m[36mtrain[m[m/


a.) spk2gender 
This file informs about speakers gender. As we assumed, 'speakerID' is a unique name of each speaker.

Pattern: [speakerID] [gender]

In [8]:
import os
audio_path = parent_path + '/LDC2015S04/seame_d2/data/interview/audio'
dir_list = os.listdir(audio_path)[1:]
import re
from collections import defaultdict 
id_dic = defaultdict(int)
for file in dir_list:
    id_dic[re.split('_', file)[0][2:-1]] += 1
print ('there are {} unique speaker id'.format(len(id_dic)))

there are 94 unique speaker id


In [9]:
test_ids = ['01MA', '03FA','08MA', '29FA','29MB','42FB','44MB','45FB','67MB','55FB']
train_ids = []
for key in id_dic:
    if key not in test_ids:
        train_ids.append(key)
print ('there are {} speaker ids in the training set'.format(len(train_ids)))

there are 84 speaker ids in the training set


In [11]:
directory = parent_path + "/data/train/spk2gender"
with open(directory, 'w') as outfile:
    for speakerid in train_ids:
        outfile.write("{} {}\n".format(speakerid,speakerid[2]))

In [12]:
directory = parent_path + "/data/test/spk2gender"
with open(directory, 'w') as outfile:
    for speakerid in test_ids:
        outfile.write("{} {}\n".format(speakerid, speakerid[2]))

In [13]:
less ../data/test/spk2gender

b.) wav.scp 
This file connects every utterance (sentence said by one person during particular recording session) with an audio file related to this utterance. If you stick to my naming approach, 'utteranceID' is nothing more than 'speakerID' (speaker's folder name) glued with *.wav file name without '.wav' ending (look for examples below).

Pattern: [uterranceID] [full_path_to_audio_file]

In [15]:
ls -l ../interview_audio/test/01MA

total 93848
-rw-r--r--  1 yehua  staff  48047643 Mar 24 12:35 NI01MAX_0101.flac


In [27]:
directory = parent_path + "/data/train/wav.scp"
with open(directory, 'w') as outfile:
    for file in dir_list:
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in train_ids:
            path = parent_path + "/interview_audio/train" + speaker_id + "/" + file
            outfile.write("{} {}\n".format(re.split("\.", file)[0], path))
        

In [30]:
directory = parent_path + "/data/test/wav.scp"
with open(directory, 'w') as outfile:
    for file in dir_list:
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in test_ids:
            path = parent_path + "/interview_audio/test" + speaker_id + "/" + file
            outfile.write("{} {}\n".format(re.split("\.", file)[0], path))

In [31]:
less ../data/test/wav.scp

c.) text 
This file contains every utterance matched with its text transcription.

Pattern: [uterranceID] [text_transcription]

In [38]:
trans_path = parent_path + "/LDC2015S04/seame_d2/data/interview/transcript"
trans_list = os.listdir(trans_path)[1:]
directory = parent_path + "/data/train/text"
with open(directory, 'w') as outputfile:
    for file in trans_list: 
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in train_ids:
            trans_file = trans_path + "/" + file
            with open(trans_file, 'r') as inputfile:
                for line in inputfile:
                    outputfile.write(line)

In [39]:
less ../data/train/text

In [40]:
trans_path = parent_path + "/LDC2015S04/seame_d2/data/interview/transcript"
trans_list = os.listdir(trans_path)[1:]
directory = parent_path + "/data/test/text"
with open(directory, 'w') as outputfile:
    for file in trans_list: 
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in test_ids:
            trans_file = trans_path + "/" + file
            with open(trans_file, 'r') as inputfile:
                for line in inputfile:
                    outputfile.write(line)

In [41]:
less ../data/test/text

d.) utt2spk 
This file tells the ASR system which utterance belongs to particular speaker.

Pattern: [uterranceID] [speakerID]

In [47]:
utt2spk_path = parent_path + "/data/train/utt2spk"
with open(utt2spk_path, 'w') as outputfile:
    for file in dir_list:
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in train_ids:
            outputfile.write("{} {}\n".format(re.split("\.", file)[0], speaker_id))

In [49]:
utt2spk_path = parent_path + "/data/test/utt2spk"
with open(utt2spk_path, 'w') as outputfile:
    for file in dir_list:
        speaker_id = re.split("_", file)[0][2:-1]
        if speaker_id in test_ids:
            outputfile.write("{} {}\n".format(re.split("\.", file)[0], speaker_id))

In [48]:
less ../data/train/utt2spk

In [50]:
less ../data/test/utt2spk

e.) corpus.txt 
This file has a slightly different directory. In kaldi-trunk/egs/digits/data create another folder local. In kaldi/egs/code-switching/data/local create a file corpus.txt which should contain every single utterance transcription that can occur in your ASR system (in our case it will be 100 lines from 100 audio files).

Pattern: [text_transcription]

In [60]:
temp_path = parent_path + "/data/local"
if not os.path.exists(temp_path):
    os.makedirs(temp_path)
    
corpus_path = parent_path + "/data/local/corpus.txt"
trans_path = parent_path + "/LDC2015S04/seame_d2/data/interview/transcript"
trans_list = os.listdir(trans_path)[1:]

with open(corpus_path, 'w') as outputfile:
    for file in trans_list: 
            trans_file = trans_path + "/" + file
            with open(trans_file, 'r') as inputfile:
                for line in inputfile:
                    #outputfile.write(line)
                    outputfile.write(re.split("\t", line)[3])


In [61]:
less ../data/local/corpus.txt