# Moth Transcripts to Gentle

The Huth Moth transcripts are provided within Praat. There are two issues with this format:
1. There is no joint transcript including punctuation (allowing us to present the next-word prediction framework)
2. Our pipeline uses Gentle as its starting point to process files

We load the Praat files and align it with a transcript generated through ChatGPT (adjusting mismatched words).

In [1]:
%load_ext autoreload
import os, sys, glob
import json
import re
import numpy as np
import pandas as pd
from pathlib import Path
from praatio import textgrid as tgio
import json
import shutil

sys.path.append('../utils/')

from text_utils import strip_punctuation
# from text_utils import get_pos_tags, get_lemma

In [1437]:
def load_clean_textgrid(praat_fn):
    '''
    Load a praat textgrid file using PraatIO
    '''
    
    # things to remove from the textgrid (indicates laughing, chewing, pauses etc)
    REMOVE_CHARACTERS = ['sp', 'br', 'lg', 'cg', 'ls', 'ns', 'sl', 'ig',
                         '{sp}', '{br}', '{lg}', '{cg}', '{ls}', '{ns}', '{sl}', '{ig}', 'pause']
    
    # open the textgrid
    tg = tgio.openTextgrid(praat_fn, includeEmptyIntervals=False, reportingMode="warning") 
    
    # remove entries of unwanted characters
    for tier_name in tg.tierNames:
        # get the current tier
        tier = tg.getTier(tier_name)
        
        for x in tier.entries:
            if x[-1].lower() in REMOVE_CHARACTERS:
                tier.deleteEntry(x)

#         for char in REMOVE_CHARACTERS:
#             upper_set = set(tier.find(char.upper()))
#             lower_set = set(tier.find(char.lower()))
#             remove_idxs = sorted(upper_set.union(lower_set))

#             # go through each index and remove
#             for idx in remove_idxs:
#                 try:
#                     tier.deleteEntry(tier.entries[idx])
#                 except:
#                     print (idx)
    
#     # go through each entry at the word tier, remove the items
#     words = [x for x in tg.getTier('word').entries if x[-1].lower() not in REMOVE_CHARACTERS]
#     phones = [x for x in tg.getTier('phone').entries if x[-1].lower() not in REMOVE_CHARACTERS]
#     words = tg.getTier('word').entries
#     phones = tg.getTier('phone').entries
    return tg

def load_transcription(transcript_fn):
    
    with open(transcript_fn, 'r') as f: #open the file
        contents = f.readlines() #put the lines to a variable (list).
        
    # get the transcription stripped of punctuation
    words_transcribed = strip_punctuation(contents).split()
    
    return contents, words_transcribed

def textgrid_to_gentle(praat_fn, transcript_fn):
    '''
    Transform Moth dataset textgrid files into gentle format
    '''
    
    tg_phones, tg_words = load_clean_textgrid(praat_fn)
    
    contents, words_transcribed = load_transcription(transcript_fn)
    
    assert (len(tg_words) == len(words_transcribed))
    
    # create the dictionary to store things in
    # put the transcript in the raw form
    align = {}
    align['transcript'] = contents[0]
    align['words'] = []
    
    # Taken from Kaldi metasentence tokenizer
    # splits the transcript based on any punctuation besides for apostrophes and hyphens
    regex_split_pattern = r'(\w|\’\w|\'\w|\-\w)+'
    
    iterator = list(re.finditer(regex_split_pattern, ''.join(contents), re.UNICODE))
    n_items = len(list(iterator))
    
    # make sure the iterator matches the length
    assert (n_items == len(tg_words) == len(words_transcribed))
    
    # if all matches we're good to go
    for tg_info, m in zip(tg_words, iterator):
        # span of the word in characters relative to the overall string
        start_offset, end_offset = m.span()
        word = m.group()
        
        word_align = {
            'alignedWord': word.lower(),
            "case": "success",
            'word': word,
            'start': tg_info[0],
            'end': tg_info[1],
            "startOffset": start_offset,
            "endOffset": end_offset,
        }
        
        align['words'].append(word_align)
        
    return align

## Set paths 

These are paths to the main directory and the stimulus directory

CHANGE THE PATH BELOW TO MATCH YOUR DIRECTORY --> FinnLabTasks/transcript_alignment/

In [1341]:
base_dir = '/dartfs/rc/lab/F/FinnLab/tommy/isc_asynchrony_behavior/'
datasets_dir = '/dartfs/rc/lab/F/FinnLab/datasets/'
stim_dir = os.path.join(datasets_dir, 'huth-moth', 'stimuli')

# for prepping for onlin eexpt
# stim_dir = os.path.join(base_dir, 'stimuli')

## Load Praat files

We first get all the filenames of TextGrid files within the stimulus directory. We also print out the number of files within this directory.

In [1342]:
praat_fns = sorted(glob.glob(os.path.join(stim_dir, 'praat', '*.TextGrid')))

print (f'Total files in dataset: {len(praat_fns)}')

Total files in dataset: 27


<b>Note:</b> This is <b>very</b> likely not to work on the first time. Follow the steps below to get the file to load!

We are going to load a Praat TextGrid file. This will probably not work on the first time due to overlapping timestamps. To address this, do the following:
1. Open the .TextGrid file in a text editor (e.g., TextEdit, SublimeText)
2. Look at the Python error -- you will need to manually adjust these overlapping times. Copy the first number in the second parentheses:
    - <b>Example error:</b> Two intervals in the same tier overlap in time: (START_1, END_1, sp) and (START_2, END_2, B)
    - For this error, copy the number "START_2"
3. Go to the text editor, and search (cmd + F) for the copied number (e.g., "START_2").
4. Adjust the word/phoneme before's end time (e.g., END_1) to match the copied number ("START_2").
5. Save the file and rerun the code
6. Repeat for as many times until the file loads

In [1345]:
# select a file number to load -- we then select that file from the list of alphabetized file names
file_num = 15
praat_fn = praat_fns[file_num]

# now grab the current filename as a path -- print out only the filename (no extension)
filepath = Path(praat_fn)
stim_name = filepath.stem
print (f'Stimulus name: {filepath.stem}')

# attempt to load the praat file -- if this doesn't work, follow the steps above 
tg = tgio.openTextgrid(praat_fns[file_num], includeEmptyIntervals=False, reportingMode="warning") 

print (f'Successfully loaded Praat file!')


Stimulus name: myfirstdaywiththeyankees
Successfully loaded Praat file!


## Adjust the words to have punctuation

After loading the transcript using Praat, we concatenate all the transcript words and pass it to ChatGPT to ensure punctuation. Then we need to go through comparing word by word making sure of the following:
-  The new transcript matches the original number of words
- Words are spelled correctly (as full words)

This cell below will print out all the words of the TextGrid as a string. You will need to do the following:
1. Open ChatGPT: https://chat.openai.com/chat
2. Type the following instructions: "Add punctuation and capitalization to the following but change nothing else:"
3. Copy and paste the transcript below <i>after</i> the instructions

In [1442]:
def get_textgrid_words(textgrid):
    '''
    Extracts the words in the textgrid to show in a legible format
    '''
    words = [strip_punctuation(x[-1]) for x in textgrid.getTier('word').entries]
    return words

# load the textgrid removing all enunciations
textgrid = load_clean_textgrid(praat_fn)

# gets all the words in the textgrid as an interpretable string
tg_words = get_textgrid_words(textgrid)
print (' '.join(tg_words))

I GREW UP A UH A HUGE FAN OF OF THE NEW YORK YANKEES WHICH WHEN I WAS VERY SMALL INVOLVED GOING TO GAMES MAYBE ONCE A YEAR WITH MY MY FATHER AND MY LITTLE BROTHER WATCHING UH REGGIE JACKSON AND A LITTLE BIT OLDER WATCHING UH DAVE WINFIELD AND THEN WHEN I KIND OF CAME INTO MY TEENS UH DON MATTINGLY WHO WAS YOU KNOW MY ABSOLUTE FAVORITE PLAYER AND AS I I WENT TO HIGH SCHOOL IN NEW YORK AND IT WAS KIND OF A TURNING POINT THE FIRST TIME THAT I WENT TO A YANKEE GAME BY MYSELF AND I STARTED GOING TO YANKEE GAMES BY MYSELF AND IT WAS AT ONE OF THESE GAMES IN THE FALL OF NINETEEN NINETYONE THAT I WENT UP TO THE STADIUM BOUGHT A TICKET TO THE BLEACHERS AND WENT AND SAT IN THE BLEACHERS AND WAS WATCHING UM THE GAME AND NOTICED FOR THE FIRST TIME SOMETHING THAT ID ID BEEN TO THE STADIUM SO MANY TIMES BEFORE BUT ID NEVER SEEN UH THIS KID IN RIGHT FIELD WEARING A YANKEE UNIFORM WHO WAS A BAT BOY PLAYING CATCH WITH THE RIGHT FIELDER AND ID NEVER NOTICED THE BAT BOY BEFORE AND THIS KID COULD NOT PLAY

## Create a transcript file

ChatGPT will then print out a verion of the transcript with punctuation. However, we need to double-check that the words match the original transcript. After getting the transcript from ChatGPT:
1. Go to the directory '/stimuli/transcripts/' 
2. Create a text file names "STIMULUSNAME.txt" (where STIMULUSNAME is the name of the stimulus - printed out above)
3. Paste the transcript from ChatGPT into the text file

You should now be able to load the file in this notebook

In [1432]:
def compare_praat_to_transcript(words_original, words_transcribed):
    '''
    Compares words from TextGrid and ChatGPT transcript word by word
    '''
    
    for i, (word_orig, word_transc) in enumerate(zip(words_original, words_transcribed)):
        if word_orig.lower() != word_transc.lower():
            print (f'Word index: {i}')
            print (f'TextGrid word: {word_orig}')
            print (f'Transcript word: {word_transc}')
            print (f'Word context: {words_original[i-5:i+5]}')
            break
    
    if i+1 == len(words_original):
        print (f'Finished transcript!')

## Check the transcript with the original file

Run the following cell to compare words from the TextGrid to words from the ChatGPT transcript.

Sometimes words will be misaligned:
- ChatGPT may have missed some words
- The Praat words may be misspelled, or hyphenated words may have been treated separately (e.g., eighty-four --> eighty four)

You will need to correct this in either 1) the transcript or 2) the Praat file and make note of the change within the tracking document

In [1443]:
transcript_fn = os.path.join(stim_dir, 'transcripts', f'{stim_name}_transcript.txt')

# load the textgrid and get all words
textgrid = load_clean_textgrid(praat_fn)
words_original = get_textgrid_words(textgrid)

# load the ChatGPT created transcript
_, words_transcribed = load_transcription(transcript_fn)

compare_praat_to_transcript(words_original, words_transcribed)

Finished transcript!


In [1434]:
praat_fn

'/dartfs/rc/lab/F/FinnLab/datasets/huth-moth/stimuli/praat/myfirstdaywiththeyankees.TextGrid'

## Create a gentle align file from Praat

In [1350]:
gentle_stim_dir = os.path.join(stim_dir, 'gentle', stim_name)

# if the directory does not exist, make the directory
if not os.path.exists(gentle_stim_dir):
    os.makedirs(gentle_stim_dir)

Now that the directory is created, we will do the following:
- Write the aligned file to the directory
- Move a copy of the stimulus audio to the directory
- Move a copy of the transcript to the directory

In [1375]:
tg.getTier('word').find

<bound method TextgridTier.find of <praatio.data_classes.interval_tier.IntervalTier object at 0x2b22e7eab820>>

2757

In [1450]:
word_info

Interval(start=0.0922902494331, end=0.281859410431, label='I')

In [1454]:
?textgrid.crop

In [1460]:
cropped_grid.getTier('phone').entries

(Interval(start=0.0922902494331, end=0.281859410431, label='AY1'),)

In [1461]:

textgrid = load_clean_textgrid(praat_fn)
tg_words = textgrid.getTier('word')

contents, words_transcribed = load_transcription(transcript_fn)

assert (len(tg_words) == len(words_transcribed))

# create the dictionary to store things in
# put the transcript in the raw form
align = {}
align['transcript'] = contents[0]
align['words'] = []

# Taken from Kaldi metasentence tokenizer
# splits the transcript based on any punctuation besides for apostrophes and hyphens
regex_split_pattern = r'(\w|\.\w|\’\w|\'\w|\-\w)+'

iterator = list(re.finditer(regex_split_pattern, ''.join(contents), re.UNICODE))
n_items = len(list(iterator))
# make sure the iterator matches the length
assert (n_items == len(tg_words) == len(words_transcribed))

# if all matches we're good to go
for word_info, m in zip(tg_words, iterator):
    
    # span of the word in characters relative to the overall string
    start_offset, end_offset = m.span()
    word = m.group()
    
    # crop textgrid to the word
    cropped_grid = textgrid.crop(cropStart=word_info[0], cropEnd=word_info[1], mode='truncated', rebaseToZero=False)
    tg_phones = cropped_grid.getTier('phone').entries
    word_phones = []
    
    for phone in tg_phones:
        phone = re.sub(r'\d+', '', phone[-1])
        duration = phone[1] - phone[0]
        word_phones.append({'duration': duration, 'phone': phone})
    
    word_align = {
        'alignedWord': word.lower(),
        "case": "success",
        'word': word,
        'start': word_info[0],
        'end': word_info[1],
        "startOffset": start_offset,
        "endOffset": end_offset,
    }

    align['words'].append(word_align)
    
    sys.exit(0)

# return align

SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [1470]:
set([ for entry in textgrid.getTier('phone').entries])

{'AA',
 'AE',
 'AH',
 'AO',
 'AW',
 'AY',
 'B',
 'CH',
 'D',
 'DH',
 'EH',
 'ER',
 'EY',
 'F',
 'G',
 'HH',
 'IH',
 'IY',
 'JH',
 'K',
 'L',
 'M',
 'N',
 'NG',
 'OW',
 'OY',
 'P',
 'R',
 'S',
 'SH',
 'T',
 'TH',
 'UH',
 'UW',
 'V',
 'W',
 'Y',
 'Z',
 'ZH'}

In [1384]:
tg.crop(tg_info[0], tg_info[1], mode='truncated', rebaseToZero=False).getTier('phone').entries

(Interval(start=727.738548753, end=727.798412698, label='Y'),
 Interval(start=727.798412698, end=727.888208617, label='UW1'))

In [1376]:
tg_info

Interval(start=727.738548753, end=727.888208617, label='YOU')

In [1304]:
# open the textgrid
tg = tgio.openTextgrid(praat_fn, includeEmptyIntervals=False, reportingMode="warning") 

(Interval(start=0.0124716553288, end=0.0922902494331, label='sp'),
 Interval(start=0.0922902494331, end=0.281859410431, label='AY1'),
 Interval(start=0.281859410431, end=0.371655328798, label='G'),
 Interval(start=0.371655328798, end=0.54126984127, label='R'),
 Interval(start=0.54126984127, end=0.571201814059, label='UW1'),
 Interval(start=0.571201814059, end=0.700907029478, label='AH1'),
 Interval(start=0.700907029478, end=0.750793650794, label='P'),
 Interval(start=0.750793650794, end=1.11995464853, label='EY1'),
 Interval(start=1.11995464853, end=1.46916099773, label='AH1'),
 Interval(start=1.46916099773, end=1.58888888889, label='sp'),
 Interval(start=1.58888888889, end=1.69863945578, label='AH0'),
 Interval(start=1.69863945578, end=1.8283446712, label='HH'),
 Interval(start=1.8283446712, end=1.92811791383, label='Y'),
 Interval(start=1.92811791383, end=1.96802721088, label='UW1'),
 Interval(start=1.96802721088, end=2.07777777778, label='JH'),
 Interval(start=2.07777777778, end=2.1

In [1303]:
textgrid

[Interval(start=0.0922902494331, end=0.281859410431, label='I'),
 Interval(start=0.281859410431, end=0.571201814059, label='GREW'),
 Interval(start=0.571201814059, end=0.750793650794, label='UP'),
 Interval(start=0.750793650794, end=1.11995464853, label='A'),
 Interval(start=1.11995464853, end=1.46916099773, label='UH'),
 Interval(start=1.58888888889, end=1.69863945578, label='A'),
 Interval(start=1.69863945578, end=2.07777777778, label='HUGE'),
 Interval(start=2.07777777778, end=2.46689342404, label='FAN'),
 Interval(start=2.46689342404, end=2.70634920635, label='OF'),
 Interval(start=2.84603174603, end=3.0156462585, label='OF'),
 Interval(start=3.0156462585, end=3.09546485261, label='THE'),
 Interval(start=3.09546485261, end=3.2052154195, label='NEW'),
 Interval(start=3.2052154195, end=3.4746031746, label='YORK'),
 Interval(start=3.4746031746, end=3.87369614512, label='YANKEES'),
 Interval(start=3.87369614512, end=4.02335600907, label='WHICH'),
 Interval(start=4.02335600907, end=4.15

In [14]:
# given the two files, creates a file in gentle aligned format
align_json = textgrid_to_gentle(praat_fn, transcript_fn)

In [18]:
for word in align_json['words']:
    print (word['alignedWord'])

under
the
influence
is
our
topic
tonight
and
uh
i
i
thought
of
a
lot
of
different
things
and
and
the
one
i
kinda
wanted
to
talk
about
was
my
secret
influence
um
there
are
people
in
our
lives
that
are
incredibly
important
to
us
and
i
would
even
mention
to
say
some
of
the
most
important
people
everybody
aside
from
my
mother
in
my
life
doesn't
know
anything
about
this
person
who
i
think
about
every
day
and
his
name
was
michael
marquis
so
if
this
were
a
violin
piece
it
would
be
called
ode
to
michael
marquis
um
i
met
him
when
i
was
seven
years
old
and
i
was
growing
up
uh
my
mother
had
left
my
father
when
i
was
about
two
or
three
and
this
was
the
first
man
that
my
mother
had
brought
home
like
for
me
to
meet
and
uh
he
was
a
vietnam
vet
he
was
a
professional
athlete
ah
he
had
played
college
football
college
baseball
he
was
uh
a
ski
instructor
and
a
golf
instructor
uh
he
had
like
a
big
beard
and
he
drank
scotch
and
he
had
a
german
shepherd
and
he
had
a
lab
and
he
had
this
unbelievable
m
g
and
n

In [None]:
# given the two files, creates a file in gentle aligned format
align_json = textgrid_to_gentle(praat_fn, transcript_fn)

# write the file out to the directory
with open(os.path.join(gentle_stim_dir, 'align.json'), 'w') as f:
    json.dump(align_json, f)
    
# copy the transcript file renaming it to "transcript.txt" matching gentle convention
shutil.copyfile(
    transcript_fn, 
    os.path.join(gentle_stim_dir, 'transcript.txt')
)

# copy the stimulus audio file renaming it to "a.wav" matching gentle convention
shutil.copyfile(
    os.path.join(stim_dir, 'audio', f'{stim_name}.wav'), 
    os.path.join(gentle_stim_dir, 'a.wav')
)