## Textgrids to dataframes

Process the inventory of xrmbdb `.TextGrid` files and create `word` and `phone` dataframes for all subjects and utterances. Store results in `.feather` files that can be loaded in the `load_dataframes` notebook.

In [1]:
import os
import pandas as pd
import numpy as np
from audiolabel import read_label  # from https://github.com/rsprouse/audiolabel
from phonlab.utils import dir2df   # from https://github.com/rsprouse/phonlab

In [2]:
# Location of speaker folders containing .TextGrid files
datadir = '../annotation'

Define a function to load a `.TextGrid` file and return `phone` and `word` dataframes with additional metadata columns.

In [3]:
def tg2phwd(row):
    '''Read words and phones from textgrid and return tuple of (phone, word) dataframes.'''
    
    # Read textgrid identified by `row` and read into dataframes.
    wavpath = os.path.join(row.relpath, row.fname.replace('.TextGrid', '.wav'))
    phdf, wddf = read_label(os.path.join(datadir, row.relpath, row.fname), ftype='praat')
    
    # Add metadata columns and remove `fname` column.
    bname = os.path.splitext(row.fname)[0]
    try:
        uttid, rep = bname.split('_')
        rep = rep
    except:
        uttid = bname
        rep = '1'
    phdf = phdf.assign(
        speaker=row.relpath, uttid=uttid, rep=rep, wavpath=wavpath
    ).drop('fname', axis='columns')
    wddf = wddf.assign(
        speaker=row.relpath, uttid=uttid, rep=rep, wavpath=wavpath
    ).drop('fname', axis='columns')
    
    return (phdf, wddf)

Get an inventory of `.TextGrid` files for processing.

In [4]:
tgdf = dir2df(datadir, fnpat='.TextGrid$')
tgdf

Unnamed: 0,relpath,fname
0,JW11,tp001.TextGrid
1,JW11,tp002.TextGrid
2,JW11,tp003.TextGrid
3,JW11,tp004.TextGrid
4,JW11,tp005.TextGrid
...,...,...
5041,JW63,tp102_2.TextGrid
5042,JW63,tp103.TextGrid
5043,JW63,tp103_2.TextGrid
5044,JW63,tp104_2.TextGrid


For a convenient end result, sort by numeric speaker identifier instead of alphabetic.

In [5]:
tgdf = tgdf.loc[
    tgdf.relpath.str.replace('JW', '').astype(int).sort_values().index
]
tgdf

Unnamed: 0,relpath,fname
0,JW11,tp001.TextGrid
77,JW11,tp077.TextGrid
76,JW11,tp076.TextGrid
75,JW11,tp075.TextGrid
74,JW11,tp074.TextGrid
...,...,...
3593,JW502,tp029.TextGrid
3592,JW502,tp028.TextGrid
3591,JW502,tp027.TextGrid
3589,JW502,tp025.TextGrid


The output of applying `tg2phwd` is assembled into a list of tuples, where the first element of the tuple is a `phone` dataframe and the second element is a `word` dataframe. These are concatenated separately to create master `phone` and `word` dataframes created from all `.TextGrid` files.

In [6]:
dftuples = tgdf.apply(tg2phwd, axis='columns')

In [7]:
allph = pd.concat([t[0] for t in dftuples]).reset_index(drop=True)
allwd = pd.concat([t[1] for t in dftuples]).reset_index(drop=True)

Cast columns to more efficient dtypes and give better names to the `label` columns.

In [8]:
for col in ('label', 'speaker', 'uttid', 'rep', 'wavpath'):
    allph[col] = allph[col].astype('category')
    allwd[col] = allwd[col].astype('category')
for col in ('t1', 't2'):
    allph[col] = allph[col].astype(np.float32)
    allwd[col] = allwd[col].astype(np.float32)
allph = allph.rename({'label': 'phone'}, axis='columns')
allwd = allwd.rename({'label': 'word'}, axis='columns')

Store the dataframes in `.feather` files.

In [9]:
allwd.to_feather(os.path.join(datadir, 'all_words.feather'))
allph.to_feather(os.path.join(datadir, 'all_phones.feather'))

Also store as `.csv` files.

In [None]:
allwd.to_csv(os.path.join(datadir, 'all_words.csv'), index=False)
allph.to_csv(os.path.join(datadir, 'all_phones.csv'), index=False)