# Automated Praat formant measurements, continued

This notebook continues the formant analysis [started in a previous notebook](Praat%20formant%20measurements.ipynb). In the first notebook formant analysis was performed on a set of `.wav` files and the results were cached in `.csv` files. In this notebook we combine the cached formant measurements with annotated vowel regions to extract time-aligned vowel formant measurements.

In [1]:
from pathlib import Path
import pandas as pd
from audiolabel import read_label
from phonlab.utils import dir2df

Define variables that pertain to the project.

* `tgdir` is the location of textgrids with `phone` and `word` tiers
* `csvdir` is the location of `.csv` files with time-aligned formant measurements
* `of_interest` is a list of phones for which formant measurements are desired

In [9]:
tgdir = Path.home() / 'src/xray_microbeam_database/annotation'
csvdir = Path.home() / 'xray_formants_praat'

# Phones for which we will extract formant measurements
of_interest = ['AH0', 'AO0', 'AO1', 'AO2']

Make a dataframe of textgrid files, extracting a `task` identifier from the filename. This corpus is organized by subject directories, and the `relpath` value is also a subject identifier.

In [11]:
dirpat = '^JW6'
#dirpat = '^JW' # use this instead for all subject directories
fnpat = '^tp(?P<task>00\d)\.TextGrid$'
#fnpat = '^t[ap](?P<task>\d+)\.TextGrid$' # use this instead for all .TextGrid files
tgdf = dir2df(
    tgdir,
    dirpat=dirpat,
    fnpat=fnpat,
    addcols=['barename']
)
tgdf

Unnamed: 0,relpath,fname,barename,task
0,JW60,tp001.TextGrid,tp001,1
1,JW60,tp002.TextGrid,tp002,2
2,JW60,tp003.TextGrid,tp003,3
3,JW60,tp004.TextGrid,tp004,4
4,JW60,tp005.TextGrid,tp005,5
5,JW60,tp006.TextGrid,tp006,6
6,JW60,tp007.TextGrid,tp007,7
7,JW60,tp008.TextGrid,tp008,8
8,JW60,tp009.TextGrid,tp009,9
9,JW61,tp001.TextGrid,tp001,1


Make a dataframe of `.csv` files with formant measurements, extracting `task` from the filenames as well as `ceil` and `numformant` analysis parameters that were used. The subject identifier is in `relpath`. 

In [4]:
csvpat = '^tp(?P<task>00\d)\.(?P<ceil>\d+)ceil\.(?P<numformant>\d)formant\.csv$'
csvdf = dir2df(
    csvdir,
    dirpat=dirpat,
    fnpat=csvpat,
    addcols=['barename']
)
csvdf.head()

Unnamed: 0,relpath,fname,barename,task,ceil,numformant
0,JW60,tp001.5500ceil.5formant.csv,tp001.5500ceil.5formant,1,5500,5
1,JW60,tp002.5500ceil.5formant.csv,tp002.5500ceil.5formant,2,5500,5
2,JW60,tp003.5500ceil.5formant.csv,tp003.5500ceil.5formant,3,5500,5
3,JW60,tp004.5500ceil.5formant.csv,tp004.5500ceil.5formant,4,5500,5
4,JW60,tp005.5500ceil.5formant.csv,tp005.5500ceil.5formant,5,5500,5


Merge the textgrid dataframe with the csv dataframe, so that subject (in `relpath`) and `task` match. The `how='inner'` parameter means that a textgrid without a corresponding `.csv` file will not be in the merge result, nor will a `.csv` file that doesn't have a corresponding textgrid.

In [12]:
readydf = tgdf.merge(
    csvdf,
    on=['relpath', 'task'],
    how='inner',
    suffixes=['_tg', '_csv']
)
readydf.head()

Unnamed: 0,relpath,fname_tg,barename_tg,task,fname_csv,barename_csv,ceil,numformant
0,JW60,tp001.TextGrid,tp001,1,tp001.5500ceil.5formant.csv,tp001.5500ceil.5formant,5500,5
1,JW60,tp002.TextGrid,tp002,2,tp002.5500ceil.5formant.csv,tp002.5500ceil.5formant,5500,5
2,JW60,tp003.TextGrid,tp003,3,tp003.5500ceil.5formant.csv,tp003.5500ceil.5formant,5500,5
3,JW60,tp004.TextGrid,tp004,4,tp004.5500ceil.5formant.csv,tp004.5500ceil.5formant,5500,5
4,JW60,tp005.TextGrid,tp005,5,tp005.5500ceil.5formant.csv,tp005.5500ceil.5formant,5500,5


For each textgrid, read and merge phone and word annotations and make a subset of the phones of interest. Also read formant measurements from `.csv` files and combine with phone/word metadata.

Concatenate all such dataframes into a master dataframe containing all phones of interest from every textgrid and its associated formant measurements.

In [13]:
# Iterate over all textgrids for which there is a matching formant .csv
dflist = []
for row in readydf.itertuples():
    # Load and combine phone and word tokens from textgrid.
    [phdf, wddf] = read_label(
        tgdir / row.relpath / row.fname_tg,
        ftype='praat',
        tiers=['phone', 'word']
    )
    phwddf = pd.merge_asof(
        phdf.rename({'t1': 't1_ph', 't2': 't2_ph'}, axis='columns'),
        wddf.drop('fname', axis='columns') \
            .rename({'t1': 't1_wd', 't2': 't2_wd'}, axis='columns'),
        left_on='t1_ph',
        right_on='t1_wd'
    )

    # Make a subset of phones of interest
    intdf = phwddf[phwddf['phone'].isin(of_interest)]
    if len(intdf) == 0:
        print(f'No phones of interest in {row.fname_tg}')
        continue

    # Load formant measurements
    fdf = pd.read_csv(
        csvdir / row.relpath / row.fname_csv,
        usecols=['t1', 'f1', 'f2', 'bw1', 'bw2']
    ).rename({'t1': 't1_fmt'}, axis='columns')

    # Extract formant measurements associated with each phone token
    # of interest, as identified by `t1_ph`
    measdf = intdf.groupby('t1_ph').apply(
        lambda x: fdf[
            (fdf['t1_fmt'] >= x['t1_ph'].iloc[0]) & \
            (fdf['t1_fmt'] <= x['t2_ph'].iloc[0])
        ]
    ).reset_index(level='t1_ph').reset_index(drop=True)

    # Merge formant measurements with phone/word token metadata
    intdf = intdf.merge(measdf, on='t1_ph')

    # Add metadata from filenames to combined phone/word/measurement tokens.
    columns_to_add = [row.relpath, row.task, row.ceil, row.numformant]
    intdf.loc[:, ['subject', 'task', 'ceil', 'numformant']] = columns_to_add

    dflist.append(intdf)
df = pd.concat(dflist)
df

Unnamed: 0,t1_ph,t2_ph,phone,fname,t1_wd,t2_wd,word,t1_fmt,f1,f2,bw1,bw2,subject,task,ceil,numformant
0,1.491271,1.572168,AH0,/Users/ronald/src/xray_microbeam_database/anno...,1.133978,1.724331,PROBLEM,1.496312,545.432020,1095.857657,83.795915,69.447219,JW60,001,5500,5
1,1.491271,1.572168,AH0,/Users/ronald/src/xray_microbeam_database/anno...,1.133978,1.724331,PROBLEM,1.502562,566.833958,1153.626435,158.799320,107.668172,JW60,001,5500,5
2,1.491271,1.572168,AH0,/Users/ronald/src/xray_microbeam_database/anno...,1.133978,1.724331,PROBLEM,1.508812,578.607474,1214.521658,211.767481,101.353490,JW60,001,5500,5
3,1.491271,1.572168,AH0,/Users/ronald/src/xray_microbeam_database/anno...,1.133978,1.724331,PROBLEM,1.515062,574.573435,1231.101758,209.987057,86.476241,JW60,001,5500,5
4,1.491271,1.572168,AH0,/Users/ronald/src/xray_microbeam_database/anno...,1.133978,1.724331,PROBLEM,1.521312,572.916680,1271.646685,251.478525,134.587741,JW60,001,5500,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14,1.232507,1.346688,AO1,/Users/ronald/src/xray_microbeam_database/anno...,1.232507,1.668300,ORDER,1.321312,476.506547,1002.216895,186.815441,204.240203,JW63,009,5000,5
15,1.232507,1.346688,AO1,/Users/ronald/src/xray_microbeam_database/anno...,1.232507,1.668300,ORDER,1.327562,482.209137,1022.324106,218.639543,137.232934,JW63,009,5000,5
16,1.232507,1.346688,AO1,/Users/ronald/src/xray_microbeam_database/anno...,1.232507,1.668300,ORDER,1.333812,499.246916,1050.953559,254.895630,136.721045,JW63,009,5000,5
17,1.232507,1.346688,AO1,/Users/ronald/src/xray_microbeam_database/anno...,1.232507,1.668300,ORDER,1.340062,515.317129,1108.855434,261.191323,223.110803,JW63,009,5000,5


The master dataframe contains observations from multiple subjects, phones, and recordings (identified by the `subject`, `phone`, and `fname` columns). You can use these to create summary statistics of formant measurement values.

In [16]:
df[['subject', 'phone', 'f1', 'f2', 'bw1', 'bw2']].groupby(['subject', 'phone']).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,f1,f1,f1,f1,f1,f1,f1,f1,f2,f2,...,bw1,bw1,bw2,bw2,bw2,bw2,bw2,bw2,bw2,bw2
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
subject,phone,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
JW60,AH0,182.0,484.310298,59.638841,324.392773,452.268214,495.378997,522.695592,595.787319,182.0,1461.099171,...,195.831873,354.388095,182.0,171.901077,92.196981,35.154771,109.671967,155.566927,214.610737,641.944
JW60,AO0,10.0,648.93273,43.61391,535.500318,648.758744,656.28599,669.520749,686.979077,10.0,1142.111996,...,187.503471,414.325035,10.0,121.434173,66.936456,51.537166,83.749851,109.579288,134.525091,293.058168
JW60,AO1,198.0,524.018439,99.527867,83.281851,460.011587,516.147544,590.189529,988.528853,198.0,1225.485176,...,168.308829,8210.842752,198.0,393.319122,782.156719,24.472302,100.479448,160.033085,301.717828,4990.158529
JW60,AO2,27.0,509.354165,95.051367,285.0848,450.094378,503.237799,601.002773,691.858347,27.0,1083.591018,...,129.557424,431.10106,27.0,189.242962,151.63221,21.851148,86.765291,173.52099,228.121861,667.401349
JW61,AH0,117.0,410.712939,58.581055,251.601999,369.142753,409.769971,460.455508,511.746421,117.0,1305.234911,...,235.585745,512.760933,117.0,178.734335,88.651359,53.901334,123.192249,171.090833,219.751994,773.637009
JW61,AO0,14.0,582.670966,49.266404,473.741497,580.239602,599.654421,618.068465,625.71261,14.0,1045.888247,...,146.688695,293.933055,14.0,119.066911,43.301616,71.877489,90.18196,105.935083,141.263387,229.073184
JW61,AO1,197.0,506.669411,96.893347,228.260303,462.155675,490.832453,522.302457,994.771576,197.0,963.26708,...,161.447401,1068.618666,197.0,145.393393,186.828459,45.776764,90.079734,110.135597,139.866875,1873.209023
JW61,AO2,37.0,585.146977,177.622292,332.822626,495.888738,510.446801,640.757412,1146.067224,37.0,1104.163408,...,217.150376,541.64972,37.0,176.407567,112.710372,79.204989,107.662124,147.188226,188.380139,611.613198
JW62,AH0,229.0,561.745819,107.513663,139.913806,503.465385,569.327997,630.045717,801.068274,229.0,1733.401198,...,210.268998,809.531769,229.0,225.447141,111.511451,31.107732,144.065809,205.88069,286.429639,647.8313
JW62,AO0,15.0,723.292273,109.091558,545.618521,635.935782,745.234526,804.298322,892.271901,15.0,1218.374474,...,152.699642,282.247558,15.0,225.765328,69.831661,109.085526,177.243586,217.296642,266.778526,364.394474
