In [None]:
import os, sys
import pandas as pd
import subprocess
from phonlab.utils import dir2df

# Phonlab tasks

This notebook contains short articles on common data analysis tasks you might need for your work in the Phonlab.

1. [A sample post-processing workflow](#sample_workflow)
    1. [Mirror a directory structure](#mirror_directory)
    1. [Find rows in a source DataFrame that do not have a match in a second DataFrame](#find_non_matched_rows)
    1. [Perform a task for every row in a DataFrame](#task_per_df_row)
    1. [Summary of the post-processing workflow](#sample_workflow_summary)
1. [Collecting results for analysis](#collecting_results)


## <a name="sample_workflow"></a>A sample post-processing workflow

A typical step in your data analysis pipeline is to do post-processing on a dataset. This dataset may contain a number of audio recordings, for instance, and you wish to perform formant analysis on each of them, then collect the results into a DataFrame for statistical analysis. This section covers the steps involved in post-processing the files, the formant analysis part.

Sometimes you have a complete dataset already in hand and need to do post-processing only once. Other times you may have a dataset that grows incrementally, perhaps as new subjects are included. The steps you will see are modular and can be repeated (or skipped) so that incremental additions can be updated with reasonable efficiency. For very large datasets a more sophisticated workflow might be necessary.

In our example we will perform formant analysis on a set of .wav files found in the dataset. The directory that contains the .wav files is the source directory `srcdir`. We will write the formant measurements to a separate directory with the same internal structure as `srcdir`, and we will refer to this directory as the `cachedir` because it holds cached results obtained from the source data files.

The steps in the workflow are:

1. [Load filenames in `srcdir` and add analysis parameters](#load_filenames_and_parameters)
1. [Find filenames in `srcdir` that require post-processing](#find_non_matched_rows)
1. [Mirror the directory structure](#mirror_directory) (copy the internal structure of `srcdir` to `cachedir` in preparation for writing formant measurements)
1. [Perform a task for every row in a DataFrame](#task_per_df_row) (do formant analysis)

These steps are summarized in the tl;dr section [Summary of the post-processing workflow](#sample_workflow_summary).

### Load filenames in `srcdir` and add analysis parameters

In this section we load the filenames of the .wav files in `srcdir` and add analysis parameters to be used by `ifcformant`. First we define the locations of the source and cache directories.

In [None]:
srcdir = '../resource/postproc/orig_data'
cachedir = '../resource/postproc/cache'

***WARNING:*** `cachedir` must not be contained anywhere under `srcdir`! The mirroring technique described in this article is very simple to implement but may produce unexpected results if `cachedir` is part of `srcdir`. It is okay if `cachedir` is a sibling of `srcdir`, as they are defined above, but avoid this:

```python
# This is not okay!
srcdir = '../resource/postproc/orig_data'
cachedir = '../resource/postproc/orig_data/cache' # cachedir inside srcdir!
```

#### Load filenames from `srcdir`

The [`dir2df()` function](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb) makes it easy to load the set of filenames of the .wav files in `srcdir`. We use a [named capture](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables_named_capture) to also extract the subject identifier and [`addcols`](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables_addcols) to add the file's barename. The barename will make it easy to match .ifc files, as you will see shortly.

In [None]:
srcdf = dir2df(
    srcdir,
    dirpat='^(?P<subject>subj\d+)',
    fnpat='\.wav$',
    addcols=['barename']
)
srcdf

#### Add analysis parameters

The `ifcformant` command requires two parameters, 1) the name of the input .wav file; and 2) the speaker type ('female', 'male', or 'child'). The first parameter is already available in `srcdf`, and we need to add the second.

If you code speaker type into your filenames or directory structure, you can [extract speaker type](Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb#adding_variables) when you call `dir2df()` and skip the rest of this section.

If you do not encode speaker type in your filename, you may load speaker type from an external file using one of [Pandas' Input/Output functions](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output), most likely either [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) or [`read_excel`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html#pandas.read_excel). We'll use `read_csv()` to load metadata from a metadata file in `srcdir`. The 'speaker_metadata.csv' file contains two comma-separated columns labelled 'subject' and 'sex'.

In [None]:
md = pd.read_csv(os.path.join(srcdir, 'speaker_metadata.csv'))
md

A left merge adds speaker type to `srcdf`. In order for this merge to work properly, the values of one of the columns in `srcdf` must match the values of one of the columns in `md`. Here the columns named 'subject' match and we merge `on` those columns. (If the column names don't match you can use `left_on` and `right_on` instead of `on`.)

In [None]:
srcdf = srcdf.merge(md, on='subject', how='left')
srcdf

The left merge ensures that all of the rows in `srcdf` are in the merge output. If you find a NaN value in the output, it means that `md` is missing a subject and you should update your metadata document.

### <a name="find_non_matched_rows"></a>Find filenames in `srcdir` that require post-processing

The next step in the workflow is to find existing `ifcformant` output files in `cachedir` and use that result to find the rows in `srcdf` that do ***not*** have a corresponding output file. These are the files that require post-processing.

In general, if you have a source DataFrame and want to check whether a second DataFrame has a corresponding row, you can use a left merge with the source on the lefthand side. A left merge returns all of the merge keys from the lefthand DataFrame regardless of whether a matching key is found on the right. When there is no match, the columns from the right DataFrame are filled with NaN.

For this example the left DataFrame will be `srcdf`. The right DataFrame will be created from `cachedir`. We load the .ifc files in `cachedir` into `cachedf`. As you can see, the formant measurements for the first two acquisitions for subj1 have already been created and cached in .ifc files that correspond to the .wav files in the relative path 'subj1/trial1'.

In [None]:
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['barename'])
cachedf

Our goal is to find each '.wav' file in `srcdf` that does not have a corresponding '.ifc' file in `cachedf`, and we do this in part by matching the barename values--each 'acq1.wav' should match an 'acq2.ifc'. It is not enough to match the barename values in the two DataFrames, however. This is because the barenames are not unique in the global dataset:

In [None]:
srcdf[srcdf.barename == 'acq1']   # barename 'acq1' appears four times

To fully identify each .wav file in `srcdf` it is necessary to include 'relpath' as well as 'barename'. The combination of these two column uniquely identifies each row in `srcdf`.

Another left merge, with `srcdf` as the left DataFrame and `cachedf` as the right, matches .wav input files with cached .ifc files, using `relpath` and `barename` as the complex merge key. Since the `fname` column is in both `srcdf` and `cachedf` we provide suffixes to append to those column names in the merged DataFrame.

In [None]:
mrgdf = srcdf.merge(
    cachedf,
    on=['relpath', 'barename'],
    how='left',
    suffixes=['_wav', '_ifc']
)
mrgdf

Select the rows of `mrgdf` that have a value of NaN in the `fname_ifc` column. These represent the .wav files that do not have cached .ifc measurements.

In [None]:
noifcdf = mrgdf[mrgdf.fname_ifc.isnull()]
noifcdf

### <a name="mirror_directory"></a>Mirror the directory structure

We have already seen that the directory structure of `srcdir` organizes files into trial subdirectories nested inside subject directories. We must create the same directory structure in `cachedir` before running `ifcformant`. The `ifcformant` command will fail if a writeable output directory does not exist for the .ifc file.

The directory structure is provided by the `relpath` column, and the unique values of it in the `noifcdf` provide all of the directory names that must exist prior to running `ifcformant`.

In [None]:
unique_relpath = noifcdf.relpath.unique()
unique_relpath

The simple way to copy the structure is to loop over the unique set of relative paths and create them in `cachedir` using [os.makedirs()](https://docs.python.org/3/library/os.html#os.makedirs).

The following cell will copy the required directory structure to `cachedir` and print a success message if it succeeds. If there are any problems in creating a directory an error will be raised instead.

In [None]:
for destdir in unique_relpath:
    os.makedirs(
        os.path.join(cachedir, destdir),  # e.g. ../resource/postproc/cache/subj1/trial1
        exist_ok=True
    )
sys.stderr.write('Directory mirroring succeeded.')

The `os.makedirs()` function automatically creates parent directories where necessary. For instance, the first relative path in the example above is `subj1/trial2`, and the first call to `os.makedirs()` is a request to create the directory `../resource/postproc/cache/subj1/trial1`. If `../resource/postproc/cache/subj1` does not exist already, then that directory will be created first.

The `exist_ok=True` means that `os.makedirs()` will not raise an error if the target directory already exists. This behavior is convenient for mirroring a `srcdir` incrementally. If you add `subj3/trial1` and `subj3/trial2` directories after running the above cell, then you can simply append them to `unique_relpath` and re-rerun the loop without raising an error for the existing directories under `subj1` and `subj2`.

### <a name="task_per_df_row"></a>Perform a task for every row in a DataFrame

The `noifcdf` DataFrame contains the names of .wav files that require formant analysis and the speaker analysis parameters. In this section we'll construct a function for doing formant analysis that uses the rows of `noifcdf` as its inputs. As a reminder, `noifcdf` contains:

In [None]:
noifcdf

#### Iterate over rows with `itertuples()`

The [`itertuples()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples) iterates over the rows of a DataFrame and returns each row as a [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple).

***Aside:*** You might come across a similarly named method `iterrows()`, and it is recommended that you avoid it. The `iterrows()` method is less convenient than `itertuples()` because 1) it doesn't provide access to column values by name; and 2) it is slower to execute than `itertuples()`.

Here is a simple example that prints the values that `itertuples()` returns:

In [None]:
for row in noifcdf.itertuples():
    print(row)

Notice that the value of the row index is added as the first value and is named 'Index'. The DataFrame column labels are the other attributes and are easily accessed by name with attribute '.' notation.

In [None]:
for row in noifcdf.itertuples():
    print(row.fname_wav)

It makes sense to put your task in a named function when the task you wish to perform is more complicated than a simple print statement. Doing so helps make your code easier to debug, and you can re-use the function in multiple places.

The `my_print` function uses `os.path.join` to construct a filepath from the 'relpath' and 'fname' attributes of a row and prints the result. The `for` loop calls `my_print` on each row in turn.

In [None]:
# Function definition.
def my_print(rowtuple):
    '''Print values from a DataFrame row provided as a namedtuple.'''
    print(rowtuple.subject)
    print(os.path.join(rowtuple.relpath, rowtuple.barename + '.ifc'))
    print(os.path.join(rowtuple.relpath, rowtuple.fname_wav))
    print('******************')

# Loop over rows and call the `my_print` function.
for row in noifcdf.itertuples():
    my_print(row)

#### The `ifcformant` function

Now that we know the mechanics of calling a named function for every row in the DataFrame, let's construct a function that runs the `ifcformant` command using the parameters provided by a namedtuple.

If we were working at the command line, a representative example of calling `ifcformant` is:

```
ifcformant --speaker female --print-header --output myfile.ifc myfile.wav
```

The arguments to `ifcformant` include the speaker type, the name of the output file that will contain the formant measurements, and the input '.wav' file. The `--print-header` argument is used to print column labels as the first row of the output file.

The `do_ifcformant` function shown below constructs an array of arguments from an input namedtuple and output directory, then uses the [`subprocess` module](https://docs.python.org/3/library/subprocess.html) to execute `ifcformant`.


In [None]:
def do_ifcformant(rt, srcdir, outdir, errors='raise'):
    '''Perform formant analysis with the ifcformant command.
    
    Parameters
    ----------
    
    rt : namedtuple that contains formant analysis parameters
         in fields:
         'relpath' (relative path to audio file),
         'fname' (name of .wav file),
         'barename' (name of .wav file without extension)
         'speaker' (ifcformant speaker type, one of 'female',
             'male', 'child')
             
    srcdir : str
        Base pathname to input .wav file. The path to the input file
        will be: srcdir/rt.relpath/rt.fname_wav.
             
    outdir : str
        Base pathname to ifcformant output. The output file will
        be written to: outdir/rt.relpath/rt.barename + '.ifc'.
             
    errors : str (default 'raise')
        How to handle errors if `check_call()` fails. If
        'ignore', print debug statement to STDERR and return the
        ifcformant return code; if 'raise' immediately reraise
        the CalledProcessError.
        
    Returns
    -------
    
    The `ifcformant` return code is returned by this function,
    0 for success or non-zero for errors.
    '''
    ifcargs = [
        'ifcformant',
        '--speaker', rt.sex,
        '--print-header',
        '--output', os.path.join(
            outdir, rt.relpath, rt.barename + '.ifc'
        ),
        os.path.join(srcdir, rt.relpath, rt.fname_wav)
    ]
    try:
        subprocess.check_call(ifcargs)
    except subprocess.CalledProcessError as e:
        if errors == 'ignore':
            msg = 'Caught error while invoking ifcformant:\n{:}'.format(e)
            sys.stderr.write(msg)
            return e.returncode
        else:
            raise e
    return 0

It's always best to include a docstring at the top of your named functions. This helps you document your workflow and to possibly re-use the function in another project. Execute the following cell to see the documentation for the new function.

In [None]:
do_ifcformant?

Once your function is created and debugged, use `itertuples()` to run the function on every row of your DataFrame.

In [None]:
for row in noifcdf.itertuples():
    do_ifcformant(row, srcdir=srcdir, outdir=cachedir)    

Use `dir2df()` to list the .ifc files in `cachedir`. All but two should show recent mtime values.

In [None]:
dir2df(cachedir, fnpat='\.ifc$', addcols=['mtime'])

### <a name="sample_workflow_summary"></a>Summary of the post-processing workflow

This section contains a summary of the post-processing workflow with minimal explanation. Each step is in a separate cell to make it easy to execute each separately, in modular fashion.

In [None]:
# Define source and cache directories.
srcdir = '../resource/postproc/orig_data'
cachedir = '../resource/postproc/cache'

In [None]:
# Load .wav filenames from srcdf.
srcdf = dir2df(
    srcdir,
    dirpat='^(?P<subject>subj\d+)',
    fnpat='\.wav$',
    addcols=['barename']
)

In [None]:
# Load speaker metadata and merge with srcdf.
md = pd.read_csv(os.path.join(srcdir, 'speaker_metadata.csv'))
srcdf = srcdf.merge(md, on='subject', how='left')

In [None]:
# Load cached .ifc filenames and merge with srcdf.
cachedf = dir2df(cachedir, fnpat='\.ifc$', addcols=['barename'])
mrgdf = srcdf.merge(
    cachedf,
    on=['relpath', 'barename'],
    how='left',
    suffixes=['_wav', '_ifc']
)

In [None]:
# Select .wav files that do not have a corresponding cached .ifc file.
noifcdf = mrgdf[mrgdf.fname_ifc.isna()]

In [None]:
# Mirror srcdir directory structure in cachedir.
unique_relpath = noifcdf.relpath.unique()
for destdir in unique_relpath:
    os.makedirs(
        os.path.join(cachedir, destdir),  # e.g. ../resource/postproc/cache/subj1/trial1
        exist_ok=True
    )
sys.stderr.write('Directory mirroring succeeded.')

In [None]:
# Run ifcformant on .wav files and output to cachedir.
# NOTE: do_ifcformant() function must already be defined.
for row in noifcdf.itertuples():
    do_ifcformant(row, outdir=cachedir)