In [1]:
import os, sys
import pandas as pd
import subprocess
from phonlab.utils import dir2df

# Phonlab tasks

This notebook contains short articles on common data analysis tasks you might need for your work in the Phonlab.

1. [Mirror a directory structure](#mirror_directory)
1. [Perform a task for every row in a DataFrame](#task_per_df_row)
1. [Find rows in a source DataFrame that do not have a match in a second DataFrame](#find_non_matched_rows)


## <a name="mirror_directory"></a>Mirror a directory structure

It can be very useful to copy the directory structure of one directory to another. For example, take a set of files in a source data directory `srcdir`, which we enumerate using [`dir2df()`](https://github.com/rsprouse/phonlab/blob/master/doc/Retrieving%20filenames%20in%20a%20directory%20tree%20with%20%60dir2df%28%29%60.ipynb):

In [2]:
srcdir = '../resource/mirrorexp/orig_data'
srcdf = dir2df(srcdir)
srcdf

Unnamed: 0,relpath,fname
0,subj1/trial1,acq1.tg
1,subj1/trial1,acq1.wav
2,subj1/trial1,acq2.tg
3,subj1/trial1,acq2.wav
4,subj1/trial2,acq1.tg
5,subj1/trial2,acq1.wav
6,subj1/trial2,acq2.tg
7,subj1/trial2,acq2.wav
8,subj2/trial1,acq1.tg
9,subj2/trial1,acq1.wav


The `srcdir` directory contains per-trial subdirectories nested in per-subject subdirectories of the toplevel directory. The unique set of subdirectory combinations is:

In [3]:
unique_relpath = srcdf.relpath.unique()
unique_relpath

[subj1/trial1, subj1/trial2, subj2/trial1, subj2/trial2]
Categories (4, object): [subj1/trial1, subj1/trial2, subj2/trial1, subj2/trial2]

As part of your analysis workflow you wish to downsample all of the '.wav' files under `srcdir` using the `sox` command line utility and write the downsampled files to a cache directory `cachedir` that is separate from `srcdir`. To keep `cachedir` organized you will use the same directory structure as `srcdir`. Since `sox` will not create this directory structure for you, you must create the directory structure in `cachedir` first, then run `sox`.



In [4]:
cachedir = '../resource/mirrorexp/ds_files'

***WARNING:*** `cachedir` must not be contained anywhere under `srcdir`! The mirroring technique described in this article is very simple to implement but may produce unexpected results if `cachedir` is part of `srcdir`. It is okay if `cachedir` is a sibling of `srcdir`, e.g.:

```
# Acceptable organization of srcdir and cachedir in 'myexp'
../resource/mirrorexp/orig_data    # srcdir
../resource/mirrorexp/ds_files     # cachedir

# This is not okay!
../resource/mirrorexp/orig_data/ds_files  # cachedir contained in srcdir
```

The simple way to copy the structure is to loop over the unique set of relative paths that were found `srcdir` and create them in the `cachedir` using [os.makedirs()](https://docs.python.org/3/library/os.html#os.makedirs).

The following cell will copy the `srcdir` directory structure to `cachedir` and print a success message if it succeeds. A problem with creating a directory in `cachedir` will raise an error instead.

In [6]:
for destdir in unique_relpath:
    os.makedirs(
        os.path.join(cachedir, destdir),  # e.g. ../resource/mirrorexp/ds_files/subj1/trial1
        exist_ok=True
    )
sys.stderr.write('Directory mirroring succeeded.')

Directory mirroring succeeded.

The `os.makedirs()` function automatically creates parent directories where necessary. For instance, the first relative path in the example above is `subj1/trial1`, and the first call to `os.makedirs()` is a request to create the directory `../resource/mirrorexp/ds_files/subj1/trial1`. If `../resource/mirrorexp/ds_files/subj1` does not exist already, then that directory will be created first.

The `exist_ok=True` means that `os.makedirs()` will not raise an error if the target directory already exists. This behavior is convenient for mirroring a `srcdir` incrementally. If you add `subj3/trial1` and `subj3/trial2` directories after running the above cell, then you can simply append them to `relpath` and re-rerun the loop without raising an error for the existing directories under `subj1` and`subj2`.

Now that the output directories are created in `cachedir` you can run `sox` (or whatever process you want) and use locations in `cachedir` as the places to write the output files.

## <a name="task_per_df_row"></a>Perform a task for every row in a DataFrame

Sometimes it is useful to iterate over the rows of a DataFrame and perform a task that uses the row values. This is a common need in post-processing, for example, if the DataFrame contains names of data files and you want to run a command line utility to do some analysis on each.

In this section we'll start with a DataFrame that contains names of '.wav' files, and we'll construct a function for doing formant analysis.

In [7]:
df = pd.DataFrame.from_records([
    ('dir1', 'file1.wav', 'file1', '.wav', 'female'),
    ('dir1', 'file2.wav', 'file2', '.wav', 'female'),
    ('dir2', 'file3.wav', 'file3', '.wav', 'male')
], columns=['relpath', 'fname', 'barename', 'ext', 'speaker'])
df

Unnamed: 0,relpath,fname,barename,ext,speaker
0,dir1,file1.wav,file1,.wav,female
1,dir1,file2.wav,file2,.wav,female
2,dir2,file3.wav,file3,.wav,male


The DataFrame [`itertuples()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples) iterates over the rows of a DataFrame and returns each row as a [namedtuple](https://docs.python.org/3/library/collections.html#collections.namedtuple).

Here is a simple example that prints the values that `itertuples()` returns:

In [8]:
for row in df.itertuples():
    print(row)

Pandas(Index=0, relpath='dir1', fname='file1.wav', barename='file1', ext='.wav', speaker='female')
Pandas(Index=1, relpath='dir1', fname='file2.wav', barename='file2', ext='.wav', speaker='female')
Pandas(Index=2, relpath='dir2', fname='file3.wav', barename='file3', ext='.wav', speaker='male')


Notice that the value of the row index is added as the first value and is named 'Index'. The DataFrame column labels are the other attributes and are easily accessed by name with attribute '.' notation:

In [10]:
for row in df.itertuples():
    print(row.fname)

file1.wav
file2.wav
file3.wav


***Aside:*** You might come across a similarly named method `iterrows()`, and it is recommended that you avoid it. The `iterrows()` method is less convenient than `itertuples()` because 1) it doesn't provide access to column values by name; and 2) it is slower to execute than `itertuples()`.

It makes sense to put your task in a named function when the task you wish to perform is more complicated than a simple print statement. Doing so helps make your code easier to debug, and you can re-use the function in multiple places.

The `my_print` function uses `os.path.join` to construct a filepath from the 'relpath' and 'fname' attributes of a row and prints the result. The `for` loop calls `my_print` on each row in turn.

In [11]:
# Function definition.
def my_print(rowtuple):
    '''Print values from a DataFrame row provided as a namedtuple.'''
    print(rowtuple.speaker)
    print(os.path.join(rowtuple.relpath, rowtuple.barename + '.ifc'))
    print(os.path.join(rowtuple.relpath, rowtuple.fname))
    print('******************')

# Loop over rows and call the `my_print` function.
for row in df.itertuples():
    my_print(row)

female
dir1/file1.ifc
dir1/file1.wav
******************
female
dir1/file2.ifc
dir1/file2.wav
******************
male
dir2/file3.ifc
dir2/file3.wav
******************


### The `ifcformant` function

Now that we know the mechanics of calling a named function for every row in the DataFrame, let's construct a function that runs the `ifcformant` command using the parameters provided by a namedtuple.

If we were working at the command line, a representative example of calling `ifcformant` is:

```
ifcformant --speaker female --print-header --output myfile.ifc myfile.wav
```

The arguments to `ifcformant` include the speaker type, name of the output file to contain the formant measurements, and the input '.wav' file. The `--print-header` argument is used to print column labels as the first row of the output file.

The `do_ifcformant` function shown below constructs an array of arguments from an input namedtuple and output directory, then uses the [`subprocess`](https://docs.python.org/3/library/subprocess.html) module to execute `ifcformant`.


In [14]:
def do_ifcformant(rt, outdir, errors='raise'):
    '''Perform formant analysis with the ifcformant command.
    
    Parameters
    ----------
    
    rt : namedtuple that contains formant analysis parameters
         in fields:
         'relpath' (relative path to audio file),
         'fname' (name of .wav file),
         'barename' (name of .wav file without extension)
         'speaker' (ifcformant speaker type, one of 'female',
             'male', 'child')
             
    outdir : str
        Base pathname to ifcformant output. The output file will
        be written to: outdir/rt.relpath/rt.barename + '.ifc'.
             
    errors : str (default 'raise')
        How to handle errors if `check_call()` fails. If
        'ignore', print debug statement to STDERR and return the
        ifcformant return code; if 'raise' immediately reraise
        the CalledProcessError.
        
    Returns
    -------
    
    The `ifcformant` return code is returned by this function,
    0 for success or non-zero for errors.
    '''
    ifcargs = [
        'ifcformant',
        'speaker', rt.speaker,
        '--print-header',
        '--output', os.path.join(
            outdir, rt.relpath, rt.barename + '.ifc'
        ),
        os.path.join(rt.relpath, rt.fname)
    ]
    try:
        subprocess.check_call(ifcargs)
    except subprocess.CalledProcessError as e:
        if errors == 'ignore':
            msg = 'Caught error while invoking ifcformant:\n{:}'.format(e)
            sys.stderr.write(msg)
            return e.returncode
        else:
            raise e
    return 0

It's always best to include a docstring at the top of your named functions. This helps you document your workflow and to possibly re-use the function in another project. Execute the following cell to see the documentation for the new function.

In [16]:
do_ifcformant?

Once your function is created and debugged, use `itertuples()` to run the function on every row of your DataFrame.

In [17]:
for row in df.itertuples():
    do_ifcformant(row, outdir='~/myexp/cache')
    

FileNotFoundError: [Errno 2] No such file or directory: 'ifcformant': 'ifcformant'

## <a name="find_non_matched_rows"></a>Find rows in a source DataFrame that do not have a match in a second DataFrame

If you have a source DataFrame and want to check whether a second DataFrame has a corresponding row, you can use a left merge with the source on the lefthand side. A left merge returns all of the merge keys from the lefthand DataFrame regardless of whether a matching key is found on the right. When there is no match, the columns from the right DataFrame are filled with NaN.

A common use case for this merge is for finding files that require post-processing. If you have a set of source filenames (e.g. '.wav' files) in one DataFrame and a set of existing post-processed filenames (e.g. '.fb' files from the `formant` command) in another, you can use the merge described in this section to find the source filenames that don't have a matching post-processed file. These are the ones that require post-processing.

For this example the source DataFrame will be the left DataFrame `ldf`. It contains a list of '.wav' filenames, parsed into a barename and extension.

In [18]:
ldf = pd.DataFrame.from_records([
    ('file1.wav', 'file1', '.wav'),
    ('file2.wav', 'file2', '.wav'),
    ('file3.wav', 'file3', '.wav')
], columns=['fname', 'barename', 'ext'])
ldf

Unnamed: 0,fname,barename,ext
0,file1.wav,file1,.wav
1,file2.wav,file2,.wav
2,file3.wav,file3,.wav


The `rdf` DataFrame has a corresponding set of filenames that have the same barenames as `ldf` with different extensions. Notice that it does not contain a filename that corresponds to the 'file2' barename in `ldf`. It does contain a barename not found in `ldf` ('file4').

In [19]:
rdf = pd.DataFrame.from_records([
    ('file1.fb', 'file1', '.fb'),
    ('file3.fb', 'file3', '.fb'),
    ('file4.fb', 'file4', '.fb')
], columns=['fname', 'barename', 'ext'])
rdf

Unnamed: 0,fname,barename,ext
0,file1.fb,file1,.fb
1,file3.fb,file3,.fb
2,file4.fb,file4,.fb


Our goal is to find each '.wav' file in `ldf` that does not have a corresponding '.fb' file in the second Dataframe `rdf`.

Performing a left merge on `ldf` preserves all of its key values, in this case 'barename'. This means that each barename value in `ldf` is represented by at least one row in the output. When there is no matching key from the right DataFrame, then the columns contributed from the right are filled with NaN. Notice that the 'file4' barename from the right DataFrame is not in the merge result. A left merge does not preserve all key values from the right DataFrame.

In [20]:
mdf = ldf.merge(rdf, on='barename', how='left', suffixes=['_lt', '_rt'])
mdf

Unnamed: 0,fname_lt,barename,ext_lt,fname_rt,ext_rt
0,file1.wav,file1,.wav,file1.fb,.fb
1,file2.wav,file2,.wav,,
2,file3.wav,file3,.wav,file3.fb,.fb


To find all of the '.wav' files that do not have a corresponding '.fb' file, select the rows from the merged DataFrame where the 'ext_rt' column has a value of NaN.

In [21]:
mdf[mdf.ext_rt.isna()]

Unnamed: 0,fname_lt,barename,ext_lt,fname_rt,ext_rt
1,file2.wav,file2,.wav,,


***Important*** Be careful! If your right DataFrame contains files you don't expect, you could be in trouble. In `rdf2` there are two '.txt' files in addition to the '.fb' files.

In [22]:
rdf2 = pd.DataFrame.from_records([
    ('file1.fb', 'file1', '.fb'),
    ('file2.txt', 'file2', '.txt'),
    ('file3.wav', 'file3', '.fb'),
    ('file3.txt', 'file3', '.txt')
], columns=['fname', 'barename', 'ext'])
rdf2

Unnamed: 0,fname,barename,ext
0,file1.fb,file1,.fb
1,file2.txt,file2,.txt
2,file3.wav,file3,.fb
3,file3.txt,file3,.txt


Merging `rdf2` with `ldf` does not produce the result we want! Since we only match on barename values, the existence of 'file2.txt' masks the fact that 'file2.fb' is missing, and there are no NaN values in the merge result.

(Secondarily, the barename 'file3' matches twice, once each for the '.fb' and '.txt' file.)

In [23]:
mdf2 = ldf.merge(rdf2, on='barename', how='left', suffixes=['_lt', '_rt'])
mdf2

Unnamed: 0,fname_lt,barename,ext_lt,fname_rt,ext_rt
0,file1.wav,file1,.wav,file1.fb,.fb
1,file2.wav,file2,.wav,file2.txt,.txt
2,file3.wav,file3,.wav,file3.wav,.fb
3,file3.wav,file3,.wav,file3.txt,.txt


To fix this problem, ensure that the right DataFrame does not contain any files that are not relevant to the task. In our case, we want to find '.wav' files that do not have a matching '.fb' file, so we take a subset of `rdf2` that contains only '.fb' files.

The subset looks like this:

In [24]:
rdf2[rdf2.ext == '.fb']

Unnamed: 0,fname,barename,ext
0,file1.fb,file1,.fb
2,file3.wav,file3,.fb


Using this subset in the merge produces the intended result, where 'file2' has a NaN value in the 'ext_rt' column.

In [25]:
mdf2 = ldf.merge(
    rdf2[rdf2.ext == '.fb'],   # Subset of '.fb' files
    on='barename',
    how='left',
    suffixes=['_lt', '_rt']
)
mdf2

Unnamed: 0,fname_lt,barename,ext_lt,fname_rt,ext_rt
0,file1.wav,file1,.wav,file1.fb,.fb
1,file2.wav,file2,.wav,,
2,file3.wav,file3,.wav,file3.wav,.fb
