In [1]:
%load_ext autoreload
%autoreload 2

# Working with experiment files using `dir2df()`

The `dir2df()` function is designed to help you manage the observations stored in a set of files in a directory, which we will call the `datadir`. `dir2df()` recursively descends into the subdirectories of `datadir` and returns a DataFrame with all of the files it finds as the rows.

## The X-Ray microbeam database

The [X-Ray microbeam database](https://github.com/rsprouse/xray_microbeam_database) serves as the first example to illustrate `dir2df()`. The only required argument is `datadir`, and the return value is a DataFrame that contains a set of filenames and the path from `datadir` to `filename`, stored in the `relpath` column.

In [38]:
from phonlab.utils import dir2df

datadir = '../resource/xrmbdb'
df = dir2df(datadir)
df

Unnamed: 0,filename,relpath
0,tp014_2.wav,JW21
1,tp029.txy,JW21
2,tp041.wav,JW21
3,tp055.wav,JW21
4,tp015.txy,JW21
5,tp069.wav,JW21
6,tp001.txy,JW21
7,tp083.TextGrid,JW21
8,tp082.wav,JW21
9,ta008.frm,JW21


The `relpath` column has the `dtype` [`Categorical`](https://pandas.pydata.org/pandas-docs/stable/categorical.html) (analogous to R's `factor`), which makes it easy to list all of the subdirectories found in `datadir` by displaying its categories (equivalent to R factor levels).

In [44]:
df.relpath.cat.categories

Index(['JW11', 'JW12', 'JW13', 'JW14', 'JW15', 'JW16', 'JW18', 'JW19', 'JW20',
       'JW21', 'JW24', 'JW25', 'JW26', 'JW27', 'JW28', 'JW29', 'JW30', 'JW31',
       'JW32', 'JW33', 'JW34', 'JW35', 'JW36', 'JW37', 'JW39', 'JW40', 'JW41',
       'JW42', 'JW43', 'JW44', 'JW45', 'JW46', 'JW48', 'JW49', 'JW50', 'JW51',
       'JW52', 'JW53', 'JW54', 'JW55', 'JW56', 'JW57', 'JW58', 'JW59', 'JW60',
       'JW61', 'JW62', 'JW63'],
      dtype='object')

The organization of files in the dataset is simple--for each subject there is a single subdirectory named 'JW' + the subject number.

## Adding file metadata

The `stats` parameter accepts one or more attribute names of the [`stat_result` returned by `os.stat()`](https://docs.python.org/3/library/os.html#os.stat_result), and you can use it to store these attributes as output column of the corresponding names. The most interesting attributes are `size` (size of the file in bytes) and `mtime` (last modification time of the file).

**NOTE:** The interpretation of `mtime` depends on your system platform. For more, see the [`stat_result` documentation](https://docs.python.org/3/library/os.html#os.stat_result).

In [75]:
df = dir2df(datadir, stats='mtime')
df

Unnamed: 0,filename,mtime,relpath
0,tp014_2.wav,2018-06-05 20:15:15.876573,JW21
1,tp029.txy,2018-06-05 20:15:15.880641,JW21
2,tp041.wav,2018-06-05 20:15:15.884534,JW21
3,tp055.wav,2018-06-05 20:15:15.887464,JW21
4,tp015.txy,2018-06-05 20:15:15.890435,JW21
5,tp069.wav,2018-06-05 20:15:15.893174,JW21
6,tp001.txy,2018-06-05 20:15:15.896432,JW21
7,tp083.TextGrid,2018-06-05 20:15:15.899190,JW21
8,tp082.wav,2018-06-05 20:15:15.901960,JW21
9,ta008.frm,2018-06-05 20:15:15.904737,JW21


You can use the `mtime` values to select a subset of filenames from the DataFrame. The next cell selects files that were last modified after a particular date and time.

In [79]:
df[df.mtime > '2018-06-05 20:16:49.8']

Unnamed: 0,filename,mtime,relpath
24194,tp002.TextGrid,2018-06-05 20:16:49.801808,JW62
24195,tp003.TextGrid,2018-06-05 20:16:49.804680,JW62
24196,tp104.wav,2018-06-05 20:16:49.807550,JW62
24197,tp110.wav,2018-06-05 20:16:49.810653,JW62
24198,tp074.TextGrid,2018-06-05 20:16:49.813829,JW62
24199,tp008.frm,2018-06-05 20:16:49.816782,JW62
24200,tp034.frm,2018-06-05 20:16:49.819773,JW62
24201,tp020.frm,2018-06-05 20:16:49.823348,JW62
24202,ta005.frm,2018-06-05 20:16:49.826156,JW62
24203,ta011.frm,2018-06-05 20:16:49.829741,JW62


To also include `size` we pass a list to the `stats` parameter. Note that the size of all of the files is 0 bytes. This is because only the file structure of the database was copied, with none of the data.

In [84]:
df = dir2df(datadir, stats=['mtime', 'size'])
df

Unnamed: 0,filename,mtime,relpath,size
0,tp014_2.wav,2018-06-05 20:15:15.876573,JW21,0
1,tp029.txy,2018-06-05 20:15:15.880641,JW21,0
2,tp041.wav,2018-06-05 20:15:15.884534,JW21,0
3,tp055.wav,2018-06-05 20:15:15.887464,JW21,0
4,tp015.txy,2018-06-05 20:15:15.890435,JW21,0
5,tp069.wav,2018-06-05 20:15:15.893174,JW21,0
6,tp001.txy,2018-06-05 20:15:15.896432,JW21,0
7,tp083.TextGrid,2018-06-05 20:15:15.899190,JW21,0
8,tp082.wav,2018-06-05 20:15:15.901960,JW21,0
9,ta008.frm,2018-06-05 20:15:15.904737,JW21,0


## Filtering your dataset

We have already seen one example of filtering the dataset by selecting only the files modified more recently than a particular date and time. Using Pandas's built-in methods and indexing is the most flexible way to filter your dataset.

Here's another example, which selects all the files than end with '.wav' that have modification times newer than a point in time.

In [82]:
df[(df.filename.str.endswith('.wav')) & (df.mtime > '2018-06-05 20:16:49.8')]

Unnamed: 0,filename,mtime,relpath,size
24196,tp104.wav,2018-06-05 20:16:49.807550,JW62,0
24197,tp110.wav,2018-06-05 20:16:49.810653,JW62,0
24205,tp070.wav,2018-06-05 20:16:49.843370,JW62,0
24206,tp064.wav,2018-06-05 20:16:49.846566,JW62,0
24209,tp058.wav,2018-06-05 20:16:49.856169,JW62,0


## Filtering with the `searchpat` and `dirpat` parameters

Using Pandas built-in indexing is flexible, but it can also be inefficient when `datadir` contains a large number of files and you want to ignore many of them. You can speed things up by using the `searchpat` parameter to apply a [regular expression](https://docs.python.org/3/howto/regex.html) to each filename and reject the ones that don't match.

[pythex.org](http://pythex.org) runs a useful service for developing and debugging regular expressions.

In the next example, `dir2df()` returns only those files that end with '.wav'. The `$` symbol in the regex anchors the match to the end of the filename. The `.` in a regex matches any character, and we use `\.` to ensure it matches only a literal dot.

In [85]:
df = dir2df(datadir, searchpat='\.wav$', stats=['mtime', 'size'])
df

Unnamed: 0,filename,mtime,relpath,size
0,tp014_2.wav,2018-06-05 20:15:15.876573000,JW21,0
1,tp041.wav,2018-06-05 20:15:15.884534000,JW21,0
2,tp055.wav,2018-06-05 20:15:15.887464000,JW21,0
3,tp069.wav,2018-06-05 20:15:15.893174000,JW21,0
4,tp082.wav,2018-06-05 20:15:15.901960000,JW21,0
5,tp096.wav,2018-06-05 20:15:15.907822000,JW21,0
6,ta001_2.wav,2018-06-05 20:15:15.934402000,JW21,0
7,tp097.wav,2018-06-05 20:15:15.953573000,JW21,0
8,tp083.wav,2018-06-05 20:15:15.959965000,JW21,0
9,tp068.wav,2018-06-05 20:15:15.963871000,JW21,0


You can also include [named capture groups](https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups) in your search pattern to parse your filenames and add columns to your output DataFrame.

In the next example, the only filenames returned start with 'tp', followed by one or more digits that correspond to an utterance number, followed by an optional underscore and instance number, and ending with the extension '.wav'. The utterance and instance number are recorded in their own columns. The instance number is optional because of the `?` modifier.

In [97]:
df = dir2df(
    datadir,
    searchpat='^tp(?P<utterance>\d+)(_(?P<instance>\d))?\.wav$'
)
df

Unnamed: 0,filename,instance,relpath,utterance
0,tp014_2.wav,2,JW21,014
1,tp041.wav,,JW21,041
2,tp055.wav,,JW21,055
3,tp069.wav,,JW21,069
4,tp082.wav,,JW21,082
5,tp096.wav,,JW21,096
6,tp097.wav,,JW21,097
7,tp083.wav,,JW21,083
8,tp068.wav,,JW21,068
9,tp054.wav,,JW21,054


The `dirpat` parameter is like `searchpat` but is applied against `relpath` rather than the `filename`. Directories that do not match `dirpat` are ignored, and no filenames are returned from them.

Unlike `searchpat`, `dirpat` does not add columns based on named capture groups.

The next example returns '.wav' files only from subjects 'JW11' and 'JW12'.

In [100]:
df = dir2df(
    datadir,
    searchpat='^tp(?P<utterance>\d+)(_(?P<instance>\d))?\.wav$',
    dirpat='JW(11|12)'
)
df

Unnamed: 0,filename,instance,relpath,utterance
0,tp041.wav,,JW11,041
1,tp055.wav,,JW11,055
2,tp069.wav,,JW11,069
3,tp082.wav,,JW11,082
4,tp096.wav,,JW11,096
5,tp109_2.wav,2,JW11,109
6,tp109.wav,,JW11,109
7,tp108.wav,,JW11,108
8,tp110_2.wav,2,JW11,110
9,tp097.wav,,JW11,097
