# Retrieving filenames in a directory tree with `dir2df()`

The `dir2df()` function is designed to help you manage a set of filenames in an experiment directory. It recursively descends from a top-level directory and returns a DataFrame in which the files are the rows.

Because the output is a DataFrame, you can use standard Pandas methods to select subsets of filenames, extract experiment variables from filenames and paths, generate additional data by applying functions to filename rows, and do certain kinds of error checking. This notebook explores some of these topics.

In [1]:
# Import libraries used in this notebook.
import os
import re
from phonlab.utils import dir2df

## The `dir2df()` function

`dir2df()` has one required parameter, the top-level directory name. The function returns a DataFrame with two columns, one for the path from the top-level directory to the filename, and the other for the filename itself. These columns are named `relpath` and `fname`, respectively. 

In [2]:
expdir = '../resource/myexp'  # top-level experiment directory
df = dir2df(expdir)
df

Unnamed: 0,relpath,fname
0,.,README.md
1,subj1/trial1,acq1.tg
2,subj1/trial1,acq1.wav
3,subj1/trial1,acq2.tg
4,subj1/trial1,acq2.wav
5,subj1/trial2,acq1.tg
6,subj1/trial2,acq1.wav
7,subj1/trial2,acq2.tg
8,subj1/trial2,acq2.wav
9,subj2,junk.txt


## Filtering filenames and paths

You can filter which rows are returned by `dir2df()` by using the `fnpat` and `dirpat` parameters. These are [regular expression](https://docs.python.org/3/howto/regex.html) patterns that define valid filenames to return. In order for a row to be returned, the filename must match `fnpat`, and the relative path must match `dirpat`.

***Aside:*** [pythex.org](http://pythex.org) runs a useful service for developing and debugging regular expressions.

The next example filters rows so that the only filenames returned are the '.wav' files for subject 3.

In [3]:
df = dir2df(
    '../resource/myexp/',
    fnpat='\.wav$',
    dirpat='^subj3/'
)
df

Unnamed: 0,relpath,fname
0,subj3/trial1,acq1.wav
1,subj3/trial1,acq2.wav
2,subj3/trial2,acq1.wav
3,subj3/trial2,acq2.wav


## Adding variables to the output

`dir2df()` will return additional metadata as variables (i.e. columns) of the output dataframe by request.

### Variables parsed from filenames and paths

If you include experiment variables in your filenaming conventions you can extract them by using [named capture groups](https://docs.python.org/3/howto/regex.html#non-capturing-and-named-groups) in `fnpat` and `dirpat`. The named captures will be included as Categorical variables (analogous to R's factor variables) in the output DataFrame.

The next cell returns '.wav' files for subjects 2 and 3 and also parses the subject, trial, and acquisition variables from the relative path and filename.

In [4]:
df = dir2df(
    '../resource/myexp/',
    fnpat='^acq(?P<acqnum>\d+)\.wav$',
    dirpat='^subj(?P<subject>[23])/trial(?P<trial>\d+)$'
)
df

Unnamed: 0,relpath,fname,subject,trial,acqnum
0,subj2/trial1,acq1.wav,2,1,1
1,subj2/trial1,acq2.wav,2,1,2
2,subj2/trial2,acq1.wav,2,2,1
3,subj2/trial2,acq2.wav,2,2,2
4,subj3/trial1,acq1.wav,3,1,1
5,subj3/trial1,acq2.wav,3,1,2
6,subj3/trial2,acq1.wav,3,2,1
7,subj3/trial2,acq2.wav,3,2,2


If you need more control over your regex, you can precompile the values of `fnpat` and `dirpat` using `re.compile()`, which allows for the use of flags in interpreting the pattern.

In the next cell, `dirpat` is passed as a precompiled regex using flags that allow for case-insensitivity (`re.IGNORECASE`) and multi-line formatting with comments (`re.VERBOSE`). See the [regex documentation](https://docs.python.org/3/library/re.html#contents-of-module-re) for more information on flags. Multiple flags are combined with the '|' operator.

In [5]:
dirpat = re.compile(
    '''
    ^                      # directory path starts with...
    subj(?P<subject>[23])  # subj2 or subj3, extracted as variable
    /                      # directory separator
    trial(?P<trial>\d+)    # trialN, extracted as variable
    $                      # ...and no more
    ''',
    flags=re.IGNORECASE|re.VERBOSE
)
df = dir2df(
    '../resource/myexp/',
    fnpat='^acq(?P<acqnum>\d+)\.wav$',
    dirpat=dirpat          # Use precompiled regex
)
df

Unnamed: 0,relpath,fname,subject,trial,acqnum
0,subj2/TRIAL3,acq1.wav,2,3,1
1,subj2/trial1,acq1.wav,2,1,1
2,subj2/trial1,acq2.wav,2,1,2
3,subj2/trial2,acq1.wav,2,2,1
4,subj2/trial2,acq2.wav,2,2,2
5,subj3/trial1,acq1.wav,3,1,1
6,subj3/trial1,acq2.wav,3,1,2
7,subj3/trial2,acq1.wav,3,2,1
8,subj3/trial2,acq2.wav,3,2,2


### Variables added with `addcols`

The `addcols` parameters takes a list of columns that `dir2df()` can add to the output. The first three columns that you can add are 'dirname', 'barename', and 'ext'. These correspond respectively to the top-level directory name (the one passed as the first argument to `dir2df()`), the filename without without extension, and the filename extension.

Two additional columns are 'bytes' and 'mtime', which are file attributes retrieved by `os.stat()`. The 'bytes' column contains the file size ('st_size' from `os.stat()`), and 'mtime' contains the last modification time of the file ('st_mtime').

***NOTE:*** The interpretation of `mtime` depends on your system platform. For more, see the [`stat_result` documentation](https://docs.python.org/3/library/os.html#os.stat_result).

The next cell repeats the previous filename listing and adds all of the possible `addcols` values.

In [7]:
df = dir2df(
    '../resource/myexp/',
    fnpat='^acq(?P<acqnum>\d+)\.wav$',
    dirpat='^subj(?P<subject>[23])/trial(?P<trial>\d+)$',
    addcols=['dirname', 'barename', 'ext', 'bytes', 'mtime']
)
df

Unnamed: 0,dirname,relpath,fname,barename,ext,bytes,mtime,subject,trial,acqnum
0,../resource/myexp/,subj2/trial1,acq1.wav,acq1,.wav,5,2018-06-06 15:07:20.681955,2,1,1
1,../resource/myexp/,subj2/trial1,acq2.wav,acq2,.wav,5,2018-06-06 15:07:20.682321,2,1,2
2,../resource/myexp/,subj2/trial2,acq1.wav,acq1,.wav,5,2018-06-06 15:07:45.210344,2,2,1
3,../resource/myexp/,subj2/trial2,acq2.wav,acq2,.wav,5,2018-06-06 15:07:45.210727,2,2,2
4,../resource/myexp/,subj3/trial1,acq1.wav,acq1,.wav,5,2018-06-06 15:07:56.697829,3,1,1
5,../resource/myexp/,subj3/trial1,acq2.wav,acq2,.wav,5,2018-06-06 15:07:56.698221,3,1,2
6,../resource/myexp/,subj3/trial2,acq1.wav,acq1,.wav,5,2018-06-06 15:08:02.866098,3,2,1
7,../resource/myexp/,subj3/trial2,acq2.wav,acq2,.wav,5,2018-06-06 15:08:02.866462,3,2,2


## Files that are ignored by `dir2df()`

By default `dir2df()` ignores certain filepaths and does not return them. There are two kinds of parameters that can be used to change these defaults.

### The `sentinel` file parameter

`dir2df()` looks for a sentinel file in every subdirectory it traverses, and if the sentinel is found, the files in that directory and any of its subdirectories are not returned in the output. The default name of the sentinel file is '.bad.txt', and this name can be changed to another via the `sentinel` parameter. If the value is changed to '', then the check for the sentinel file will not be performed, and all subdirectories are included in the output.

The 'subj7' directory was skipped when `dir2df()` returned the files for `expdir` in a previous cell. If we set the `sentinel` value to '', then 'subj7' is returned in the output.

In [214]:
df = dir2df(expdir, sentinel='')
df[df.relpath.str.startswith('subj7')]

Unnamed: 0,relpath,fname
43,subj7/trial1,acq1.tg
44,subj7/trial1,acq1.wav
45,subj7/trial1,acq2.tg
46,subj7/trial1,acq2.wav
47,subj7/trial2,acq1.tg
48,subj7/trial2,acq1.wav
49,subj7/trial2,acq2.tg
50,subj7/trial2,acq2.wav


Notice that the '.bad.txt' file itself is not returned. This is because it is a 'hidden' file, and the treatment of these files depends on the values of the `dot*` parameters.

### 'Hidden' files and the `dotfiles` and `dotdirs` parameters

Filenames and directory names that start with '.' are considered 'hidden' files. By default, filenames that start with '.' are ignored, and `dir2df()` will not descend into any subdirectory that has a name starting with '.'. To change this behavior and include 'hidden' files or directories, set `dotfiles` or `dotdirs` to `True`, respectively.

In the following cell the sentinel file is included in the output by setting `dotfiles` to `True`.

In [8]:
df = dir2df(expdir, sentinel='', dotfiles=True)
df[df.relpath.str.startswith('subj7')]

Unnamed: 0,relpath,fname
44,subj7,.bad.txt
45,subj7/trial1,acq1.tg
46,subj7/trial1,acq1.wav
47,subj7/trial1,acq2.tg
48,subj7/trial1,acq2.wav
49,subj7/trial2,acq1.tg
50,subj7/trial2,acq1.wav
51,subj7/trial2,acq2.tg
52,subj7/trial2,acq2.wav


## Rarely-used parameters

`dir2df()` supports additional parameters that you are not likely to need. For additional information, execute the next cell to see the full function signature and docstring.

In [10]:
dir2df?
# or use dir2df?? to also see the function code