# How to capture the output of command line tools to pandas DataFrames

Here's a general problem: How do you run a command line tool and use its output in a script?

One useful answer is to use the `subprocess` module to execute the command and then read its output into a pandas DataFrame. We'll illustrate this approach using tools found in the Berkeley Phonetics Machine.

## Getting started

We require three Python modules to perform the tasks, one for running the tools and two for creating DataFrames.

In [126]:
import subprocess       # Used to execute external commands

import pandas as pd     # Provides DataFrames
from io import StringIO # Converts command outputs to file-like objects
                        # that pandas can read.

The `subprocess` module executes command line tools in the context of a script. In a notebook it is also possible to execute command line tools directly, and in the cells that follow you'll sometimes see a line that begins with `!`. The `!` executes the tool that follows it and displays its output in the cell.

This example runs the `ls` command to list the contents of the current directory:

In [127]:
!ls

Command line tools to DataFrames.ipynb


Options and arguments can be provided in the same way as they are at the command line:

In [128]:
!ls -al

total 28
drwxrwxr-x 3 ubuntu ubuntu  4096 Sep 21 20:54 .
drwxrwxr-x 5 ubuntu ubuntu  4096 Sep 21 13:46 ..
-rw-rw-r-- 1 ubuntu ubuntu 12646 Sep 21 20:54 Command line tools to DataFrames.ipynb
drwxr-xr-x 2 ubuntu ubuntu  4096 Sep 21 13:37 .ipynb_checkpoints


### Piping the output of one command to the input of another

It is possible to combine commands into a series, where the output of one command is used as the input of another command. These series can be used to create more complicated behaviors from a set of simple tools. For example, the `echo` command simply outputs its arguments as a string. Notice in this example that the `\n` represents the newline character and  the '2' and '1' appear on separate lines:

In [129]:
!echo '2\n1'

2
1


Now we'll make things slightly more complicated by piping to the `sort` command, which sorts input lines and prints out the result. The pipe symbol `|` is used to link the output of the first command (aka STDOUT) to the input of the second (aka STDIN), and the two lines produced by `echo` become the input to `sort`:

In [130]:
!echo '2\n1' | sort

1
2


The lines are now ordered.

## ifcformant

The first tool we'll look at is `ifcformant`, which performs formant analysis on an input audio file.

For a basic description of `ifcformant` see http://linguistics.berkeley.edu/plab/guestwiki/index.php?title=IFC_formant_tracker or do `ifcformant --help`.

It's always important to understand what kind of outputs your tools create. Execute `ifcformant` directly and observe what it produces.

In [131]:
!ifcformant --speaker male --print-header ../resources/two_plus_two.wav

sec	rms	f1	f2	f3	f4	f0
0.0050	96.5	264.6	1456.6	3314.5	3694.6	255.3
0.0150	60.1	259.2	1327.6	3376.1	3790.2	0.0
0.0250	49.5	271.7	1100.8	2774.3	3632.4	69.8
0.0350	49.5	249.0	1084.6	2689.8	3568.3	69.8
0.0450	75.5	234.6	1401.1	3068.3	3454.2	85.7
0.0550	429.1	327.8	1572.1	3163.7	3474.9	352.9
0.0650	817.1	359.9	1645.9	2800.7	3499.2	0.0
0.0750	1152.5	356.5	1617.0	2601.8	3450.1	0.0
0.0850	1152.5	350.3	1597.6	2460.5	3472.1	106.2
0.0950	1192.9	362.9	1595.5	2467.6	3526.0	78.4
0.1050	1286.3	352.8	1577.0	2439.2	3523.1	80.5
0.1150	1363.7	334.8	1641.5	2429.4	3512.3	79.6
0.1250	1363.7	329.7	1622.6	2476.5	3534.8	77.6
0.1350	1231.4	319.4	1638.6	2556.7	3551.9	78.2
0.1450	1099.3	343.2	1221.1	2352.4	3482.1	79.5
0.1550	1019.9	325.6	1310.1	2398.8	3558.2	79.3
0.1650	839.0	307.2	1616.7	2422.4	3564.1	80.5
0.1750	497.3	275.2	1659.6	2433.8	3522.7	78.7
0.1850	405.2	289.0	1511.5	2386.6	3410.9	77.6
0.1950	394.5	343.5	1279.6	2346.5	3431.3	0.0
0.2050	245.6	363.5	1394.9	2364.8	3507.0	149.2
0.2150	134.5	335.9	1588.0	24

As you can see, `ifcformant` creates a table of measurements at 10ms intervals. We use the `--print-header` option to include the names of the columns as the first row of output.

The columns are tab-separated by default. See `ifcformant --help` if you want a different separator.

### Using `subprocess`

Now we use the `subprocess` module to execute `ifcformant`. Start by creating a list of arguments. Notice how the command line we used to run `ifcformant` earlier is split into a list of strings, which is what `subprocess` methods will expect. The `ifcformant` command itself is the first item in the list, and its options and arguments follow.

In [132]:
ifcargs = [
    'ifcformant',
    '--speaker',
    'male',
    '--print-header',
    '../resources/two_plus_two.wav'
]
ifcargs

['ifcformant',
 '--speaker',
 'male',
 '--print-header',
 '../resources/two_plus_two.wav']

The `check_output()` method executes the command defined by `ifcargs` and returns its output, which is stored in `ifcout`.

In [133]:
try:
    ifcout = subprocess.check_output(ifcargs)
except subprocess.CalledProcessError:
    print('ifcformant failed with args ', ifcargs)

 `check_output()` also raises an error if execution of the external command results in a failure. The try/except blocks print a useful error message if that occurs. Notice what happens when the arguments to ifcformant are not correct.

In [134]:
badargs = ['ifcformant', '--speaker', 'male', 'missing_file']
try:
    ifcout = subprocess.check_output(badargs)
except subprocess.CalledProcessError:
    print('ifcformant failed with args ', badargs)

ifcformant failed with args  ['ifcformant', '--speaker', 'male', 'missing_file']


### Reading into a DataFrame

The pandas `read_csv()` method is used to read tabular text data into a DataFrame. Often it reads from a real text file, but it can also read from any file-like object. The output captured from `ifcformant` is a series of bytes, which can be converted to a string with `decode()`, and `StringIO` turns a string into a file-like object (one that responds to the `read()` method).

In [135]:
df = pd.read_csv(
    StringIO(                  # make file-like
        ifcout.decode('ascii') # convert to string; utf-8 also works
    ),
    sep='\t'
)

When you run external processes with `subprocess` you should *always* check to see whether the command was successful. Now that the output of `ifcformant` has been captured, we use `communicate()` to wait for the process to end, then check its returncode. Standard behavior is for a process to return 0 for success and non-zero values indicate an error.  

In [136]:
ifcproc.communicate()
if ifcproc.returncode != 0:
    print('ERROR running ifcformant with args ', ifcargs)

ValueError: read of closed file

If `ifcformant` exited successfully, then the DataFrame `df` should be filled with measurements. Take a look at the results.

In [None]:
df.mean()