# How to capture the output of command line tools to pandas DataFrames

Here's a general problem: How do you run a command line tool and use its output in a script?

One useful answer is to use the `subprocess` module to execute the command and then read its output into a pandas DataFrame. We'll illustrate this approach using tools found in the Berkeley Phonetics Machine.

## Getting started

We require three Python modules to perform the tasks, one for running the tools and two for creating DataFrames.

In [124]:
import subprocess       # Used to execute external commands

import pandas as pd     # Provides DataFrames
from io import StringIO # Converts command outputs to file-like objects
                        # that pandas can read.

The `subprocess` module executes command line tools in the context of a script. In a notebook it is also possible to execute command line tools directly, and in the cells that follow you'll sometimes see a line that begins with `!`. The `!` executes the tool that follows it and displays its output in the cell.

This example runs the `ls` command to list the contents of the current directory:

In [None]:
!ls

Options and arguments can be provided in the same way as they are at the command line:

In [None]:
!ls -al

### Piping the output of one command to the input of another

It is possible to combine commands into a series, where the output of one command is used as the input of another command. These series can be used to create more complicated behaviors from a set of simple tools. For example, the `echo` command simply outputs its arguments as a string. Notice in this example that the `\n` represents the newline character and  the '2' and '1' appear on separate lines:

In [None]:
!echo '2\n1'

Now we'll make things slightly more complicated by piping to the `sort` command, which sorts input lines and prints out the result. The pipe symbol `|` is used to link the output of the first command (aka STDOUT) to the input of the second (aka STDIN), and the two lines produced by `echo` become the input to `sort`:

In [None]:
!echo '2\n1' | sort

The lines are now ordered.

## ifcformant

The first tool we'll look at is `ifcformant`, which performs formant analysis on an input audio file.

For a basic description of `ifcformant` see http://linguistics.berkeley.edu/plab/guestwiki/index.php?title=IFC_formant_tracker or do `ifcformant --help`.

It's always important to understand what kind of outputs your tools create. Execute `ifcformant` directly and observe what it produces.

In [None]:
!ifcformant --speaker male --print-header ../resources/two_plus_two.wav

As you can see, `ifcformant` creates a table of measurements at 10ms intervals. We use the `--print-header` option to include the names of the columns as the first row of output.

The columns are tab-separated by default. See `ifcformant --help` if you want a different separator.

### Using `subprocess`

Now we use the `subprocess` module to execute `ifcformant`. Start by creating a list of arguments. Notice how the command line we used to run `ifcformant` earlier is split into a list of strings, which is what `subprocess` methods will expect. The `ifcformant` command itself is the first item in the list, and its options and arguments follow.

In [115]:
ifcargs = [
    'ifcformant',
    '--speaker',
    'male',
    '--print-header',
    '../resources/two_plus_two.wav'
]
ifcargs

['ifcformant',
 '--speaker',
 'male',
 '--print-header',
 '../resources/two_plus_two.wav']

The `check_output()` method executes the command defined by `ifcargs` and returns its output, which is stored in `ifcout`.

In [118]:
try:
    ifcout = subprocess.check_output(ifcargs)
except subprocess.CalledProcessError:
    print('ifcformant failed with args ', ifcargs)

 `check_output()` also raises an error if execution of the external command results in a failure. The try/except blocks print a useful error message if that occurs. Notice what happens when the arguments to ifcformant are not correct.

In [121]:
badargs = ['ifcformant', '--speaker', 'male', 'missing_file']
try:
    ifcout = subprocess.check_output(badargs)
except subprocess.CalledProcessError:
    print('ifcformant failed with args ', badargs)

ifcformant failed with args  ['ifcformant', '--speaker', 'male', 'missing_file']


### Reading into a DataFrame

The pandas `read_csv()` method is used to read tabular text data into a DataFrame. Often it reads from a real text file, but it can also read from any file-like object. The output captured from `ifcformant` is a series of bytes, which can be converted to a string with `decode()`, and `StringIO` turns a string into a file-like object (one that responds to the `read()` method).

In [None]:
df = pd.read_csv(
    StringIO(                  # make file-like
        ifcout.decode('ascii') # convert to string; utf-8 also works
    ),
    sep='\t'
)

When you run external processes with `subprocess` you should *always* check to see whether the command was successful. Now that the output of `ifcformant` has been captured, we use `communicate()` to wait for the process to end, then check its returncode. Standard behavior is for a process to return 0 for success and non-zero values indicate an error.  

In [122]:
ifcproc.communicate()
if ifcproc.returncode != 0:
    print('ERROR running ifcformant with args ', ifcargs)

ValueError: read of closed file

If `ifcformant` exited successfully, then the DataFrame `df` should be filled with measurements. Take a look at the results.

In [125]:
df.mean()

sec       1.075000
rms    2645.992093
f1      441.726977
f2     1423.562791
f3     2498.030233
f4     3434.841860
f0      109.912558
dtype: float64