# How to capture the output of command line tools to Pandas DataFrames, part 2

This notebook continues the topic introduced in the [part 1](command_line_tools_to_df_1.ipynb) notebook of how to read the output from a command line tool into a pandas DataFrame.

## Overview

[Part 1](command_line_tools_to_df_1.ipynb) of this series explores the topic of how to capture text data from an external command in a way that `read_csv()` can use. If you understand that notebook you are ready to work through this one, where a second command is introduced to transform the output of one external command into a text format that `read_csv()` can handle.

Specifically, we'll use the ESPS `get_f0` command, which creates binary output that pandas does not read directly. A second ESPS tool `pplain` will be used to read `get_f0`'s output and convert it to a text format.

We'll look at three different ways of using these tools, and the main differentiator among them is how many intermediate files are created during the processing chain. Generally speaking, techniques for eliminating intermediate files are a little more complicated to follow and a little cleaner to use since they don't leave lots of extra files lying around on your machine storage.

These are the three approaches we'll explore:

1. [Caching of intermediate files](#caching)
1. [Hybrid caching/non-caching of intermediate files](#hybrid_caching)
1. [No caching of intermediate files](#no_caching)

Before we start with the different approaches, first we have to know how to run the ESPS tools we'll be using.

## Get to know your tools

There are three ESPS tools that will be used in this notebook, `get_f0`, `rem_dc`, and `pplain`. We'll get a sense of what each does by running them as `bash` commands.

**Important reminder**

This notebook is not a comprehensive introduction to ESPS tools and introduces just enough detail about each for you to understand the examples. Before you use any of these commands in your research you should read and understand their `man` pages. You can do this by typing command `man <command>` in a terminal window--for the tools in this notebook, `man get_f0`, `man pplain`, or `man rem_dc`.

### `get_f0`

The `get_f0` command reads an input audio file and produces four voicing-related measurements (f0, prob_voice, rms, ac_peak) at regular frame intervals (default 10ms). These measurements can be written to a binary output file or to STDOUT.

A simple command line that uses `get_f0` looks like:

In [5]:
!get_f0 ../resources/two_plus_two.wav ../resources/two_plus_two.f0

The first filename is the input file and the second filename specifies the output file. To write to STDOUT you can use `-` as the filename:

In [6]:
!get_f0 ../resources/two_plus_two.wav -

   ¸          j                 j  Tue Sep 26 13:52:56 2017  1.97    get_f0          1.14    10/21/96                                             (   e  ubuntu                                   óÅô?ü      `Mg÷H8üÈ\ü                                                                                                                     F0  
 prob_voice   rms   ac_peak    n_cands        src_sf           pÇ@  voice_bias            lag_weight      >  cand_thresh     >  wind_dur        Âõ;  frame_step      
×#<  record_freq      `	  Y@  trans_amp          ?  min_f0        HB  max_f0       	D  double_cost     33³>  freq_weight     
×£<  trans_spec         ?  start_time         àQ¸n?  trans_cost      
×£;  get_f0 ../resources/two_plus_two.wav - 
      BCE:/home/ubuntu/src/phonlab-illustrations/notebook   ../resources/two_plus_two.wav    è    j                            1.97    p

It's not very useful to look at, is it?

### `rem_dc`

The `get_f0` documentation recommends running `rem_dc` on the input audio to remove the DC component (zero offset) of your audio before processing with `get_f0`. Like `get_f0`, the first argument to `rem_dc` identifies the input file and the second specifies the output file.

We can combine the two commands by having `get_f0` work on the output file created by `rem_dc`. In the next cell the `../resources/two_plus_two.rem_dc.wav` file is used to link the output of `rem_dc` to the input of `get_f0`.

In [8]:
%%bash
rem_dc ../resources/two_plus_two.wav ../resources/two_plus_two.rem_dc.wav
get_f0 ../resources/two_plus_two.rem_dc.wav ../resources/two_plus_two.f0

We can skip the file created by `rem_dc` by piping output to STDOUT and using it as the input to `get_f0`. In this example only the final output is written to a binary file.

In [13]:
!rem_dc ../resources/two_plus_two.wav - | get_f0 - ../resources/two_plus_two.f0    

The pipe symbol `|` connects STDOUT of the first command to STDIN of the second. The intermediate output from `rem_dc` is never saved as a separate file:

And the last example produces no intermediate files or final output file. The `get_f0` output is sent to STDOUT.

In [11]:
!rem_dc ../resources/two_plus_two.wav - | get_f0 - -

   ¸          j                 j  Tue Sep 26 13:59:15 2017  1.97    get_f0          1.14    10/21/96                                             (      ubuntu                                   óÅ;	      `½n÷ð;	p0;	                                                                                                                     F0  
 prob_voice   rms   ac_peak    n_cands        src_sf           pÇ@  voice_bias            lag_weight      >  cand_thresh     >  wind_dur        Âõ;  frame_step      
×#<  record_freq      `	  Y@  trans_amp          ?  min_f0        HB  max_f0       	D  double_cost     33³>  freq_weight     
×£<  trans_spec         ?  start_time         àQ¸n?  trans_cost      
×£;  get_f0 - - 
      BCE:/home/ubuntu/src/phonlab-illustrations/notebook   <stdin>     j  Tue Sep 26 13:59:15 2017  1.97    rem_dc          3.10    3/23/97                   ÿÿÿ

Again, the output is not easy for a human to interpret. For that we need `pplain`.

### `pplain`

The `pplain` command reads the binary output of ESPS tools and transforms it to a text format. Try the last command again and pipe its output to `pplain`. Like `rem_dc` and `get_f0`, `pplain` can read its input from a real file, or you can use `-` to indicate it should read from STDIN:

In [12]:
!rem_dc ../resources/two_plus_two.wav - | get_f0 - - | pplain -

116.97 1 0 0.915904 
116.426 1 85.3791 0.272178 
120.038 1 43.7041 0.570703 
113.028 1 27.5266 0.434183 
112.996 1 23.1241 0.599601 
115.1 1 49.0686 0.619396 
129.306 1 271.072 0.584563 
104.235 1 555.881 0.959619 
78.9803 1 669.251 0.905303 
80.584 1 755.271 0.937208 
80.0428 1 825.016 0.916295 
77.4492 1 878.609 0.971195 
76.7525 1 855.739 0.962965 
79.3813 1 834.785 0.978924 
79.4154 1 792.175 0.974874 
80.3465 1 783.147 0.989194 
79.0875 1 670.08 0.975477 
76.8653 1 533.766 0.928102 
77.0093 1 282.969 0.707372 
68.2491 1 249.097 0.690759 
61.1677 1 264.859 0.776066 
60.8746 1 154.763 0.737538 
57.914 1 74.3248 0.730815 
0 0 63.1954 0.722898 
0 0 45.799 0.669247 
0 0 47.0152 0.664364 
0 0 51.3827 0.426164 
0 0 395.121 0.668454 
0 0 837.448 0.398838 
0 0 746.569 0.54801 
0 0 718.077 0.41717 
0 0 533.599 0.491895 
0 0 338.788 0.398434 
0 0 441.555 0.496035 
0 0 2347.89 0.611058 
130.58 1 4218.09 0.930457 
116.288 1 4647.71 0.952275 
115.722 1 5013.

That output is something humans can read and `read_csv()` can parse.



<a id='caching'></a>
## Caching of intermediate files

We've seen how to run several ESPS utilities to produce a table of voicing-related measurements, and now we'll look at how to run these tools in a script. The first approach used creates intermediate files at each step of the process and is equivalent to the following series of command line statements:

    rem_dc ../resources/two_plus_two.wav ../resources/two_plus_two.rem_dc.wav
    get_f0 ../resources/two_plus_two.rem_dc.wav ../resources/two_plus_two.f0
    pplain ../resources/two_plus_two.f0 ../resources/two_plus_two.f0.txt

Each of the ESPS tools reads a real input file and writes its output to a new file. To perform these steps in a script we need to translate each command to a separate `subprocess` call. Since we do not need to capture the output of the `rem_dc` and `get_f0` commands, we use `check_call()` to run them. `pplain` does not create an output file, however, and we need to save its output for reading into a DataFrame, so we use `check_output()`:

In [189]:
# Load libraries
import sys
import subprocess
import pandas as pd
from io import StringIO

In [20]:
fname = '../resources/two_plus_two.wav'
remdc_out = fname.replace('.wav', '.rem_dc.wav')
f0_out = fname.replace('.wav', '.f0')
pplain_out = fname.replace('.wav', '.f0.txt')
try:
    subprocess.check_call(['rem_dc', fname, remdc_out])
    subprocess.check_call(['get_f0', remdc_out, f0_out])
    f0meas = subprocess.check_output(['pplain', f0_out])
except subprocess.CalledProcessError as e:
    print('Could not process file ', fname)
    raise e

The `get_f0` output is now stored in `f0meas` as a series of bytes. Take a close look at the first 105 bytes:

In [44]:
f0meas[:105]

b'116.97 1 0 0.915904 \n116.426 1 85.3791 0.272178 \n120.038 1 43.7041 0.570703 \n113.028 1 27.5266 0.434183 \n'

Several features stand out that are important to note for proper parsing by `read_csv()`.

1. Columns are separated by the space character, not a tab or comma. Use the `sep` parameter to select the separator.
1. There is an extraneous space character just before the record separator (newline `\n`). This means there are four space characters per record, and splitting the record on space will result in *five* columns produced, the last of which is empty. Use the `usecols` parameter to select only the first four columns for import.
1. There is no header provided. Use the `header` parameter to indicate that fact, and use `names` to label the columns.

In [52]:
df = pd.read_csv(
    StringIO(f0meas.decode('utf-8')),
    sep=' ',
    usecols=[0, 1, 2, 3],
    header=None,
    names=['f0', 'prob_voice', 'rms', 'ac_peak']
)
df

Unnamed: 0,f0,prob_voice,rms,ac_peak
0,116.9700,1,0.0000,0.915904
1,116.4260,1,85.3791,0.272178
2,120.0380,1,43.7041,0.570703
3,113.0280,1,27.5266,0.434183
4,112.9960,1,23.1241,0.599601
5,115.1000,1,49.0686,0.619396
6,129.3060,1,271.0720,0.584563
7,104.2350,1,555.8810,0.959619
8,78.9803,1,669.2510,0.905303
9,80.5840,1,755.2710,0.937208


You probably want to add a time column, [as discussed below](#time_column).

<a id='hybrid_caching'></a>
## Hybrid caching/non-caching of intermediate files

The next technique we'll use caches the output of `get_f0` but not the output of `rem_dc`. To accomplish this task we link the output of `rem_dc` to the input of `get_f0` with a pipe. The `rem_dc` output is 

In [257]:
try:
    rdproc = subprocess.Popen(['rem_dc', fname, '-'], stdout=subprocess.PIPE)
    f0proc = subprocess.Popen(['get_f0', '-', f0_out], stdin=rdproc.stdout)
    
    # Clean up and check for returncode errors.
    rdproc.stdout.close()
    rdproc.wait()
    if rdproc.returncode != 0:
        msg = 'Error in rem_dc. Returncode {:d}.'.format(rdproc.returncode)
        raise RuntimeError(msg)

    f0proc.wait()
    if f0proc.returncode != 0:
        msg = 'Error in get_f0. Returncode {:d}.'.format(f0proc.returncode)
        raise RuntimeError(msg)

    f0meas = subprocess.check_output(['pplain', f0_out])
except (subprocess.CalledProcessError, RuntimeError) as e:
    print('Could not process file ', fname)
    raise e

The process of creating the DataFrame from `f0meas` is the same as it was with the full cache approach.

In [258]:
df = pd.read_csv(
    StringIO(f0meas.decode('utf-8')),
    sep=' ',
    usecols=[0, 1, 2, 3],
    header=None,
    names=['f0', 'prob_voice', 'rms', 'ac_peak']
)
df

Unnamed: 0,f0,prob_voice,rms,ac_peak
0,116.9700,1,0.0000,0.915904
1,116.4260,1,85.3791,0.272178
2,120.0380,1,43.7041,0.570703
3,113.0280,1,27.5266,0.434183
4,112.9960,1,23.1241,0.599601
5,115.1000,1,49.0686,0.619396
6,129.3060,1,271.0720,0.584563
7,104.2350,1,555.8810,0.959619
8,78.9803,1,669.2510,0.905303
9,80.5840,1,755.2710,0.937208


<a id='caching'></a>
## No caching of intermediate files

The final approach we'll use does not create any intermediate files as part of the processing chain.

**It is not possible to use the pure no-caching technique with all ESPS commands. Some commands, like `formant`, always create output binary files and do not allow the possibility of sending output to STDOUT.** For these commands you can use the caching or hybrid caching techniques.

<a id='time_column'></a>
## Adding a time column to your DataFrame

Whichever technique you use, the DataFrame that you construct from `get_f0` output does not contain a time column to indicate when each row measurement occurred. The `get_f0` command simply does not include this value in its output.

So how do you know the timepoint for each row? You read the `man` page with `man get_f0`, where you can see that the default value for the `frame_step` parameter is 0.01 sec (10ms). Since we did not specify a non-default value for `frame_step`, we'll label the first row at 0.0 sec and increment each row's time value by 0.01 sec.

By default the `read_csv()` method creates an index for the DataFrame that begins with 0 and increases by 1 for each row, and you've seen this index as the row labels already. If we simply multiply the index by the frame step, we have our desired time points:

In [56]:
df.index * 0.01

Float64Index([ 0.0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09,
              ...
              2.06, 2.07, 2.08, 2.09,  2.1, 2.11, 2.12, 2.13, 2.14, 2.15],
             dtype='float64', length=216)

And we can `assign()` these timepoints to a new column named `seconds`:

In [57]:
df = df.assign(seconds=df.index * 0.01)
df

Unnamed: 0,f0,prob_voice,rms,ac_peak,seconds
0,116.9700,1,0.0000,0.915904,0.00
1,116.4260,1,85.3791,0.272178,0.01
2,120.0380,1,43.7041,0.570703,0.02
3,113.0280,1,27.5266,0.434183,0.03
4,112.9960,1,23.1241,0.599601,0.04
5,115.1000,1,49.0686,0.619396,0.05
6,129.3060,1,271.0720,0.584563,0.06
7,104.2350,1,555.8810,0.959619,0.07
8,78.9803,1,669.2510,0.905303,0.08
9,80.5840,1,755.2710,0.937208,0.09


In [2]:
%%bash
man formant

FORMANT(1-ESPS)                                                FORMANT(1-ESPS)

NAME
       formant - speech formant and fundamental frequency (pitch) analysis

SYNOPSIS
       formant  [  -p  preemphasis ] [ -n num_formants ] [ -o lpc_order ] [ -i
       frame_step ] [ -w window_duration ] [ -W window_type ] [ -t lpc_type  ]
       [  -F  ]  [  -O  output_path  ] [ -r range ] [ -S ] [ -f ds_freq ] [ -y
       f0_max ] [ -z f0_min ] [ -N nom_f0_freq ] [ -B max_buff_bytes  ]  [  -R
       maxrms_duration ] [ -M maxrms_value ] [ -x debug_level ]  infile

DESCRIPTION
       Formant  estimates  speech  formant trajectories, fundamental frequency
       (F0) and related information.  In particular, for each frame of sampled
       data,  formant  estimates formant frequencies, formant bandwidths, pole
       frequencies corresponding to linear predictor coefficients, and voicing
       information   (fundamental  frequency,  voiced/unvoiced  decision,  rms
       energy, first normalized  a