# Reading text files into dataframes

In this notebook we will learn to read formatted text files into dataframes. This operation will probably be the way you most commonly create dataframes in your research.

We'll use `read_csv()` from the pandas library and `read_label()` from audiolabel.

In [1]:
import pandas as pd
from audiolabel import read_label

# `read_csv()`

The `read_csv()` method reads tabular format text files. Let's start with the output of `ifcformant`, which has been saved to a file with the extension `.ifc`. We'll use `!` to run the shell command `head` to look at the first few lines of this file.

In [2]:
!head resource/two_plus_two_1.ifc

sec	rms	f1	f2	f3	f4	f0
0.0050	96.5	264.6	1456.6	3314.5	3694.6	255.3
0.0150	60.1	259.2	1327.6	3376.1	3790.2	0.0
0.0250	49.5	271.7	1100.8	2774.3	3632.4	69.8
0.0350	49.5	249.0	1084.6	2689.8	3568.3	69.8
0.0450	75.5	234.6	1401.1	3068.3	3454.2	85.7
0.0550	429.1	327.8	1572.1	3163.7	3474.9	352.9
0.0650	817.1	359.9	1645.9	2800.7	3499.2	0.0
0.0750	1152.5	356.5	1617.0	2601.8	3450.1	0.0
0.0850	1152.5	350.3	1597.6	2460.5	3472.1	106.2


The output shows a table of data with a header row to define the column names. The first column contains time values corresponding to the time of the analysis frames, and the remaining columns show the RMS and F0, F1-F4 measures for each frame.

## Files with a header

Since the `.ifc` file is clearly tabular data, let's try reading it with `read_csv()`.  By default `read_csv()` determines the column names from the header in the first row of the file. To shorten the output we'll use the dataframe's `head()` method to show only the first few rows of the result.

In [3]:
ifcfile = 'resource/two_plus_two_1.ifc'
ifcdf = pd.read_csv(ifcfile)
ifcdf.head()

Unnamed: 0,sec	rms	f1	f2	f3	f4	f0
0,0.0050\t96.5\t264.6\t1456.6\t3314.5\t3694.6\t2...
1,0.0150\t60.1\t259.2\t1327.6\t3376.1\t3790.2\t0.0
2,0.0250\t49.5\t271.7\t1100.8\t2774.3\t3632.4\t69.8
3,0.0350\t49.5\t249.0\t1084.6\t2689.8\t3568.3\t69.8
4,0.0450\t75.5\t234.6\t1401.1\t3068.3\t3454.2\t85.7


The results don't look quite right. The header and data rows look like they were identified correctly, but there is only one column. By default `read_csv()` expects `','` as the column separator, and our columns are separated by tab `'\t'`. The `sep` parameter is used to choose a non-default separtor.

In [4]:
ifcdf = pd.read_csv(ifcfile, sep='\t')
ifcdf.head()

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0
0,0.005,96.5,264.6,1456.6,3314.5,3694.6,255.3
1,0.015,60.1,259.2,1327.6,3376.1,3790.2,0.0
2,0.025,49.5,271.7,1100.8,2774.3,3632.4,69.8
3,0.035,49.5,249.0,1084.6,2689.8,3568.3,69.8
4,0.045,75.5,234.6,1401.1,3068.3,3454.2,85.7


The output correctly separates the columns as well as the rows.

## Files without a header

Some tabular text files do not include a header line. The `ifcformant` command, for instance, does not include a header unless you use the `--print-header` option. If you forgot to use it, then the output looks like:

In [5]:
!head resource/two_plus_two_1.nohead.ifc

0.0050	96.5	264.6	1456.6	3314.5	3694.6	255.3
0.0150	60.1	259.2	1327.6	3376.1	3790.2	0.0
0.0250	49.5	271.7	1100.8	2774.3	3632.4	69.8
0.0350	49.5	249.0	1084.6	2689.8	3568.3	69.8
0.0450	75.5	234.6	1401.1	3068.3	3454.2	85.7
0.0550	429.1	327.8	1572.1	3163.7	3474.9	352.9
0.0650	817.1	359.9	1645.9	2800.7	3499.2	0.0
0.0750	1152.5	356.5	1617.0	2601.8	3450.1	0.0
0.0850	1152.5	350.3	1597.6	2460.5	3472.1	106.2
0.0950	1192.9	362.9	1595.5	2467.6	3526.0	78.4


If we use the same `read_csv()` call that we used previously, then the first row is interpreted as column labels rather than data.

In [6]:
nohead = 'resource/two_plus_two_1.nohead.ifc'
nhdf = pd.read_csv(nohead, sep='\t')
nhdf.head()

Unnamed: 0,0.0050,96.5,264.6,1456.6,3314.5,3694.6,255.3
0,0.015,60.1,259.2,1327.6,3376.1,3790.2,0.0
1,0.025,49.5,271.7,1100.8,2774.3,3632.4,69.8
2,0.035,49.5,249.0,1084.6,2689.8,3568.3,69.8
3,0.045,75.5,234.6,1401.1,3068.3,3454.2,85.7
4,0.055,429.1,327.8,1572.1,3163.7,3474.9,352.9


We can signal that `read_csv()` should not expect to find a header row by adding `header=None` as an additional parameter.

In [7]:
nhdf = pd.read_csv(nohead, sep='\t', header=None)
nhdf.head()

Unnamed: 0,0,1,2,3,4,5,6
0,0.005,96.5,264.6,1456.6,3314.5,3694.6,255.3
1,0.015,60.1,259.2,1327.6,3376.1,3790.2,0.0
2,0.025,49.5,271.7,1100.8,2774.3,3632.4,69.8
3,0.035,49.5,249.0,1084.6,2689.8,3568.3,69.8
4,0.045,75.5,234.6,1401.1,3068.3,3454.2,85.7


The first row is now correctly read as a data row. The columns receive default labels that are integers starting with `0`, just as the rows are numbered. Use the `names` parameter to provide meaningful names when you read the file.

In [8]:
nhdf = pd.read_csv(
    nohead, sep='\t', header=None,
    names=['sec', 'rms', 'f1', 'f2', 'f3', 'f4', 'f0']
)
nhdf.head()

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0
0,0.005,96.5,264.6,1456.6,3314.5,3694.6,255.3
1,0.015,60.1,259.2,1327.6,3376.1,3790.2,0.0
2,0.025,49.5,271.7,1100.8,2774.3,3632.4,69.8
3,0.035,49.5,249.0,1084.6,2689.8,3568.3,69.8
4,0.045,75.5,234.6,1401.1,3068.3,3454.2,85.7


# `read_label()`

The `audiolabel` package provides the `read_label()` function that reads one or more phonetic label files and produces dataframes. Label files can be more complicated than tabular text data; in particular, they can contain multiple annotation tiers, and `read_label()` maps each tier to a separate dataframe. The tier dataframes are returned as a list.

The `two_plus_two_1.tg` file is a Praat textgrid as created by `pyalign`. This file contains alignments of phonemes to an audio signal in the `phone` tier and word alignments in the `word` tier.

To read the file and produce dataframes for each tier, we provide `read_label()` with the name of the label file and its format.

In [9]:
two1 = 'resource/two_plus_two_1.tg'
[phdf, wddf] = read_label(two1, 'praat')
phdf.head()

Unnamed: 0,t1,t2,label,fname
0,0.0125,0.3417,T,resource/two_plus_two_1.tg
1,0.3417,0.4914,UW1,resource/two_plus_two_1.tg
2,0.4914,0.5912,P,resource/two_plus_two_1.tg
3,0.5912,0.6211,L,resource/two_plus_two_1.tg
4,0.6211,0.6909,AH1,resource/two_plus_two_1.tg


The `phdf` dataframe contains the labels from the `phone` tier. The `t1` value records the start time of the phone alignment, and `t2` records the end of the alignment. The label text is in the `label` column, and `fname` preserves the name of the input file that is the source of the row.

The `wddf` dataframe has the same format with different data. Now the rows correspond to labels from the `word` tier.

In [10]:
wddf

Unnamed: 0,t1,t2,label,fname
0,0.0125,0.4914,TWO,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,resource/two_plus_two_1.tg
3,1.3195,1.3594,sp,resource/two_plus_two_1.tg
4,1.3594,1.7585,EQUALS,resource/two_plus_two_1.tg
5,1.7585,1.8283,sp,resource/two_plus_two_1.tg
6,1.8283,2.1975,FOUR,resource/two_plus_two_1.tg


You can also pass a list of files to `read_label()`. If they all have compatible tier structures, then the corresponding tiers from each file are concatenated to form the tier dataframes that are returned.

In [11]:
three1 = 'resource/three_plus_five_1.tg'
flist = [two1, three1]
[phdf, wddf] = read_label(flist, 'praat')
wddf

Unnamed: 0,t1,t2,label,fname
0,0.0125,0.4914,TWO,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,resource/two_plus_two_1.tg
3,1.3195,1.3594,sp,resource/two_plus_two_1.tg
4,1.3594,1.7585,EQUALS,resource/two_plus_two_1.tg
5,1.7585,1.8283,sp,resource/two_plus_two_1.tg
6,1.8283,2.1975,FOUR,resource/two_plus_two_1.tg
7,0.0125,0.4116,THREE,resource/three_plus_five_1.tg
8,0.4116,0.8107,PLUS,resource/three_plus_five_1.tg
9,0.8107,1.2696,FIVE,resource/three_plus_five_1.tg


The `wddf` dataframe now contains each of the labels from both input files. The `fname` column tells you which file was the source of the label. 

## `read_label()` options

### `codec`

Normally the encoding of Praat textgrids can be detected automatically. If automatic detection fails, you can use the `codec` parameter to specify the file encoding. All input files to `read_label()` must have the same encoding.

### `tiers`

Use the `tiers` parameter to select which tiers from the label files to return. The tiers are specified as a list of tier names. For label files that do not have named tiers, you can use a list of tier indexes instead.

### `addcols`

`read_label()` automatically adds the `fname` column to the output tier dataframes, and you can ask for additional file-related columns to be added by using the `addcols` parameter with a list of additional column names. These names must be one or more of:
  'barename': the label file's barename, with no path info or extension
  'dirname': the user-provided path to the label file without the filename
  'ext': the label file's extension
  'fidx': the idx of the label file in fname
  
### `stop_on_error`

By default `read_label()` will raise an error if it is unable to process any of its input files. You can set `stop_on_error=False` so that a warning is printed to STDERR instead and processing continues, if possible. Labels from unprocessed files will not be included in the output tier dataframes.

### `ignore_index`

By default `read_csv()` creates an index for each output tier dataframe that starts at `0` and increases by one for each row. If you use `ignore_index=False` then the index starts at `0` for each group of labels from a single input file.

The following call illustrates all of the options except `stop_on_error`. Notice that the index is `0` through `6` for the labels from the first file, then starts at `0` again for the second file.

In [12]:
[wddf] = read_label(
    flist,
    'praat',
    codec='utf-8',
    tiers=['word'],
    addcols=['barename', 'dirname', 'ext', 'fidx'],
    ignore_index=False
)
wddf

Unnamed: 0,t1,t2,label,barename,dirname,ext,fidx,fname
0,0.0125,0.4914,TWO,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
3,1.3195,1.3594,sp,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
4,1.3594,1.7585,EQUALS,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
5,1.7585,1.8283,sp,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
6,1.8283,2.1975,FOUR,two_plus_two_1,resource,.tg,0,resource/two_plus_two_1.tg
0,0.0125,0.4116,THREE,three_plus_five_1,resource,.tg,1,resource/three_plus_five_1.tg
1,0.4116,0.8107,PLUS,three_plus_five_1,resource,.tg,1,resource/three_plus_five_1.tg
2,0.8107,1.2696,FIVE,three_plus_five_1,resource,.tg,1,resource/three_plus_five_1.tg
