# Selecting and indexing

In this chapter we explore how to select columns and/or rows from dataframes to create subsets of data.

In [1]:
import pandas as pd
from audiolabel import read_label

In [2]:
flist = ['resource/two_plus_two_1.tg', 'resource/three_plus_five_1.tg']
[phdf, wddf] = read_label(flist, 'praat', addcols=['barename'], ignore_index=False)
phdf.head()

Unnamed: 0,t1,t2,label,barename,fname
0,0.0125,0.3417,T,two_plus_two_1,resource/two_plus_two_1.tg
1,0.3417,0.4914,UW1,two_plus_two_1,resource/two_plus_two_1.tg
2,0.4914,0.5912,P,two_plus_two_1,resource/two_plus_two_1.tg
3,0.5912,0.6211,L,two_plus_two_1,resource/two_plus_two_1.tg
4,0.6211,0.6909,AH1,two_plus_two_1,resource/two_plus_two_1.tg


## Selecting columns

There are (at least) three ways to access a single column of a dataframe.

### attribute access

Column names are added as attributes of the dataframe:

In [3]:
phdf.t1.head()

0    0.0125
1    0.3417
2    0.4914
3    0.5912
4    0.6211
Name: t1, dtype: float64

Attribute access is particularly useful for interactive data exploration because it works well with tab completion. Click after the dot `'.'` in the following cell and press `Tab` to explore available methods.

In [None]:
phdf.t1.

### dict key style access

In addition to attribute style selection, you can select columns with a dict-like syntax.

In [4]:
phdf['t1'].head()

0    0.0125
1    0.3417
2    0.4914
3    0.5912
4    0.6211
Name: t1, dtype: float64

This kind of selection differs from dicts, however, and allows multiple columns to be selected with a list of column names.

In [5]:
phdf[['t1', 't2', 'label']].head()

Unnamed: 0,t1,t2,label
0,0.0125,0.3417,T
1,0.3417,0.4914,UW1
2,0.4914,0.5912,P
3,0.5912,0.6211,L
4,0.6211,0.6909,AH1


### `.loc` access

The most general kind of selection is with `.loc`, which allows label-based indexing of rows and columns. To get entire columns, use `:` as the row indexer, which indicates all rows should be returned.

Notice that `.loc` uses method syntax (it is introduced with a dot `'.'`) but is followed by indexing square brackets `[]` rather than parens `()`.

In [6]:
phdf.loc[:, 't1'].head()

0    0.0125
1    0.3417
2    0.4914
3    0.5912
4    0.6211
Name: t1, dtype: float64

In [7]:
phdf.loc[:, ['t1', 't2', 'label']].head()

Unnamed: 0,t1,t2,label
0,0.0125,0.3417,T
1,0.3417,0.4914,UW1
2,0.4914,0.5912,P
3,0.5912,0.6211,L
4,0.6211,0.6909,AH1


## Row selection

You can do label-based selection of rows as well as columns with `.loc`, though you are less likely to need it. The labels you use are drawn from the dataframe's index. As with rows, `:` indicates all columns are selected.

In [8]:
phdf.loc[[0, 1], :]

Unnamed: 0,t1,t2,label,barename,fname
0,0.0125,0.3417,T,two_plus_two_1,resource/two_plus_two_1.tg
0,0.0125,0.1222,TH,three_plus_five_1,resource/three_plus_five_1.tg
1,0.3417,0.4914,UW1,two_plus_two_1,resource/two_plus_two_1.tg
1,0.1222,0.222,R,three_plus_five_1,resource/three_plus_five_1.tg


## Boolean indexing

More commonly you will select rows based on whether the rows meet some condition(s), and `.loc` supports this style of boolean indexing as well.

Many of the comparison operators do element-by-element evaluation of each row in a dataframe column, which means the result of the comparison is a series of True/False values that has the same number of rows as the input column.

In [9]:
wddf.t1 < 1.0

0     True
1     True
2     True
3    False
4    False
5    False
6    False
0     True
1     True
2     True
3    False
4    False
5    False
6    False
Name: t1, dtype: bool

This series of boolean values can be used to select all the rows where the value is True.

In [10]:
wddf.loc[wddf.t1 < 1.0, :]

Unnamed: 0,t1,t2,label,barename,fname
0,0.0125,0.4914,TWO,two_plus_two_1,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,two_plus_two_1,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,two_plus_two_1,resource/two_plus_two_1.tg
0,0.0125,0.4116,THREE,three_plus_five_1,resource/three_plus_five_1.tg
1,0.4116,0.8107,PLUS,three_plus_five_1,resource/three_plus_five_1.tg
2,0.8107,1.2696,FIVE,three_plus_five_1,resource/three_plus_five_1.tg


You can build more complex conditions by combining multiple conditions with logical AND `&` or OR `|`.

In [11]:
(wddf.t1 < 1.0) & (wddf.label.isin(['TWO', 'THREE']))

0     True
1    False
2     True
3    False
4    False
5    False
6    False
0     True
1    False
2    False
3    False
4    False
5    False
6    False
dtype: bool

The parens are obligatory to indicate that the comparison should proceed element-by-element (as opposed to evaluating the entire first series for truth, then the second). If you forget the parens expect to see an error (or in older versions of pandas, a warning).

In [12]:
wddf.t1 < 1.0 & wddf.label.isin(['TWO', 'THREE'])

TypeError: cannot compare a dtyped [bool] array with a scalar of type [bool]

And of course you can perform row and column selection simultaneously.

In [14]:
wddf.loc[
    (wddf.t1 < 1.0) & (wddf.label.isin(['TWO', 'THREE'])),   # rows
    ['t1', 't2', 'label']                                    # columns
]

Unnamed: 0,t1,t2,label
0,0.0125,0.4914,TWO
2,0.8805,1.3195,TWO
0,0.0125,0.4116,THREE


You can assign your row and column selections to variables and use those as your indexers. For complex conditions this could enhance the readability of your code, or you might do this because you intend to reuse the conditions in multiple statements.

In [15]:
rowmask = (wddf.t1 < 1.0) & (wddf.label.isin(['TWO', 'THREE']))
cols = ['t1', 't2', 'label']
wddf.loc[rowmask, cols]

Unnamed: 0,t1,t2,label
0,0.0125,0.4914,TWO
2,0.8805,1.3195,TWO
0,0.0125,0.4116,THREE


## Getting just the values

All of the selections we just performed return a `pd.Series` or `pd.DataFrame` object, which is often what you want.

You can also get access to the underlying values of a `pd.Series` or `pd.DataFrame` object through the `values` attribute. Compare the return value of the statements with and without the `values` attribute.

In [16]:
phdf.t1.head()   # pd.Series

0    0.0125
1    0.3417
2    0.4914
3    0.5912
4    0.6211
Name: t1, dtype: float64

In [17]:
phdf.label.head().values  # numpy ndarray

array(['T', 'UW1', 'P', 'L', 'AH1'], dtype=object)

Underneath every dataframe column lies a numpy ndarray, and using `values` returns this array for a `pd.Series`. The results are similar for `pd.DataFrames`.

In [18]:
phdf[['t1', 'label']].head()  # pd.DataFrame

Unnamed: 0,t1,label
0,0.0125,T
1,0.3417,UW1
2,0.4914,P
3,0.5912,L
4,0.6211,AH1


In [19]:
phdf[['t1', 'label']].head().values

array([[0.0125, 'T'],
       [0.3417, 'UW1'],
       [0.4914, 'P'],
       [0.5912, 'L'],
       [0.6211, 'AH1']], dtype=object)

Why would you want access to `values`? The first reason is that since `values` returns a numpy ndarray, you can use regular numpy methods (and you can't use pandas methods!) on them, and in certain cases this can be faster.

The second reason you might want access to the values is that the semantics of numpy ndarrays are different than `pd.Series` and `pd.DataFrame` objects--in particular, **numpy ndarrays do not have a label-based index that can be used in combining operations**. We'll return to this point in the next chapter.