# Basic dataframe operations

In this chapter we will explore some of the basic operations you can perform on dataframes.

The first task is to read some data into a dataframe.

In [1]:
import pandas as pd
from audiolabel import read_label

In [2]:
flist = ['resource/two_plus_two_1.tg', 'resource/three_plus_five_1.tg']
[phdf, wddf] = read_label(flist, 'praat', addcols=['fidx'])
wddf

Unnamed: 0,t1,t2,label,fidx,fname
0,0.0125,0.4914,TWO,0,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,0,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,0,resource/two_plus_two_1.tg
3,1.3195,1.3594,sp,0,resource/two_plus_two_1.tg
4,1.3594,1.7585,EQUALS,0,resource/two_plus_two_1.tg
5,1.7585,1.8283,sp,0,resource/two_plus_two_1.tg
6,1.8283,2.1975,FOUR,0,resource/two_plus_two_1.tg
7,0.0125,0.4116,THREE,1,resource/three_plus_five_1.tg
8,0.4116,0.8107,PLUS,1,resource/three_plus_five_1.tg
9,0.8107,1.2696,FIVE,1,resource/three_plus_five_1.tg


## Viewing dataframes

Here are a few ways to explore ways to interact with the contents of a dataframe. Let's starting with a dataframe object. The dot `'.'` following the dataframe's name is how we access its methods. Try clicking after the dot in the following cell and then press the `Tab` key.

You'll see a list of available methods. Scroll through the list with the arrow keys to review the possible actions you can perform on a dataframe.

In [None]:
phdf.

Chapter 1 introduced the `head()` method to show the first few rows of a dataframe. The `tail()` method shows the last few rows.

In [3]:
wddf.head()

Unnamed: 0,t1,t2,label,fidx,fname
0,0.0125,0.4914,TWO,0,resource/two_plus_two_1.tg
1,0.4914,0.8805,PLUS,0,resource/two_plus_two_1.tg
2,0.8805,1.3195,TWO,0,resource/two_plus_two_1.tg
3,1.3195,1.3594,sp,0,resource/two_plus_two_1.tg
4,1.3594,1.7585,EQUALS,0,resource/two_plus_two_1.tg


In [4]:
wddf.tail()

Unnamed: 0,t1,t2,label,fidx,fname
9,0.8107,1.2696,FIVE,1,resource/three_plus_five_1.tg
10,1.2696,1.4592,sp,1,resource/three_plus_five_1.tg
11,1.4592,1.8583,EQUALS,1,resource/three_plus_five_1.tg
12,1.8583,2.2274,EIGHT,1,resource/three_plus_five_1.tg
13,2.2274,2.5966,sp,1,resource/three_plus_five_1.tg


## Getting basic dataframe info

A number dataframe attributes give detailed information about its contents.

The `shape` attribute tells you how many rows and columns are present.

In [5]:
wddf.shape   # rows, columns

(14, 5)

The `len()` function returns the number of dataframe rows. Note that `len()` is not a dataframe method.

In [6]:
len(wddf)    # not wddf.len()

14

In [7]:
wddf.shape[0] == len(wddf)

True

The column labels are accessible through the `columns` attribute

In [8]:
wddf.columns

Index(['t1', 't2', 'label', 'fidx', 'fname'], dtype='object')

The length of the `columns` is the number of columns.

In [9]:
len(wddf.columns)

5

In [10]:
wddf.shape[1] == len(wddf.columns)

True

To find out what kinds of values are stored in your columns, use the `dtypes` attribute.

In [11]:
wddf.dtypes

t1        float64
t2        float64
label      object
fidx        int64
fname    category
dtype: object

You can also view the dataframe's index, which is used in row selection and combining operations.

In [12]:
wddf.index#.values

RangeIndex(start=0, stop=14, step=1)

## Renaming columns

Sometimes you need to assign names to your columns, perhaps because you read a headerless text file with `read_csv()` and didn't assign column names with the `names` parameter. You can add names to an existing dataframe by assigning to the `columns` attribute.

In [13]:
nhdf = pd.read_csv('resource/two_plus_two_1.nohead.ifc', sep='\t', header=None)
nhdf.tail()

Unnamed: 0,0,1,2,3,4,5,6
210,2.105,612.5,684.4,1187.4,1489.2,3059.5,129.8
211,2.115,550.0,676.4,1228.3,1609.7,3078.8,0.0
212,2.125,511.3,881.6,1240.9,1628.9,2982.4,101.5
213,2.135,442.6,951.5,1254.8,1654.1,3177.6,106.3
214,2.145,260.4,768.5,1239.6,1871.3,3146.8,107.1


In [14]:
nhdf.columns = ['sec', 'rms', 'f1', 'f2', 'f3', 'f4', 'f0']
nhdf.tail()

Unnamed: 0,sec,rms,f1,f2,f3,f4,f0
210,2.105,612.5,684.4,1187.4,1489.2,3059.5,129.8
211,2.115,550.0,676.4,1228.3,1609.7,3078.8,0.0
212,2.125,511.3,881.6,1240.9,1628.9,2982.4,101.5
213,2.135,442.6,951.5,1254.8,1654.1,3177.6,106.3
214,2.145,260.4,768.5,1239.6,1871.3,3146.8,107.1


You can overwrite existing column names. The next cell converts all the column names to upper case. Execute the cell, then try converting back to lower case with `lower()`.

In [15]:
phdf.columns = [c.upper() for c in phdf.columns]
phdf.tail()

Unnamed: 0,T1,T2,LABEL,FIDX,FNAME
34,1.6986,1.7685,L,1,resource/three_plus_five_1.tg
35,1.7685,1.8583,Z,1,resource/three_plus_five_1.tg
36,1.8583,2.1376,EY1,1,resource/three_plus_five_1.tg
37,2.1376,2.2274,T,1,resource/three_plus_five_1.tg
38,2.2274,2.5966,sp,1,resource/three_plus_five_1.tg


If you want to rename only some of the columns, you can use `rename()` with a dict that maps old names to new names.

In [16]:
nhdf = nhdf.rename(columns={'sec': 'seconds', 'rms': 'rootmnsq'})
nhdf.tail()

Unnamed: 0,seconds,rootmnsq,f1,f2,f3,f4,f0
210,2.105,612.5,684.4,1187.4,1489.2,3059.5,129.8
211,2.115,550.0,676.4,1228.3,1609.7,3078.8,0.0
212,2.125,511.3,881.6,1240.9,1628.9,2982.4,101.5
213,2.135,442.6,951.5,1254.8,1654.1,3177.6,106.3
214,2.145,260.4,768.5,1239.6,1871.3,3146.8,107.1


Notice that dataframe methods do not generally modify an existing dataframe unless you ask for modification. These methods usually return a copy of the modified dataframe, and you can assign that to a variable of the same name as the input. Alternatively, you can use `inplace=True` as a paramter to modify a dataframe directly.

## Getting summary information

You can get a quick summary of your dataframe with `describe()`.

In [17]:
nhdf.describe()

Unnamed: 0,seconds,rootmnsq,f1,f2,f3,f4,f0
count,215.0,215.0,215.0,215.0,215.0,215.0,215.0
mean,1.075,2645.992093,441.726977,1423.562791,2498.030233,3434.84186,109.912558
std,0.622093,2861.467948,176.761018,280.27118,438.113048,204.961373,91.344017
min,0.005,49.5,234.6,660.2,1489.2,2889.2,0.0
25%,0.54,504.3,315.85,1256.5,2260.05,3266.55,71.75
50%,1.075,1238.6,365.4,1416.9,2363.1,3446.6,96.0
75%,1.61,4946.0,529.4,1636.7,2721.75,3577.25,117.65
max,2.145,9835.1,1240.5,2149.0,3521.9,3882.4,413.8


Many other descriptive statistics are available as dataframe methods. See the pandas docs for a [convenient list of available methods](https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics).

In [18]:
nhdf.mean()

seconds        1.075000
rootmnsq    2645.992093
f1           441.726977
f2          1423.562791
f3          2498.030233
f4          3434.841860
f0           109.912558
dtype: float64

In [19]:
nhdf.std()

seconds        0.622093
rootmnsq    2861.467948
f1           176.761018
f2           280.271180
f3           438.113048
f4           204.961373
f0            91.344017
dtype: float64