# Data frames

In this tutorial we'll explore data frames, the primary object in many data analytics and machine learning applications. We'll start with one of R's built-in data frames, mtcars.

## Basics

In [1]:
mtcars

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


Instead of dumping the entire contents of the data frame, we can display the first or last rows using the head and tail functions, respectively. With a single argument, the name of the data frame, these functions return the header plus six rows of data. By adding a second optional argument, we can specify the number of rows.

In [2]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [3]:
tail(mtcars,3)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Ferrari Dino,19.7,6,145,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121,109,4.11,2.78,18.6,1,1,4,2


The row names are returned using the rownames function

In [5]:
rownames(mtcars)

The number of rows and columns can be returned using the nrow and ncol functions.

In [6]:
nrow(mtcars)

In [7]:
ncol(mtcars)

The contents of a cell can be accessed by the row and column indexes or names. Note that the pair is comma separated and enclosed by square brackets and we can mix or match indexes and names.

In [8]:
mtcars[6,4]

In [8]:
mtcars["Valiant", "hp"]

In [9]:
mtcars["Valiant", 4]

In [10]:
mtcars[6, "hp"]

## Column operations

The contents of a column can be returned by index/name enclosed in double square brackets, the dollar (\$) operator or the index/name following a comma in single square brackets. In this last option, the missing argument before the comma represents a wildcard.

In [11]:
mtcars[["wt"]]

In [12]:
mtcars[[6]]

In [13]:
mtcars$wt

In [14]:
mtcars[,"wt"]

The usual list functions can be applied to a column.

In [15]:
max(mtcars$wt)

In [16]:
sum(mtcars$wt)

## Slices

One of the key data frame operations is generating slices by column, row or a combination of the two.

### Column slices

Acessing a column by index or name using single square brackets returns a "column slice" of the data frame. A slice containing multiple columns can be created using a vector of indexes or names and the result can be assigned to a new data frame.

In [17]:
head(mtcars["hp"])

Unnamed: 0,hp
Mazda RX4,110
Mazda RX4 Wag,110
Datsun 710,93
Hornet 4 Drive,110
Hornet Sportabout,175
Valiant,105


In [18]:
df1 <- mtcars[c("mpg", "cyl", "wt", "carb")]

In [19]:
head(df1)

Unnamed: 0,mpg,cyl,wt,carb
Mazda RX4,21.0,6,2.62,4
Mazda RX4 Wag,21.0,6,2.875,4
Datsun 710,22.8,4,2.32,1
Hornet 4 Drive,21.4,6,3.215,1
Hornet Sportabout,18.7,8,3.44,2
Valiant,18.1,6,3.46,1


A column can be removed from a data frame by assigning NULL to the column. In the example below, we remove the "wt" column.

In [20]:
df1$wt <- NULL

In [21]:
head(df1)

Unnamed: 0,mpg,cyl,carb
Mazda RX4,21.0,6,4
Mazda RX4 Wag,21.0,6,4
Datsun 710,22.8,4,1
Hornet 4 Drive,21.4,6,1
Hornet Sportabout,18.7,8,2
Valiant,18.1,6,1


### Row slices

We can generate a "row slice" of the data frame by specifying a name, index, vector of names or vector of indexes in single square brackets followed by a comma. The missing argument aftr the comma represents a wildcard.

In [22]:
mtcars[3,]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1


In [23]:
mtcars[c(3,5,7),]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Duster 360,14.3,8,360,245,3.21,3.57,15.84,0,0,3,4


In [24]:
mtcars[c("Datsun 710", "Duster 360"),]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Duster 360,14.3,8,360,245,3.21,3.57,15.84,0,0,3,4


In [25]:
which(mtcars$hp > 200)

Row slices can also be generated by filtering on values. In the example below, we select all cars with "hp" > 240

In [26]:
mtcars[which(mtcars$hp > 240),]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Duster 360,14.3,8,360,245,3.21,3.57,15.84,0,0,3,4
Camaro Z28,13.3,8,350,245,3.73,3.84,15.41,0,0,3,4
Ford Pantera L,15.8,8,351,264,4.22,3.17,14.5,0,1,5,4
Maserati Bora,15.0,8,301,335,3.54,3.57,14.6,0,1,5,8


### Slicing by row and column

As you might expect, we can generate arbitrary slices by row and column

In [27]:
mtcars[c("Datsun 710", "Duster 360", "Valiant"), c("mpg", "wt", "carb")]

Unnamed: 0,mpg,wt,carb
Datsun 710,22.8,2.32,1
Duster 360,14.3,3.57,4
Valiant,18.1,3.46,1
