# Data frames

In this tutorial we'll explore data frames, the primary object in many data analytics and machine learning applications. We'll start with one of R's built-in data frames, mtcars.

## Basics

In [None]:
mtcars

Instead of dumping the entire contents of the data frame, we can display the first or last rows using the head and tail functions, respectively. With a single argument, the name of the data frame, these functions return the header plus six rows of data. By adding a second optional argument, we can specify the number of rows.

In [None]:
head(mtcars)

In [None]:
tail(mtcars,3)

The row names are returned using the rownames function

In [None]:
rownames(mtcars)

The number of rows and columns can be returned using the nrow and ncol functions.

In [None]:
nrow(mtcars)

In [None]:
ncol(mtcars)

The contents of a cell can be accessed by the row and column indexes or names. Note that the pair is comma separated and enclosed by square brackets and we can mix or match indexes and names.

In [None]:
mtcars[6,4]

In [None]:
mtcars["Valiant", "hp"]

In [None]:
mtcars["Valiant", 4]

In [None]:
mtcars[6, "hp"]

## Column operations

The contents of a column can be returned by index/name enclosed in double square brackets, the dollar (\$) operator or the index/name following a comma in single square brackets. In this last option, the missing argument before the comma represents a wildcard.

In [None]:
mtcars[["wt"]]

In [None]:
mtcars[[6]]

In [None]:
mtcars$wt

In [None]:
mtcars[,"wt"]

The usual list functions can be applied to a column.

In [None]:
max(mtcars$wt)

In [None]:
sum(mtcars$wt)

## Slices

One of the key data frame operations is generating slices by column, row or a combination of the two.

### Column slices

Acessing a column by index or name using single square brackets returns a "column slice" of the data frame. A slice containing multiple columns can be created using a vector of indexes or names and the result can be assigned to a new data frame.

In [None]:
head(mtcars["hp"])

In [None]:
df1 <- mtcars[c("mpg", "cyl", "wt", "carb")]

In [None]:
head(df1)

A column can be removed from a data frame by assigning NULL to the column. In the example below, we remove the "wt" column.

In [None]:
df1$wt <- NULL

In [None]:
head(df1)

### Row slices

We can generate a "row slice" of the data frame by specifying a name, index, vector of names or vector of indexes in single square brackets followed by a comma. The missing argument aftr the comma represents a wildcard.

In [None]:
mtcars[3,]

In [None]:
mtcars[c(3,5,7),]

In [None]:
mtcars[c("Datsun 710", "Duster 360"),]

In [None]:
which(mtcars$hp > 200)

Row slices can also be generated by filtering on values. In the example below, we select all cars with "hp" > 240

In [None]:
mtcars[which(mtcars$hp > 240),]

### Slicing by row and column

As you might expect, we can generate arbitrary slices by row and column

In [None]:
mtcars[c("Datsun 710", "Duster 360", "Valiant"), c("mpg", "wt", "carb")]

## Extracting information from data frames

We already covered the basic functionality of data frames. Let's now explore how we can extract useful information from the data frame. Continuing with the mtcars example, we'll ask questions such as "what is the highest mpg achieved by a car in our data frame?" and "which car has the highest weight (wt)?"

Getting statistics from a column is easy and we've seen earlier. A few more examples are shown below.

In [None]:
max(mtcars$mpg) # Highest miles per gallon (mpg)

In [None]:
min(mtcars$cyl) # Lowest number of cylinders (cyl)

In [None]:
mean(mtcars$hp) # Averge horsepower (hp)

Asking questions such as "which car has the highest weight?" is a little trickier. We'll break this down into a few steps.

In [None]:
which.max(mtcars$wt) # Return the row number of the car with the highest weight

In [None]:
mtcars[which.max(mtcars$wt),] # Return a row slice of the car with the highest weight

In [None]:
rownames(mtcars[which.max(mtcars$wt),]) # Extract the row name using the rownames function

Let's now look at a slightly more complex example - "which car has the lowest horsepower (hp) to weight (wt) ratio?"

In [None]:
rownames(mtcars[which.min(mtcars$hp/mtcars$wt),])

## Sorting data frames

Data frames can be sorted on one or more columns using the order function. To do a reverse sort, prepend the column with a minus sign. In the example below, we sort the mtcars data frame by number of cylinder (cyl), ascending, and then by horse power (hp), descending. Note that the missing argument after the comma in the square brackets operator is interpreted as a wildcard.

In [None]:
mtcars[order(mtcars$cyl, -mtcars$hp),]

## Constructing a data frame

So far we've been working with one of R's built-in data frames. We'll now explore how to build a data frame from scratch by calling the data.frame function with one or more vectors.

In [None]:
names <- c('Nancy', 'Mahidhar', 'Mariano', 'Jorge', 'Susan')
fruits <- c('apple', 'banana', 'orange', 'lemon', 'pineapple')
sports <- c('cycling', 'cricket', 'football', 'basketball', 'swimming')
cars <- c('Toyota', 'Dodge', 'BMW', 'Ford', 'Audi')
colors <-c('blue', 'green', 'yellow', 'gray', 'red')

In [None]:
df2 <- data.frame(names, fruits, sports, cars, colors)

In [None]:
df2

In [None]:
rownames(df2)

By default our data frame does not have row names. We can fix this by either calling data.frame with an additional row.name argument or by assigning one of the columns to be the row name after the data frame has been created.

In [None]:
df2 <- data.frame(fruits, sports, cars, colors, row.names=names)
df2

In [None]:
df2 <- data.frame(fruits, sports, cars, colors)
rownames(df2) <- names
df2

## Reading a data frame from a file

Most often, you'll read your data frame from a file. R provides a functions for handling CSV, Excel and other formats

In [3]:
df3 <- read.csv("people.csv")

In [4]:
df3

names,fruits,sports,cars,colors
Nancy,apple,cycling,Toyota,blue
Mahidhar,banana,cricket,Dodge,green
Mariano,orange,football,BMW,yellow
Jorge,lemon,basketball,Ford,gray
Susan,pineapple,swimming,Audi,red
Ilkay,blueberry,running,Saab,purple
Mike,lime,hiking,Chevy,teal
