# Chapter 4 The tidyverse

Up to now we have been manipulating vectors by reordering and subsetting them through indexing. However, once we start more advanced analyses, the preferred unit for data storage is not the vector but the _data frame_. In this chapter we learn to work directly with data frames, which greatly facilitate the organization of information. We will be using data frames for the majority of this book. We will focus on a specific data format referred to as _tidy_ and on specific collection of packages that are particularly helpful for working with _tidy_ data referred to as the _tidyverse_.

We can load all the tidyverse packages at once by installing and loading the __tidyverse__ package:

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



We will learn how to implement the tidyverse approach throughout the book, but before delving into the details, in this chapter we introduce some of the most widely used tidyverse functionality, starting with the __dplyr__ package for manipulating data frames and the __purrr__ package for working with functions. Note that the tidyverse also includes a graphing package, __ggplot2__, which we introduce later in Chapter 7 in the Data Visualization part of the book; the readr package discussed in Chapter 5; and many others. In this chapter, we first introduce the concept of _tidy_ data and then demonstrate how we use the tidyverse to work with data frames in this format.

## 4.1 Tidy data

We say that a data table is in _tidy_ format if each row represents one observation and columns represent the different variables available for each of these observations. The `murders` dataset is an example of a tidy data frame.

In [2]:
library(dslabs)
data(murders)
head(murders)

Unnamed: 0_level_0,state,abb,region,population,total
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135
2,Alaska,AK,West,710231,19
3,Arizona,AZ,West,6392017,232
4,Arkansas,AR,South,2915918,93
5,California,CA,West,37253956,1257
6,Colorado,CO,West,5029196,65


Each row represent a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders.

To see how the same information can be provided in different formats, consider the following example:

In [3]:
#>       country year fertility
#> 1     Germany 1960      2.41
#> 2 South Korea 1960      6.16
#> 3     Germany 1961      2.44
#> 4 South Korea 1961      5.99
#> 5     Germany 1962      2.47
#> 6 South Korea 1962      5.79

This tidy dataset provides fertility rates for two countries across the years. This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate. However, this dataset originally came in another format and was reshaped for the __dslabs__ package. Originally, the data was in the following format:

In [4]:
#>       country 1960 1961 1962
#> 1     Germany 2.41 2.44 2.47
#> 2 South Korea 6.16 5.99 5.79

The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables, year, is stored in the header. For the tidyverse packages to be optimally used, data need to be reshaped into `tidy` format, which you will learn to do in the Data Wrangling part of the book. Until then, we will use example datasets that are already in tidy format.

Although not immediately obvious, as you go through the book you will start to appreciate the advantages of working in a framework in which functions use tidy formats for both inputs and outputs. You will see how this permits the data analyst to focus on more important aspects of the analysis rather than the format of the data.

## 4.2 Exercises

1. Examine the built-in dataset co2. Which of the following is true:

 a. `co2` is tidy data: it has one year for each row.
 
 b. `co2` is not tidy: we need at least one column with a character vector.
 
 c. `co2` is not tidy: it is a matrix instead of a data frame.
 
 d. `co2` is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.


In [5]:
head(co2)

Answer: d

2. Examine the built-in dataset ChickWeight. Which of the following is true:

 a. `ChickWeight` is not tidy: each chick has more than one row.

 b. `ChickWeight` is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.

 c. `ChickWeight` is not tidy: we are missing the year column.
 
 d. `ChickWeight` is tidy: it is stored in a data frame.


In [6]:
head(ChickWeight)

Unnamed: 0_level_0,weight,Time,Chick,Diet
Unnamed: 0_level_1,<dbl>,<dbl>,<ord>,<fct>
1,42,0,1,1
2,51,2,1,1
3,59,4,1,1
4,64,6,1,1
5,76,8,1,1
6,93,10,1,1


Answer: b

3. Examine the built-in dataset BOD. Which of the following is true:

 a. `BOD` is not tidy: it only has six rows.
 
 b. `BOD` is not tidy: the first column is just an index.
 
 c. `BOD` is tidy: each row is an observation with two values 
(time and demand)

 d. `BOD` is tidy: all small datasets are tidy by definition.


In [7]:
head(BOD)

Unnamed: 0_level_0,Time,demand
Unnamed: 0_level_1,<dbl>,<dbl>
1,1,8.3
2,2,10.3
3,3,19.0
4,4,16.0
5,5,15.6
6,7,19.8


Answer: c

4. Which of the following built-in datasets is tidy (you can pick more than one):

 a. `BJsales`

 b. `EuStockMarkets`

 c. `DNase`

 d. `Formaldehyde`

 e. `Orange`

 f. `UCBAdmissions`

In [8]:
head(BJsales)
head(EuStockMarkets)
head(DNase)
head(Formaldehyde)
head(Orange)
head(UCBAdmissions)


DAX,SMI,CAC,FTSE
1628.75,1678.1,1772.8,2443.6
1613.63,1688.5,1750.5,2460.2
1606.51,1678.6,1718.0,2448.2
1621.04,1684.1,1708.1,2470.4
1618.16,1686.6,1723.1,2484.7
1610.61,1671.6,1714.3,2466.8


Unnamed: 0_level_0,Run,conc,density
Unnamed: 0_level_1,<ord>,<dbl>,<dbl>
1,1,0.04882812,0.017
2,1,0.04882812,0.018
3,1,0.1953125,0.121
4,1,0.1953125,0.124
5,1,0.390625,0.206
6,1,0.390625,0.215


Unnamed: 0_level_0,carb,optden
Unnamed: 0_level_1,<dbl>,<dbl>
1,0.1,0.086
2,0.3,0.269
3,0.5,0.446
4,0.6,0.538
5,0.7,0.626
6,0.9,0.782


Unnamed: 0_level_0,Tree,age,circumference
Unnamed: 0_level_1,<ord>,<dbl>,<dbl>
1,1,118,30
2,1,484,58
3,1,664,87
4,1,1004,115
5,1,1231,120
6,1,1372,142


* Answer: b, c, d, e

## 4.3 Manipulating data frames

The __dplyr__ package from the __tidyverse__ introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use `mutate`. To filter the data table to a subset of rows, we use `filter`. Finally, to subset the data by selecting specific columns, we use `select`.


### 4.3.1 Adding a column with `mutate`

We want all the necessary information for our analysis to be included in the data table. So the first task is to add the murder rates to our murders data frame. The function `mutate` takes the data frame as a first argument and the name and values of the variable as a second argument using the convention `name = values`. So, to add murder rates, we use:

In [9]:
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)

Notice that here we used `total` and `population` inside the function, which are objects that are __not__ defined in our workspace. But why don’t we get an error?

This is one of __dplyr__’s main features. Functions in this package, such as `mutate`, know to look for variables in the data frame provided in the first argument. In the call to mutate above, `total` will have the values in `murders$total`. This approach makes the code much more readable.

We can see that the new column is added:

In [10]:
head(murders)

Unnamed: 0_level_0,state,abb,region,population,total,rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135,2.824424
2,Alaska,AK,West,710231,19,2.675186
3,Arizona,AZ,West,6392017,232,3.629527
4,Arkansas,AR,South,2915918,93,3.18939
5,California,CA,West,37253956,1257,3.374138
6,Colorado,CO,West,5029196,65,1.292453


Although we have overwritten the original `murders` object, this does not change the object that loaded with `data(murders)`. If we load the murders data again, the original will overwrite our mutated version.

### 4.3.2 Subsetting with `filter`

Now suppose that we want to filter the data table to only show the entries for which the murder rate is lower than 0.71. To do this we use the `filter` function, which takes the data table as the first argument and then the conditional statement as the second. Like `mutate`, we can use the unquoted variable names from `murders` inside the function and it will know we mean the columns and not objects in the workspace.

In [11]:
filter(murders, rate <= 0.71)

state,abb,region,population,total,rate
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>
Hawaii,HI,West,1360301,7,0.514592
Iowa,IA,North Central,3046355,21,0.6893484
New Hampshire,NH,Northeast,1316470,5,0.3798036
North Dakota,ND,North Central,672591,4,0.5947151
Vermont,VT,Northeast,625741,2,0.3196211


### 4.3.3 Selecting columns with `select`

Although our data table only has six columns, some data tables include hundreds. If we want to view just a few, we can use the __dplyr__ `select` function. In the code below we select three columns, assign this to a new object and then filter the new object:

In [12]:
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)

state,region,rate
<chr>,<fct>,<dbl>
Hawaii,West,0.514592
Iowa,North Central,0.6893484
New Hampshire,Northeast,0.3798036
North Dakota,North Central,0.5947151
Vermont,Northeast,0.3196211


In the call to `select`, the first argument `murders` is an object, but `state`, `region`, and `rate` are variable names.

## 4.4 Exercises

 1. Load the dplyr package and the murders dataset.

In [13]:
library(dplyr)
library(dslabs)
data(murders)

You can add columns using the __dplyr__ function `mutate`. This function is aware of the column names and inside the function you can call them unquoted:

In [14]:
murders <- mutate(murders, population_in_million = population / 10^6)

We can write `population` rather than `murders$population`. The function mutate knows we are grabbing columns from `murders`.

Use the function `mutate` to add a murders column named rate with the per 100,000 murder rate as in the example code above. Make sure you redefine `murders` as done in the example code above ( murders <- [your code]) so we can keep using this variable.

In [15]:
murders <- mutate(murders, rate = total / population * 100000)
head(murders, 2)

Unnamed: 0_level_0,state,abb,region,population,total,population_in_million,rate
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135,4.779736,2.824424
2,Alaska,AK,West,710231,19,0.710231,2.675186


2. If `rank(x)` gives you the ranks of x from lowest to highest, `rank(-x)` gives you the ranks from highest to lowest. Use the function `mutate` to add a column rank containing the rank, from highest to lowest murder rate. Make sure you redefine `murders` so we can keep using this variable.

In [16]:
murders <- mutate(murders, rank = rank(-rate))
head(murders)

Unnamed: 0_level_0,state,abb,region,population,total,population_in_million,rate,rank
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Alabama,AL,South,4779736,135,4.779736,2.824424,23
2,Alaska,AK,West,710231,19,0.710231,2.675186,27
3,Arizona,AZ,West,6392017,232,6.392017,3.629527,10
4,Arkansas,AR,South,2915918,93,2.915918,3.18939,17
5,California,CA,West,37253956,1257,37.253956,3.374138,14
6,Colorado,CO,West,5029196,65,5.029196,1.292453,38


3. With __dplyr__, we can use `select` to show only certain columns. For example, with this code we would only show the states and population sizes:

In [17]:
select(murders, state, population) %>% head()

Unnamed: 0_level_0,state,population
Unnamed: 0_level_1,<chr>,<dbl>
1,Alabama,4779736
2,Alaska,710231
3,Arizona,6392017
4,Arkansas,2915918
5,California,37253956
6,Colorado,5029196


Use `select` to show the state names and abbreviations in `murders`. Do not redefine `murders`, just show the results.

In [18]:
select(murders, state, abb) %>% head()

Unnamed: 0_level_0,state,abb
Unnamed: 0_level_1,<chr>,<chr>
1,Alabama,AL
2,Alaska,AK
3,Arizona,AZ
4,Arkansas,AR
5,California,CA
6,Colorado,CO


4. The dplyr function filter is used to choose specific rows of the data frame to keep. Unlike select which is for columns, filter is for rows. For example, you can show just the New York row like this:

In [19]:
filter(murders, state == "New York")

state,abb,region,population,total,population_in_million,rate,rank
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
New York,NY,Northeast,19378102,517,19.3781,2.66796,29


You can use other logical vectors to filter rows.

Use `filter` to show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Remember that you can filter based on the `rank` column.

In [20]:
a <- select(murders, state, rate, rank)
filter(a, rank <= 5)

state,rate,rank
<chr>,<dbl>,<dbl>
District of Columbia,16.452753,1
Louisiana,7.742581,2
Maryland,5.074866,4
Missouri,5.359892,3
South Carolina,4.475323,5


5. We can remove rows using the `!=` operator. For example, to remove Florida, we would do this:

In [21]:
no_florida <- filter(murders, state != "Florida" )

Create a new data frame called `no_south` that removes states from the South region. How many states are in this category? You can use the function nrow for this.

In [22]:
no_south <- filter(murders, region != "South")
nrow(no_south)

6. We can also use `%in%` to filter with __dplyr__. You can therefore see the data from New York and Texas like this:

In [23]:
filter(murders, state %in% c("New York", "Texas"))

state,abb,region,population,total,population_in_million,rate,rank
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
New York,NY,Northeast,19378102,517,19.3781,2.66796,29
Texas,TX,South,25145561,805,25.14556,3.20136,16


Create a new data frame called `murders_nw` with only the states from the Northeast and the West. How many states are in this category?

In [24]:
murders_nw <- filter(murders, region %in% c("Northeast", "West"))
nrow(murders_nw)

7. Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with `filter`. Here is an example in which we filter to keep only small states in the Northeast region.

In [25]:
filter(murders, population <5000000 & region == "Nortrheast")

state,abb,region,population,total,population_in_million,rate,rank
<chr>,<chr>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>


Make sure `murders` has been defined with `rate` and `rank` and still has all states. Create a table called `my_states` that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use `select` to show only the state name, the rate, and the rank.

In [26]:
my_state <- filter(murders, region %in% c("Northeast", "West") & rate < 1)
select(my_state, state, rate, rank)

state,rate,rank
<chr>,<dbl>,<dbl>
Hawaii,0.514592,49
Idaho,0.7655102,46
Maine,0.8280881,44
New Hampshire,0.3798036,50
Oregon,0.9396843,42
Utah,0.795981,45
Vermont,0.3196211,51
Wyoming,0.8871131,43


### 4.5 The pipe: `%>%`

With __dplyr__ we cna perfomr a series of operations, for example `select` and then `filter`, by sending the result of one funciton to another using what is called the _pipe operator_. `%>%`. Some details are included below.

We wrote code above to show three variables (state, region, rate) for states that have murder rates below 0.71. To do this, we defined the intermediate object `new_table`. In __dplyr__ we can write code that looks more like a description of what we want to do without intermediate objects:

$$ origianl \ data → select → filter$$

For such an operation, we can use the pipe `%>%`. The code looks like this:

In [27]:
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

state,region,rate
<chr>,<fct>,<dbl>
Hawaii,West,0.514592
Iowa,North Central,0.6893484
New Hampshire,Northeast,0.3798036
North Dakota,North Central,0.5947151
Vermont,Northeast,0.3196211


This line of code is equivalent to the two lines of code above. What is going on here?

In general, the pipe _sends_ the result of the left side of the pipe to be the first argument of the functino on the right side of the pipe. Here is a very simple example:

In [28]:
16 %>% sqrt()

We can continue to pipe values along:

In [29]:
16 %>% sqrt() %>% log2()

The above statement is equivalent to `log2(sqrt(16))`.

Remember that the pipe send value to the first argument, so we can define other arguments as if the first argument is already defined:

In [30]:
16 %>% sqrt() %>% log(base = 2)

Therefore, when we using the pipe with data frames and __dplyr__, we no longer to need to specify the required first argument sine the __dplyr__ functions we have described all take the data as the first argument. In the code we wrote:

In [31]:
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)

state,region,rate
<chr>,<fct>,<dbl>
Hawaii,West,0.514592
Iowa,North Central,0.6893484
New Hampshire,Northeast,0.3798036
North Dakota,North Central,0.5947151
Vermont,Northeast,0.3196211


`murders` is the first argument of the `select` function, and the new data frame (formerly `new_table`) is the first argument of the `filter` function.

Note that the pipe works well with functions where the first argument is the input data. Functions in __tidyverse__ packages like __dplyr__ have this format and can be used easily with the pipe.

## 4.6 Exercises


1. The pipe `%>%` can be used to perform operations sequentially  without having to define intermediate objects. Start by redefining murder to include rate and rank.

In [32]:
data(murders)
murders <- mutate(murders, rate = total / population * 100000, rank = rank(-rate))

In the solution to the previous exercise, we did the following:

In [33]:
my_state <- filter(murders, region %in% c("Northeast", "West") & rate < 1)
select(my_state, state, rate, rank)

state,rate,rank
<chr>,<dbl>,<dbl>
Hawaii,0.514592,49
Idaho,0.7655102,46
Maine,0.8280881,44
New Hampshire,0.3798036,50
Oregon,0.9396843,42
Utah,0.795981,45
Vermont,0.3196211,51
Wyoming,0.8871131,43


The pipe `%>%` permits us to perform both operations sequentially without having to define an intermediate variable `my_state`. We therefore could have mutated and selected in the same line like this:

In [34]:
mutate(murders, rate = total /population * 100000, rank = rank(-rate)) %>% select(state, rate, rank)

state,rate,rank
<chr>,<dbl>,<dbl>
Alabama,2.8244238,23
Alaska,2.675186,27
Arizona,3.6295273,10
Arkansas,3.1893901,17
California,3.3741383,14
Colorado,1.2924531,38
Connecticut,2.7139722,25
Delaware,4.2319369,6
District of Columbia,16.4527532,1
Florida,3.3980688,13


Notice that `select` no longer have a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the `%>%`.

Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate and rank columns.Use a pipe `%>%` to do this in just one line.

In [36]:
data(murders)
mutate(murders, rate = total / population *100000, rank = rank(-rate)) %>% filter(region %in% c("Northeast", "West")) %>% select(state, rate, rank)

state,rate,rank
<chr>,<dbl>,<dbl>
Alaska,2.675186,27
Arizona,3.6295273,10
California,3.3741383,14
Colorado,1.2924531,38
Connecticut,2.7139722,25
Hawaii,0.514592,49
Idaho,0.7655102,46
Maine,0.8280881,44
Massachusetts,1.8021791,32
Montana,1.2128379,39


2. Reset `murders` to the original table by using `data(murders)`. Use a pipe to create a new data frame called `my_states` that considers only states in the Northeast or West which have a murder rate lower than 1, and contains only the state, rate and rank columns. The pipe should also have four components separated by three %>%. The code should look something like this:

In [None]:
#my_states <- murders %>%
#  mutate SOMETHING %>% 
#  filter SOMETHING %>% 
#  select SOMETHING

In [38]:
data(murders)
my_states <- murders %>%
mutate(rate = total / population * 100000, rank = rank(-rate)) %>%
filter(region %in% c("Northeast", "West") & rate < 1) %>%
select(state, rate, rank)
my_states