-
Notifications
You must be signed in to change notification settings - Fork 25
/
joining-tables.qmd
323 lines (202 loc) · 10.4 KB
/
joining-tables.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# Joining tables
The information we need for a given analysis may not be just in one table. Here we use a simple examples to illustrate the general challenge of combining tables.
Suppose we want to explore the relationship between population size for US states and electoral votes. We have the population size in this table:
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
library(dslabs)
head(murders)
```
and electoral votes in this one:
```{r}
head(results_us_election_2016)
```
Just concatenating these two tables together will not work since the order of the states is not the same.
```{r}
identical(results_us_election_2016$state, murders$state)
```
The _join_ functions, described below, are designed to handle this challenge.
## Joins {#sec-joins}
The _join_ functions in the __dplyr__ package make sure that the tables are combined so that matching rows are together. If you know SQL, you will see that the approach and syntax is very similar. The general idea is that one needs to identify one or more columns that will serve to match the two tables. Then a new table with the combined information is returned. Notice what happens if we join the two tables above by state using `left_join` (we will remove the `others` column and rename `electoral_votes` so that the tables fit on the page):
```{r}
tab <- left_join(murders, results_us_election_2016, by = "state") |>
select(-others) |> rename(ev = electoral_votes)
head(tab)
```
The data has been successfully joined and we can now, for example, make a plot to explore the relationship:
```{r ev-vs-population, message=FALSE, warning=FALSE, echo=FALSE}
library(ggrepel)
tab |> ggplot(aes(population/10^6, ev, label = abb)) +
geom_point() +
scale_x_log10() + scale_y_log10() +
geom_text_repel() +
geom_smooth(method = "lm", se = FALSE)
```
We see the relationship is close to linear with about 2 electoral votes for every million persons, but with very small states getting higher ratios.
In practice, it is not always the case that each row in one table has a matching row in the other. For this reason, we have several versions of join. To illustrate this challenge, we will take subsets of the tables above. We create the tables `tab1` and `tab2` so that they have some states in common but not all:
```{r}
tab_1 <- slice(murders, 1:6) |> select(state, population)
tab_2 <- results_us_election_2016 |>
filter(state %in% c("Alabama", "Alaska", "Arizona",
"California", "Connecticut", "Delaware")) |>
select(state, electoral_votes) |> rename(ev = electoral_votes)
```
We will use these two tables as examples in the next sections.
### Left join
Suppose we want a table like `tab_1`, but adding electoral votes to whatever states we have available. For this, we use `left_join` with `tab_1` as the first argument. We specify which column to use to match with the `by` argument.
```{r}
left_join(tab_1, tab_2, by = "state")
```
Note that `NA`s are added to the two states not appearing in `tab_2`. Also, notice that this function, as well as all the other joins, can receive the first arguments through the pipe:
```{r, eval=FALSE}
tab_1 |> left_join(tab_2, by = "state")
```
### Right join
If instead of a table with the same rows as first table, we want one with the same rows as second table, we can use `right_join`:
```{r}
tab_1 |> right_join(tab_2, by = "state")
```
Now the NAs are in the column coming from `tab_1`.
### Inner join
If we want to keep only the rows that have information in both tables, we use `inner_join`. You can think of this as an intersection:
```{r}
inner_join(tab_1, tab_2, by = "state")
```
### Full join
If we want to keep all the rows and fill the missing parts with NAs, we can use `full_join`. You can think of this as a union:
```{r}
full_join(tab_1, tab_2, by = "state")
```
### Semi join
The `semi_join` function lets us keep the part of first table for which we have information in the second. It does not add the columns of the second:
```{r}
semi_join(tab_1, tab_2, by = "state")
```
### Anti join
The function `anti_join` is the opposite of `semi_join`. It keeps the elements of the first table for which there is no information in the second:
```{r}
anti_join(tab_1, tab_2, by = "state")
```
The following diagram summarizes the above joins:
![](img/joins.png){width="50%" fig-align="center"}
(Image courtesy of RStudio^[<https://github.com/rstudio/cheatsheets>]. CC-BY-4.0 license^[<https://github.com/rstudio/cheatsheets/blob/master/LICENSE>]. Cropped from original.)
## Binding
Although we have yet to use it in this book, another common way in which datasets are combined is by _binding_ them. Unlike the join function, the binding functions do not try to match by a variable, but instead simply combine datasets. If the datasets don't match by the appropriate dimensions, one obtains an error.
### Binding columns
The __dplyr__ function _bind_cols_ binds two objects by making them columns in a tibble. For example, we quickly want to make a data frame consisting of numbers we can use
```{r}
bind_cols(a = 1:3, b = 4:6)
```
This function requires that we assign names to the columns. Here we chose `a` and `b`.
Note that there is an R-base function `cbind` with the exact same functionality. An important difference is that `cbind` can create different types of objects, while `bind_cols` always produces a data frame.
`bind_cols` can also bind two different data frames. For example, here we break up the `tab` data frame and then bind them back together:
```{r}
tab_1 <- tab[, 1:3]
tab_2 <- tab[, 4:6]
tab_3 <- tab[, 7:8]
new_tab <- bind_cols(tab_1, tab_2, tab_3)
head(new_tab)
```
### Binding by rows
The `bind_rows` function is similar to `bind_cols`, but binds rows instead of columns:
```{r}
tab_1 <- tab[1:2,]
tab_2 <- tab[3:4,]
bind_rows(tab_1, tab_2)
```
This is based on an R-base function `rbind`.
## Set operators
Another set of commands useful for combining datasets are the set operators. When applied to vectors, these behave as their names suggest. Examples are `intersect`, `union`, `setdiff`, and `setequal`. However, if the __tidyverse__, or more specifically __dplyr__, is loaded, these functions can be used on data frames as opposed to just on vectors.
### Intersect
You can take intersections of vectors of any type, such as numeric:
```{r}
intersect(1:10, 6:15)
```
or characters:
```{r}
intersect(c("a","b","c"), c("b","c","d"))
```
The __dplyr__ package includes an `intersect` function that can be applied to tables with the same column names. This function returns the rows in common between two tables. To make sure we use the __dplyr__ version of `intersect` rather than the base R version, we can use `dplyr::intersect` like this:
```{r}
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
dplyr::intersect(tab_1, tab_2)
```
### Union
Similarly _union_ takes the union of vectors. For example:
```{r}
union(1:10, 6:15)
union(c("a","b","c"), c("b","c","d"))
```
The __dplyr__ package includes a version of `union` that combines all the rows of two tables with the same column names.
```{r}
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
dplyr::union(tab_1, tab_2)
```
### `setdiff`
The set difference between a first and second argument can be obtained with `setdiff`. Unlike `intersect` and `union`, this function is not symmetric:
```{r}
setdiff(1:10, 6:15)
setdiff(6:15, 1:10)
```
As with the functions shown above, __dplyr__ has a version for data frames:
```{r}
tab_1 <- tab[1:5,]
tab_2 <- tab[3:7,]
dplyr::setdiff(tab_1, tab_2)
```
### `setequal`
Finally, the function `setequal` tells us if two sets are the same, regardless of order. So notice that:
```{r}
setequal(1:5, 1:6)
```
but:
```{r}
setequal(1:5, 5:1)
```
The __dplyr__ version checks whether data frames are equal, regardless of order of rows _or_ columns:
```{r}
dplyr::setequal(tab_1, tab_2)
```
## Joining with data.table
The **data.table** package includes `merge`, a very efficient function for joining tables.
In **tidyverse** we joined two tables with `left_join`:
```{r}
tab <- left_join(murders, results_us_election_2016, by = "state")
```
In **data.table** the `merge` functions works similarly:
```{r}
#| message: false
#| warning: false
library(data.table)
tab <- merge(murders, results_us_election_2016, by = "state", all.x = TRUE)
```
Instead of defining different functions for the different type of joins, `merge` uses the the logical arguments `all` (full join), `all.x` (left join), and `all.y` (right join).
## Exercises
1\. Install and load the __Lahman__ library. This database includes data related to baseball teams. It includes summary statistics about how the players performed on offense and defense for several years. It also includes personal information about the players.
The `Batting` data frame contains the offensive statistics for all players for many years. You can see, for example, the top 10 hitters by running this code:
```{r, eval=FALSE}
library(Lahman)
top <- Batting |>
filter(yearID == 2016) |>
arrange(desc(HR)) |>
slice(1:10)
top |> as_tibble()
```
But who are these players? We see an ID, but not the names. The player names are in this table
```{r, eval=FALSE}
People |> as_tibble()
```
We can see column names `nameFirst` and `nameLast`. Use the `left_join` function to create a table of the top home run hitters. The table should have `playerID`, first name, last name, and number of home runs (HR). Rewrite the object `top` with this new table.
2\. Now use the `Salaries` data frame to add each player's salary to the table you created in exercise 1. Note that salaries are different every year so make sure to filter for the year 2016, then use `right_join`. This time show first name, last name, team, HR, and salary.
3\. In a previous exercise, we created a tidy version of the `co2` dataset:
```{r, eval=FALSE}
co2_wide <- data.frame(matrix(co2, ncol = 12, byrow = TRUE)) |>
setNames(1:12) |>
mutate(year = 1959:1997) |>
pivot_longer(-year, names_to = "month", values_to = "co2") |>
mutate(month = as.numeric(month))
```
We want to see if the monthly trend is changing, so we are going to remove the year effects and then plot the results. We will first compute the year averages. Use the `group_by` and `summarize` to compute the average co2 for each year. Save in an object called `yearly_avg`.
4\. Now use the `left_join` function to add the yearly average to the `co2_wide` dataset. Then compute the residuals: observed co2 measure - yearly average.
5\. Make a plot of the seasonal trends by year but only after removing the year effect.