-
Notifications
You must be signed in to change notification settings - Fork 2
/
dplyr-joins.Rmd
209 lines (128 loc) · 5.47 KB
/
dplyr-joins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "dplyr joins wrappers"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{dplyr joins wrappers}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
withr::local_options(joyn.verbose = TRUE)
```
## Overview
Joining data tables with `joyn` is particularly convenient as it allows you to analyze/be aware of the quality of the merging.
This vignette explores dplyr-like join functions available in `joyn`. Their major objective is to let you employ a syntax you are supposedly already familiar with - the `dplyr` one - while at the same time benefiting of the additional tools that `joyn` offers. That is, obtaining additional information and verification of the joining.
There are four types of dplyr-like join functions in `joyn`:
- Left joins: `joyn::left_join()`
- Right joins: `joyn::right_join()`
- Full joins: `joyn::full_join()`
- Inner joins: `joyn::inner_join()`
Each of them is a wrapper that works in a similar way as the corresponding `dplyr` function.
```{r setup}
library(joyn)
library(data.table)
```
## Rationale
```{r}
x1 <- data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = 11:15)
y1 <- data.table(id = c(1,2, 4),
y = c(11L, 15L, 16))
```
Suppose you want to perform a **simple left join** between tables `x1` and `y1`.
With `joyn` you have two possibilities:
- using the `joyn()` function, specifying `keep = "left"`
- using the **`joyn::left_join()`** function
In addition, you could use `dplyr::left_join()` or base R merging functions.
Consider these three options:
```{r example}
# Option 1
joyn(x = x1,
y = y1,
keep = "left",
match_type = "m:1")
# Option 2
joyn::left_join(x = x1,
y = y1,
relationship = "many-to-one")
# Option 3
dplyr::left_join(x = x1,
y = y1,
relationship = "many-to-one")
```
Comparing the results, the same returning data table is produced.
However, `joyn::left_join()` allows you to enjoy both the intuitive syntax from `dplyr` and the additional tools from `joyn`. These include additional options to customize how the join is performed, the availability of the joyn report, messages informing you on time of execution and the status of the join as well as the execution of various checks during the merging. (For additional information on each of these `joyn`'s features, please take a look at all the other articles in this website.)
## Some examples
#### 1. Left join
ℹ️ Left joins return in the output table all rows from `x`, i.e., the left table, and only matching rows from `y`, i.e., the right table.
```{r left-ex1}
# Data tables to be joined
df1 <- data.frame(id = c(1L, 1L, 2L, 3L, NA_integer_, NA_integer_),
t = c(1L, 2L, 1L, 2L, NA_integer_, 4L),
x = 11:16)
df2 <- data.frame(id = c(1,2, 4, NA_integer_, 8),
y = c(11L, 15L, 16, 17L, 18L),
t = c(13:17))
```
*Example usage of some of the `joyn`'s additional options:*
***Updating NAs in left table***
Using the `update_NAs` argument from `joyn` you can update the values that are NA in the *t* variable in the left table with the actual values from the matching column *t* in the right one
```{r left-ex2}
left_join(x = df1,
y = df2,
relationship = "many-to-one",
by = "id",
update_NAs = TRUE)
```
***Specifying which variables to keep from the right table after the join***
```{r}
left_join(x = df1,
y = df2,
relationship = "many-to-one",
by = "id",
y_vars_to_keep = "y")
```
#### 2. Right join
ℹ️ Right joins return in the output table matching rows from `x`, i.e., the left table, and all rows from `y`, i.e., the right table.
*Example usage of some of the `joyn`'s additional options:*
***Specifying a name for the reporting variable***
```{r right-ex1}
right_join(x = df1,
y = df2,
relationship = "many-to-one",
by = "id",
reportvar = "right.joyn")
```
***Updating values in common variables***
By setting `update_values = TRUE`, all values in x (both NAs and not) will be updated with the actual values of variables in y with the same name as the ones in x. You can then see the status of the update in the reporting variable.
```{r right-ex2}
right_join(x = df1,
y = df2,
relationship = "many-to-one",
by = "id",
reportvar = "right.joyn")
```
#### 3. Full join
ℹ️ Full joins return in the output table all rows, both matching and non matching rows from `x`, i.e., the left table, and `y`, i.e., the right table.
```{r}
full_join(x = x1,
y = y1,
relationship = "many-to-one",
keep = TRUE)
```
#### 4. Inner join
ℹ️ Inner joins return in the output table only rows that match between `x`, i.e., the left table, and `y`, i.e., the right table.
***Simple inner join***
```{r }
inner_join(x = df1,
y = df2,
relationship = "many-to-one",
by = "id")
```