-
Notifications
You must be signed in to change notification settings - Fork 190
Expand file tree
/
Copy pathtranslation-function.Rmd
More file actions
428 lines (303 loc) · 15.1 KB
/
Copy pathtranslation-function.Rmd
File metadata and controls
428 lines (303 loc) · 15.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
---
title: "Function translation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Function translation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r}
#| include: false
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
options(tibble.print_min = 6L, tibble.print_max = 6L, digits = 3)
```
There are two parts to dbplyr SQL translation: translating dplyr verbs, and translating expressions within those verbs. This vignette describes how individual expressions (function calls) are translated; `vignette("translation-verb")` describes how entire verbs are translated.
```{r}
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
```
## Getting started with translations
In this vignette, I'll use `lazy_frame()` to create a toy lazy table that allows us to see the translation without needing to connect to a real database:
```{r}
lf <- lazy_frame(x = 1, y = 2, g = "a")
lf |> mutate(z = (x + y) / 2)
```
The default `lazy_frame()` uses a generic database that generates (approximately) SQL-92 compliant SQL. You can use `simulate_*()` connections to see the translations used by different backends. Different databases generate slightly different SQL; see `vignette("new-backend")` for more details.
```{r}
lf_sqlite <- lazy_frame(x = 1, con = simulate_sqlite())
lf_access <- lazy_frame(x = 1, con = simulate_access())
lf_sqlite |> transmute(z = x^2)
lf_access |> transmute(z = x^2)
```
One key difference between dbplyr-generated SQL and hand-written SQL is that dbplyr always quotes all table and column names. This is verbose but necessary because column names in database tables can be any string, including SQL reserved words like `select` or `if`. Quoting all names ensures that dbplyr-generated SQL always works regardless of the table and column names involved.
In general, perfect translation is not possible because databases don't have all the functions that R does. The goal of dbplyr is to provide a semantic rather than a literal translation: what you mean, rather than precisely what is done. In fact, even for functions that exist both in databases and in R, you shouldn't expect results to be identical; database programmers have different priorities than R core programmers. For example, in R in order to get a higher level of numerical accuracy, `mean()` loops through the data twice. R's `mean()` also provides a `trim` option for computing trimmed means; this is something that databases do not provide.
If you're interested in how `translate_sql()` is implemented, the basic techniques that underlie the implementation of `translate_sql()` are described in ["Advanced R"](https://adv-r.hadley.nz/translation.html).
## Basic differences
There are two fundamental differences between R and SQL:
* `"` and `'` mean different things.
R can use either `"` or `'` for strings, but in ANSI SQL, must be `"` used for names and must be `'` used for strings.
```{r}
lf |> filter(x == "x")
```
* R and SQL have different defaults for integers and reals.
In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real.
```{r}
lf |> transmute(z = 1)
lf |> transmute(z = 1L)
```
## Known functions
### Mathematics
* basic math operators: `+`, `-`, `*`, `/`, `^`
* trigonometry: `acos()`, `asin()`, `atan()`, `atan2()`, `cos()`, `cot()`, `tan()`, `sin()`
* hyperbolic: `cosh()`, `coth()`, `sinh()`, `tanh()`
* logarithmic: `log()`, `log10()`, `exp()`
* misc: `abs()`, `ceiling()`, `floor()`, `sqrt()`, `sign()`, `round()`
```{r}
lf |> transmute(x = x / 2, y = x^2 + y^2)
lf |> transmute(x = log(x), y = round(y, 1))
```
### Modulo arithmetic
dbplyr translates `%%` to the SQL equivalents but note that it's not precisely the same: most databases use truncated division where the modulo operator takes the sign of the dividend, where R using the mathematically preferred floored division with the modulo sign taking the sign of the divisor.
```{r}
df <- tibble(
x = c(10L, 10L, -10L, -10L),
y = c(3L, -3L, 3L, -3L)
)
db <- copy_to(memdb(), df)
df |> mutate(x %% y)
db |> mutate(x %% y)
```
dbplyr no longer translates `%/%` because there's no robust cross-database translation available.
### Logical comparisons and boolean operations
* logical comparisons: `<`, `<=`, `!=`, `>=`, `>`, `==`, `%in%`, `between()`
* boolean operations: `&`, `&&`, `|`, `||`, `!`, `xor()`
```{r}
lf |> filter(x > 5 | y == 2)
lf |> filter(x %in% c(1, 2, 3))
lf |> filter(between(x, 1, 5))
```
### Bitwise operations
`bitwNot()`, `bitwAnd()`, `bitwOr()`, `bitwXor()`, `bitwShiftL()`, and `bitwShiftR()` are all supported:
```{r}
lf |> transmute(x = bitwAnd(x, 3L), y = bitwShiftL(x, 2L))
```
### Type coercion
Type coercion functions use the corresponding SQL `CAST()` call:
```{r}
lf |> transmute(x = as.integer(y), y = as.character(x))
```
* integer types: `as.integer()`, `as.integer64()`
* floating point: `as.numeric()`, `as.double()`
* character: `as.character()`
* logical: `as.logical()`
* date/time: `as.Date()`, `as.POSIXct()`
For database-specific types not covered by these functions, use `as()`:
```{r}
lf |> transmute(x = as(x, "TIME"), y = as(y, "DECIMAL(10, 2)"))
```
### `NULL`/`NA` handling
* `is.na()`, `is.null()`: test for `NULL`.
* `na_if()`: replace a value with `NULL`.
* `coalesce()`: replace `NULL` with a default value.
```{r}
lf |> filter(!is.na(x))
lf |> transmute(x = coalesce(x, 0L))
lf |> transmute(x = na_if(x, 0L))
```
### Aggregation
All databases provide translation for the basic aggregations: `mean()`, `sum()`, `min()`, `max()`. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. The aggregation functions warn you about this important difference:
```{r}
lf |> summarise(z = mean(x))
lf |> summarise(z = mean(x, na.rm = TRUE))
```
Note that aggregation functions used inside `mutate()` or `filter()` generate a window translation:
```{r}
lf |> mutate(z = mean(x, na.rm = TRUE))
lf |> filter(mean(x, na.rm = TRUE) > 0)
```
Most backends also support:
* `sd()`, `var()`, `cor()`, `cov()`
* `median()`, `quantile()`
* `n()`, `n_distinct()`
* `all()`, `any()`
* `str_flatten()`
### Conditional evaluation
`if`, `ifelse()`, and `if_else()` are translated to `CASE WHEN`:
```{r}
lf |> transmute(z = ifelse(x > 5, "big", "small"))
```
`case_when()`, `case_match()`, and `switch()` are also supported:
```{r}
lf |>
mutate(z = case_when(
x > 10 ~ "medium",
x > 30 ~ "big",
.default = "small"
))
lf |> mutate(z = switch(g, a = 1L, b = 2L, 3L))
```
### String functions
Base R string functions and their stringr equivalents are widely supported:
* `nchar()`, `str_length()`
* `tolower()`, `toupper()`, `str_to_lower()`, `str_to_upper()`, `str_to_title()`
* `trimws()`, `str_trim()`
* `paste()`, `paste0()`, `str_c()`
* `substr()`, `substring()`, `str_sub()`
```{r}
lf |> transmute(x = paste0(g, " dog"))
lf |> transmute(x = substr(g, 1L, 2L))
```
Many backends also support regular expression functions like `str_detect()`, `str_replace()`, `str_replace_all()`, `str_remove()`, `str_remove_all()`, `str_squish()`, and `str_like()`. Support varies by backend; see the individual backend documentation for details.
### Date/time functions
dbplyr supports many lubridate functions for extracting date components:
* `today()`, `now()`
* `year()`, `month()`, `day()`, `mday()`, `hour()`, `minute()`, `second()`
```{r}
lf_dt <- lazy_frame(dt = Sys.time())
lf_dt |> transmute(
year = year(dt),
month = month(dt),
day = day(dt)
)
```
Some backends also support additional lubridate functions including `yday()`, `wday()`, `week()`, `isoweek()`, `quarter()`, `isoyear()`, `floor_date()`, and period functions like `seconds()`, `minutes()`, `hours()`, `days()`, `weeks()`, `months()`, `years()`.
Several backends (including PostgreSQL, Snowflake, SQL Server, Redshift, and Spark SQL) support [clock](https://clock.r-lib.org) functions for date arithmetic.
* `add_days()`, `add_years()`
* `date_build()`
* `get_year()`, `get_month()`, `get_day()`
* `date_count_between()`
* `difftime()`
clock functions tend to be easier to translate than lubridate functions because they are more specific.
### Other functions
* `pmin()`, `pmax()` for parallel min/max
* `desc()` for descending order
* `cut()` for binning numeric values into categories
## Unknown functions
Any function that dbplyr doesn't know how to convert is left as is. This means that database functions that are not covered by dbplyr can often be used directly.
### Prefix functions
Any function that dbplyr doesn't know about will be left as is:
```{r}
lf |> mutate(z = foofify(x, y))
```
But to make it clear that you're deliberately calling a SQL function, we recommend using the `.sql` pronoun:
```{r}
lf |> transmute(z = .sql$foofify(x, y))
```
If you're working inside a package, this also makes it easier to avoid `R CMD CHECK` notes. Just import `.sql` from dbplyr using a roxygen2 tag like `@importFrom dbplyr .sql`
### Infix functions
As well as prefix functions (where the name of the function comes before the arguments), dbplyr also translates infix functions. That allows you to use expressions like `LIKE`, which does a limited form of pattern matching:
```{r}
lf |> filter(x %LIKE% "%foo%")
```
You can also use `str_like()` for this common case:
```{r}
lf |> filter(str_like(x, "%foo%"))
```
You could use `%||%` for string concatenation, but in most cases it's more R-like to use `paste()` or `paste0()`:
```{r}
lf |> transmute(z = x %||% y)
lf |> transmute(z = paste0(x, y))
lf |> transmute(z = paste(x, y))
```
### Special forms
SQL functions tend to have a greater variety of syntax than R. That means there are a number of expressions that can't be translated directly from R code. To insert these in your own queries, you can use literal SQL inside `sql()`:
```{r}
lf |> transmute(z = sql("x!"))
lf |> transmute(z = x == sql("ANY VALUES(1, 2, 3)"))
```
This gives you a lot of freedom to generate the SQL you need:
```{r}
lf |> transmute(factorial = sql("x!"))
lf |> transmute(factorial = sql("CAST(x AS FLOAT)"))
```
### Error for unknown translations
If needed, you can also use the `dplyr.strict_sql` option to force dbplyr to error if it doesn't know how to translate a function:
```{r}
#| error = TRUE
options(dplyr.strict_sql = TRUE)
lf |> mutate(z = glob(x, y))
```
## Window functions
Things get a little trickier with window functions, because SQL's window functions are considerably more expressive than the specific variants provided by base R or dplyr. They have the form `[expression] OVER ([partition clause] [order clause] [frame_clause])`:
* The __expression__ is a combination of variable names and window functions.
Support for window functions varies from database to database, but most
support:
- ranking: `row_number()`, `min_rank()`, `rank()`, `dense_rank()`,
`percent_rank()`, `cume_dist()`, `ntile()`;
- offsets: `lead()`, `lag()`, `first()`, `last()`, `nth()`;
- aggregates: `mean()`, `sum()`, `min()`, `max()`, `n()`, `n_distinct()`;
- cumulative: `cummean()`, `cumsum()`, `cummin()`, `cummax()`.
* The __partition clause__ specifies how the window function is broken down
over groups. It plays an analogous role to `GROUP BY` for aggregate functions,
and `group_by()` in dplyr. It is possible for different window functions to
be partitioned into different groups, but not all databases support it, and
neither does dplyr.
* The __order clause__ controls the ordering (when it makes a difference).
This is important for the ranking functions since it specifies which
variables to rank by, but it's also needed for cumulative functions and lead.
Whenever you're thinking about before and after in SQL, you must always tell
it which variable defines the order. If the order clause is missing when
needed, some databases fail with an error message while others return
non-deterministic results.
* The __frame clause__ defines which rows, or __frame__, that are passed
to the window function, describing which rows (relative to the current row)
should be included. The frame clause provides two offsets which determine
the start and end of frame. There are three special values: -Inf means
to include all preceding rows (in SQL, "unbounded preceding"), 0 means the
current row ("current row"), and Inf means all following rows ("unbounded
following"). The complete set of options is comprehensive, but fairly
confusing, and is summarised visually below.
```{r}
#| echo: false
#| out-width: 100%
#| fig-alt: >
#| A visual summary of the frame clause using the real line labelled
#| with negative infinity, -3, -2, -1, 0, 1, 2, 3, infinity. The most
#| important clauses are rolling, cumulative, and recycling.
#| Rolling, e.g. between 1 preceding and 1, following, run from
#| -1 to -1. Cumulative, between unbounded preceding and
#| current row, runs from negative infinity to 0. Recycled,
#| between unbound preceeding and unbound following, runs from
#| negative infinity to positive infinity.
knitr::include_graphics("windows.png", dpi = 300)
```
Of the many possible specifications, only three are commonly
used. They select between aggregation variants:
* Recycled: `BETWEEN UNBOUND PRECEDING AND UNBOUND FOLLOWING`
* Cumulative: `BETWEEN UNBOUND PRECEDING AND CURRENT ROW`
* Rolling: `BETWEEN 2 PRECEDING AND 2 FOLLOWING`
dbplyr generates the frame clause based on whether you're using a recycled
aggregate or a cumulative aggregate.
To see how individual window functions are translated to SQL, we can use `transmute()`:
```{r}
lf <- lazy_frame(g = 1, year = 2020, id = 3, con = simulate_dbi())
lf |> transmute(
mean = mean(g),
rank = min_rank(g),
cumsum = cumsum(g),
lag = lag(g)
)
```
If the lazy frame has been grouped or arranged previously in the pipeline, then dbplyr will use that information to set the "partition by" and "order by" clauses:
```{r}
lf |> arrange(year) |> mutate(z = cummean(g))
lf |> group_by(id) |> mutate(z = rank())
```
There are some challenges when translating window functions between R and SQL, because dbplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you're using:
* For ranking functions, the ordering variable is the first argument: `rank(x)`,
`ntile(y, 2)`. If omitted or `NULL`, will use the default ordering associated
with the tbl (as set by `arrange()`).
* Accumulating aggregates only take a single argument (the vector to aggregate).
To control ordering, use `order_by()`.
* Aggregates implemented in dplyr (`lead()`, `lag()`, `nth()`, `first()`,
`last()`) have an `order_by` argument. Supply it to override the
default ordering.
The three options are illustrated in the snippet below:
```{r}
lf |> transmute(
x1 = min_rank(g),
x2 = order_by(year, cumsum(g)),
x3 = lead(g, order_by = year)
)
```
Currently there is no way to order by multiple variables, except by setting the default ordering with `arrange()`. This will be added in a future release.