-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathintroduction.Rmd
451 lines (316 loc) · 18.4 KB
/
introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
---
title: "A Guide to Working with Quantities"
author: "Iñaki Ucar"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: yes
vignette: >
%\VignetteIndexEntry{A Guide to Working with Quantities}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, cache = FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = T, comment = "#>",
fig.width = 6, fig.height = 4, fig.align = "center")
```
## Introduction
This document intends to be a guide on how to work with quantities data (magnitudes with units and/or uncertainty) in two distinct workflows: R *base* and the so-called [*tidyverse*](https://www.tidyverse.org/). Units and errors (and, by extension, quantities) objects essentially are numeric vectors, arrays and matrices with associated metadata. This metadata is not always compatible with some functions, and thus we here explore the most common operations in data wrangling (subsetting, ordering, transformations, aggregations...) to identify potential issues and propose possible workarounds.
Let us consider the traditional `iris` data set for this exercise. According to its documentation,
> `iris` is a data frame with 150 cases (rows) and 5 variables (columns) named `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.
And values are provided in centimeters. If we consider, for instance, a 2% of uncertainty, the first step is to define proper quantities. Then we will work on the resulting data frame for the rest of this article.
```{r}
library(quantities)
iris.q <- iris
for (i in 1:4)
quantities(iris.q[,i]) <- list("cm", iris.q[,i] * 0.02)
head(iris.q)
```
Note that, throughout this document, and unless otherwise stated, we will talk about *quantities objects* as a shortcut for *quantities, units and errors objects*.
## R Base
In this section, we consider all the methods and functions included in the default packages, i.e., those that are automatically installed along with any R distribution:
```{r, eval=FALSE}
rownames(installed.packages(priority="base"))
```
```{r, echo=FALSE}
print(c("base", "compiler", "datasets", "graphics", "grDevices", "grid", "methods", "parallel", "splines", "stats", "stats4", "tcltk", "tools", "utils"))
```
### Row Subsetting
Quantities objects have all the subsetting methods defined (`[`, `[[`, `[<-`, `[[<-`). Therefore they can be used in the same way as with plain numeric vectors, and in conjunction with `which` and other functions to perform subsetting. The `subset` function is very handy too and achieves the same result:
```{r}
iris.q[which(iris.q$Sepal.Length > set_quantities(75, mm)), ]
subset(iris.q, Sepal.Length > set_quantities(75, mm))
```
Note that another quantities object is defined for the comparison. This is needed because different units are incomparable. Also note that the first line throws a warning telling us that the uncertainty was dropped for this operation. This kind of warning is thrown once, and this is why `subset` succeeds silently.
### Row Ordering
The `sort` function, as its name suggests, *sorts* vectors, and it is compatible with quantities:
```{r}
iris.q$Sepal.Length[1:5]
sort(iris.q$Sepal.Length[1:5])
```
More generally, the `order` function can be used for data frame ordering:
```{r}
head(iris.q[order(iris.q$Sepal.Length), ])
```
### Column Transformation
The `transform` function is able to modify variables in a data frame or to create new ones. The `within` function provides a similar but more flexible approach though. Both are fully compatible with quantities:
```{r}
head(within(iris.q, {
Sepal.Area <- Sepal.Length * Sepal.Width
Petal.Area <- Petal.Length * Petal.Width
rm(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)
}))
```
### Row Aggregation
Row aggregation is the process of summarising data based on some grouping variable(s). There are several ways of working with data split by factors in R base, and, although they tend to preserve classes, they are generally not very kind to other metadata (i.e., attributes) by default.
In the following example, the average `Sepal.Length` is computed per `Species`, but the metadata gets dropped:
```{r}
tapply(iris.q$Sepal.Length, iris.q$Species, mean)
```
Many of these functions include a `simplify` parameter which, if set to `FALSE`, preserves quantities metadata:
```{r}
(sepal.length.agg <-
tapply(iris.q$Sepal.Length, iris.q$Species, mean, simplify=FALSE))
```
The only drawback is that the result is a list, and such a list must be unlisted with care, otherwise, metadata gets dropped again:
```{r}
# drops quantities
unlist(sepal.length.agg)
# preserves quantities
do.call(c, sepal.length.agg)
```
The `by` function is an object-oriented wrapper for `tapply` applied to data frames which also provides a `simplify` parameter. A more convenient way of working with summary statistics is the `aggregate` generic, from the `stats` namespace. Although there is a `aggregate.data.frame` method, there is a more intuitive interface to it through the `aggregate.formula` method. Again, it is necessary to set `simplify=FALSE` to keep quantities:
```{r}
(iris.q.agg <- aggregate(. ~ Species, data = iris.q, mean, simplify=FALSE))
```
Apparently, the output has no metadata associated, but what really happens is that the resulting columns are lists:
```{r}
class(iris.q.agg$Sepal.Length)
```
Therefore, as in the `tapply`/`by` case, they must be unlisted with care to still preserve the metadata:
```{r}
unlist_quantities <- function(x) {
stopifnot(is.list(x) || is.data.frame(x))
unlist <- function(x) {
if (any(class(x[[1]]) %in% c("quantities", "units", "errors")))
do.call(c, x)
else x
}
if (is.data.frame(x))
as.data.frame(lapply(x, unlist), col.names=colnames(x))
else unlist(x)
}
unlist_quantities(iris.q.agg)
```
And this method works for the `tapply`/`by` case too:
```{r}
unlist_quantities(sepal.length.agg)
```
### Column Joining
Joining data frames by common columns can done with the `merge` generic. Such operations are based on appending columns, which may be subset or replicated to fit the length of the merged observations. Therefore, quantities should be preserved in all cases. In the following example, we generate a data frame with the height per species and then merge it with the main data set:
```{r}
height <- data.frame(
Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
Species = c("setosa", "virginica", "versicolor")
)
head(merge(iris.q, height))
```
### (Un)Pivoting
The `reshape` function, from the `stats` namespace, provides an interface for both pivoting and unpivoting (i.e., *tidyfying* data). In the case of the `iris` data set, we would say that it is in the *wide format*, because each row has more than one observation.
This function has a quite peculiar nomenclature. First of all, the unpivoting operation is accessed by providing the argument `direction="long"`. We need to define the `varying` columns (columns to unpivot), as character or indices, and they are unpivoted based on their names. By default, the separator `sep="."` is used, which means that `Sepal.Width` will be broken down into `Sepal` and `Width`, and the former will be unpivoted with the latter as grouping variable. We can specify the name of the grouping variable with the `timevar` argument.
Putting everything together, this is how to unpivot the data set by the dimension (which we will call it `dim`) of the petal/sepal:
```{r}
long.1 <- reshape(iris.q, varying=1:4, timevar="dim", idvar="dim.id", direction="long")
head(long.1)
```
It can be noted that the unpivoting also generates an index to indentify multiple records from the same group. We have changed the name of that identifier to `dim.id` (just `id` by default).
We can further unpivot sepal and petal as the `part` of the flower. First, we need to prepend a common identifier to columns 3 and 4, which are to be unpivoted:
```{r}
names(long.1)[3:4] <- paste0("value.", names(long.1)[3:4])
long.2 <- reshape(long.1, varying=3:4, timevar="part", idvar="part.id", direction="long")
head(long.2)
```
And the final result has one tidy observation per row.
The pivoting operation can be accessed by providing the argument `direction="wide"`. The process is almost symmetrical, but we need to specify `v.names`, as character, instead of `varying` columns. First, we can pivot by flower part:
```{r}
wide.1 <- reshape(long.2, v.names="value", timevar="part", idvar="part.id", direction="wide")
head(wide.1)
```
Then, we remove `"value."` from the column names and pivot by dimension (note that indices are removed to match the initial data frame):
```{r}
names(wide.1)[5:6] <- sub("value\\.", "", names(wide.1)[5:6])
wide.2 <- reshape(wide.1, v.names=c("Sepal", "Petal"), timevar="dim", idvar="dim.id", direction="wide")
wide.2$dim.id <- NULL
wide.2$part.id <- NULL
head(wide.2)
```
We have seen that quantities have been correctly preserved through the whole process. Finally, we can check whether both data frames are identical. Given that the order of columns have changed, we can simply check this column name by column name and then put everything together:
```{r}
all(sapply(colnames(iris.q), function(col) all(iris.q[[col]] == wide.2[[col]])))
```
### Plotting
```{r, cache = FALSE, include=FALSE}
required <- c(
packageVersion("errors") >= "0.3.6.1",
packageVersion("units") >= "0.8-0"
)
if (!all(required))
knitr::opts_chunk$set(eval = FALSE)
```
Quantities support R base scatterplots out of the box: errors are displayed as segments around each point and units are automatically added to the corresponding axis label.
Here is an example of a simple plot of a single quantity, where each value is automatically indexed in the x-axis:
```{r}
# vector plots
with(iris.q, plot(Sepal.Width, col=Species))
```
X-Y scatterplots support units and errors in both axes:
```{r, eval=FALSE}
# x-y plots
with(iris.q, plot(Sepal.Length, Sepal.Width, col=Species))
# dataframe plots
plot(iris.q[, c("Sepal.Length", "Sepal.Width")], col=iris.q$Species)
```
which are equivalent, and produce the same result, as the formula method:
```{r}
plot(Sepal.Width ~ Sepal.Length, iris.q, col=Species)
```
There is a fundamental limitation in R base for mixed quantities and non-quantities data due to S3 dispatch. It is possible, for instance, to plot quantities in the x-axis and numeric data in the y-axis:
```{r}
plot(as.numeric(Sepal.Width) ~ Sepal.Length, iris.q, col=Species)
```
However, when the x-axis has numeric data, quantities methods will not be dispatched for the y-axis:
```{r}
plot(Sepal.Width ~ as.numeric(Sepal.Length), iris.q, col=Species)
```
One way to overcome this limitation is to set unitless and errorless quantities in the x-axis:
```{r}
plot(Sepal.Width ~ set_quantities(as.numeric(Sepal.Length), 1, 0), iris.q, col=Species)
```
## Tidyverse
```{r, cache = FALSE, include=FALSE}
required <- c("dplyr", "tidyr")
knitr::opts_chunk$set(eval = TRUE)
if (!all(sapply(required, requireNamespace, quietly = TRUE)))
knitr::opts_chunk$set(eval = FALSE)
```
The [*core tidyverse*](https://www.tidyverse.org/packages/) includes the following packages: `ggplot2`, `dplyr`, `tidyr`, `readr`, `purrr`, `tibble`, `stringr` and `forcats`. This section covers use cases for [`dplyr`](https://dplyr.tidyverse.org/) (everything except for pivoting and unpivoting), [`tidyr`](https://tidyr.tidyverse.org/) (for pivoting and unpivoting) and `ggplot2` (for plotting).
```{r, message=FALSE, warning=FALSE}
library(dplyr); packageVersion("dplyr")
library(tidyr); packageVersion("tidyr")
```
Although not strictly necessary, we will convert the data frame to `tibble` format to enjoy the formatting provided by `pillar`.
```{r}
iris.q <- as_tibble(iris.q)
head(iris.q)
```
Since `dplyr` 1.0.0, as we will see, there is enhanced support for custom S3 classes thanks to the new implementation based on `vctrs` >= 0.3.0. Packages `units` >= 0.6-7, `errors` >= 0.3.4 and `quantities` >= 0.1.5 add support for this approach.
### Row Subsetting
The `filter` generic finds observations where conditions hold. The main difference with base subsetting is that, if a condition evaluates to `NA` for a certain row, it is dropped. As in the base case, another quantities object must be defined for the comparison:
```{r}
iris.q %>%
filter(Sepal.Length > set_quantities(75, mm)) %>%
head()
```
There are also three scoped variants available (`filter_all`, `filter_if`, `filter_at`) and a subsetting function by row number called `slice`. All of them preserve quantities.
### Row Ordering
The `arrange` generic sorts variables in a straightforward way, and it is compatible with quantities:
```{r}
iris.q %>%
arrange(Sepal.Length) %>%
head()
```
The `desc` function can be applied to individual variables to arrange in descending order.
### Column Transformation
There are two generics for column transformations: `mutate` modifies or adds new variables preserving the existing ones, while `transmute` drops the existing variables. The syntax is very similar to base functions `transform` and `within`, and equally compatible with quantities:
```{r}
iris.q %>%
transmute(
Species = Species,
Petal.Area = Petal.Length * Petal.Width,
Sepal.Area = Sepal.Length * Sepal.Width
) %>%
head()
```
### Row Aggregation
`dplyr` breaks down aggregation operations in two distinct parts: grouping (with `group_by`) and summarising (using `summarise` and others). Since `dplyr` >= 1.0.0, operations on aggregated data is now fully compatible with quantities and,compared to base methods, no fancy unlisting is required:
```{r}
iris.q %>%
group_by(Species) %>%
summarise_all(mean)
```
### Column Joining
Several verbs are provided for different types of joins, such as `inner_join`, `left_join`, `right_join` or `full_join`. Internally, they use the same grouping mechanism than summaries. Therefore, since `dplyr` >= 1.0.0, these are fully compatible with quantities too:
```{r}
iris.q %>%
left_join(data.frame(
Height = set_quantities(c(55, 60, 45), cm, c(45, 30, 35)),
Species = c("setosa", "virginica", "versicolor")
)) %>%
head()
```
The only difference with base `merge` here is that `dplyr` does not reorder columns with respect to the left-hand side.
### (Un)Pivoting
Finally, pivoting and unpivoting is handled by a separate package, `tidyr`. Historically, this was managed using the verbs `spread` (pivot) and `gather` (unpivot). These verbs, which are not compatible with quantities, are deprecated and no longer maintained.
Instead, there are new and more straightforward verbs for (un)pivoting data frames called `pivot_wider` (equivalent to `spread`) and `pivot_longer` (equivalent to `gather`). These verbs do make use of the new approach brought by `vctrs` and therefore are fully compatible with quantities.
Compared to base R, the unpivoting operation is substantially more straightforward. In the next example, we directly merge the four columns of interest into the `value` column, and the correspoding column names are gathered into the `name` column. Such a column is then separated into flower `part` (sepal, petal) and `dim` (length, height):
```{r}
iris.q %>%
pivot_longer(1:4) %>%
separate(name, c("part", "dim")) %>%
head()
```
In the following example, we first unpivot the original data set, then we assign quantities and try to pivot it to obtain `iris.q` back, and it just works:
```{r}
iris %>%
# first gather, with row numbers as row_id
mutate(row_id = 1:n()) %>%
pivot_longer(1:4) %>%
# assign quantities
mutate(value = set_quantities(value, cm, value * 0.05)) %>%
# now spread and remove the row_id
pivot_wider() %>%
select(-row_id) %>%
head()
```
### Plotting
```{r, cache = FALSE, include=FALSE}
required <- c(
packageVersion("errors") >= "0.3.6.1",
packageVersion("units") >= "0.8-0"
)
if (!all(required))
knitr::opts_chunk$set(eval = FALSE)
```
```{r, message=FALSE, warning=FALSE}
library(ggplot2); packageVersion("ggplot2")
```
Quantities packages provide `ggplot2` elements to make scatterplots straightforward:
- `units` provides automatic detection of units scale type, with optional conversion and customization via `scale_x_units()` and `scale_y_units()`.
- `errors` provides automatic placement of errorbars via `geom_errors()`.
- `quantities` provides a compatibility layer between them, so that conversions from `scale_[x|y]_units` affect errorbars too.
By default, units are automatically placed in the axes:
```{r}
p0 <- ggplot(iris.q) + aes(Sepal.Length, Sepal.Width, color=Species) +
geom_point()
p0
```
And errobars can be requested via `geom_errors()`:
```{r}
p0 + geom_errors()
```
Errors may be dropped from any axis:
```{r}
p0 + geom_errors(aes(x=drop_errors(Sepal.Length)))
```
And units can be converted for display:
```{r}
p0 + geom_errors() + scale_x_units(unit="mm") + scale_y_units(unit="m")
```
## Summary
R base works smoothly with quantities in most cases. The only shortcoming is that some care must be applied to aggregations. In particular, simplification must be explicitly disabled (`simplify=FALSE`), and such a simplification (i.e., converting lists to vectors of quantities) must be applied manually while avoiding `unlist`.
Since `dplyr` 1.0.0 and `tidyr` 1.1.0 (for `units` >= 0.6-7, `errors` >= 0.3.4 and `quantities` >= 0.1.5), the new `vctrs`-based approach brings us full compatibility with quantities for all the operations considered in this document, including grouped operations, joining and pivoting, which did not work for previous versions.
Both R base and `ggplot2` plots work out of the box, although the latter provides much more flexibility and can be used independently of the tidyverse.
## A Note on `data.table`
The [`data.table`](https://github.com/Rdatatable/data.table/wiki) package is another popular data tools, which provides *a high-performance version of base R's `data.frame` with syntax and feature enhancements for ease of use, convenience and programming speed*.
Long story short, we have not included a section on `data.table` because *currently* (v1.11.4) it does not work well with vectorised attributes. The underlying problem is similar to `dplyr`'s issue, but unfortunately it affects more operations, including row subsetting and ordering. Only column transformation seems to work, and other operations generate corrupted objects.
We have found that defining quantities columns as lists (where each element consists of a single value, with unit and uncertainty) may be a workaround, but this probably would be a serious performance penalty for a package that is typically chosen for speed reasons.