/
intro_sjlabelled.Rmd
402 lines (303 loc) · 13.6 KB
/
intro_sjlabelled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
---
title: "Labelled Data and the sjlabelled-Package"
author: "Daniel Lüdecke"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Labelled Data and the sjlabelled-Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
if (!requireNamespace("sjmisc", quietly = TRUE) ||
!requireNamespace("haven", quietly = TRUE) ||
!requireNamespace("magrittr", quietly = TRUE) ||
!requireNamespace("dplyr", quietly = TRUE)) {
knitr::opts_chunk$set(eval = FALSE)
}
```
This package provides functions to read and write data between R and other statistical software packages like _SPSS_, _SAS_ or _Stata_ and to work with labelled data; this includes easy ways to get and set label attributes, to convert labelled vectors into factors (and vice versa), or to deal with multiple declared missing values etc.
This vignette gives an overview of functions to work with labelled data.
# Labelled Data
_Labelled data_ (or labelled vectors) is a common data structure in other statistical environments to store meta-information about variables, like variable names, value labels or multiple defined missing values.
Labelled data not only extends **R**'s capabilities to deal with proper value _and_ variable labels, but also facilitates the representation of different types of missing values, like in other statistical software packages. Typically, in R, multiple declared missings cannot be represented in a similar way, like in 'SPSS' or 'SAS', with the regular missing values. However, the **haven**-package introduced `tagged_na` values, which can do this. Tagged NA's work exactly like regular R missing values except that they store one additional byte of information: a tag, which is usually a letter ("a" to "z") or also may be a character number ("0" to "9"). This allows to indicate different missings.
Functions of **sjlabelled** do not necessarily require vectors of class `labelled` or `haven_labelled`. The `labelled` class, implemented by the packages **haven** and **labelled**, may cause troubles with other packages, thus it's only intended as being an intermediate data structure that should be converted to common R classes. However, coercing a `labelled` vector to other classes (like factor or numeric) typically means that meta information like value and variable label attributes are lost. Actually, there is no need to drop these attributes for non-`labelled`-class vectors. Functions like `lm()` simply copy these attributes to the data that is included in the returned object. Packages like **sjPlot** support labelled data for easily annotated data visualization. **sjlabelled** supports working with _labelled data_ and offers functions to benefit from these features.
**Note:** Since package-version 2.0 of the **haven**-package, the `labelled`-class attribute was changed to `haven_labelled`, to avoid interferences with the **Hmisc**-package.
## Labelled Data in haven and labelled
The **labelled**-package is intended to support `labelled` / `haven_labelled` metadata structures, thus the data structure of labelled vectors in **haven** and **labelled** is the same.
Labelled data in this format stores information about value labels, variable names and multiple defined missing values. However, _variable names_ are only part of this information if data was imported with one of **haven**'s read-functions. Adding a variable label attribute is (at least up to version 1.0.0) not possible via the `labelled()`-constructor method.
```{r}
library(haven)
x <- labelled(
c(1:3, tagged_na("a", "c", "z"), 4:1),
c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"),
"Refused" = tagged_na("a"), "Not home" = tagged_na("z"))
)
print(x)
```
A `labelled` vector can either be a numeric or character vector. Conversion to factors copies the value labels as factor levels, but drops the label attributes and missing information:
```{r}
is.na(x)
as_factor(x)
is.na(as_factor(x))
```
## Labelled Data in sjlabelled
**sjlabelled** supports label attributes in **haven**-style (`label` and `labels`). You're not restricted to the `labelled` class for vectors when working with **sjlabelled** and labelled data. Hence, you can have vectors of common R classes and still use information like variable or value labels.
```{r message=FALSE}
library(sjlabelled)
# sjlabelled-sample data, an atomic vector with label attributes
data(efc)
str(efc$e16sex)
```
# Value Labels
## Getting value labels
The `get_labels()`-method is a generic method to return value labels of a vector or data frame.
```{r}
get_labels(efc$e42dep)
```
You can prefix the value labels with the associated values or return them as named vector with the `values` argument.
```{r}
get_labels(efc$e42dep, values = "p")
```
`get_labels()` also returns "labels" of factors, even if the factor has no label attributes.
```{r}
x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
get_labels(x)
```
To ensure that labels are only returned for vectors with label-attribute, use the `attr.only` argument.
```{r}
x <- factor(c("low", "mid", "low", "hi", "mid", "low"))
get_labels(x, attr.only = TRUE)
```
If a vector has a label attribute, only these labels are returned. Non-labelled values are excluded from the output by default...
```{r}
# get labels, including tagged NA values
x <- labelled(
c(1:3, tagged_na("a", "c", "z"), 4:1),
c("Agreement" = 1, "Disagreement" = 4, "First" = tagged_na("c"),
"Refused" = tagged_na("a"), "Not home" = tagged_na("z"))
)
get_labels(x)
```
... however, you can add non-labelled values to the return value as well, using the `non.labelled` argument.
```{r}
get_labels(x, non.labelled = TRUE)
```
Tagged missing values can also be included in the output, using the `drop.na` argument.
```{r}
get_labels(x, values = "n", drop.na = FALSE)
```
## Getting labelled values
The `get_values()` method returns the values for labelled values (i.e. values that have an associated label). We still use the vector `x` from the above examples.
```{r}
print(x)
get_values(x)
```
With the `drop.na` argument you can omit those values from the return values that are defined as missing.
```{r}
get_values(x, drop.na = TRUE)
```
## Setting value labels
With `set_labels()` you can add label attributes to any vector.
```{r}
x <- sample(1:4, 20, replace = TRUE)
# return new labelled vector
x <- set_labels(x, labels = c("very low", "low", "mid", "hi"))
x
```
If more labels than values are given, only as many labels elements are used as values are present.
```{r}
x <- c(2, 2, 3, 3, 2)
x <- set_labels(x, labels = c("a", "b", "c"))
x
```
However, you can force to use all labels, even for values that are not in the vector, using the `force.labels` argument.
```{r}
x <- c(2, 2, 3, 3, 2)
x <- set_labels(
x,
labels = c("a", "b", "c"),
force.labels = TRUE
)
x
```
For vectors with more unique values than labels, additional labels for non-labelled values are added.
```{r}
x <- c(1, 2, 3, 2, 4, NA)
x <- set_labels(x, labels = c("yes", "maybe", "no"))
x
```
Use `force.values` to add only those labels that have been passed as argument.
```{r}
x <- c(1, 2, 3, 2, 4, NA)
x <- set_labels(
x,
labels = c("yes", "maybe", "no"),
force.values = FALSE
)
x
```
To add explicit labels for values (without adding more labels than wanted and without dropping labels for values that do not appear in the vector), use a named vector of labels as argument. The arguments `force.values` and `force.labels` are ignored when using named vectors.
```{r}
x <- c(1, 2, 3, 2, 4, 5)
x <- set_labels(
x,
labels = c("strongly agree" = 1,
"totally disagree" = 4,
"refused" = 5,
"missing" = 9)
)
x
```
If you want to set different value labels for a complete data frame, if you provide the labels as a `list`. For each variable in the data frame, provide a list element with value labels as character vector. Note that the length of the list must be equal to the number of variables (columns) in the data frame.
```{r}
tmp <- data.frame(
a = c(1, 2, 3),
b = c(1, 2, 3),
c = c(1, 2, 3)
)
labels <- list(
c("one", "two", "three"),
c("eins", "zwei", "drei"),
c("un", "dos", "tres")
)
tmp <- set_labels(tmp, labels = labels)
str(tmp)
```
You can use `set_labels()` within a pipe-workflow with _dplyr_.
```{r message=FALSE}
library(dplyr)
library(sjmisc) # for frq()
data(efc)
efc %>%
select(c82cop1, c83cop2, c84cop3) %>%
set_labels(labels = c("not often" = 1, "very often" = 4)) %>%
frq()
```
# Variable Labels
## Getting variable labels
The `get_label()`-method returns the variable label of a vector or all variable labels from a data frame.
```{r}
get_label(efc$e42dep)
get_label(efc, e42dep, e16sex, e15relat)
```
If a vector has no variable label, `NULL` is returned. However, `get_label()` also allows returning a standard value instead of `NULL`, in case the vector has no label attribute. This is useful to combine with `deparse(substitute())` in function calls, so - for instance - the name of the vector can be used as default value if no variable labels are present.
```{r}
dummy <- c(1, 2, 3)
testit <- function(x) get_label(x, def.value = deparse(substitute(x)))
# returns name of vector, if it has no variable label
testit(dummy)
```
If you want human-readable labels, you can use the `case`-argument, which will pass the labels to a string parser in the [snakecase-package](https://cran.r-project.org/package=snakecase).
```{r}
data(iris)
# returns no labels, because iris-data is not labelled
get_label(iris)
# returns the column name as default labels, if data is not labelled
get_label(iris, def.value = colnames(iris))
# labels are parsed in a readable way
get_label(iris, def.value = colnames(iris), case = "parsed")
```
## Setting variable labels
The `set_label()` function adds the variable label attribute to a vector. You can either return a new vector, or label an existing vector
```{r}
x <- sample(1:4, 10, replace = TRUE)
# return new vector
x <- set_label(x, label = "Dummy-variable")
str(x)
# label existing vector
set_label(x) <- "Another Dummy-variable"
str(x)
```
`set_label()` can also set variable labels for a data frame. In this case, the variable attributes get an additional `name` attribute with the vector's name. This makes it easier to see which label belongs to which vector.
```{r}
x <- data.frame(
a = sample(1:4, 10, replace = TRUE),
b = sample(1:4, 10, replace = TRUE),
c = sample(1:4, 10, replace = TRUE)
)
x <- set_label(x, label = c("Variable A",
"Variable B",
"Variable C"))
str(x)
get_label(x)
```
An alternative to `set_label()` is `var_labels()`, which also works within pipe-workflows. `var_labels()` requires named vectors as arguments to match the column names of the input, and set the associated variable labels.
```{r}
x <- data.frame(
a = sample(1:4, 10, replace = TRUE),
b = sample(1:4, 10, replace = TRUE),
c = sample(1:4, 10, replace = TRUE)
)
library(magrittr) # for pipe
x %>%
var_labels(
a = "Variable A",
b = "Variable B",
c = "Variable C"
) %>%
str()
```
# Missing Values
## Defining missing values
`set_na()` converts values of a vector or of multiple vectors in a data frame into `NA`s. With `as.tag = TRUE`, `set_na()` creates tagged `NA` values, which means that these missing values get an information tag and a value label (which is, by default, the former value that was converted to NA). You can either return a new vector/data frame, or set `NA`s into an existing vector/data frame.
```{r}
x <- sample(1:8, 100, replace = TRUE)
# show value distribution
table(x)
# set value 1 and 8 as tagged missings
x <- set_na(x, na = c(1, 8), as.tag = TRUE)
x
# show value distribution, including missings
table(x, useNA = "always")
# now let's see, which NA's were "1" and which were "8"
print_tagged_na(x)
x <- factor(c("a", "b", "c"))
x
# set NA into existing vector
x <- set_na(x, na = "b", as.tag = TRUE)
x
```
## Getting missing values
The `get_na()` function returns all tagged NA values. We still use the vector `x` from the previous example.
```{r}
get_na(x)
```
To see the tags of the NA values, use the `as.tag` argument.
```{r}
get_na(x, as.tag = TRUE)
```
## Replacing specific NA with values
While `set_na()` allows you to replace values with (tagged) NA's, `replace_na()` (from package **sjmisc**) allows you to replace either all NA values of a vector or specific tagged NA values with a non-NA value.
```{r}
library(sjmisc) # for replace_na()
data(efc)
str(efc$c84cop3)
efc$c84cop3 <- set_na(efc$c84cop3, na = c(2, 3), as.tag = TRUE)
get_na(efc$c84cop3, as.tag = TRUE)
# this would replace all NA's into "2"
dummy <- replace_na(efc$c84cop3, value = 2)
# labels of former tagged NA's are preserved
get_labels(dummy, drop.na = FALSE, values = "p")
get_na(dummy, as.tag = TRUE)
# No more NA values
frq(dummy)
# In this example, the tagged NA(2) is replaced with value 2
# the new value label for value 2 is "restored NA"
dummy <- replace_na(efc$c84cop3, value = 2, na.label = "restored NA", tagged.na = "2")
# Only one tagged NA remains
get_labels(dummy, drop.na = FALSE, values = "p")
get_na(dummy, as.tag = TRUE)
# Some NA values remain
frq(dummy)
```
## Replacing values labels
With `replace_labels()`, you can replace (change) value labels of labelled values. This can also be used to change the labels of tagged missing values. Make sure to know the missing tag, which can be accessed via `get_na()`.
```{r}
str(efc$c82cop1)
efc$c82cop1 <- set_na(efc$c82cop1, na = c(2, 3), as.tag = TRUE)
get_na(efc$c82cop1, as.tag = TRUE)
efc$c82cop1 <- replace_labels(efc$c82cop1, labels = c("new NA label" = tagged_na("2")))
get_na(efc$c82cop1, as.tag = TRUE)
```