The more I review my code and others', the more I realize that base::factor() causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generates NA, which can cause heavy misunderstanding later.
Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.
Here is an example:
library(glue)
library(rlang)
library(magrittr)
df_labels = read.table(header=TRUE, text="
level label
setsa SETO #typo
verssicolor VERSICO #typo
virginica VIRGINI
")
x=as.character(iris$Species)
f1 = factor(x, levels=df_labels$level, labels=df_labels$label)
table(f1)
#> f1
#> SETO VERSICO VIRGINI
#> 0 0 50
Before calling table(), the user has no idea that the previous call had "failing" cases.
Here is the function I'm using instead:
fct = function(x=character(), levels, labels=levels, ...){
miss_x = !x %in% levels
if(any(miss_x)){
miss_x_s = unique(x[miss_x]) %>% glue_collapse(", ")
warn(c("Unknown factor level in `x`, NA generated.",
x=glue("Unknown levels: {miss_x_s}")))
}
factor(x, levels, labels, ...)
}
f2 = fct(x, levels=df_labels$level, labels=df_labels$label)
#> Warning: Unknown factor level in `x`, NA generated.
#> x Unknown levels: setosa, versicolor
table(f2)
#> f2
#> SETO VERSICO VIRGINI
#> 0 0 50
Created on 2022-02-13 by the reprex package (v2.0.1)
As more and more people rely on tidyverse to write cleaner code, I would guess this could belong here.
The more I review my code and others', the more I realize that
base::factor()causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generatesNA, which can cause heavy misunderstanding later.Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.
Here is an example:
Before calling
table(), the user has no idea that the previous call had "failing" cases.Here is the function I'm using instead:
Created on 2022-02-13 by the reprex package (v2.0.1)
As more and more people rely on
tidyverseto write cleaner code, I would guess this could belong here.