Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upclean_names could transliterate accented characters #120
Comments
|
I love this idea. I will look into it more; could you share a quick reproducible example ("reprex") I could test out? |
|
This might be of interest. |
|
Well, the special characters got corrupted when I used I was not aware of the platform problem pointed by @Tazinho, but I believe this solution is a good start until we find a better one. There it is: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
clean_names <- function(dat) {
old_names <- names(dat)
new_names <- old_names %>%
gsub("'", "", .) %>%
gsub("\"","", .) %>%
gsub("%", "percent", .) %>%
gsub("^[ ]+", "", .) %>%
make.names(.) %>%
gsub("[.]+", "_", .) %>%
gsub("[_]+", "_", .) %>%
tolower(.) %>%
gsub("_$", "", .) %>%
## here is the new line to transliterate the characters ##
stringi::stri_trans_general("latin-ascii")
dupe_count <- sapply(1:length(new_names), function(i) {
sum(new_names[i] == new_names[1:i])
})
new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1], dupe_count[dupe_count > 1], sep = "_")
stats::setNames(dat, new_names)
}
tmp_df <- data_frame(a = 1, b = 2, c = 3, d = 4, e = 5)
names(tmp_df) <- c("á", "ê", "ï", "õ", "ù")
tmp_df
#> # A tibble: 1 x 5
#> á ê ï õ ù
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5
df_clean <- clean_names(tmp_df)
df_clean
#> # A tibble: 1 x 5
#> a e i o u
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5Session info
devtools::session_info() |
|
Thanks @Tazinho for looking into this w/ stringi. As a Windows user I can (unfortunately) attest to the cross-platform differences. The example with But, it looks like gagolews/stringi#270 will enable @fernandovmacedo's changes to |
|
Looks like that stringi fix for Windows worked, I was just making user errors on my end. That fix isn't on CRAN yet. Until stringi 1.1.6 is on CRAN, if we add the line @fernandovmacedo suggests to Then after stringi goes on CRAN, I suggest janitor be dependent on the latest version of stringi. @fernandovmacedo - would you like to add that line, and some tests (like moving your above example into a formal test with |
|
Sure, I will work on the Pull Request this weekend. |
|
added this into snakecase, see #96 (couldn't reference from there for some reason...) |
clean_names() transliterates accented letters. Closes sfirke#120
clean_names() transliterates accented letters sfirke#120
Data frames with accented characters (things like áôü) have to be wrapped with quotation in dplyr. If clean_names transliterate them to ASCII the problem would be solved.
The solution is quite simple, you would only need to add this line to the function before the duplicate handling part and also include stringi in the dependencies of the package.
stringi::stri_trans_general(., "Latin-ASCII"))