New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate() fails with Arabic text #1085
Comments
|
Minimal reprex: library(tidyr)
df <- tibble(
id = 1:4,
string = c("أ", "آب", "أباجور", "دار")
)
df %>% separate(string, "", into = paste0("r", 6:1), fill = "left")
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 1
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 2
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 3
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#> 'bad offset into UTF string'
#> for element 4
#> # A tibble: 4 x 7
#> id r6 r5 r4 r3 r2 r1
#> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 <NA> <NA> <NA> <NA> "" أ
#> 2 2 <NA> <NA> <NA> <NA> "" آب
#> 3 3 <NA> <NA> <NA> <NA> "" أباجور
#> 4 4 <NA> <NA> <NA> <NA> "" دارCreated on 2021-02-17 by the reprex package (v1.0.0) I'm not yet sure how to fix this, but in the meantime, you can work around it by splitting the string yourself and then using library(tidyr)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
id = 1:4,
string = c("أ", "آب", "أباجور", "دار")
)
df %>%
rowwise() %>%
mutate(
letters = strsplit(string, ""),
letters = list(setNames(letters, paste0("r", seq_along(letters))))
) %>%
unnest_wider(letters)
#> # A tibble: 4 x 8
#> id string r1 r2 r3 r4 r5 r6
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 أ أ <NA> <NA> <NA> <NA> <NA>
#> 2 2 آب آ ب <NA> <NA> <NA> <NA>
#> 3 3 أباجور أ ب ا ج و ر
#> 4 4 دار د ا ر <NA> <NA> <NA>Created on 2021-02-17 by the reprex package (v1.0.0) This prints incorrectly, but using |
|
library(tidyr)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
id = 1:4,
string = c("أ", "آب", "أباجور", "دار")
)
stringr::str_split(df$string[[3]], stringr::boundary("character", locale = "ar"))
#> [[1]]
#> [1] "أ" "ب" "ا" "ج" "و" "ر"
strsplit(df$string[[3]], "")
#> [[1]]
#> [1] "أ" "ب" "ا" "ج" "و" "ر"Created on 2021-02-17 by the reprex package (v1.0.0) |
|
This will be resolved by the new family of separate functions proposed in tidyverse/tidyups#23, since they'll be backed by stringi, not base R. |
Greetings! Thank you in advance for your time!
In the dataframe dict below, I am trying to separate the character strings in the root_letters column so that:
Unfortunately, the
separate()function doesn't seem to recognize the Arabic text (see the "bad offset" warnings in the output).Some workarounds have been kindly proposed in RStudio Community but I still end up with strange reverse-order when I try to open the edited dataframe in Excel.
I'd still like to know if tidyr can actually currently handle this (and if not, if such support for parsing of right-to-left texts could be added). Thank you!
The text was updated successfully, but these errors were encountered: