Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate() fails with Arabic text #1085

Closed
lizzhuntley opened this issue Jan 6, 2021 · 3 comments
Closed

separate() fails with Arabic text #1085

lizzhuntley opened this issue Jan 6, 2021 · 3 comments
Labels
bug an unexpected problem or unintended behavior strings 🎻

Comments

@lizzhuntley
Copy link

@lizzhuntley lizzhuntley commented Jan 6, 2021

Greetings! Thank you in advance for your time!

In the dataframe dict below, I am trying to separate the character strings in the root_letters column so that:

  1. Each letter will appear in its own column, AND
  2. The words are parsed correctly for a right-to-left script, i.e.
    • the first letter of the word is correctly identified as the right-most letter in the string AND
    • the direction of split goes from right-to left [although if the split defaults to start left and move right, this second issue could be fixed by reordering the columns]
    • (please click here to see the desired outcome) using tidyr

Unfortunately, the separate() function doesn't seem to recognize the Arabic text (see the "bad offset" warnings in the output).

Some workarounds have been kindly proposed in RStudio Community but I still end up with strange reverse-order when I try to open the edited dataframe in Excel.

I'd still like to know if tidyr can actually currently handle this (and if not, if such support for parsing of right-to-left texts could be added). Thank you!

# Load packages
library(tidyr)

# Create sample dataframe
root_letters <- c("أ", "آب", "أباجور", "دار")
entry <- c(1:4)

dict <- data.frame(entry, root_letters)
dict # display dataframe
#>   entry root_letters
#> 1     1            أ
#> 2     2           آب
#> 3     3       أباجور
#> 4     4          دار

# Find the maximum number of letters in a root
long <- max(nchar(dict$root_letters))

# Separate function in tidyr package
dict_sep <- dict %>% separate(
     root_letters, # column to seprate
     "", # separate every character
     into = paste0("r", long:1), # names of new variables to create as character vector,
     remove = F, # keep original input column
     extra = "drop", # drop any extra values without a warning.
     fill = "left") # fill values on the left
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 1
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 2
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 3
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 4

dict_sep # display outcome
#>   entry root_letters   r6   r5   r4   r3 r2     r1
#> 1     1            أ <NA> <NA> <NA> <NA>         أ
#> 2     2           آب <NA> <NA> <NA> <NA>        آب
#> 3     3       أباجور <NA> <NA> <NA> <NA>    أباجور
#> 4     4          دار <NA> <NA> <NA> <NA>       دار
@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2021

Minimal reprex:

library(tidyr)

df <- tibble(
  id = 1:4,
  string = c("أ", "آب", "أباجور", "دار")
)
df %>% separate(string, "", into = paste0("r", 6:1), fill = "left")
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 1
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 2
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 3
#> Warning in gregexpr(pattern, x, perl = TRUE): PCRE error
#>  'bad offset into UTF string'
#>  for element 4
#> # A tibble: 4 x 7
#>      id r6    r5    r4    r3    r2    r1    
#>   <int> <chr> <chr> <chr> <chr> <chr> <chr> 
#> 1     1 <NA>  <NA>  <NA>  <NA>  ""    أ     
#> 2     2 <NA>  <NA>  <NA>  <NA>  ""    آب    
#> 3     3 <NA>  <NA>  <NA>  <NA>  ""    أباجور
#> 4     4 <NA>  <NA>  <NA>  <NA>  ""    دار

Created on 2021-02-17 by the reprex package (v1.0.0)

I'm not yet sure how to fix this, but in the meantime, you can work around it by splitting the string yourself and then using unnest_wider():

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:4,
  string = c("أ", "آب", "أباجور", "دار")
)

df %>% 
  rowwise() %>% 
  mutate(
    letters = strsplit(string, ""),
    letters = list(setNames(letters, paste0("r", seq_along(letters))))
  ) %>% 
  unnest_wider(letters)
#> # A tibble: 4 x 8
#>      id string r1    r2    r3    r4    r5    r6   
#>   <int> <chr>  <chr> <chr> <chr> <chr> <chr> <chr>
#> 1     1 أ      أ     <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2     2 آب     آ     ب     <NA>  <NA>  <NA>  <NA> 
#> 3     3 أباجور أ     ب     ا     ج     و     ر    
#> 4     4 دار    د     ا     ر     <NA>  <NA>  <NA>

Created on 2021-02-17 by the reprex package (v1.0.0)

This prints incorrectly, but using View() in RStudio suggests the underlying data is ok.

@hadley hadley changed the title Support for parsing Arabic texts (right-to-left scripts) separate() fails with Arabic text Feb 17, 2021
@hadley hadley added bug an unexpected problem or unintended behavior strings 🎻 labels Feb 17, 2021
@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2021

stringr::str_split() (which is powered by stringi, which use ICU, which should give the correct result) orders characters in the same way as split(), so if that's not what you want, you'll need a rev().

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:4,
  string = c("أ", "آب", "أباجور", "دار")
)

stringr::str_split(df$string[[3]], stringr::boundary("character", locale = "ar"))
#> [[1]]
#> [1] "أ" "ب" "ا" "ج" "و" "ر"
strsplit(df$string[[3]], "")
#> [[1]]
#> [1] "أ" "ب" "ا" "ج" "و" "ر"

Created on 2021-02-17 by the reprex package (v1.0.0)

@hadley
Copy link
Member

@hadley hadley commented Dec 17, 2021

This will be resolved by the new family of separate functions proposed in tidyverse/tidyups#23, since they'll be backed by stringi, not base R.

@hadley hadley closed this as completed Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior strings 🎻
Projects
None yet
Development

No branches or pull requests

2 participants