New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
separate() right edge fixed position splitting is off by 1 #315
Comments
Would you mind updating your reprex to use the reprex package? That way I can see the output more easily. |
On reading the documentation for the library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag(): dplyr, stats
df <- tribble(
~foo_bar,
"ABC",
"DEF"
)
# split left most character
df %>%
separate(foo_bar, sep = 1, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#> foo bar
#> * <chr> <chr>
#> 1 A BC
#> 2 D EF
# expected right most character split based on documentation but get blank column instead
df %>%
separate(foo_bar, sep = -1, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#> foo bar
#> * <chr> <chr>
#> 1 ABC
#> 2 DEF
# split right most character using sep = -2 (i.e. is the 'right-to-left' equivalent of sep = 1)
df %>%
separate(foo_bar, sep = -2, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#> foo bar
#> * <chr> <chr>
#> 1 AB C
#> 2 DE F If I'm looking in the right place, perhaps slightly modifying the strsep function could address this if a code change was considered worthwhile (assuming documentation is OK as it is)? # possible solution?
strsep <- function(x, sep) {
nchar <- stringi::stri_length(x)
sep <- c(0, sep, nchar)
pos <- map(sep, function(i) {
if (i >= 0) return(i)
nchar + i
})
map(1:(length(pos) - 1), function(i) {
stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
})
} |
@markdly thanks for this reprex! |
Given this has now been flagged as a bug, here's a reprex for the possible modification to the library(purrr)
# existing
strsep_old <- function(x, sep) {
sep <- c(0, sep, -1)
nchar <- stringi::stri_length(x)
pos <- map(sep, function(i) {
if (i >= 0) return(i)
nchar + i + 1
})
map(1:(length(pos) - 1), function(i) {
stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
})
}
# proposed
strsep_new <- function(x, sep) {
nchar <- stringi::stri_length(x)
sep <- c(0, sep, nchar)
pos <- map(sep, function(i) {
if (i >= 0) return(i)
nchar + i
})
map(1:(length(pos) - 1), function(i) {
stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
})
}
x <- "ABC"
# still the same for +ve sep
identical(strsep_old(x, 0), strsep_new(x, 0))
#> [1] TRUE
identical(strsep_old(x, 1), strsep_new(x, 1))
#> [1] TRUE
identical(strsep_old(x, 2), strsep_new(x, 2))
#> [1] TRUE
identical(strsep_old(x, 3), strsep_new(x, 3))
#> [1] TRUE
identical(strsep_old(x, 4), strsep_new(x, 4)) # nonsense extreme value
#> [1] TRUE
# different for -ve sep
strsep_old(x, -1)
#> [[1]]
#> [1] "ABC"
#>
#> [[2]]
#> [1] ""
strsep_new(x, -1)
#> [[1]]
#> [1] "AB"
#>
#> [[2]]
#> [1] "C"
# other -ve sep results
strsep_new(x, -2)
#> [[1]]
#> [1] "A"
#>
#> [[2]]
#> [1] "BC"
strsep_new(x, -3)
#> [[1]]
#> [1] ""
#>
#> [[2]]
#> [1] "ABC" |
Do you want to do a PR? Your change looks correct, so it just needs a couple of unit tests. |
Sure! I should be able to submit one in the next 24-48 hrs... |
I've marked this issue as WIP which means I won't work on it (So please let me know if you change your mind) |
When putting together the PR, I noticed my original proposed change above wasn't going to be suitable as there is a separate issue where large negative values for library(tidyr)
library(dplyr)
df <- tibble(x = c("ab"))
# OK. "ab" in column x only using +ve sep
separate(df, x, c("x", "y"), 4)
#> # A tibble: 1 x 2
#> x y
#> * <chr> <chr>
#> 1 ab
# Issue: "ab" appears in both x and y columns using -ve sep
separate(df, x, c("x", "y"), -4)
#> # A tibble: 1 x 2
#> x y
#> * <chr> <chr>
#> 1 ab ab
# Expected:
tibble(x = "", y = "ab")
#> # A tibble: 1 x 2
#> x y
#> <chr> <chr>
#> 1 ab Because of this, PR #380 also handles this issue to return something reasonable if/when extreme sep values are provided (i.e. when |
According to the documentation for the
sep
argument:However, to split the right-most character into a new column,
sep
must be set to -2. Reproducible example:The text was updated successfully, but these errors were encountered: