New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate() right edge fixed position splitting is off by 1 #315

Closed
JoFrhwld opened this Issue Jun 21, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@JoFrhwld

JoFrhwld commented Jun 21, 2017

According to the documentation for the sep argument:

If numeric, interpreted as positions to split at. Positive values start at 1 at the far-left of the string; negative value start at -1 at the far-right of the string.

However, to split the right-most character into a new column, sepmust be set to -2. Reproducible example:

library(tidyverse)

df <- expand.grid(group = c("alpha", "beta"),
                  dimension = c(1, 2),
                  stringsAsFactors = F) %>%
        tbl_df() %>%
        mutate(gr_dim = paste0(group, dimension))


# splits the one left most character
df %>%
  separate(gr_dim, sep = 1, into = c("gr", "dim"))

# creates two columns, one of which is the original column, the other blank
df %>%
  separate(gr_dim, sep = -1, into = c("gr", "dim"))

# splits the two left-most characters
df %>%
  separate(gr_dim, sep = 2, into = c("gr", "dim"))

# splits the one right-most character
df %>%
  separate(gr_dim, sep = -2, into = c("gr", "dim"))

@JoFrhwld JoFrhwld changed the title from separate() right edged fixed position splitting is off by 1 to separate() right edge fixed position splitting is off by 1 Jun 21, 2017

@hadley hadley added the reprex label Jun 23, 2017

@hadley

This comment has been minimized.

Member

hadley commented Jun 23, 2017

Would you mind updating your reprex to use the reprex package? That way I can see the output more easily.

@markdly

This comment has been minimized.

Contributor

markdly commented Jul 9, 2017

On reading the documentation for the sep argument I also thought negative values would work right-to-left in the same way as positive values work from left-to-right. Reprex example:

library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

df <- tribble(
  ~foo_bar,
  "ABC",
  "DEF"
)

# split left most character  
df %>%
  separate(foo_bar, sep = 1, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#>     foo   bar
#> * <chr> <chr>
#> 1     A    BC
#> 2     D    EF

# expected right most character split based on documentation but get blank column instead
df %>%
  separate(foo_bar, sep = -1, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#>     foo   bar
#> * <chr> <chr>
#> 1   ABC      
#> 2   DEF

# split right most character using sep = -2 (i.e. is the 'right-to-left' equivalent of sep = 1) 
df %>%
  separate(foo_bar, sep = -2, into = c("foo", "bar"))
#> # A tibble: 2 x 2
#>     foo   bar
#> * <chr> <chr>
#> 1    AB     C
#> 2    DE     F

If I'm looking in the right place, perhaps slightly modifying the strsep function could address this if a code change was considered worthwhile (assuming documentation is OK as it is)?

# possible solution?
strsep <- function(x, sep) {
  nchar <- stringi::stri_length(x)
  sep <- c(0, sep, nchar)
  
  pos <- map(sep, function(i) {
    if (i >= 0) return(i)
    nchar + i
  })
  
  map(1:(length(pos) - 1), function(i) {
    stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
  })
}

@hadley hadley added bug and removed reprex labels Nov 15, 2017

@hadley

This comment has been minimized.

Member

hadley commented Nov 15, 2017

@markdly thanks for this reprex!

@markdly

This comment has been minimized.

Contributor

markdly commented Nov 15, 2017

Given this has now been flagged as a bug, here's a reprex for the possible modification to the strsep function included in my previous post. strsep_old is copied from strsep, while strsep_new was the proposed change. This looks OK to me for a single negative sep value, however I don't think this works as a general solution. For example, what output should be expected for strsep_new("ABCD", c(-1, -3))? Should it be equivalent to strsep_new("ABCD", c(1, 3)) (albeit in reverse order)? Intuitively I thought this would be what to expect but that's probably a misconception on my part!

library(purrr)

# existing
strsep_old <- function(x, sep) {
  sep <- c(0, sep, -1)
  
  nchar <- stringi::stri_length(x)
  pos <- map(sep, function(i) {
    if (i >= 0) return(i)
    nchar + i + 1
  })
  
  map(1:(length(pos) - 1), function(i) {
    stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
  })
}

# proposed
strsep_new <- function(x, sep) {
  nchar <- stringi::stri_length(x)
  sep <- c(0, sep, nchar)
  
  pos <- map(sep, function(i) {
    if (i >= 0) return(i)
    nchar + i
  })
  
  map(1:(length(pos) - 1), function(i) {
    stringi::stri_sub(x, pos[[i]] + 1, pos[[i + 1]])
  })
}

x <- "ABC"

# still the same for +ve sep
identical(strsep_old(x,  0), strsep_new(x, 0))
#> [1] TRUE
identical(strsep_old(x,  1), strsep_new(x, 1))
#> [1] TRUE
identical(strsep_old(x,  2), strsep_new(x, 2))
#> [1] TRUE
identical(strsep_old(x,  3), strsep_new(x, 3))
#> [1] TRUE
identical(strsep_old(x,  4), strsep_new(x, 4))  # nonsense extreme value
#> [1] TRUE

# different for -ve sep
strsep_old(x, -1)
#> [[1]]
#> [1] "ABC"
#> 
#> [[2]]
#> [1] ""
strsep_new(x, -1)
#> [[1]]
#> [1] "AB"
#> 
#> [[2]]
#> [1] "C"

# other -ve sep results
strsep_new(x, -2)
#> [[1]]
#> [1] "A"
#> 
#> [[2]]
#> [1] "BC"
strsep_new(x, -3)
#> [[1]]
#> [1] ""
#> 
#> [[2]]
#> [1] "ABC"
@hadley

This comment has been minimized.

Member

hadley commented Nov 16, 2017

Do you want to do a PR? Your change looks correct, so it just needs a couple of unit tests.

@markdly

This comment has been minimized.

Contributor

markdly commented Nov 16, 2017

Sure! I should be able to submit one in the next 24-48 hrs...

@hadley

This comment has been minimized.

Member

hadley commented Nov 16, 2017

I've marked this issue as WIP which means I won't work on it (So please let me know if you change your mind)

@markdly

This comment has been minimized.

Contributor

markdly commented Nov 17, 2017

When putting together the PR, I noticed my original proposed change above wasn't going to be suitable as there is a separate issue where large negative values for sep can sometimes lead to unexpected results. See reprex:

library(tidyr)
library(dplyr)

df <- tibble(x = c("ab"))

# OK. "ab" in column x only using +ve sep
separate(df, x, c("x", "y"),  4)
#> # A tibble: 1 x 2
#>       x     y
#> * <chr> <chr>
#> 1    ab

# Issue: "ab" appears in both x and y columns using -ve sep
separate(df, x, c("x", "y"), -4)
#> # A tibble: 1 x 2
#>       x     y
#> * <chr> <chr>
#> 1    ab    ab
# Expected:
tibble(x = "", y = "ab")
#> # A tibble: 1 x 2
#>       x     y
#>   <chr> <chr>
#> 1          ab

Because of this, PR #380 also handles this issue to return something reasonable if/when extreme sep values are provided (i.e. when abs(sep) > nchar(x)). For positive values no change is required. For negative values I've used pmax to achieve the desired result

@hadley hadley closed this in #380 Nov 17, 2017

hadley added a commit that referenced this issue Nov 17, 2017

Fix handling of negative integer values in strsep (#380)
* Fix strsep negative value handling. Closes #315

* add new tests when sep argument is integer

* update news (#315)

* add author, shorten line length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment