Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In separate an into= column name can be NA to not generate that column #397

Closed
ggrothendieck opened this issue Jan 15, 2018 · 6 comments · Fixed by #445
Closed

In separate an into= column name can be NA to not generate that column #397

ggrothendieck opened this issue Jan 15, 2018 · 6 comments · Fixed by #445
Labels
feature a feature request or enhancement strings 🎻

Comments

@ggrothendieck
Copy link

ggrothendieck commented Jan 15, 2018

In the example below, modified from https://stackoverflow.com/questions/48255809/to-change-tabular-data-to-a-different-format-in-r/48256012#48256012 , the X column is a junk column that just contains the second underscore delimiter for those rows that have it. If the separate statement would allow NA or "", say, as a column name with the meaning do not generate a column for that component then the select(-X) could have been omitted and the separate statement would become separate(variable, c("CurveGroup", NA, "suffix"), sep = 5:6, fill = "right") . Not only would this reduce the code by one line but it would also eliminate the need to generate a junk name for a column that is subsequently dropped.

DF <- structure(list(pos = 1:2, BZ_SP = c(-300000L, 0L), BZ_SP_m1 = c(2L, 
  0L), BZ_SP_m2 = c(3L, 0L), CL_SP = c(2540544L, -118621L), CL_SP_m1 = c(1L, 
  3L), CL_SP_m2 = c(2L, 4L)), .Names = c("pos", "BZ_SP", "BZ_SP_m1", 
  "BZ_SP_m2", "CL_SP", "CL_SP_m1", "CL_SP_m2"), 
  class = "data.frame", row.names = c(NA, -2L))

DF
##   pos   BZ_SP BZ_SP_m1 BZ_SP_m2   CL_SP CL_SP_m1 CL_SP_m2
## 1   1 -300000        2        3 2540544        1        2
## 2   2       0        0        0 -118621        3        4

library(dplyr)
library(tidyr)

DF %>%
   gather(variable, value, -pos) %>% 
   separate(variable, c("CurveGroup", "X", "suffix"), sep = 5:6, fill = "right") %>% 
   select(-X) %>%
   spread(suffix, value)

giving:

  pos CurveGroup      V1 m1 m2
1   1      BZ_SP -300000  2  3
2   1      CL_SP 2540544  1  2
3   2      BZ_SP       0  0  0
4   2      CL_SP -118621  3  4

@hadley
Copy link
Member

hadley commented Jan 15, 2018

Nice idea!

@ggrothendieck
Copy link
Author

ggrothendieck commented Feb 9, 2018

Here is a second example . This one is taken from https://stackoverflow.com/questions/48707294/turn-singular-row-with-interval-into-multiple-rows-which-equal-the-interval/48707563#48707563

This example is not as compelling because there is no subsequent reduction in code but it still does reduce the mental load of the developer by freeing them from having to come up with junk names so it still demonstrates an advantage and also reduces the size of the data relative to specifying names for all columns.

In this example with the feature under discussion, X1 and X2 could have been replaced with NA to emphasize that they are not subsequently used. Maybe parity too.

library(dplyr)
library(tidyr)

DF <- structure(list(Col1 = c("a", "b", "c"), Col2 = c("odd from 1 to 9", 
"even from 2 to 14", "even from 30 to 50")), .Names = c("Col1", 
"Col2"), row.names = c(NA, -3L), class = "data.frame")

DF %>% 
   separate(Col2, into = c("parity", "X1", "from", "X2", "to")) %>% 
   group_by(Col1) %>% 
   do(data.frame(Col2 = seq(.$from, .$to, 2))) %>%
   ungroup
```

@ggrothendieck
Copy link
Author

ggrothendieck commented Feb 11, 2018

Here is a third example. The objective is to create two columns. It is based on https://stackoverflow.com/questions/48727230/how-to-split-columns-when-writing-csv-files/48727340#48727360

Had we been able to use NA instead of X1 and X2 in into it would have eliminated the need to come up with junk names, the data from separate would be smaller and the select could have been eliminated.

Lines <- c("(Wirtschaft, 00:00)", "(Kultur, 23:42)") # test data

library(dplyr)
library(tidyr)
library(tibble)

Lines %>% 
      as.tibble %>%
      separate(value, into = c("X1", "Name", "Time", "X2"), sep = ", |[()]") %>%
      select(Name, Time)

Also note that it would have been nice to be able to write the sep argument as sep = "[,()]"; however, that would have left the Time column contents with a space in front of it. That could be trimmed off later but it would be nice if the field splitting in separate had some way to specify white space trimming at the same time since I would think it would be often needed. For example, read.table can do white space trimming.

@hadley hadley added feature a feature request or enhancement strings 🎻 labels Feb 19, 2018
@markdly
Copy link
Contributor

markdly commented Apr 18, 2018

+1 for this idea. This is a use case I've found myself in several times and needing to use dummy column names felt clunky. I've tried using extract() instead in these situations which might be OK for something like example 3 but I would prefer to avoid the regexes needed for examples 1 and 2 if separate() can handle the job.

Collation of previous examples also using `extract()`
library(tidyverse)

### Example 1
ex1 <- structure(
  list(pos = 1:2, BZ_SP = c(-300000L, 0L), BZ_SP_m1 = c(2L, 0L), 
       BZ_SP_m2 = c(3L, 0L), CL_SP = c(2540544L, -118621L), 
       CL_SP_m1 = c(1L, 3L), CL_SP_m2 = c(2L, 4L)), 
  .Names = c("pos", "BZ_SP", "BZ_SP_m1", "BZ_SP_m2", 
             "CL_SP", "CL_SP_m1", "CL_SP_m2"), 
  class = "data.frame", row.names = c(NA, -2L))

sep1 <- ex1 %>%
  gather(variable, value, -pos) %>% 
  separate(variable, c("CurveGroup", "X", "suffix"), sep = 5:6, fill = "right") %>% 
  select(-X) %>%
  spread(suffix, value)

ext1 <- ex1 %>%
  gather(variable, value, -pos) %>% 
  extract(variable, c("CurveGroup", "suffix"), "(.{5})[_]*(.*)") %>% 
  spread(suffix, value)

identical(sep1, ext1)
#> [1] TRUE

### Example 2
ex2 <- structure(
  list(Col1 = c("a", "b", "c"), 
       Col2 = c("odd from 1 to 9", "even from 2 to 14", "even from 30 to 50")),
  .Names = c("Col1", "Col2"), row.names = c(NA, -3L), class = "data.frame")

sep2 <- ex2 %>% 
  separate(Col2, into = c("parity", "X1", "from", "X2", "to")) %>% 
  group_by(Col1) %>% 
  do(data.frame(Col2 = seq(.$from, .$to, 2))) %>%
  ungroup

ext2 <- ex2 %>%
  extract(Col2, c("from", "to"), regex = ("([0-9]+) to ([0-9]+)")) %>%
  mutate(Col2 = map2(from, to, ~ seq(.x, .y, 2))) %>%
  unnest() %>% select(Col1, Col2) %>% as_tibble()

identical(sep2, ext2)
#> [1] TRUE

### Example 3
ex3 <- c("(Wirtschaft, 00:00)", "(Kultur, 23:42)")

sep3 <- ex3 %>% 
  as.tibble %>%
  separate(value, into = c("X1", "Name", "Time", "X2"), sep = ", |[()]") %>%
  select(Name, Time)

ext3 <- ex3 %>% as_tibble() %>% 
  extract(value, into = c("Name", "Time"), "\\((.+), (.+)\\)")

identical(sep3, ext3)
#> [1] TRUE

Created on 2018-04-18 by the reprex package (v0.2.0).

@markdly
Copy link
Contributor

markdly commented Apr 19, 2018

I've proposed a draft PR for this in #445 which can be installed with

devtools::install_github("tidyverse/tidyr#445")

Short example

library(tidyverse)

df <- tribble(
  ~foo,
  "Amanda likes apple and asparagus",
  "Barry likes banana and broccoli",
  "Cornelia likes cherry and carrot"
)

df %>% separate(foo, into = c("name", NA, "fruit", NA, "vegetable"))
#> # A tibble: 3 x 3
#>   name     fruit  vegetable
#>   <chr>    <chr>  <chr>    
#> 1 Amanda   apple  asparagus
#> 2 Barry    banana broccoli 
#> 3 Cornelia cherry carrot

@ggrothendieck
Copy link
Author

ggrothendieck commented Sep 20, 2018

Here is a yet another example, this one based on https://stackoverflow.com/questions/52431841/how-to-find-the-first-space-in-a-sentence-with-regular-expressions-within-r/52432371#52432371

The input is has a single column each element of which looks like [1] some text and it is desired to split that into two columns with the first column containing the row number, i.e. 1, and the second column containing the text that follows the space.

library(dplyr)
library(tidyr)

DF <-
structure(list(X = c("[2] wererea 187w 50HH (1)", "[3] werewefrea 145y 50HH (4)", 
"[5] weredssda 100NH (2)")), class = "data.frame", row.names = c(NA, 
-3L))

DF %>%
  separate(X, into = c("junk", "X", "Y"), 
    extra = "merge", convert = TRUE, remove = TRUE) %>%
  select(-junk)
##   X                        Y
## 1 2    wererea 187w 50HH (1)
## 2 3 werewefrea 145y 50HH (4)
## 3 5      weredssda 100NH (2)

With the suggested feature this one reduces to the following where the last line of code above is no longer needed:

DF %>%
  separate(X, into = c(NA, "X", "Y"), 
    extra = "merge", convert = TRUE, remove = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement strings 🎻
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants