Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow components of names_to in pivot_longer to be NA #793

Closed
ggrothendieck opened this issue Oct 26, 2019 · 9 comments
Closed

Allow components of names_to in pivot_longer to be NA #793

ggrothendieck opened this issue Oct 26, 2019 · 9 comments
Labels
feature pivoting ♻️
Milestone

Comments

@ggrothendieck
Copy link

@ggrothendieck ggrothendieck commented Oct 26, 2019

In the example below taken from https://stackoverflow.com/questions/58566740/pivot-by-group-for-unequal-data-size/58567045#58567045 we don't need the Num column so it would be nice to be able to specify NA instead of giving a dummy name. Also because that column is generated values_drop_na=TRUE won't eliminate NA rows and we need to add additional statements (drop_na, select) to drop them and remove that column. These could have been eliminated had NA components been available.

dat <- structure(list(ID = c(21785L, 21785L, 21785L), HR1 = c(0.828273303, 
6.404590021, 0.775568448), Weekday1 = c(2L, 3L, 2L), HR2 = c(NA, 
1.122899914, 0.850113234), Weekday2 = c(NA, 4L, 3L), HR3 = c(NA, 
0.866757168, 0.868943246), Weekday3 = c(NA, 5L, 4L), HR4 = c(NA, 
0.563804788, 0.728656328), Weekday4 = c(NA, 6L, 5L), HR5 = c(NA, 
0.888109208, 0.823803733), Weekday5 = c(NA, 7L, 6L), HR6 = c(NA, 
0.578834113, 0.863467391), Weekday6 = c(NA, 1L, 7L), HR7 = c(NA, 
NA, 0.939920869), Weekday7 = c(NA, 2, 8)), row.names = c(5L, 
163L, 167L), class = "data.frame")

library(dplyr)
library(tidyr)

dat %>% pivot_longer(-ID,
   names_to = c(".value", "Num"),  # names_to = c(".value", NA),
   names_pattern = "(\\D+)(\\d+)") %>%   
   drop_na %>% # would not need this line if NA in names_to were allowed
   select(-Num) # would not need this line if NA in names_to were allowed

@ggrothendieck
Copy link
Author

@ggrothendieck ggrothendieck commented Nov 11, 2019

Another example where this would have been useful is shown here:
https://stackoverflow.com/questions/58792882/appending-values-in-a-column-according-to-row-index/58794307#58794307

In that example we could have omitted the select(-name) had we been able to write names_to = NA .

@ggrothendieck
Copy link
Author

@ggrothendieck ggrothendieck commented Nov 11, 2019

Another example from SO can be found here: https://stackoverflow.com/questions/58785329/is-there-a-function-in-r-that-will-let-me-convert-a-dataset-into-long-format-b/58785660#58785660

In the first pivot_longer in that example if we could have written to_names = NA we could have omitted the subsequent select(-name) making the pipeline more compact and eliminating the need for generating a column which is never used.

@hadley
Copy link
Member

@hadley hadley commented Nov 13, 2019

Yeah, I like this idea, and it matches the syntax in separate()

@hadley hadley added feature pivoting ♻️ labels Nov 13, 2019
@hadley
Copy link
Member

@hadley hadley commented Nov 24, 2019

You can already use NA in names, provided that you're using names_sep because it uses str_separate() which already supports this behaviour. I'll also add it to str_extract() for consistency, but note that you don't really use it since you can just drop the () grouping.

However, there's another problem that prevents this from working: there's currently no way to have a single names_to = ".value":

library(tidyr)

dat <- data.frame(
  ID = c(21785L, 21785L, 21785L),
  HR_1 = c(0.828273303, 6.404590021, 0.775568448), 
  Weekday_1 = c(2L, 3L, 2L), 
  HR_2 = c(NA, 1.122899914, 0.850113234),
  Weekday_2 = c(NA, 4L, 3L), 
  HR_3 = c(NA, 0.866757168, 0.868943246), 
  Weekday_3 = c(NA, 5L, 4L), 
  HR_4 = c(NA, 0.563804788, 0.728656328), 
  Weekday_4 = c(NA, 6L, 5L)
)

dat %>% pivot_longer(-ID, 
  names_to = ".value", 
  names_pattern = "\\D+_(\\d+)"
)
#> New names:
#> * .value -> .value...2
#> * .value -> .value...3
#> Error: `spec` must have `.name` and `.value` columns

dat %>% pivot_longer(-ID,
  names_to = c(".value", NA),
  names_sep = "_"
)
#> Error: `spec` must have at least 3 columns

Created on 2019-11-24 by the reprex package (v0.3.0)

I'll need to think about that assumption more.

hadley added a commit that referenced this issue Nov 24, 2019
@hadley hadley added this to the v1.1.0 milestone Nov 28, 2019
@hadley
Copy link
Member

@hadley hadley commented Dec 6, 2019

Reprex:

library(tidyr)

url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion

spec <- tibble(.name = setdiff(names(df2), "id"))
spec$.value <- rep(c("title", "year", "artist"), length = nrow(spec))
spec$seq <- rep(1:(nrow(spec) / 3), each = 3)

df2 %>% pivot_longer_spec(spec)
#> # A tibble: 1,150 x 5
#>       id   seq title                      year artist                           
#>    <dbl> <int> <chr>                     <dbl> <chr>                            
#>  1     1     1 Sophisticated Lady / Tea…  1933 Art Tatum                        
#>  2     1     2 The Genius Of Art Tatum,…  1955 Art Tatum                        
#>  3     1     3 The Tatum Group Masterpi…  1964 Art Tatum / Lionel Hampton / Har…
#>  4     1     4 Live Sessions 1940 / 1941  1975 Art Tatum                        
#>  5     1     5 20th Century Piano Genius  1986 Art Tatum                        
#>  6     1     6 Jazz Masters (100 Ans De…  1998 Art Tatum                        
#>  7     1     7 The Art Tatum - Ben Webs…  2015 Art Tatum / Ben Webster          
#>  8     1     8 El Gran Tatum                NA Art Tatum                        
#>  9     1     9 Sweet Georgia Brown / Sh…  1945 Benny Goodman Quintet* / Esquire…
#> 10     1    10 The Immortal Live Sessio…  1975 Louis Armstrong                  
#> # … with 1,140 more rows

Created on 2019-12-06 by the reprex package (v0.3.0)

The problem is that we have to generate a sequence placeholder that is not useful afterwards. This seems somehow related to #792.

@hadley
Copy link
Member

@hadley hadley commented Dec 6, 2019

Yes, this is exactly the same problem as #792. With local experimental fix:

library(tidyr)

df <- data.frame(
  ID = c(21785L, 21785L, 21785L),
  HR_1 = c(0.828273303, 6.404590021, 0.775568448), 
  Weekday_1 = c(2L, 3L, 2L), 
  HR_2 = c(NA, 1.122899914, 0.850113234),
  Weekday_2 = c(NA, 4L, 3L), 
  HR_3 = c(NA, 0.866757168, 0.868943246), 
  Weekday_3 = c(NA, 5L, 4L), 
  HR_4 = c(NA, 0.563804788, 0.728656328), 
  Weekday_4 = c(NA, 6L, 5L)
)

df %>% pivot_longer(-ID, names_to = ".value", names_pattern = "(\\D+)_\\d+")
#> # A tibble: 12 x 3
#>       ID     HR Weekday
#>    <int>  <dbl>   <int>
#>  1 21785  0.828       2
#>  2 21785 NA          NA
#>  3 21785 NA          NA
#>  4 21785 NA          NA
#>  5 21785  6.40        3
#>  6 21785  1.12        4
#>  7 21785  0.867       5
#>  8 21785  0.564       6
#>  9 21785  0.776       2
#> 10 21785  0.850       3
#> 11 21785  0.869       4
#> 12 21785  0.729       5
df %>% pivot_longer(-ID, names_to = c(".value", NA), names_sep = "_")
#> # A tibble: 12 x 3
#>       ID     HR Weekday
#>    <int>  <dbl>   <int>
#>  1 21785  0.828       2
#>  2 21785 NA          NA
#>  3 21785 NA          NA
#>  4 21785 NA          NA
#>  5 21785  6.40        3
#>  6 21785  1.12        4
#>  7 21785  0.867       5
#>  8 21785  0.564       6
#>  9 21785  0.776       2
#> 10 21785  0.850       3
#> 11 21785  0.869       4
#> 12 21785  0.729       5

df2 <- setNames(df, c("ID", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday", "HR", "Weekday"))
df2 %>% pivot_longer(-ID, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 12 x 4
#>       ID .copy     HR Weekday
#>    <int> <int>  <dbl>   <int>
#>  1 21785     1  0.828       2
#>  2 21785     2 NA          NA
#>  3 21785     3 NA          NA
#>  4 21785     4 NA          NA
#>  5 21785     1  6.40        3
#>  6 21785     2  1.12        4
#>  7 21785     3  0.867       5
#>  8 21785     4  0.564       6
#>  9 21785     1  0.776       2
#> 10 21785     2  0.850       3
#> 11 21785     3  0.869       4
#> 12 21785     4  0.729       5
url <- "https://raw.githubusercontent.com/bac3917/Cauldron/master/jazz.csv"
df2 <- readr::read_csv(url, col_types = list())
#> Warning: Missing column names filled in: 'X1' [1]
#> Warning: Duplicated column names deduplicated: 'X1' => 'X1_1' [2]
df2 <- df2[-1]
df2$X68 <- as.numeric(df2$X68) # fix non-numeric year data
#> Warning: NAs introduced by coercion

names(df2) <- c(outer(c("title", "year", "artist"), 1:(ncol(df2) / 3), paste0), "id")
df2 %>% pivot_longer(-id, names_to = c(".value", NA), names_pattern = "(.*?)(\\d+)")
#> # A tibble: 1,150 x 4
#>       id title                         year artist                              
#>    <dbl> <chr>                        <dbl> <chr>                               
#>  1     1 Sophisticated Lady / Tea Fo…  1933 Art Tatum                           
#>  2     1 The Genius Of Art Tatum, No…  1955 Art Tatum                           
#>  3     1 The Tatum Group Masterpiece…  1964 Art Tatum / Lionel Hampton / Harry …
#>  4     1 Live Sessions 1940 / 1941     1975 Art Tatum                           
#>  5     1 20th Century Piano Genius     1986 Art Tatum                           
#>  6     1 Jazz Masters (100 Ans De Ja…  1998 Art Tatum                           
#>  7     1 The Art Tatum - Ben Webster…  2015 Art Tatum / Ben Webster             
#>  8     1 El Gran Tatum                   NA Art Tatum                           
#>  9     1 Sweet Georgia Brown / Shiek…  1945 Benny Goodman Quintet* / Esquire Al…
#> 10     1 The Immortal Live Sessions …  1975 Louis Armstrong                     
#> # … with 1,140 more rows

names(df2) <- c(rep(c("title", "year", "artist"), (ncol(df2) - 1) / 3), "id")
df2 %>% pivot_longer(-id, names_to = ".value")
#> Warning: Duplicate column names detected, adding .copy variable
#> # A tibble: 1,150 x 5
#>       id .copy title                      year artist                           
#>    <dbl> <int> <chr>                     <dbl> <chr>                            
#>  1     1     1 Sophisticated Lady / Tea…  1933 Art Tatum                        
#>  2     1     2 The Genius Of Art Tatum,…  1955 Art Tatum                        
#>  3     1     3 The Tatum Group Masterpi…  1964 Art Tatum / Lionel Hampton / Har…
#>  4     1     4 Live Sessions 1940 / 1941  1975 Art Tatum                        
#>  5     1     5 20th Century Piano Genius  1986 Art Tatum                        
#>  6     1     6 Jazz Masters (100 Ans De…  1998 Art Tatum                        
#>  7     1     7 The Art Tatum - Ben Webs…  2015 Art Tatum / Ben Webster          
#>  8     1     8 El Gran Tatum                NA Art Tatum                        
#>  9     1     9 Sweet Georgia Brown / Sh…  1945 Benny Goodman Quintet* / Esquire…
#> 10     1    10 The Immortal Live Sessio…  1975 Louis Armstrong                  
#> # … with 1,140 more rows

Created on 2019-12-06 by the reprex package (v0.3.0)

This makes it more obvious that the behaviour with duplicate column names is inconsistent; it should probably work silently via the new special .seq variable.

@hadley
Copy link
Member

@hadley hadley commented Dec 6, 2019

Steps to finish this up:

  • Write unit tests
  • Update duplicated variable name behaviour
  • Can we move .seq generation in to check_spec()?
  • Document new special .seq variable
  • Document use of NA in names_to

@ggrothendieck
Copy link
Author

@ggrothendieck ggrothendieck commented Dec 7, 2019

As an aside, regarding your observation about base reshape not working correctly the problem seems to be that base reshape gets confused if there are multiple columns witih the same name. It does seem work to if we make them unique.

# d is same as above
d <- structure(list(country = c("Austria", "Belgium", "Bulgaria", 
"Estonia", "France", "Germany", "Switzerland"), A = c(NA, -2.58060150400384, 
-0.0623757258909573, 0.267776001395166, NA, NA, 0.048219924249952
), A = c(NA, -0.249416955035044, 1.3525450891501, 1.04446768824697, 
NA, -0.0582347596434839, -0.891400228849837), A = c(1.82469277697851, 
NA, NA, 1.04231605324821, NA, -0.900145118946308, -1.19320727433597
), A = c(NA, NA, NA, NA, NA, NA, NA), B = c(NA, NA, NA, NA, NA, 
NA, NA), B = c(0.633712375393134, NA, 1.24760861316098, -0.092964787061478, 
-0.59403260962332, NA, -0.650348234181285), B = c(0.587318286831079, 
NA, NA, 0.348890470222513, NA, NA, NA), B = c(0.0645316087966406, 
-0.279456557428068, NA, NA, -0.0627400036074545, 1.30419117694731, 
-0.484654596062051)), row.names = c(NA, -7L), class = "data.frame")

# dd is like d except it has unique column names
dd <- d 
names(dd) <- make.unique(names(dd))

names1 <- names(dd)[-1]
v <- split(names1, substr(names1, 1, 1))
long <- reshape(dd, dir = "long", varying = v, v.names = names(v))
long

giving the following which now seems to work:

        country time           A           B id
1.1     Austria    1          NA          NA  1
2.1     Belgium    1 -2.58060150          NA  2
3.1    Bulgaria    1 -0.06237573          NA  3
4.1     Estonia    1  0.26777600          NA  4
5.1      France    1          NA          NA  5
6.1     Germany    1          NA          NA  6
7.1 Switzerland    1  0.04821992          NA  7
1.2     Austria    2          NA  0.63371238  1
2.2     Belgium    2 -0.24941696          NA  2
3.2    Bulgaria    2  1.35254509  1.24760861  3
4.2     Estonia    2  1.04446769 -0.09296479  4
5.2      France    2          NA -0.59403261  5
6.2     Germany    2 -0.05823476          NA  6
7.2 Switzerland    2 -0.89140023 -0.65034823  7
1.3     Austria    3  1.82469278  0.58731829  1
2.3     Belgium    3          NA          NA  2
3.3    Bulgaria    3          NA          NA  3
4.3     Estonia    3  1.04231605  0.34889047  4
5.3      France    3          NA          NA  5
6.3     Germany    3 -0.90014512          NA  6
7.3 Switzerland    3 -1.19320727          NA  7
1.4     Austria    4          NA  0.06453161  1
2.4     Belgium    4          NA -0.27945656  2
3.4    Bulgaria    4          NA          NA  3
4.4     Estonia    4          NA          NA  4
5.4      France    4          NA -0.06274000  5
6.4     Germany    4          NA  1.30419118  6
7.4 Switzerland    4          NA -0.48465460  7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature pivoting ♻️
Projects
None yet
Development

No branches or pull requests

2 participants