Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add drop to pivot_wider #770

Closed
romagnolid opened this issue Oct 3, 2019 · 11 comments · Fixed by #1277
Closed

Add drop to pivot_wider #770

romagnolid opened this issue Oct 3, 2019 · 11 comments · Fixed by #1277
Labels
feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes"

Comments

@romagnolid
Copy link

romagnolid commented Oct 3, 2019

pivot_wider is missing the drop option that was present in spread which is quite useful
to fill a matrix with empty rows.

df <- tibble(x=factor(c(1,1,2), 1:3), y=c("a", "b", "b"), z=11:13)
print(df)                                                                       
#> # A tibble: 3 x 3
#>   x     y         z
#>   <fct> <chr> <int>
#> 1 1     a        11
#> 2 1     b        12
#> 3 2     b        13                                                                                                                 
pivot_wider(df, names_from = y, values_from=z)                                      
#> # A tibble: 2 x 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
spread(df, key=y, value=z, drop=FALSE)     
#> # A tibble: 3 x 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
#> 3 3        NA    NA
@hadley hadley added feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes" labels Nov 24, 2019
@hadley
Copy link
Member

hadley commented Nov 28, 2019

Needs r-lib/vctrs#686.

Not sure what to call this argument; it could be names_drop = FALSE but drop is not terribly descriptive. It's more like names_use_factor_levels = TRUE but that's rather verbose.

@romagnolid
Copy link
Author

What about names_keep or names_keep_all?

@huftis
Copy link

huftis commented Mar 31, 2020

Note that the spread() function with drop = FALSE also kept factor levels that were not present in the data, creating columns for them (this is not shown in the above example). This was a very useful feature, which is not present in pivot_wider(). Here is a regexp for this:

library(tidyr)
d = tibble(day_int = c(4,5,1,2),
           day_fac = factor(day_int, levels=1:5,
                            labels=c("Mon","Tue", "Wed","Thu","Fri")))
d
#> # A tibble: 4 x 2
#>   day_int day_fac
#>     <dbl> <fct>  
#> 1       4 Thu    
#> 2       5 Fri    
#> 3       1 Mon    
#> 4       2 Tue
levels(d$day_fac)
#> [1] "Mon" "Tue" "Wed" "Thu" "Fri"

# spread() automatically creates a `Wed` column
spread(d, key = "day_fac", value = "day_int", drop = FALSE)
#> # A tibble: 1 x 5
#>     Mon   Tue   Wed   Thu   Fri
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     4     5

# ... but pivot_wider() does not (and does not respect
# the level ordering, but that is issue #839)
pivot_wider(d, names_from = day_fac, values_from = day_int,
               values_fill = list(day_fac = NA))
#> # A tibble: 1 x 4
#>     Thu   Fri   Mon   Tue
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4     5     1     2

For naming the argument, I like the consistency of names_fill = TRUE (cf. values_fill).

@hadley
Copy link
Member

hadley commented Apr 27, 2020

Thinking about this more, I think it's unlikely the full spectrum of needs can be resolved with just a couple of new arguments to pivot_wider(). I think the right approach is probably to document the spec data frame in much more detail, and show how you can use expand_grid() and friends to generate exactly the output that you want.

@smarc
Copy link

smarc commented May 5, 2020

Hi,

I think it's a shame that spread/pivot_wider can't be told what columns to expect when given an empty input. In extensive pipes, empty data frames/tibbles do happen and missing columns can break the whole mechanism e.g. mutate after pivot_wider. I'd be happier with a result albeit an empty data frames than a broken pipe.

thanks,
Stefan

@hadley
Copy link
Member

hadley commented May 5, 2020

@smarc it can be. That's the purpose of the spec.

@huftis
Copy link

huftis commented May 6, 2020

Here’s another use case, a bit different from my earlier example, but similar to data I work with. Here, each level of a factor (gender) is present, but not always for each value/label of the other names_from variable.

library(tidyverse)
# Example data
# Note: In 2019, only males responded
d = tibble(
  year = c(2018, 2018, 2019, 2020, 2020),
  gender = factor(c("female", "male", "male", "female", "male")),
  percentage = seq(30, 70, 10)
)
pivot_wider(d, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 1 x 5
#>   `2018_female` `2018_male` `2019_male` `2020_female` `2020_male`
#>           <dbl>       <dbl>       <dbl>         <dbl>       <dbl>
#> 1            30          40          50            60          70

Expected syntax and results:

pivot_wider(d, names_from = c(year, gender), values_from = percentage,
               names_fill = TRUE)
#> # A tibble: 1 x 6
#>   `2018_female` `2018_male` `2019_female` `2019_male` `2020_female` `2020_male`
#>           <dbl>       <dbl>         <dbl>       <dbl>         <dbl>       <dbl>
#> 1            30          40            NA          50            60          70

(I am aware that the complete() function can be used to preprocess the data to achieve the above outcome.)

@Bonnie-Buyuklieva
Copy link

names_fill = TRUE would be a useful feature!

@gosianow
Copy link

gosianow commented Jun 3, 2021

Any progress on this feature? It would be very useful and would save the manual filling in of the names and values.

@DavisVaughan
Copy link
Member

DavisVaughan commented Dec 14, 2021

I think there are generally two classes of problems that drop = FALSE from spread() was solving (both of which have examples here)

Problem 1 - Missing levels in names_from columns

In this case, you'd like pivot_wider() to create more columns to represent levels that aren't represented in the data. Additionally, you'd like pivot_wider() to order those columns using the order seen in the levels attribute of the factor. Both of these can be resolved by calling complete() before pivoting.

Here is an example with 1 names_from column:

library(tidyr)

df <- tibble(
  day_int = c(4, 5, 1, 2),
  day_fac = factor(
    day_int, 
    levels = 1:5, 
    labels = c("Mon", "Tue", "Wed", "Thu", "Fri")
  )
)

# Wednesday isn't represented in the data
df
#> # A tibble: 4 × 2
#>   day_int day_fac
#>     <dbl> <fct>  
#> 1       4 Thu    
#> 2       5 Fri    
#> 3       1 Mon    
#> 4       2 Tue

# So pivot_wider() won't show it
pivot_wider(df, names_from = day_fac, values_from = day_int)
#> # A tibble: 1 × 4
#>     Thu   Fri   Mon   Tue
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4     5     1     2

# spread() will if you do `drop = FALSE`
spread(df, key = "day_fac", value = "day_int", drop = FALSE)
#> # A tibble: 1 × 5
#>     Mon   Tue   Wed   Thu   Fri
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     4     5

# Solution: `complete()` first
df <- complete(df, day_fac)
df
#> # A tibble: 5 × 2
#>   day_fac day_int
#>   <fct>     <dbl>
#> 1 Mon           1
#> 2 Tue           2
#> 3 Wed          NA
#> 4 Thu           4
#> 5 Fri           5

# Note that this also reordered them as you expected!
pivot_wider(df, names_from = day_fac, values_from = day_int)
#> # A tibble: 1 × 5
#>     Mon   Tue   Wed   Thu   Fri
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2    NA     4     5

Here is an example with 2 names_from columns:

library(tidyr)

df <- tibble(
  year = c(2018, 2018, 2019, 2020, 2020),
  gender = factor(c("F", "M", "M", "F", "M"), levels = c("F", "M")),
  percentage = seq(30, 70, 10)
)

df
#> # A tibble: 5 × 3
#>    year gender percentage
#>   <dbl> <fct>       <dbl>
#> 1  2018 F              30
#> 2  2018 M              40
#> 3  2019 M              50
#> 4  2020 F              60
#> 5  2020 M              70

# We'd like to see `2019_F` in here
pivot_wider(df, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 1 × 5
#>   `2018_F` `2018_M` `2019_M` `2020_F` `2020_M`
#>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1       30       40       50       60       70

# We were missing one of the combinations
df <- complete(df, year, gender)
df
#> # A tibble: 6 × 3
#>    year gender percentage
#>   <dbl> <fct>       <dbl>
#> 1  2018 F              30
#> 2  2018 M              40
#> 3  2019 F              NA
#> 4  2019 M              50
#> 5  2020 F              60
#> 6  2020 M              70

pivot_wider(df, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 1 × 6
#>   `2018_F` `2018_M` `2019_F` `2019_M` `2020_F` `2020_M`
#>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1       30       40       NA       50       60       70

Problem 2 - Missing levels in id_cols columns

In this case, you'd like pivot_wider() to create more rows to represent levels from the id_cols that aren't represented in the data. You'd also like the rows to be in the order of the levels attribute. This can be resolved by using complete() on both the id_cols and the names_from cols.

Here is an example:

library(tidyr)

df <- tibble(
  x = factor(c(1, 1, 2), levels = c(1, 2, 3)), 
  y = c("a", "b", "b"), 
  z = 11:13
)

# Level 3 of `x` isn't represented in the data
df
#> # A tibble: 3 × 3
#>   x     y         z
#>   <fct> <chr> <int>
#> 1 1     a        11
#> 2 1     b        12
#> 3 2     b        13

# So we don't get a row for `x == 3` in the result
pivot_wider(df, id_cols = x, names_from = y, values_from = z)
#> # A tibble: 2 × 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13

# spread() will add that row with `drop = FALSE`
spread(df, y, z, drop = FALSE)
#> # A tibble: 3 × 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
#> 3 3        NA    NA

# `complete()` on both the id_cols (x) and names_from (y)
df <- complete(df, x, y)
df
#> # A tibble: 6 × 3
#>   x     y         z
#>   <fct> <chr> <int>
#> 1 1     a        11
#> 2 1     b        12
#> 3 2     a        NA
#> 4 2     b        13
#> 5 3     a        NA
#> 6 3     b        NA

pivot_wider(df, id_cols = x, names_from = y, values_from = z)
#> # A tibble: 3 × 3
#>   x         a     b
#>   <fct> <int> <int>
#> 1 1        11    12
#> 2 2        NA    13
#> 3 3        NA    NA

@DavisVaughan
Copy link
Member

Here is an example exhibiting both problem 1 and problem 2 from above, which is solved by a call to complete()

library(tidyr)

df <- tibble(
  id = factor(c(2, 1, 1, 2, 1), levels = c(1, 2, 3)),
  year = c(2018, 2018, 2019, 2020, 2020),
  gender = factor(c("F", "M", "M", "F", "M"), levels = c("F", "M")),
  percentage = seq(30, 70, 10)
)

df
#> # A tibble: 5 × 4
#>   id     year gender percentage
#>   <fct> <dbl> <fct>       <dbl>
#> 1 2      2018 F              30
#> 2 1      2018 M              40
#> 3 1      2019 M              50
#> 4 2      2020 F              60
#> 5 1      2020 M              70

# - rows are in the wrong order (2 then 1)
# - rows are missing level 3
# - cols are missing combination `2019_F`
pivot_wider(df, id_cols = id, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 2 × 6
#>   id    `2018_F` `2018_M` `2019_M` `2020_F` `2020_M`
#>   <fct>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 2           30       NA       NA       60       NA
#> 2 1           NA       40       50       NA       70

df <- complete(df, id, year, gender)
df
#> # A tibble: 18 × 4
#>    id     year gender percentage
#>    <fct> <dbl> <fct>       <dbl>
#>  1 1      2018 F              NA
#>  2 1      2018 M              40
#>  3 1      2019 F              NA
#>  4 1      2019 M              50
#>  5 1      2020 F              NA
#>  6 1      2020 M              70
#>  7 2      2018 F              30
#>  8 2      2018 M              NA
#>  9 2      2019 F              NA
#> 10 2      2019 M              NA
#> 11 2      2020 F              60
#> 12 2      2020 M              NA
#> 13 3      2018 F              NA
#> 14 3      2018 M              NA
#> 15 3      2019 F              NA
#> 16 3      2019 M              NA
#> 17 3      2020 F              NA
#> 18 3      2020 M              NA

pivot_wider(df, id_cols = id, names_from = c(year, gender), values_from = percentage)
#> # A tibble: 3 × 7
#>   id    `2018_F` `2018_M` `2019_F` `2019_M` `2020_F` `2020_M`
#>   <fct>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1 1           NA       40       NA       50       NA       70
#> 2 2           30       NA       NA       NA       60       NA
#> 3 3           NA       NA       NA       NA       NA       NA

Created on 2021-12-14 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants