Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_longer: when .value needs to be described by two distinct parenthesis? #619

Closed
MatthieuStigler opened this issue Apr 27, 2019 · 3 comments

Comments

@MatthieuStigler
Copy link

commented Apr 27, 2019

Let us say my column name is x_1_mean, x_1_sd. I want to make it longer by pivoting the 1 in between, i.e. I want a resulting data with x_mean, x_sd

The .value feature is great, but is assumes that the corresponding regex for .value is defined by a single parenthesis "()". I.e., if columns were x_mean_1 and x_sd_1 one could just use

names_to = c(".value","time"),
names_pattern = "(.+)_([0-9]$)"

But now if my first .value pattern needs to be described by two distinct parenthesis, this will create a problem! What is the suggested solution? Obviously, I could make all longer, then selective wider, or use the *_spec way, but maybe could have a neater direct solution?

Possible solutions:

  • allow for a .repeated value, indicating that this pattern refers also to .value?
  • allows for repeated .value? Currently having repeated .value just transforms the seond .value into a V-named column. This is quite confusing for now?

Thanks!

library(tidyverse)

pnl <- tibble(
  x_1_mean = 1:4,
  x_2_mean = c(1, 1,0, 0),
  x_1_sd = c(0, 1, 1, 1),
  x_2_sd = rnorm(4),
  y_1_mean = 1:4,
  y_2_mean = c(1, 1,0, 0),
  y_1_sd = c(0, 1, 1, 1),
  y_2_sd = rnorm(4),
  unit = 1:4,
)

## current behaviour: .value gets transformed into variable
pnl %>% pivot_longer(
  cols = -unit, 
  names_to = c(".value","time", ".value"),
  names_pattern = "(x|y)_([0-9])_(mean|sd)")
#> # A tibble: 16 x 5
#>     unit time  V3          x      y
#>    <int> <chr> <chr>   <dbl>  <dbl>
#>  1     1 1     mean   1       1    
#>  2     1 2     mean   1       1    
#>  3     1 1     sd     0       0    
#>  4     1 2     sd    -0.579   0.989
#>  5     2 1     mean   2       2    
#>  6     2 2     mean   1       1    
#>  7     2 1     sd     1       1    
#>  8     2 2     sd    -0.389  -1.06 
#>  9     3 1     mean   3       3    
#> 10     3 2     mean   0       0    
#> 11     3 1     sd     1       1    
#> 12     3 2     sd     0.0427 -0.725
#> 13     4 1     mean   4       4    
#> 14     4 2     mean   0       0    
#> 15     4 1     sd     1       1    
#> 16     4 2     sd    -0.163   0.249


## end result, obtained with pivot_wider
pnl %>% pivot_longer(
  cols = -unit, 
  names_to = c(".value","time", ".value"),
  names_pattern = "(x|y)_([0-9])_(mean|sd)") %>% 
  pivot_wider(names_from = V3,
              values_from = c(x, y))
#> # A tibble: 8 x 6
#>    unit time  x_mean    x_sd y_mean   y_sd
#>   <int> <chr>  <dbl>   <dbl>  <dbl>  <dbl>
#> 1     1 1          1  0           1  0    
#> 2     1 2          1 -0.579       1  0.989
#> 3     2 1          2  1           2  1    
#> 4     2 2          1 -0.389       1 -1.06 
#> 5     3 1          3  1           3  1    
#> 6     3 2          0  0.0427      0 -0.725
#> 7     4 1          4  1           4  1    
#> 8     4 2          0 -0.163       0  0.249

Created on 2019-04-26 by the reprex package (v0.2.1)

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 27, 2019

I’ll need to think about this for a bit. I think this is sufficiently esoteric it would fine to require a spec, but maybe there’s a simple way to allow multiple value columns (and at least we should generate a better error message).

If it turns out the spec is needed, can I include this dataset in the vignette as another example of where the spec is useful?

@MatthieuStigler

This comment has been minimized.

Copy link
Author

commented Apr 28, 2019

Thanks for looking at this!

And yes, please feel free to use that dataset for sure!

Otherwise, if one wanted a slightly more realistic one, could image we got a wide dataset from worldbank, where the indicator (spread ) readsPRJ.POP.2024.2.FE, meaning variable - age - education - gender, and one would want to get only education longer, i.e. the .value variable would refer to (variable)(age)(...)(education).

library(wbstats)
library(tidyverse)
pop_data <- wb(indicator = c("PRJ.POP.2024.2.FE","PRJ.POP.2024.3.FE", "PRJ.POP.1519.3.FE", "PRJ.POP.1519.2.FE",
                             "PRJ.POP.2024.2.MA","PRJ.POP.2024.3.MA"),
               startdate =2055, enddate=2060) %>% 
  as_tibble() %>% 
  select(-indicator, -iso3c, -country) %>% 
  pivot_wider(names_from = "indicatorID")
colnames(pop_data)
#> [1] "date"              "iso2c"             "PRJ.POP.2024.2.FE"
#> [4] "PRJ.POP.2024.3.FE" "PRJ.POP.1519.3.FE" "PRJ.POP.1519.2.FE"
#> [7] "PRJ.POP.2024.2.MA" "PRJ.POP.2024.3.MA"
pop_data 
#> # A tibble: 334 x 8
#>    date  iso2c PRJ.POP.2024.2.… PRJ.POP.2024.3.… PRJ.POP.1519.3.…
#>    <chr> <chr>            <dbl>            <dbl>            <dbl>
#>  1 2060  1W            38729.           145620.          113890. 
#>  2 2055  1W            40059.           144049.          113540. 
#>  3 2060  AL               27                29.4             13.1
#>  4 2055  AL               30.0              32.1             13.6
#>  5 2060  DZ              232.              488.             863. 
#>  6 2055  DZ              234.              498.             847. 
#>  7 2060  AR              151.             1071.             375. 
#>  8 2055  AR              156.             1078.             382. 
#>  9 2060  AM                5.12             48.2             41.6
#> 10 2055  AM                5.2              50.0             43.5
#> # … with 324 more rows, and 3 more variables: PRJ.POP.1519.2.FE <dbl>,
#> #   PRJ.POP.2024.2.MA <dbl>, PRJ.POP.2024.3.MA <dbl>

Created on 2019-04-28 by the reprex package (v0.2.1)

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

In the world bank case though, I think you'd use separate()/extract() on indicator before pivoting, so I think your original example is a bit more pure.

I'm now pretty sure we can just make this work in pivot_spec_wider(): if names_to continues duplicates, we should just paste them together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.