Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot with duplicate column names #472

Closed
GillesSanMartin opened this issue Jun 24, 2018 · 12 comments

Comments

@GillesSanMartin
Copy link

commented Jun 24, 2018

I'm more and more often confronted with messy spreadsheets with duplicate column names that need to be tidied up as for example in this SO question.

tidyr::gather simply refuse to do it (error message) and reshape::melt pr reshape2::melt return the wrong numbers without any warning. The data.table version of melt works as intended.

Here is a minimal reprex :
# Reprex
d <- data.frame(Group = c("A", "B"), 
                rbind(c(0, 0, 5, 5), 
                      c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")

# Dataframe with duplicate column names --> quite frequent situation in messy spreadsheets...
d
#>   Group Var1 Var2 Var1 Var2
#> 1     A    0    0    5    5
#> 2     B    0    0   10   10

# With tidyr we have an error message : definitively better than to have the 
# wrong numbers...
tidyr::gather(d,,,-1)
#> Error: Can't bind data because some arguments have the same name

# with reshape and reshape2 : wrong results (0 everywhere, the 5 and 10 values have disapeared)
reshape::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
#> 5     A     Var1     0
#> 6     B     Var1     0
#> 7     A     Var2     0
#> 8     B     Var2     0
reshape2::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0

# data.table::melt fails similarly when we work on a data.frame 
# but provides exactly the intended result if we work on a data.table
data.table::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
data.table::melt(data.table::as.data.table(d), id.vars = 1)
#>    Group variable value
#> 1:     A     Var1     0
#> 2:     B     Var1     0
#> 3:     A     Var2     0
#> 4:     B     Var2     0
#> 5:     A     Var1     5
#> 6:     B     Var1    10
#> 7:     A     Var2     5
#> 8:     B     Var2    10

# base::stack provides the right values but good luck for the other columns ...
stack(d[,-1])
#>   values    ind
#> 1      0   Var1
#> 2      0   Var1
#> 3      0   Var2
#> 4      0   Var2
#> 5      5 Var1.1
#> 6     10 Var1.1
#> 7      5 Var2.1
#> 8     10 Var2.1

Created on 2018-06-25 by the reprex package (v0.2.0).

@jennybc

This comment was marked as resolved.

Copy link
Member

commented Jun 25, 2018

It is possible that name repair should get done by gather(). Regardless, you could also implement name repair yourself right now, prior to gathering. If the names are OK, it will be a no op.

tibble::set_names() repairs names on an existing tibble and is an emerging standard across multiple tidyverse packages. The dev version of readxl is already using this, for example, before it returns a tibble read out of xls[x]. The design of this function is discussed here: tidyverse/tibble#217

d <- data.frame(Group = c("A", "B"), 
                rbind(c(0, 0, 5, 5), 
                      c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")

## tibble::set_tidy_names() is the emerging standard for making sure
## a tibble has non-duplicate and non-empty names
tidyr::gather(tibble::set_tidy_names(d),,,-1)
#> New names:
#> Var1 -> Var1..2
#> Var2 -> Var2..3
#> Var1 -> Var1..4
#> Var2 -> Var2..5
#>   Group     key value
#> 1     A Var1..2     0
#> 2     B Var1..2     0
#> 3     A Var2..3     0
#> 4     B Var2..3     0
#> 5     A Var1..4     5
#> 6     B Var1..4    10
#> 7     A Var2..5     5
#> 8     B Var2..5    10

Created on 2018-06-24 by the reprex package (v0.2.0.9000).

@GillesSanMartin

This comment was marked as resolved.

Copy link
Author

commented Jun 25, 2018

Thanks for the hint!
You still need to back transform the variables names to their original one as these are truly the same values.

Maybe with something like :

d <- tidyr::gather(tibble::set_tidy_names(d),,,-1)
d$key <- factor(gsub("(.*)\\.\\.\\d+$", "\\1", d$key))

I still feel very uncomfortable with the behavior of reshape::melt because I have used it a lot and I'm afraid that I could already have been trapped into its strange behavior (in this admittedly strange situation).

@jennybc

This comment was marked as resolved.

Copy link
Member

commented Jun 25, 2018

You still need to back transform the variables names to their original one as these are truly the same values.

Indeed! The need to do this explicitly (somewhere) seems like a good idea. Because your knowledge that the same-named VarX columns are indeed the same thing is external and probably shouldn't be silent default behaviour. Perhaps the tibble::set_tidy_names() / tibble::tidy_names() family should offer a helper or example code in the docs for post-processing.

@hadley

This comment was marked as resolved.

Copy link
Member

commented Jan 4, 2019

I think it's worth allowing gather() to work non-unique column names, since you are trying to get it into a tidy form. (Similarly it should probably allow missing column names)

@PMassicotte

This comment was marked as resolved.

Copy link

commented Jan 28, 2019

Same issue here. I have duplicated column names, which once gatherer I will calculate an average.

hadley added a commit that referenced this issue Mar 3, 2019
Part of #472
@hadley

This comment was marked as outdated.

Copy link
Member

commented Mar 3, 2019

Pivot currently silently returns the wrong number of rows:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d
#>   Group Var1 Var2 Var1 Var2
#> 1     A    0    0    5    5
#> 2     B    0    0   10   10

spec <- d %>% pivot_spec_long(-Group)
spec
#> # A tibble: 2 x 3
#>   col_name measure variable
#>   <chr>    <chr>   <chr>   
#> 1 Var1     value   Var1    
#> 2 Var2     value   Var2

d %>% pivot(spec)
#> # A tibble: 4 x 3
#>   Group variable value
#>   <fct> <chr>    <dbl>
#> 1 A     Var1         0
#> 2 A     Var2         0
#> 3 B     Var1         0
#> 4 B     Var2         0

Created on 2019-03-03 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

Also need to test with NA column names.

@hadley hadley changed the title gather/melt whith duplicate column names Pivot with duplicate column names Mar 3, 2019
@hadley hadley added this to the v1.0.0 milestone Mar 3, 2019
@PMassicotte

This comment was marked as resolved.

Copy link

commented Mar 4, 2019

Thank you!

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

Possibly related:

test_that("minimal spec works", {
  df <- tibble(x = "x", y = 1)
  sp <- tibble::tribble(
    ~.name, ~.value,
    "x", "X",
    "y", "X",
  )
  pv <- pivot_long(df, spec = sp)
  expect_named(pv, c("X"))
  expect_equal(pv$X, c("x", "y"))
})
@hadley

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Reprex update for latest interface:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d <- as_tibble(d, .name_repair = "minimal")
d
#> # A tibble: 2 x 5
#>   Group  Var1  Var2  Var1  Var2
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A         0     0     5     5
#> 2 B         0     0    10    10

d %>% pivot_long(-Group)
#> # A tibble: 4 x 3
#>   Group name  value
#>   <fct> <chr> <dbl>
#> 1 A     Var1      0
#> 2 A     Var2      0
#> 3 B     Var1      0
#> 4 B     Var2      0

Created on 2019-03-06 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

After thinking about it for a while, I think the best approach is to make the copies explicit. This makes it straightforward to implement, makes it obvious from the outside, and gives you all the needed data.

# A tibble: 8 x 4
  Group name  .copy value
  <fct> <chr> <int> <dbl>
1 A     Var1      1     0
2 A     Var2      1     0
3 A     Var1      2     5
4 A     Var2      2     5
5 B     Var1      1     0
6 B     Var2      1     0
7 B     Var1      2    10
8 B     Var2      2    10
Warning message:
Duplicate column names detected, adding .copy variable 
@hadley hadley closed this in eac7ac7 Mar 7, 2019
@GillesSanMartin

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

That would be a very nice solution indeed! Thanks !!!
This configuration typically happens when people have measured the same variables several times (eg different seasons) or on different subsample units in nested designs (eg trees within sites).
Typically encoded in spreadsheets as merged cells above the column titles (in my experience)...
The new .copy variable would ease the "salvaging" of this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.