Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot with duplicate column names #472

Closed
GillesSanMartin opened this issue Jun 24, 2018 · 12 comments
Closed

Pivot with duplicate column names #472

GillesSanMartin opened this issue Jun 24, 2018 · 12 comments

Comments

@GillesSanMartin
Copy link

@GillesSanMartin GillesSanMartin commented Jun 24, 2018

I'm more and more often confronted with messy spreadsheets with duplicate column names that need to be tidied up as for example in this SO question.

tidyr::gather simply refuse to do it (error message) and reshape::melt pr reshape2::melt return the wrong numbers without any warning. The data.table version of melt works as intended.

Here is a minimal reprex :
# Reprex
d <- data.frame(Group = c("A", "B"), 
                rbind(c(0, 0, 5, 5), 
                      c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")

# Dataframe with duplicate column names --> quite frequent situation in messy spreadsheets...
d
#>   Group Var1 Var2 Var1 Var2
#> 1     A    0    0    5    5
#> 2     B    0    0   10   10

# With tidyr we have an error message : definitively better than to have the 
# wrong numbers...
tidyr::gather(d,,,-1)
#> Error: Can't bind data because some arguments have the same name

# with reshape and reshape2 : wrong results (0 everywhere, the 5 and 10 values have disapeared)
reshape::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
#> 5     A     Var1     0
#> 6     B     Var1     0
#> 7     A     Var2     0
#> 8     B     Var2     0
reshape2::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0

# data.table::melt fails similarly when we work on a data.frame 
# but provides exactly the intended result if we work on a data.table
data.table::melt(d, id.vars = 1)
#>   Group variable value
#> 1     A     Var1     0
#> 2     B     Var1     0
#> 3     A     Var2     0
#> 4     B     Var2     0
data.table::melt(data.table::as.data.table(d), id.vars = 1)
#>    Group variable value
#> 1:     A     Var1     0
#> 2:     B     Var1     0
#> 3:     A     Var2     0
#> 4:     B     Var2     0
#> 5:     A     Var1     5
#> 6:     B     Var1    10
#> 7:     A     Var2     5
#> 8:     B     Var2    10

# base::stack provides the right values but good luck for the other columns ...
stack(d[,-1])
#>   values    ind
#> 1      0   Var1
#> 2      0   Var1
#> 3      0   Var2
#> 4      0   Var2
#> 5      5 Var1.1
#> 6     10 Var1.1
#> 7      5 Var2.1
#> 8     10 Var2.1

Created on 2018-06-25 by the reprex package (v0.2.0).

@jennybc

This comment has been hidden.

@GillesSanMartin

This comment has been hidden.

@jennybc

This comment has been hidden.

@hadley

This comment has been hidden.

@PMassicotte

This comment has been hidden.

hadley added a commit that referenced this issue Mar 3, 2019
@hadley

This comment has been hidden.

@hadley
Copy link
Member

@hadley hadley commented Mar 3, 2019

Also need to test with NA column names.

@hadley hadley changed the title gather/melt whith duplicate column names Pivot with duplicate column names Mar 3, 2019
@hadley hadley added this to the v1.0.0 milestone Mar 3, 2019
@PMassicotte

This comment has been hidden.

@hadley
Copy link
Member

@hadley hadley commented Mar 5, 2019

Possibly related:

test_that("minimal spec works", {
  df <- tibble(x = "x", y = 1)
  sp <- tibble::tribble(
    ~.name, ~.value,
    "x", "X",
    "y", "X",
  )
  pv <- pivot_long(df, spec = sp)
  expect_named(pv, c("X"))
  expect_equal(pv$X, c("x", "y"))
})

@hadley
Copy link
Member

@hadley hadley commented Mar 6, 2019

Reprex update for latest interface:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(c("A", "B"), rbind(c(0, 0, 5, 5), c(0, 0, 10, 10)))
colnames(d) <- c("Group", "Var1", "Var2", "Var1", "Var2")
d <- as_tibble(d, .name_repair = "minimal")
d
#> # A tibble: 2 x 5
#>   Group  Var1  Var2  Var1  Var2
#>   <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 A         0     0     5     5
#> 2 B         0     0    10    10

d %>% pivot_long(-Group)
#> # A tibble: 4 x 3
#>   Group name  value
#>   <fct> <chr> <dbl>
#> 1 A     Var1      0
#> 2 A     Var2      0
#> 3 B     Var1      0
#> 4 B     Var2      0

Created on 2019-03-06 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

@hadley hadley commented Mar 7, 2019

After thinking about it for a while, I think the best approach is to make the copies explicit. This makes it straightforward to implement, makes it obvious from the outside, and gives you all the needed data.

# A tibble: 8 x 4
  Group name  .copy value
  <fct> <chr> <int> <dbl>
1 A     Var1      1     0
2 A     Var2      1     0
3 A     Var1      2     5
4 A     Var2      2     5
5 B     Var1      1     0
6 B     Var2      1     0
7 B     Var1      2    10
8 B     Var2      2    10
Warning message:
Duplicate column names detected, adding .copy variable 

@hadley hadley closed this in eac7ac7 Mar 7, 2019
@GillesSanMartin
Copy link
Author

@GillesSanMartin GillesSanMartin commented Mar 7, 2019

That would be a very nice solution indeed! Thanks !!!
This configuration typically happens when people have measured the same variables several times (eg different seasons) or on different subsample units in nested designs (eg trees within sites).
Typically encoded in spreadsheets as merged cells above the column titles (in my experience)...
The new .copy variable would ease the "salvaging" of this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants