Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate() upset by tbl_dfs with x.1 column naming convention #61

Closed
brendan-r opened this issue Feb 23, 2015 · 3 comments
Closed

separate() upset by tbl_dfs with x.1 column naming convention #61

brendan-r opened this issue Feb 23, 2015 · 3 comments

Comments

@brendan-r
Copy link

Apologies if this is a dupe, or should be for dplyr. The following behavior chokes spread(). Here's a convoluted 'reprex':

library(dplyr)
library(tidyr)

s <- summarise(
  group_by(iris, Species), 
  sepal_l = mean(Sepal.Length), sepal_w = mean(Sepal.Width),
  petal_l = mean(Petal.Length), petal_w = mean(Petal.Width)
)

# The result of gather()
gather(s, Species, value)
# Source: local data frame [12 x 3]
# 
#       Species Species.1 value
#1      setosa   sepal_l 5.006
#2  versicolor   sepal_l 5.936
#3   virginica   sepal_l 6.588
# ... (abridged by br)

# This is the problematic call
separate(
  gather(s, Species, value),
  Species.1, c("part", "orient"), sep = "_"
)

# Error in matrix(unlist(pieces), ncol = n, byrow = TRUE) : 
#   'data' must be of a vector type, was 'NULL'

In similar-ish cases (tidyverse/dplyr#860, #51) ungrouping seemed to help. However, here only calling data.frame() on the data does.

separate( # Same error as above
  ungroup(gather(s, Species, value)),
  Species.1, c("part", "orient"), sep = "_"
)

separate( # Same error as above
  gather(ungroup(s), Species, value),
  Species.1, c("part", "orient"), sep = "_"
)

separate( # This works
  data.frame(gather(s, Species, value)), 
  Species.1, c("part", "orient"), sep = "_"
)

# Desired result:
#       Species  part orient value
#1      setosa sepal      l 5.006
#2  versicolor sepal      l 5.936
#3   virginica sepal      l 6.588
#4      setosa sepal      w 3.428
# ... (abridged by br)

Comparing objects that work and those that don't, the the x.1 renaming convention for duplicate column names (in this case Species.1) seems to be the only difference. The tbl_df shows Species.1 as the column name, but seems to represent it as 'Species' internally, when unclassed. 'Species.1' is the col parameter in the problematic call to spread().

# e.g.
a <- gather(s, Species, value); b <- data.frame(a)
unclass(a); unclass(b)

Versions:

other attached packages:
[1] tidyr_0.2.0 dplyr_0.4.1

loaded via a namespace (and not attached):
 [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5    parallel_3.1.2 
 [6] plyr_1.8.1      Rcpp_0.11.3     reshape2_1.4.1  stringi_0.4-1   stringr_0.6.2  
[11] tools_3.1.2

Apologies for the verbose report!

@hadley
Copy link
Member

hadley commented Apr 16, 2015

The problem is that gather creates data frames with duplicated column names:

df <- data.frame(x = 1:2, y = 1:2)
names(gather(df, x, x, x))
#> [1] "y" "x" "x"

@brendan-r
Copy link
Author

Yup. Perhaps not a bug at all, just a different usage style from base R that confused me!

@hadley
Copy link
Member

hadley commented Dec 30, 2015

I think this mostly because of a printing bug in dplyr. I now see:

Source: local data frame [12 x 3]

      Species Species value
       (fctr)   (chr) (dbl)
1      setosa sepal_l 5.006

which should make the problem more obvious.

@hadley hadley closed this as completed Dec 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants