separate() upset by tbl_dfs with x.1 column naming convention #61

brendan-r · 2015-02-23T22:52:14Z

Apologies if this is a dupe, or should be for dplyr. The following behavior chokes spread(). Here's a convoluted 'reprex':

library(dplyr)
library(tidyr)

s <- summarise(
  group_by(iris, Species), 
  sepal_l = mean(Sepal.Length), sepal_w = mean(Sepal.Width),
  petal_l = mean(Petal.Length), petal_w = mean(Petal.Width)
)

# The result of gather()
gather(s, Species, value)
# Source: local data frame [12 x 3]
# 
#       Species Species.1 value
#1      setosa   sepal_l 5.006
#2  versicolor   sepal_l 5.936
#3   virginica   sepal_l 6.588
# ... (abridged by br)

# This is the problematic call
separate(
  gather(s, Species, value),
  Species.1, c("part", "orient"), sep = "_"
)

# Error in matrix(unlist(pieces), ncol = n, byrow = TRUE) : 
#   'data' must be of a vector type, was 'NULL'

In similar-ish cases (tidyverse/dplyr#860, #51) ungrouping seemed to help. However, here only calling data.frame() on the data does.

separate( # Same error as above
  ungroup(gather(s, Species, value)),
  Species.1, c("part", "orient"), sep = "_"
)

separate( # Same error as above
  gather(ungroup(s), Species, value),
  Species.1, c("part", "orient"), sep = "_"
)

separate( # This works
  data.frame(gather(s, Species, value)), 
  Species.1, c("part", "orient"), sep = "_"
)

# Desired result:
#       Species  part orient value
#1      setosa sepal      l 5.006
#2  versicolor sepal      l 5.936
#3   virginica sepal      l 6.588
#4      setosa sepal      w 3.428
# ... (abridged by br)

Comparing objects that work and those that don't, the the x.1 renaming convention for duplicate column names (in this case Species.1) seems to be the only difference. The tbl_df shows Species.1 as the column name, but seems to represent it as 'Species' internally, when unclassed. 'Species.1' is the col parameter in the problematic call to spread().

# e.g.
a <- gather(s, Species, value); b <- data.frame(a)
unclass(a); unclass(b)

Versions:

other attached packages:
[1] tidyr_0.2.0 dplyr_0.4.1

loaded via a namespace (and not attached):
 [1] assertthat_0.1  DBI_0.3.1       lazyeval_0.1.10 magrittr_1.5    parallel_3.1.2 
 [6] plyr_1.8.1      Rcpp_0.11.3     reshape2_1.4.1  stringi_0.4-1   stringr_0.6.2  
[11] tools_3.1.2

Apologies for the verbose report!

The text was updated successfully, but these errors were encountered:

hadley · 2015-04-16T11:40:10Z

The problem is that gather creates data frames with duplicated column names:

df <- data.frame(x = 1:2, y = 1:2)
names(gather(df, x, x, x))
#> [1] "y" "x" "x"

brendan-r · 2015-04-21T02:02:59Z

Yup. Perhaps not a bug at all, just a different usage style from base R that confused me!

hadley · 2015-12-30T18:24:54Z

I think this mostly because of a printing bug in dplyr. I now see:

Source: local data frame [12 x 3]

      Species Species value
       (fctr)   (chr) (dbl)
1      setosa sepal_l 5.006

which should make the problem more obvious.

hadley closed this as completed Dec 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

separate() upset by tbl_dfs with x.1 column naming convention #61

separate() upset by tbl_dfs with x.1 column naming convention #61

brendan-r commented Feb 23, 2015

hadley commented Apr 16, 2015

brendan-r commented Apr 21, 2015

hadley commented Dec 30, 2015

separate() upset by tbl_dfs with x.1 column naming convention #61

separate() upset by tbl_dfs with x.1 column naming convention #61

Comments

brendan-r commented Feb 23, 2015

hadley commented Apr 16, 2015

brendan-r commented Apr 21, 2015

hadley commented Dec 30, 2015