Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread should work with drop=FALSE and fill=NA and create columns (not rows) from empty factor levels #56

alexbbrown opened this issue Jan 16, 2015 · 5 comments


Copy link

Short description

The function id fails to consider the possibility that it's input is a data frame containing (or only including) factors with zero length members. As does the function id_var, called by id. This causes problems for the spread function, which should be able to handle these cases and generate named empty columns. This further causes problems for dplyr, which can result in missing column names (that should have been generated by the factor transformation) in later stages in the pipe. Instead the name is bound to some other thing in the global scope and the script (or shiny app) will error or otherwise fail.

Long description

In the following non empty example

data.frame(U=c("foo","bar")[c(1,1,2,2)],K=factor(letters[c(1:2,1:2)],levels=letters[1:2]),V=c(1:2,1:2)) %>% spread(K,V)

We get a data frame (like) with columns "a", "b".

    U a b
1 bar 1 2
2 foo 1 2

Now assume I know it has "a" and "b", and in later steps I consume "a" and "b" in e.g. mutate, or ggvis.

What if there is no data?

data.frame(U=c("foo","bar")[c()],K=factor(letters[c()],levels=letters[c(1:2)]),V=numeric()) %>% spread(K,V)

This produces the following:

[1] U
<0 rows> (or 0-length row.names)

Which then goes horribly wrong if I try to consume "a"

> .Last.value %>% mutate(zep = a)
Error: unsupported type for column 'zep' (CLOSXP, classes = function)

What happened here? It pulled 'a' form the environment - in fact it's using shiny::a - a function to produce html.

Ideally, drop=FALSE should address this - by creating columns for factor levels that don't exist in the data.

But right now that doesn't work - drop=FALSE does something different - it fills in missing data...

I can't even make this work properly, to produce an example, but never mind.

Copy link

Interestingly, if ONE level is missing, but they others are present, it works as expected:

data.frame(U=c("foo","bar")[c(1,1,2,2)],K=factor(letters[c(1:2,1:2)],levels=letters[1:3]),V=c(1:2,1:2)) %>% spread(K,V,drop=FALSE)
    U a b  c
1 bar 1 2 NA
2 foo 1 2 NA

But fails if no data is present:

data.frame(U=c("foo","bar")[c()],K=factor(c(),levels=letters[1:3]),V=numeric()) %>% spread(K,V,drop=FALSE)

Error in `colnames<-`(`*tmp*`, value = c("a", "b", "c")) : 
  'names' attribute [3] must be the same length as the vector [0]

Copy link

This the provided example:

data.frame(U=c("foo","bar")[c()],K=factor(c(),levels=letters[1:3]),V=numeric()) %>% spread(K,V,drop=FALSE)

in spread_.data_frame we get a failure to operate because

col_id <- dplyr::id(col, drop = drop)
[1] 0


col_labels <- split_labels(col, col_id, drop = drop)
1 a
2 b
3 c

Hence in

> dim(ordered) <- c(attr(row_id, "n"), attr(col_id, "n"))
[1] 0,0
> ordered <-, stringsAsFactors = FALSE)
> colnames(ordered) <- as.character(col_labels[[1]])

colnames fails because it's expecting col_labels column count, but in fact has a column count of 0.

This failure to behave in a similar manner in empty cases (a seasonal problem in R code) starts:

in tidyr::id

.variables <- .variables[lengths != 0]

the id function which nominally handles a data frame (for which lengths would be a repeated number), is actually capable of handling a list (for which other situations might occur). It's hard for me to see how that code - filtering out empty columns - would make sense in a data frame.

It continue in tidy::id_var with

    if (length(x) == 0) 
    return(structure(integer(), n = 0L))

which precedes the factor handling code:

if (is.factor(x) && !drop) {
    id <- as.integer(addNA(x, ifany = TRUE))
    n <- length(levels(x))

I suggest that the order of these clauses in id_var be fixed to check factor first.

Copy link

hadley commented May 13, 2015

Gah, wrong issue.

Copy link

hadley commented Dec 30, 2015

Here's my understanding of the issue, in code:

df_c <- data_frame(
  x = c("a", "a", "b", "b"),
  y = c("y", "z", "y", "z"),
  z = 1:4
df_f <- df_c %>% mutate(x = factor(x, levels = c("b", "a")), y = factor(y))

# Correct: only differ in order
df_c %>% spread(y, z) %>% str()
df_f %>% spread(y, z) %>% str()

# Correct: only see y
df_c[1,] %>% spread(y, z) %>% str()
df_f[1,] %>% spread(y, z) %>% str()

# Correct: expands out both y and z
df_f[1,] %>% spread(y, z, drop = FALSE) %>% str()

# Correct: don't see any values
df_c[0,] %>% spread(y, z) %>% str()

# Incorrect: from the levels of the factor, should have columns a and b
df_f[0,] %>% spread(y, z, drop = FALSE) %>% str()

@hadley hadley closed this as completed in d8fe889 Dec 30, 2015
Copy link

hadley commented Dec 30, 2015

I'm pretty sure I correctly identified the underlying problem. Please let me know if I missed anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

No branches or pull requests

2 participants