New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unnest multiple columns at once #44

Closed
momeara opened this Issue Nov 21, 2014 · 2 comments

Comments

Projects
None yet
2 participants
@momeara

momeara commented Nov 21, 2014

I'd like unnest to support unnesting multiple columns at once. For example,

x <- data_frame(
  a=c("a:b", "c"), b=c("1:2", "3"), c=c(11,22)) %>%
  transform(
    a = strsplit(a,":"),
    b = strsplit(b,":")) %>%
  unnest(a, b)

would produce

  a b  c
1 a 1 11
2 b 2 11
3 c 3 22

As a real world example where this comes up, the HGNC allows extracting gene family ids and descriptions, but it organizes them like this:

       hgnc_id                hgnc_gene_name hgnc_gene_family_ids                         hgnc_gene_family_descriptions
 1: HGNC:10006    Rh-associated glycoprotein  CD\tbloodgroup\tSLC   CD molecules\tBlood group antigens\tSolute carriers
 2: HGNC:10008 Rh blood group, CcEe antigens       CD\tbloodgroup                    CD molecules\tBlood group antigens
 3: HGNC:10009     Rh blood group, D antigen       CD\tbloodgroup                    CD molecules\tBlood group antigens
 4:  HGNC:1001         B-cell CLL/lymphoma 6      ZBTB\tZNF\tBTBD -\tZinc fingers, C2H2-type\tBTB/POZ domain containing

I'd like it unnest hgnc_gene_family_ids and hgnc_gene_family_descriptions simultaneously:

       hgnc_id                hgnc_gene_name hgnc_gene_family_ids hgnc_gene_family_descriptions
 1  HGNC:10006    Rh-associated glycoprotein                   CD                  CD molecules
 2  HGNC:10006    Rh-associated glycoprotein           bloodgroup          Blood group antigens
 3  HGNC:10006    Rh-associated glycoprotein                  SLC               Solute carriers
 4  HGNC:10008 Rh blood group, CcEe antigens                   CD                  CD molecules
 5  HGNC:10008 Rh blood group, CcEe antigens           bloodgroup          Blood group antigens
 6  HGNC:10009     Rh blood group, D antigen                   CD                  CD molecules
 7  HGNC:10009     Rh blood group, D antigen           bloodgroup          Blood group antigens
 8   HGNC:1001         B-cell CLL/lymphoma 6                 ZBTB                             -
 9   HGNC:1001         B-cell CLL/lymphoma 6                  ZNF       Zinc fingers, C2H2-type
 10  HGNC:1001         B-cell CLL/lymphoma 6                 BTBD     BTB/POZ domain containing

as a preliminary implementation, I have this

unnest <- function (data, cols){
    if(length(cols) > 1) {
       nested <- data[,cols]
       unnested <- apply(data[,cols], 2, function(x) list(unlist(x)))
       n <- lapply(nested,                                                                                                                                                                                      
           function(nested_col) vapply(nested_col, length, numeric(1)))
       if(length(unique(n)) != 1) {
           stop("nested columns must have the same number of elements for in each cell")
       }
       data <- data[rep(1:nrow(data), n[[1]]),]
       which_cols <- which(names(data) %in% cols)

       for(i in 1:length(cols)){
           data[, which_cols[i] ] <- unnested[[i]]
       }
       rownames(data) <- NULL
       return(data)
    } else {
       nested <- data[[cols]]
       unnested <- list(unlist(nested))
       names(unnested) <- cols
       n <- vapply(nested, length, numeric(1))
       rest <- data[rep(1:nrow(data), n), setdiff(names(data), cols),
           drop = FALSE]
       rownames(rest) <- NULL
       return(tidyr:::append_df(rest, unnested, which(names(data) == cols) - 1))
    }
}

If this looks like something that would be generally useful, I'd be happy to make a pull request that fits it into the package.

@hadley

This comment has been minimized.

Member

hadley commented May 13, 2015

What would you expect this to do?

data_frame(
  a = c("a:b", "c"), 
  b = c("1:2:3", "3"), 
  c = c(11,22)
) %>%
  transform(
    a = strsplit(a,":"),
    b = strsplit(b,":")
  )
  %>%
  unnest(a, b)
@momeara

This comment has been minimized.

momeara commented May 13, 2015

Either giving an error or filling in with NA values like this:

a  b c
a  1 11
b  2 11
NA 3 11
c  3 22

@hadley hadley closed this in 30d6177 May 18, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment