New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Make a data frame from a (possibly named) vector or list #31

Closed
jennybc opened this Issue Mar 1, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@jennybc
Member

jennybc commented Mar 1, 2016

Here's something I do fairly often, mostly with a list, but sometimes with a vector: Initialize a data frame with that list or vector as a variable and, at the same time, promote its names to a proper variable. Or, perhaps, add a variable of row numbers. Why is it so important to add the names or row numbers? Because later you'll want to process with tidyr, i.e. with unnest() and/or spread().

I could point to some real uses if I need to really sell this. But hopefully this will just make sense. Or someone will tell me it's already easy to do? It is already easy, but perhaps worth making a function for.

library(tibble)

x <- list(alpha = 'horrible', beta = 'list', gamma = 'column')

## wish it were easy to make the names a proper variable
data_frame(id = names(x), thing = x)
#> Source: local data frame [3 x 2]
#> 
#>      id    thing
#>   (chr)   (list)
#> 1 alpha <chr[1]>
#> 2  beta <chr[1]>
#> 3 gamma <chr[1]>

## where id can easily default to row number
data_frame(id = seq_along(x), thing = x)
#> Source: local data frame [3 x 2]
#> 
#>      id    thing
#>   (int)   (list)
#> 1     1 <chr[1]>
#> 2     2 <chr[1]>
#> 3     3 <chr[1]>
@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 2, 2016

What verb would you use for this operation?

How about:

x %>% as_data_frame %>% tidyr::gather(id, thing)

Need to coerce to list if x is a vector. The following doesn't work:

x %>% tidyr::gather(id, thing)

I think this could be fixed by implementing gather_.list <- function(x, ...) gather_(tibble::data_frame(x), ...).

EDIT: The above also doesn't work if x is unnamed, but here you could use x %>% data_frame(thing=.) %>% add_rownames.

EDIT²: I think a new verb would help here, I'm not sure if this belongs here or in tidyr.

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 2, 2016

I will try to propose a name.

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 7, 2016

The more I think about it, maybe it makes sense to think about this as a treatment applied to a variable during the construction of a data frame. A way to say "add this variable AND promote its names to a proper variable". And also give some nice way of getting row numbers into the data frame? I have found dplyr::row_number() to be quite confusing / disappointing.

I realize id() is probably already overloaded with meaning already. upname() isn't great either, but hopefully this conveys the idea.

library(tibble)
x <- list(alpha = 'horrible', beta = 'list', gamma = 'column')

What if something like this:

df <- data_frame(id(x, "greek"))
## or
df <- data_frame(upname(x, "greek"))

produced this result

data_frame(greek = names(x), x = x)
#> Source: local data frame [3 x 2]
#> 
#>   greek        x
#>   (chr)   (list)
#> 1 alpha <chr[1]>
#> 2  beta <chr[1]>
#> 3 gamma <chr[1]>

I also wish it were easier to get plain row numbers. I wish this is what row_number() did but clearly it does not. So, again, fiction! What if something like this:

df <- data_frame(i = row_number(), upname(x, "greek"))

produced something like this:

data_frame(i = seq_along(x),
           greek = names(x),
           x = x)
#> Source: local data frame [3 x 3]
#> 
#>       i greek        x
#>   (int) (chr)   (list)
#> 1     1 alpha <chr[1]>
#> 2     2  beta <chr[1]>
#> 3     3 gamma <chr[1]>
@jennybc

This comment has been minimized.

Member

jennybc commented Mar 7, 2016

In addition to id() and upname(), dub() is a possible name for this name-promoting variable-pre-processing function.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 7, 2016

Row numbers: Have you seen #11? Your example would be then

data_frame(...) %>% rownames_to_column("i")

How would you like:

dub <- function(x) as_data_frame(setNames(list(names(x), x), c("name", "value")))

This allows at least the creation of a two-column data frame from a named object, which then can be massaged further with the other dplyr verbs, and combined with other data frames using cbind().

@hadley: Would this perhaps be suitable for purrr:

unzip_names <- function(x) set_names(list(x, names(x)), c("name", "value"))
zip_names <- function(x) set_names(x[[1]], x[[2]])

@krlmlr krlmlr added this to the 2.0 milestone Mar 7, 2016

@jennybc

This comment has been minimized.

Member

jennybc commented Mar 8, 2016

Sorry I can't really tell what #11 does just from reading the discussion. But I take your word for it that it would add the integers 1 through nrows(.) as variable i. Would it add as the first or last variable? Feels like you usually want it at the very front.

library(tibble)
x <- list(alpha = 'horrible', beta = 'list', gamma = 'column')
dub <- function(x) as_data_frame(setNames(list(names(x), x), c("name", "value")))
dub(x)
#> Source: local data frame [3 x 2]
#> 
#>    name    value
#>   (chr)   (list)
#> 1 alpha <chr[1]>
#> 2  beta <chr[1]>
#> 3 gamma <chr[1]>

The variables themselves and the object look great. Would dub() gain some arguments or different defaults in order to produce less generic names?

UPDATE: I think x and names of x are reversed in those speculative purrr functions.

@hadley

This comment has been minimized.

Member

hadley commented Mar 8, 2016

What if this was just the as_data_frame() method for vectors?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 8, 2016

@jennybc: purrr: Right, revised definition below.

unzip_names <- function(x) set_names(list(names(x), x), c("name", "value"))

rownames_to_column() will add to the front. Actually, we already have add_rownames() in dplyr, but it does "too much" and will be deprecated in favor of the new functions.

Defaults: I think we should support them, even if the renaming could be handle with a simple rename() step. tidyr::gather() has them too.

@hadley: Is there a dispatch for vector, as in as_data_frame.vector()? Otherwise it looks like we need to have as_data_frame.logical(), as_data_frame.character(), ...; we also need as_data_frame.Date(), as_data_frame.POSIXt(), ..., all with the same implementation. It also includes a certain amount of surprise -- if functions use as_data_frame() to convert input, the column names are auto-generated and the user might be unaware of it. Do you think it's worth it?

@hadley

This comment has been minimized.

Member

hadley commented Mar 17, 2016

@krlmlr No, there's no "vector" virtual class, so implementation would be a bit tedious. But we don't need to have methods for Date and factor etc, because those will be caught by the method for the underlying atomic vector.

It seems like we're adding new functionality that previously was an error, so it doesn't seem too dangerous to me.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 8, 2016

@jennybc: For now you could try kimisc::list_to_df() -- I totally forgot about this guy. I still think this should be part of tibble.

@ijlyttle

This comment has been minimized.

Contributor

ijlyttle commented Apr 15, 2016

Apologies if this is not helpful, but could purrr::map_df be useful?

library("dplyr")
library("purrr")

x <- list(alpha = 'horrible', beta = 'list', gamma = 'column')

x %>% map_df(~ data_frame(thing = .x), .id = "name")

Could a new verb be put in place of the function within map_df?

krlmlr pushed a commit that referenced this issue May 7, 2016

Kirill Müller
new dub()
@hadley: Do we want this in tibble (#31)?

@krlmlr krlmlr referenced this issue May 7, 2016

Merged

New enframe() #74

@krlmlr

This comment has been minimized.

Member

krlmlr commented May 7, 2016

Note that this is already possible with #71:

x %>% as_data_frame %>% rownames_to_column

@krlmlr krlmlr closed this in #74 May 11, 2016

krlmlr added a commit that referenced this issue May 11, 2016

Merge pull request #74 from hadley/feature/31-dub
- New `enframe()` that converts vectors to two-column tibbles (#31, #74).

krlmlr pushed a commit that referenced this issue May 11, 2016

Kirill Müller
Merge tag 'v1.0-4'
- New `enframe()` that converts vectors to two-column tibbles (#31, #74).
- Fix compatibility with `knitr` 1.13 (#76).
- Implement `as_data_frame.default()` (#71, tidyverse/dplyr#1752).

krlmlr pushed a commit that referenced this issue Jul 4, 2016

Kirill Müller
Merge tag 'v1.1'
Follow-up release.

- `tibble()` is no longer an alias for `frame_data()` (#82).
- Remove `tbl_df()` (#57).
- `$` returns `NULL` if column not found, without partial matching. A warning is given (#109).
- `[[` returns `NULL` if column not found (#109).

- Reworked output: More concise summary (begins with hash `#` and contains more text (#95)), removed empty line, showing number of hidden rows and columns (#51). The trailing metadata also begins with hash `#` (#101). Presence of row names is indicated by a star in printed output (#72).
- Format `NA` values in character columns as `<NA>`, like `print.data.frame()` does (#69).
- The number of printed extra cols is now an option (#68, @lionel-).
- Computation of column width properly handles wide (e.g., Chinese) characters, tests still fail on Windows (#100).
- `glimpse()` shows nesting structure for lists and uses angle brackets for type (#98).
- Tibbles with `POSIXlt` columns can be printed now, the text `<POSIXlt>` is shown as placeholder to encourage usage of `POSIXct` (#86).
- `type_sum()` shows only topmost class for S3 objects.

- Strict checking of integer and logical column indexes. For integers, passing a non-integer index or an out-of-bounds index raises an error. For logicals, only vectors of length 1 or `ncol` are supported. Passing a matrix or an array now raises an error in any case (#83).
- Warn if setting non-`NULL` row names (#75).
- Consistently surround variable names with single quotes in error messages.
- Use "Unknown column 'x'" as error message if column not found, like base R (#94).
- `stop()` and `warning()` are now always called with `call. = FALSE`.

- The `.Dim` attribute is silently stripped from columns that are 1d matrices (#84).
- Converting a tibble without row names to a regular data frame does not add explicit row names.
- `as_tibble.data.frame()` preserves attributes, and uses `as_tibble.list()` to calling overriden methods which may lead to endless recursion.

- New `has_name() (#102).
- Prefer `tibble()` and `as_tibble()` over `data_frame()` and `as_data_frame()` in code and documentation (#82).
- New `is.tibble()` and `is_tibble()` (#79).
- New `enframe()` that converts vectors to two-column tibbles (#31, #74).
- `obj_sum()` and `type_sum()` show `"tibble"` instead of `"tbl_df"` for tibbles (#82).
- `as_tibble.data.frame()` gains `validate` argument (as in `as_tibble.list()`), if `TRUE` the input is validated.
- Implement `as_tibble.default()` (#71, tidyverse/dplyr#1752).
- `has_rownames()` supports arguments that are not data frames.

- Two-dimensional indexing with `[[` works (#58, #63).
- Subsetting with empty index (e.g., `x[]`) also removes row names.

- Document behavior of `as_tibble.tbl_df()` for subclasses (#60).
- Document and test that subsetting removes row names.

- Don't rely on `knitr` internals for testing (#78).
- Fix compatibility with `knitr` 1.13 (#76).
- Enhance `knit_print()` tests.
- Provide default implementation for `tbl_sum.tbl_sql()` and `tbl_sum.tbl_grouped_df()` to allow `dplyr` release before a `tibble` release.
- Explicit tests for `format_v()` (#98).
- Test output for `NULL` value of `tbl_sum()`.
- Test subsetting in all variants (#62).
- Add missing test from dplyr.
- Use new `expect_output_file()` from `testthat`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment