New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_data_frame(NULL) is 0-row 0-column data frame #17

Merged
merged 6 commits into from Jan 7, 2016

Conversation

Projects
None yet
4 participants
@jennybc
Member

jennybc commented Jan 5, 2016

Follow up to https://github.com/krlmlr/tibble/pull/16

I note that data_frame(NULL) and lst(NULL) both error, whereas data.frame(NULL) and list(NULL) do not. However, those all lead to lazyeval::lazy_dots(...), which is above my paygrade.

@codecov-io

This comment has been minimized.

codecov-io commented Jan 5, 2016

Current coverage is 95.98%

Merging #17 into master will increase coverage by +0.07% as of 5924f8e

@@            master     #17   diff @@
======================================
  Files           14      14       
  Stmts          441     448     +7
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit            423     430     +7
  Partial          0       0       
  Missed          18      18       

Review entire Coverage Diff as of 5924f8e

Powered by Codecov. Updated on successful CI builds.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 5, 2016

Thanks.

The behavior of data.frame() is confusing, I'm not sure we want to replicate it:

> data.frame(NULL)
data frame with 0 columns and 0 rows
> data.frame(a = NULL, b = character())
[1] b
<0 rows> (or 0-length row.names)
> data.frame(a = NULL)
data frame with 0 columns and 0 rows
> data.frame(NULL, NULL)
data frame with 0 columns and 0 rows

The error with lst() seems to be because a[[b]] <- NULL doesn't create a NULL-valued element at index b in list a, but removes it (if it exists): https://github.com/krlmlr/tibble/blob/a531bf834ce94c599ecbdc68f86d9f78c8cf3678/R/dataframe.R#L88. If this is fixed, both errors should go away.

Would you like to take another look?

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 6, 2016

OK ... I did something. I did not know any better way to preserve NULL-valued elements inside the for loop of lst_(). I made data_frame() match the behaviour ofdata.frame() in the cases above, because I did not know what would be better.

library(tibble)

list(NULL)
#> [[1]]
#> NULL
lst(NULL)
#> $`NULL`
#> NULL
## was previously
## Error in names(output)[i] <- col_names[[i]] :
##   'names' attribute [1] must be the same length as the vector [0]

data.frame(NULL)
#> data frame with 0 columns and 0 rows
data_frame(NULL)
#> Source: local data frame [0 x 0]
## was previously
## Error in names(output)[i] <- col_names[[i]] :
##   'names' attribute [1] must be the same length as the vector [0]

data.frame(a = NULL, b = character())
#> [1] b
#> <0 rows> (or 0-length row.names)
data_frame(a = NULL, b = character())
#> Source: local data frame [0 x 1]
#> 
#> Variables
#>   not
#>   shown:
#>   b
#>   (chr).
## was previously
## Error: Each variable must be a 1d atomic vector or list.
## Problem variables: a.
@@ -85,9 +85,11 @@ lst_ <- function(xs) {
names(output) <- character(n)
for (i in seq_len(n)) {
output[[i]] <- lazyeval::lazy_eval(xs[[i]], output)
output[[i]] <- lazyeval::lazy_eval(xs[[i]], output) %||% list(NULL)

This comment has been minimized.

@krlmlr

krlmlr Jan 6, 2016

Member

Would it work if you simply omit the assignment if the result of lazyeval::lazy_eval(xs[[i]], output) is NULL? That might get rid of the postprocessing below, too.

This comment has been minimized.

@jennybc

jennybc Jan 6, 2016

Member

Yes it does indeed. Please hold.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 6, 2016

Thanks. I guess we don't want data frames with NULL columns after all. The code below gives a nice crash:

structure(list(a=NULL), class = "data.frame") %>% dplyr::bind_rows(data.frame(a=1))

Could you please add tests for using lst() with named and unnamed NULL arguments? Do lst(a=1, b=a) and lst(a=NULL, b=a) work as expected?

@jennybc

This comment has been minimized.

Member

jennybc commented Jan 7, 2016

I think names are more rational now re: named and unnamed NULLs. I couldn't find any existing tests for lst() so started new test file w/ the ones you requested.

You weren't kidding about the crash above. Bye bye RStudio!

@@ -76,7 +76,7 @@ lst_ <- function(xs) {
deparse2 <- function(x) paste(deparse(x$expr, 500L), collapse = "")
defaults <- vapply(xs[missing_names], deparse2, character(1),
USE.NAMES = FALSE)
defaults <- ifelse(defaults == "NULL", "", defaults)

This comment has been minimized.

@krlmlr

krlmlr Jan 7, 2016

Member

I think we don't need this line, and the corresponding equality test:

> tibble::lst(3)
$`3`
[1] 3

This comment has been minimized.

@jennybc

jennybc Jan 7, 2016

Member

I leave it in your hands. I did it to match the naming behaviour of list(). Was trying to avoid naming unnamed NULLs "NULL".

This comment has been minimized.

@krlmlr

krlmlr Jan 7, 2016

Member

Thanks. I see your intention, but I prefer the behavior to be consistent with passing other unnamed arguments.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 7, 2016

Thanks, this looks good to me, except for my last inline comment. If you want to update one final time -- otherwise I'll merge in a couple of hours and take care of it myself.

Sorry about the crash -- I hope your data were safe? :-)

krlmlr added a commit that referenced this pull request Jan 7, 2016

Merge pull request #17 from jennybc/master
- `as_data_frame(NULL)` is 0-row 0-column data frame (#17, @jennybc).

@krlmlr krlmlr merged commit 12fde6b into tidyverse:master Jan 7, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

krlmlr pushed a commit that referenced this pull request Jan 7, 2016

Kirill Müller
Merge branch 'feature/un-null'
- Don't rename columns named `NULL` (on top of #17).

krlmlr pushed a commit that referenced this pull request Jan 7, 2016

Kirill Müller
Merge branch 'release/0.1-4' into production
- Non-scalar input to `frame_data()` and `tibble()` creates list-valued columns (#7).
- `frame_data()` and `tibble()` create empty `data_frame` if no rows are given (#20).
- `as_data_frame(NULL)` is 0-row 0-column data frame (#17, @jennybc).
- `lst(NULL)` doesn't raise an error anymore (#17, @jennybc), but always uses deparsed expression as name (even for `NULL`).
- `trunc_mat()` and `print()` use `width` argument also for zero-row and zero-column data frames (#18).

krlmlr pushed a commit that referenced this pull request Jan 7, 2016

Kirill Müller
Merge tag 'v0.1-4'
- Non-scalar input to `frame_data()` and `tibble()` creates list-valued columns (#7).
- `frame_data()` and `tibble()` create empty `data_frame` if no rows are given (#20).
- `as_data_frame(NULL)` is 0-row 0-column data frame (#17, @jennybc).
- `lst(NULL)` doesn't raise an error anymore (#17, @jennybc), but always uses deparsed expression as name (even for `NULL`).
- `trunc_mat()` and `print()` use `width` argument also for zero-row and zero-column data frames (#18).
@@ -157,6 +160,9 @@ as_data_frame.data.frame <- function(x, ...) {
#' @export
#' @rdname as_data_frame
as_data_frame.list <- function(x, validate = TRUE, ...) {
x <- compact(x)

This comment has been minimized.

@hadley

hadley Jan 7, 2016

Member

I think it would be useful to explicitly state the invariant that this preserves, which (I think) is that all these operations should give the same result:

data_frame(a = x, b = y)

as_data_frame(list(a = x, b = y))

df <- data_frame(a = x)
df$b <- y

Regardless of what the values of x and y are.

This comment has been minimized.

@krlmlr

krlmlr Jan 7, 2016

Member

The implementation of compact() filters NULL values. I am missing the connection with the invariant you stated (which should be documented, I agree).

This comment has been minimized.

@hadley

hadley Jan 7, 2016

Member

Why is compact() here? Why should we always drop NULL columns in a data frame? What's the underlying principle?

This comment has been minimized.

@jennybc

jennybc Jan 7, 2016

Member

Where does one record such principles? In a test or as a comment or docs?

This comment has been minimized.

@krlmlr

krlmlr Jan 7, 2016

Member

@hadley: NULL columns in data frames are difficult to create with base R, and they crash dplyr:

structure(list(a=NULL), class = "data.frame") %>% dplyr::bind_rows(data.frame(a=1))

While lst() handles NULL values just fine, I think NULL columns are not appropriate for data frames.

This comment has been minimized.

@hadley

hadley Jan 7, 2016

Member

Crashing dplyr is neither here nor there, because you could argue that's a dplyr bug that should be fixed.

I think you'd be better off taking the position that a data frame can only contain vectors and NULL is not a vector hence is not allowed. There should be a check for non-vectors (e.g. environments, symbols etc) that throws an informative error, rather than special casing NULL here.

But beware is.atomic(NULL) == TRUE so you might want to bring in purrr's is_atomic:

is_atomic <- function(x) function(x) {
  typeof(x) %in% c("logical", "integer", "double", "complex", "character", "raw")
}

This comment has been minimized.

@krlmlr

krlmlr Jan 7, 2016

Member

Not meaning to criticize dplyr -- rather pointing out that this is one of those corner cases...

If I leave out this line, the call data_frame(a = NULL) already throws the error "Each variable must be a 1d atomic vector or list.". No need for further action (13b0e49).

data.frame(a = NULL) swallows a, but data.frame(a = NULL, b = 1:3) throws an error. I think there's no need to be consistent with data.frame() here.

krlmlr pushed a commit that referenced this pull request Mar 2, 2016

Kirill Müller
Merge tag 'v0.2'
- Functions related to `tbl` and `src` stay in `dplyr` (#26). Remove unused `make_tbl()`.
- Non-scalar input to `frame_data()` and `tibble()` (including lists) creates list-valued columns (#7).
- Use C++ implementation for `as_data_frame.matrix()` (#14). Also add former `matrixToDataFrame()` tests, and fix unwanted conversion to factor.
- `as_data_frame(NULL)` is 0-row 0-column data frame (#17, @jennybc). `frame_data()` and `tibble()` create empty `data_frame` if no rows are given (#20).
- `data_frame(NULL)` raises error "must be a 1d atomic vector or list".
- `lst(NULL)` doesn't raise an error anymore (#17, @jennybc), but always uses deparsed expression as name (even for `NULL`).
- `trunc_mat()` and `print()` use `width` argument also for zero-row and zero-column data frames (#18).
- `glimpse()` now (invisibly) returns `x`, so it can be used within a chain of `dplyr` verbs (@edwindj).
- `base::getElement()` now works with tibbles (#9).
- Remove spurious usage of "dplyr" in documentation (#3).
- Almost full test coverage.

krlmlr pushed a commit that referenced this pull request Mar 22, 2016

Kirill Müller
Merge tag 'v1.0'
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment