New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added repair_names (columns) and tests #15

Merged
merged 13 commits into from Mar 10, 2016

Conversation

Projects
None yet
3 participants
@r2evans

r2evans commented Jan 4, 2016

first cut at repair_names() for #10

@codecov-io

This comment has been minimized.

codecov-io commented Jan 4, 2016

Current coverage is 96.71%

Merging #15 into master will increase coverage by +0.14% as of adc8cf2

@@            master     #15   diff @@
======================================
  Files           11      12     +1
  Stmts          438     457    +19
  Branches         0       0       
  Methods          0       0       
======================================
+ Hit            423     442    +19
  Partial          0       0       
  Missed          15      15       

Review entire Coverage Diff as of adc8cf2

Powered by Codecov. Updated on successful CI builds.

#' ignored if \code{check.names} is \code{FALSE}
#' @return data.frame, with optionally different column names
#' @export
repair_names <- function(x, unique=TRUE) {

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

Please add spaces before and after = (also in function calls).

repair_names <- function(x, unique=TRUE) {
col_names <- colnames(x)
if (is.null(col_names)) {
col_names <- rep('', ncol(x))

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

This will give names "X", "X.1", "X.2", "X.3", ....

Also, make.names() below will "repair" existing but non-lexical names, and also reserved words. I'm afraid this is not in the spirit of this package. I like the way matrix conversion is handled by base R:

> as.data.frame(diag(3))
  V1 V2 V3
1  1  0  0
2  0  1  0
3  0  0  1

I suggest the following:

  • Empty names are set to "V1", "V2", "V3", ..., skipping already existing columns whose names match "^V[0-9]+$"
    • This logic might become messy and should be encapsulated in a separate function
  • Duplicate names are assigned a numeric suffix (without the dot), again watch out for name clashes
    • Perhaps the logic here shares a part of the logic from the point above

This comment has been minimized.

@r2evans

r2evans Jan 5, 2016

c("X", "X.1", "X.2") versus c("V1", "V2", "V3"): I had explicitly kept it that way, thinking simpler (fewer changes, more consistent with base R functions) is better.

I didn't realize you intended to reinvent a function to fix missing and duplicate headers. Your caveats are well-stated: I thought of doing it and argued myself out of it, precisely for the purpose of keeping simple code, using what has been working already, etc. I'll mull this over ...

Empty names: When you have a missing column, do you replace it with V1 or the nth V? Are you concerned that skipping already-existing columns ("^V[0-9]+$") can result in column names that are not sorted? I think I'd find that confusing. (Not entirely true ... I probably wouldn't let it get there in the first place :-)

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

Unfortunately, make.names() doesn't allow tweaking its behavior. But good news is that we can use make.unique(sep = ""), which does most of the heavy lifting.

Not sorted column names in corner cases are fine, this is what make.unique() rightly does:

> make.unique(c("a", "b", "a", "a1"), sep = "")
[1] "a"  "b"  "a2" "a1"
context("repair_names")

test_that("repair column names when none are provided", {
dat <- as_data_frame(diag(3))

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

This is not a particularly good initial value, because it's currently wrong and might change. Something like data_frame(a = 1, b = 2, c = 3) will do fine here.

colnames(dat) <- NULL

## ensure we start with a "bad" state
expect_true( is.null(colnames(dat)) )

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

expect_null()


test_that("repair column names with some repeats", {
dat <- as_data_frame(diag(3))
colnames(dat) <- c('a', 'a', 'b')

This comment has been minimized.

@krlmlr

krlmlr Jan 5, 2016

Member

Also need to test evil name clashes: c("", "", "V1"), c("a", "b", "a", "a1"), ...

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 5, 2016

Thanks for the pull request, the first draft looks nice. If you update the code, please simply commit to the same branch and push, no need to rebase or open a new pull request.

@r2evans

This comment has been minimized.

r2evans commented Jan 11, 2016

Current mod, still WIP, adjusts column names as follows:

        Null:  "            " --> "  V1, V2, V3"
       Empty:  "        , , " --> "  V1, V2, V3"
 EmptyWithNA:  "    , NA, NA" --> "  V1, V2, V3"
        Dup1:  "     a, a, b" --> "    a, a1, b"
       Evil1:  "    a, a, a1" --> "   a, a2, a1"
       OneNA:  "    a, b, NA" --> "    a, b, V1"
    Missing2:  "       , , b" --> "   V1, V2, b"
     Vnames1:  "      V1, , " --> "  V1, V2, V3"
     Vnames2:  "      V2, , " --> "  V2, V1, V3"
     Vnames3:  "     V1, , a" --> "   V1, V2, a"
  VnamesDup1:  "   V1, V1, c" --> "  V1, V11, c"
  VnamesDup2:  "    V1, V1, " --> " V1, V11, V2"

The last two could be changed to produce " V1, V2, c" (...), but I don't think that's necessary. Push coming shortly ...

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 11, 2016

Thanks! I think there's no need to change the last two cases.

Could you please use two spaces for indentation, for the sake of consistency? Please let me know when the code is ready for review.

@r2evans

This comment has been minimized.

r2evans commented Jan 11, 2016

@krlmlr how about now?


test_that("repair various name problems", {
# still-to-add:
# missing_dup1 = c('a', '', 'V1')

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

Is this test case still missing?

expect_equal(length(setdiff(Filter(function(a) ! (is.na(a) | a == ''),
colnames(dat)),
fixed_names)), 0, info = combo_name)
message(sprintf('%12s: "%12s" --> "%12s"',

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

This is great for debugging, but not needed in the final version.

expect_false(any(table(fixed_names) > 1), info = combo_name)

# ensure all valid column names are retained
expect_equal(length(setdiff(Filter(function(a) ! (is.na(a) | a == ''),

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member
valid <- !is.na(old_names) & old_names !=  ''
expect ... (fixed_names[valid] == old_names[valid])

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

This is actually a more precise test, and IMO easier to parse for a human.

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

...and wrong, too. Needs

valid <- !is.na(old_names) & old_names !=  '' & !duplicated(old_names)
context("repair_names")

test_that("repair column names when none are provided", {
dat <- data.frame(a = 1, b = 2, c = 3)

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

Do we still need the specific test cases?

col_names[ col_names != '' ] <- sub_names

nmissing <- sum(col_names == '')
new_names <- paste(prefix, 1:ceiling(1.5 * nmissing), sep = sep)

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

Why is it 1.5 here, and not, say, 2 or 3?

This comment has been minimized.

@r2evans

r2evans Feb 16, 2016

It was an attempt at algorithmic precision, but it buys very little and adds confusion.

if (is.null(col_names)) {
col_names <- rep('', ncol(x))
} else {
ifelse(is.na(col_names), '', col_names)

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

Please assign to col_names in the "else" block, too.

fixed_dat <- repair_names(dat)
fixed_names <- colnames(fixed_dat)
# no empty names
expect_false(any(sapply(fixed_names, is.null)), info = combo_name)

This comment has been minimized.

@krlmlr

krlmlr Jan 12, 2016

Member

How can a vector element be NULL?

@r2evans

This comment has been minimized.

r2evans commented Feb 16, 2016

Sorry for the extended sabbatical. New PR in a moment.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Feb 29, 2016

No worries. Did you add the updates to this PR? Could you please merge with master?

@r2evans

This comment has been minimized.

r2evans commented Mar 2, 2016

@krlmlr, done, merged, ready (I think).

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 2, 2016

repair_names(data_frame()) and repair_names(data.frame(row.names = 1:10)) give an error. Looks good otherwise.

@r2evans

This comment has been minimized.

r2evans commented Mar 2, 2016

Dang it ... I'll look at it later this week, thanks for pointing that out.

@r2evans

This comment has been minimized.

r2evans commented Mar 4, 2016

any other things to test? an empty data.frame was a good thought. (I fixed that last bug, give it a try.)

DESCRIPTION Outdated
@@ -1,8 +1,8 @@
Encoding: UTF-8
Package: tibble
Type: Package
Version: 0.1-5
Date: 2016-02-29
Version: 0.2

This comment has been minimized.

@krlmlr

krlmlr Mar 4, 2016

Member

I'd rather bump version and date myself.

NEWS.md Outdated
@@ -1,39 +1,16 @@
Version 0.1-5 (2016-02-29)
Version 0.2 (2016-03-02)

This comment has been minimized.

@krlmlr

krlmlr Mar 4, 2016

Member

Please also leave NEWS unchanged.

@@ -70,7 +70,8 @@ frame_data <- function(...) {
frame_columns <- lapply(seq_len(frame_ncol), function(i) {
indices <- seq.default(from = i, to = length(frame_rest), by = frame_ncol)
col <- frame_rest[indices]
if (all(vapply(col, length, integer(1L)) == 1L)) {
if (!any(vapply(col, function(x) is.list(x) || length(x) != 1L,
logical(1L)))) {

This comment has been minimized.

@krlmlr

krlmlr Mar 4, 2016

Member

This change, and the ones below, look unrelated, too. Have you performed a merge with -X ours or even -s ours?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 4, 2016

I think the easiest is to rebase/merge against master. Don't know why Git thinks these changes are introduced by your branch.

@r2evans

This comment has been minimized.

r2evans commented Mar 4, 2016

I did a git rebase -i master. I never edited NEWS, DESCRIPTION, or R/frame-data.R. I can try to reverse those, but those changes are not mine. (I must not understand git, since I thought I was following best practices.)

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 4, 2016

The rebase should have worked fine, I don't know what's wrong here -- it might be just the GitHub display. Could you please pull+rebase+force-push again, just to be sure?

@r2evans

This comment has been minimized.

r2evans commented Mar 4, 2016

So true. Is there a better way to check for "can have column names" than ...?

dx <- dim(x)
if (is.null(dx) || ! isTRUE(dx[2] > 0)) return(x)

Good question, should this work for lists. I don't see why it can't be extended to apply the same logic to list names; I'd always assumed that tibble with solely for data.frames. Is there a vision for extending tibble beyond?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 4, 2016

names(iris) and names(tbl_df(iris)) both work for me...

There's lst() in tibble, and R has duck typing. The function is reasonable for anything that can have names -- I don't see a reason to be unnecessarily restrictive here.

@r2evans

This comment has been minimized.

r2evans commented Mar 5, 2016

Alright, can do. I'm jammed for the moment and will try to get to it later this weekend. (I'll look at the intersect call ... I think I prefer it for a few cases, but I'll check again.)

@r2evans r2evans force-pushed the r2evans:repairColNames branch from f8de0d3 to ad365b3 Mar 8, 2016

@r2evans

This comment has been minimized.

r2evans commented Mar 8, 2016

How now? It works with data.frames, lists, and vectors (anything that names touches).

sub_names <- make.unique(sub_names, sep = sep)
xnames[!blanks] <- sub_names

if (any(blanks)) {

This comment has been minimized.

@krlmlr

krlmlr Mar 8, 2016

Member

How about setting blanks <- which(blanks) and moving from there?

This comment has been minimized.

@r2evans

r2evans Mar 8, 2016

Doesn't work:

xnames <- c('a', '', 'c')
blanks <- which(xnames == '')
xnames[blanks]
# [1] ""
xnames[-blanks]
# [1] "a" "c"

... but ...

xnames <- c('a', 'b', 'c')
blanks <- which(xnames == '')
xnames[blanks]
# character(0)
xnames[-blanks]
# character(0)
@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 8, 2016

Looks good; please merge/rebase one last time (NAMESPACE conflict...), my nitpick above is optional.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 8, 2016

I haven't found a test for simple named lists; would you mind adding one? (Just change the very first test to use a list and not a data frame.)

@r2evans r2evans force-pushed the r2evans:repairColNames branch from ad365b3 to 66fa65d Mar 8, 2016

Bill Evans Bill Evans
@r2evans

This comment has been minimized.

r2evans commented Mar 10, 2016

How now?

krlmlr added a commit that referenced this pull request Mar 10, 2016

Merge pull request #15 from r2evans/repairColNames
- New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).

@krlmlr krlmlr merged commit 432bc7c into tidyverse:master Mar 10, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 10, 2016

Thanks! If you let GitHub know the e-mail address you used for your commits, the commits will be associated with your GitHub user.

krlmlr pushed a commit that referenced this pull request Mar 10, 2016

Kirill Müller
Merge tag 'v0.2-3'
- New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).
- Finer coverage analysis (#37).
- Use `tibble` prefix for options (#13, #36).
- Expand README.
- Fix typos in documentation.
- Remove use of `src()` from examples.
@r2evans

This comment has been minimized.

r2evans commented Mar 10, 2016

Ahhh, right ... I didn't change from my work .gitconfig, that's completely my bad. Bummer. (Would it make sense to change the email address associated with the commits?)

krlmlr pushed a commit that referenced this pull request Mar 10, 2016

Kirill Müller
Merge tag 'v0.3'
- Features
    - New `as_data_frame.table()` with argument `n` to control name of count column (#22, #23).
    - New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).
    - `frame_data()` now also creates a list column if one of the entries is a list (#32).
    - New `rownames_to_column()` and `column_to_rownames()` functions, replace `add_rownames()` (#11, @zhilongjia).
    - Use `tibble` prefix for options (#13, #36).

- Documentation
    - Add pre-tibble NEWS (#39, #40).
    - Include vignette (#38).
    - Expand README.
    - Fix typos in documentation.
    - Remove use of `src()` from examples.

- Prepare CRAN release
    - Use new-style `.travis.yml`
    - Use AppVeyor for testing.
    - Finer coverage analysis (#37).
    - Check with win-builder and valgrind.
    - Fix NOTE from `R CMD check`.

krlmlr pushed a commit that referenced this pull request Mar 22, 2016

Kirill Müller
Merge tag 'v1.0'
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment