Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function idea: add_row() #1021

Closed
smach opened this issue Mar 14, 2015 · 8 comments
Closed

New function idea: add_row() #1021

smach opened this issue Mar 14, 2015 · 8 comments
Labels
feature a feature request or enhancement
Milestone

Comments

@smach
Copy link

smach commented Mar 14, 2015

From an exchange on Twitter: An idea for an add_row() function that would make it easy to add a row to a data frame when the columns are of a different class. As you then suggested:

you could make it like add_row(mtcars, cyl = 4, disp = 7) and it filled in the other values with missings

I took a first stab at writing such a function for myself here:

https://github.com/smach/rmiscutils/blob/master/R/add_row.R

Code that's a bit more elegant and efficient would be a nice addition to dplyr!

@hadley hadley added this to the 0.5 milestone May 19, 2015
@hadley hadley added feature a feature request or enhancement data frame labels Oct 22, 2015
@hadley
Copy link
Member

hadley commented Oct 22, 2015

@kevinushey @jennybc any thoughts on what this should look like?

@jennybc
Copy link
Member

jennybc commented Oct 29, 2015

I've always found the way rbind.data.frame handles factors to be a minor miracle. It actually expands the levels of the new factor! By default! So that would be a nice feature.

I guess I'd expect add_row() to just be slicker version of this:

library(dplyr)
mtcars2 <- mtcars %>%
  add_rownames()
new_row <- data_frame(rowname = "novel", cyl = 4, disp = 7)
mtcars2 %>%
  full_join(new_row) %>% 
  tail(3) %>% 
  select(-rowname)
#> Joining by: c("rowname", "cyl", "disp")
#> Source: local data frame [3 x 11]
#> 
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#> 1  15.0     8   301   335  3.54  3.57  14.6     0     1     5     8
#> 2  21.4     4   121   109  4.11  2.78  18.6     1     1     4     2
#> 3    NA     4     7    NA    NA    NA    NA    NA    NA    NA    NA

I wonder if you would ever want the ability to insert a row, i.e. specify a row number? Or to make the new row the first row instead of the last?

@hadley
Copy link
Member

hadley commented Oct 29, 2015

To me, silently expanding the levels of a factor just seems like the wrong thing to do - the levels of a factor represent a predefined set of values that you already known.

@jennybc
Copy link
Member

jennybc commented Oct 29, 2015

Your factors are not like my factors. Mine have ... one level per gene or other genomic feature, so lots of levels. During exploration the set that survive various steps of analysis is very fluid.

But OK, it seems like a row-inserting user should be content to get the usual dplyr behavior re: joining when it comes to same-name factors with different levels, same-name factor + character var, and same-name vars of incompatible type.

@hadley
Copy link
Member

hadley commented Oct 29, 2015

@jennybc so why do you use factors instead of strings?

@jennybc
Copy link
Member

jennybc commented Oct 29, 2015

Some is probably habit? All the pieces of your stringsAsFactors = FALSE ecosystem haven't been in place for that long.

But also: I tend to be fitting models with fixed gene effects and I need to control the reference level. Also I make lots of facetted plots where facets/genes need to be reorder()ed in a principled way. That all feels easier with factor.

@hadley
Copy link
Member

hadley commented Oct 29, 2015

@jennybc ok, that makes sense. I think we're missing a variable type that's a string + custom ordering.

@hadley hadley closed this as completed in 8b29bfb Oct 29, 2015
@jennybc
Copy link
Member

jennybc commented Oct 30, 2015

I didn't appreciate that bind_rows() would do this! I.e. that it will cope with missing vars. Nice.

But this add_row() business brings it home to me that it going to be hard to hold on to a factor (e.g. #1485). Feels like there's a really strong pull towards character. Change is hard 😕.

krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016
krlmlr pushed a commit to tidyverse/tibble that referenced this issue Mar 22, 2016
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants