New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to clean/normalize data.frames? #10

Closed
krlmlr opened this Issue Dec 17, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@krlmlr
Member

krlmlr commented Dec 17, 2015

Fixes tidyverse/dplyr#1587.

@r2evans: Would you like to contribute to this package?

@r2evans

This comment has been minimized.

r2evans commented Dec 17, 2015

What's the purpose of the package? I haven't gone through the roxy-docs for the package yet, what's the purpose for tibble? (I don't mind contributing to the package, but the description is a little brief.)

@krlmlr

This comment has been minimized.

Member

krlmlr commented Dec 17, 2015

Eventually, dplyr will import from tibble. This package extracts the functionality around tbl_df from dplyr to make it available for other packages such as tidyr and readr, see also https://github.com/krlmlr/tibble/releases/tag/v0.1 and tidyverse/dplyr#1488.

@r2evans

This comment has been minimized.

r2evans commented Dec 17, 2015

Perfect, I understand. Hadley was going to mull over the "if" and "where" of the suggested patch (from hadley/dplyr#1587). Did you have any thoughts on it? I'll be happy to create the PR for it.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Dec 18, 2015

How about fix_names() or even repair_names()? I think that for any data frame where tbl_df() raises an error, repair_names() %>% tbl_df() should just work -- this is what also should be tested in the unit tests. Data frames for which tbl_df() already works should not be touched.

We can work on this together and discuss the results with Hadley.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 3, 2016

@r2evans: Are you still interested in contributing? This could be useful for as_data_frame.matrix():

> dplyr::as_data_frame(diag(3))
Source: local data frame [3 x 3]

     NA    NA    NA
  (dbl) (dbl) (dbl)
1     1     0     0
2     0     1     0
3     0     0     1

krlmlr pushed a commit that referenced this issue Jan 3, 2016

Kirill Müller
use matrixToDataFrame()
- breaks existing test: missing column names are not auto-assigned anymore (#10)
@r2evans

This comment has been minimized.

r2evans commented Jan 3, 2016

Yes, thanks for the poke. This is a good test case, too.

@krlmlr krlmlr modified the milestone: 1.0 Jan 25, 2016

@krlmlr krlmlr modified the milestones: 2.0, 1.0 Mar 2, 2016

krlmlr added a commit that referenced this issue Mar 10, 2016

Merge pull request #15 from r2evans/repairColNames
- New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).

krlmlr pushed a commit that referenced this issue Mar 10, 2016

Kirill Müller
Merge tag 'v0.2-3'
- New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).
- Finer coverage analysis (#37).
- Use `tibble` prefix for options (#13, #36).
- Expand README.
- Fix typos in documentation.
- Remove use of `src()` from examples.

krlmlr pushed a commit that referenced this issue Mar 10, 2016

Kirill Müller
Merge tag 'v0.3'
- Features
    - New `as_data_frame.table()` with argument `n` to control name of count column (#22, #23).
    - New function `repair_names()` fixes missing and duplicate names (#10, #15, @r2evans).
    - `frame_data()` now also creates a list column if one of the entries is a list (#32).
    - New `rownames_to_column()` and `column_to_rownames()` functions, replace `add_rownames()` (#11, @zhilongjia).
    - Use `tibble` prefix for options (#13, #36).

- Documentation
    - Add pre-tibble NEWS (#39, #40).
    - Include vignette (#38).
    - Expand README.
    - Fix typos in documentation.
    - Remove use of `src()` from examples.

- Prepare CRAN release
    - Use new-style `.travis.yml`
    - Use AppVeyor for testing.
    - Finer coverage analysis (#37).
    - Check with win-builder and valgrind.
    - Fix NOTE from `R CMD check`.
@hadley

This comment has been minimized.

Member

hadley commented Mar 17, 2016

Seems like this is done, but see #47

@hadley hadley closed this Mar 17, 2016

krlmlr pushed a commit that referenced this issue Mar 22, 2016

Kirill Müller
Merge tag 'v1.0'
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment