Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print.tbl_df is very slow for wide datasets #1161

Closed
richierocks opened this issue May 21, 2015 · 4 comments
Closed

print.tbl_df is very slow for wide datasets #1161

richierocks opened this issue May 21, 2015 · 4 comments
Labels
Milestone

Comments

@richierocks
Copy link

@richierocks richierocks commented May 21, 2015

Printing wide datasets is very slow due to the printing of all the column names that weren't shown.

Here's an example dataset with 1 row and 1e5 columns. (Many genomics datasets are much wider than this, so it isn't unrealistically large.)

library(dplyr)
ncols <- 1e5
d <- structure(
  as.list(runif(ncols)), 
  class     = c("tbl_df", "tbl", "data.frame"),
  row.names = 1L,
  names     = make.names(rep.int("a", ncols), unique = TRUE)
)

Even if you limit the width of the printed content, the part where all the names are printed takes a long time.

d                    # slow
print(d, width = 50) # still slow
@richierocks
Copy link
Author

@richierocks richierocks commented May 21, 2015

There are a couple of optimisations that can be made in print.trunc_mat. If you add a call to flush.console() after print(x$table), then at least the user is guaranteed to be able to see their data while the column name printing calculations are happening.

If you change

var_types <- paste0(names(x$extra), " (", x$extra, ")", collapse = ", ")

to

var_types <- toString(paste0(names(x$extra), " (", x$extra, ")"), width = 1000)

then (on my machine, with the example in the original issue comment) the printing time is reduced from about 60 seconds to 25.

There are likely further optimisations that could be made in trunc_mat, but I haven't looked at those.

width = 1000 is a bit arbitrary; maybe width = getOption("max.print") is better.

@hadley
Copy link
Member

@hadley hadley commented May 21, 2015

Seems like it might be reasonable to print only the first 100 extra column names or so?

@richierocks
Copy link
Author

@richierocks richierocks commented May 21, 2015

Yeah, that sounds fine. I don't think anyone will bother to actually read any more than that, and they can always use colnames if they really want to see the rest.

@hadley hadley added the feature label Aug 24, 2015
@hadley hadley added this to the 0.5 milestone Aug 24, 2015
@hadley
Copy link
Member

@hadley hadley commented Sep 21, 2015

Now that we're printing column types too, the minimum width of any column is 4: that should make it possible to do a quick approximate first pass and speed the process up.

@hadley hadley closed this in 8436893 Oct 28, 2015
krlmlr pushed a commit to krlmlr/dplyr that referenced this issue Mar 2, 2016
Only print first 100 variables in extra list.
Fixes tidyverse#1161
krlmlr pushed a commit to tidyverse/tibble that referenced this issue Mar 22, 2016
- Initial CRAN release

- Extracted from `dplyr` 0.4.3

- Exported functions:
    - `tbl_df()`
    - `as_data_frame()`
    - `data_frame()`, `data_frame_()`
    - `frame_data()`, `tibble()`
    - `glimpse()`
    - `trunc_mat()`, `knit_print.trunc_mat()`
    - `type_sum()`
    - New `lst()` and `lst_()` create lists in the same way that
      `data_frame()` and `data_frame_()` create data frames (tidyverse/dplyr#1290).
      `lst(NULL)` doesn't raise an error (#17, @jennybc), but always
      uses deparsed expression as name (even for `NULL`).
    - New `add_row()` makes it easy to add a new row to data frame
      (tidyverse/dplyr#1021).
    - New `rownames_to_column()` and `column_to_rownames()` (#11, @zhilongjia).
    - New `has_rownames()` and `remove_rownames()` (#44).
    - New `repair_names()` fixes missing and duplicate names (#10, #15,
      @r2evans).
    - New `is_vector_s3()`.

- Features
    - New `as_data_frame.table()` with argument `n` to control name of count
      column (#22, #23).
    - Use `tibble` prefix for options (#13, #36).
    - `glimpse()` now (invisibly) returns its argument (tidyverse/dplyr#1570). It
      is now a generic, the default method dispatches to `str()`
      (tidyverse/dplyr#1325).  The default width is obtained from the
      `tibble.width` option (#35, #56).
    - `as_data_frame()` is now an S3 generic with methods for lists (the old
      `as_data_frame()`), data frames (trivial), matrices (with efficient
      C++ implementation) (tidyverse/dplyr#876), and `NULL` (returns a 0-row
      0-column data frame) (#17, @jennybc).
    - Non-scalar input to `frame_data()` and `tibble()` (including lists)
      creates list-valued columns (#7). These functions return 0-row but n-col
      data frame if no data.

- Bug fixes
    - `frame_data()` properly constructs rectangular tables (tidyverse/dplyr#1377,
      @kevinushey).

- Minor modifications
    - Uses `setOldClass(c("tbl_df", "tbl", "data.frame"))` to help with S4
      (tidyverse/dplyr#969).
    - `tbl_df()` automatically generates column names (tidyverse/dplyr#1606).
    - `tbl_df`s gain `$` and `[[` methods that are ~5x faster than the defaults,
      never do partial matching (tidyverse/dplyr#1504), and throw an error if the
      variable does not exist.  `[[.tbl_df()` falls back to regular subsetting
      when used with anything other than a single string (#29).
      `base::getElement()` now works with tibbles (#9).
    - `all_equal()` allows to compare data frames ignoring row and column order,
      and optionally ignoring minor differences in type (e.g. int vs. double)
      (tidyverse/dplyr#821).  Used by `all.equal()` for tibbles.  (This package
      contains a pure R implementation of `all_equal()`, the `dplyr` code has
      identical behavior but is written in C++ and thus faster.)
    - The internals of `data_frame()` and `as_data_frame()` have been aligned,
      so `as_data_frame()` will now automatically recycle length-1 vectors.
      Both functions give more informative error messages if you are attempting
      to create an invalid data frame.  You can no longer create a data frame
      with duplicated names (tidyverse/dplyr#820).  Both functions now check that
      you don't have any `POSIXlt` columns, and tell you to use `POSIXct` if you
      do (tidyverse/dplyr#813).  `data_frame(NULL)` raises error "must be a 1d
      atomic vector or list".
    - `trunc_mat()` and `print.tbl_df()` are considerably faster if you have
      very wide data frames.  They will now also only list the first 100
      additional variables not already on screen - control this with the new
      `n_extra` parameter to `print()` (tidyverse/dplyr#1161).  The type of list
      columns is printed correctly (tidyverse/dplyr#1379).  The `width` argument is
      used also for 0-row or 0-column data frames (#18).
    - When used in list-columns, S4 objects only print the class name rather
      than the full class hierarchy (#33).
    - Add test that `[.tbl_df()` does not change class (#41, @jennybc).  Improve
      `[.tbl_df()` error message.

- Documentation
    - Update README, with edits (#52, @bhive01) and enhancements (#54,
      @jennybc).
    - `vignette("tibble")` describes the difference between tbl_dfs and
      regular data frames (tidyverse/dplyr#1468).

- Code quality
    - Test using new-style Travis-CI and AppVeyor. Full test coverage (#24,
      #53). Regression tests load known output from file (#49).
    - Renamed `obj_type()` to `obj_sum()`, improvements, better integration with
     `type_sum()`.
    - Internal cleanup.
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants