Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow printing with very wide tibbles #360

Closed
blueprint-ade opened this issue Jan 11, 2018 · 11 comments
Closed

Slow printing with very wide tibbles #360

blueprint-ade opened this issue Jan 11, 2018 · 11 comments

Comments

@blueprint-ade
Copy link

@blueprint-ade blueprint-ade commented Jan 11, 2018

After updating to the most recent version of the package, I noticed that a) the new console output was great, and b) that printing was substantially slower for tibbles with ~50 or more columns. In addition to printing slower, the output process hangs between printing the tabular data preview and the list of columns excluded therefrom.

This reprex uses gapminder data to make a tibble with 1 row and 711 columns. I exaggerated the number of columns in an effort to make it reproducible on machines with better specs than my middling i-5 and 8 gigs of ram.

load packages

library(gapminder)
library(tidyverse)
#-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
# v ggplot2 2.2.1     v purrr   0.2.4
# v tibble  1.4.1     v dplyr   0.7.4
# v tidyr   0.7.2     v stringr 1.2.0
# v readr   1.1.1     v forcats 0.2.0

make the test data

tst_tibble <- gapminder %>%
  
  # change the year filter to add or subtract columns 
  # from the final tibble
  filter(year < 1975) %>% 
  unite(loc_yr, continent, country, year) %>%
  select(loc_yr, lifeExp) %>% 
  spread(loc_yr, lifeExp)


tst_df     <- as.data.frame(tst_tibble)

simple timing

system.time(print(tst_tibble))

# user  system elapsed 
# 8.24    0.00    8.28 

system.time(print(tst_df))

# user  system elapsed 
# 0.55    0.00    0.56

conclusions

Obviously tibble is doing more work to print its output than data.frame(), but the ~15X jump in time seems like quite a lot more than it was in previous versions, and also more than it should be to produce the output that is actually shown on screen. I unfortunately don't have time to downgrade tibble, or test timing more rigorously, but I'll check later and update.

My only hypothesis is that tibble is applying its print processing to all the columns, including the hidden ones, before it shrinks the output and sends it to the console, but I don't know enough to figure out whether or not that's the case.

session info

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.2.0   stringr_1.2.0   dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.7.2    
[7] tibble_1.4.1    ggplot2_2.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     cellranger_1.1.0 pillar_1.0.1     compiler_3.4.3   plyr_1.8.4      
 [6] bindr_0.1        tools_3.4.3      lubridate_1.7.1  jsonlite_1.5     nlme_3.1-131    
[11] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.6      psych_1.7.8     
[16] cli_1.0.0        rstudioapi_0.7   yaml_2.1.16      parallel_3.4.3   haven_1.1.0     
[21] bindrcpp_0.2     xml2_1.1.1       httr_1.3.1       hms_0.4.0        grid_3.4.3      
[26] glue_1.2.0       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.1    
[31] reshape2_1.4.3   magrittr_1.5     scales_0.5.0     rvest_0.3.2      assertthat_0.2.0
[36] mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.6    lazyeval_0.2.1   munsell_0.4.3   
[41] broom_0.4.3      crayon_1.3.4    
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 11, 2018

Thanks for raising this issue and for the example. We'll investigate the performance of printing and other operations for the upcoming tibble relese.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 14, 2018

We're doing expensive computations for all columns, of a data frame, not only for those to be displayed. Need to revisit that.

@krlmlr krlmlr mentioned this issue Jan 14, 2018
1 of 3 tasks complete
@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 14, 2018

I fixed some of the worst problems in pillar in r-lib/pillar#87 (now merged), install with

# install.packages("remotes")
remotes::install_github("r-lib/pillar")

Still, our implementation of col_strwrap() is too slow if 100 extra columns (the default) are displayed. I'll implement a workaround, r-lib/pillar#86 might be a cleaner solution.

@hadley
Copy link
Member

@hadley hadley commented Jan 15, 2018

We should at least replicate the previous heuristic which is to only look at the first options("width") / 4 columns

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 17, 2018

That's what we do, it's the colored word wrap that now takes most of the time.

Fixing things here will also fix OS X test failures (because of different behavior of wrapping non-breaking spaces on Linux and OS X).

@hadley
Copy link
Member

@hadley hadley commented Jan 17, 2018

The wrap of the data columns or the extra columns?

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 17, 2018

Wrapping extra columns currently takes most of the time for wide tibbles.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jan 19, 2018

The example above prints in 0.32s the first time, and in 0.16s from the second time on. This is about as fast as I could make it.

krlmlr added a commit that referenced this issue Jan 20, 2018
- `enframe(NULL)` now returns the same as `enframe(logical())` (#352).
- `tbl[1, , drop = TRUE]` now behaves identically to data frames (#367).
- The `tibble.width` option is honored again (#369).
- Faster printing of very wide tibbles (#360).
- Update vignette to match changes in 1.4.1 (#368, @bgreenwell).
- Don't rely on `ncol()` for `glimpse()`, only query `nrow()` and `head()`.
- Return input for zero-column data frames.
- Add test for `glimpse()` with unknown rows (#366, @kevinykuo).
- Faster construction and subsetting for tibbles (#353).
- `tribble()` now ignores trailing commas (#342, @LaDilettante).
- Fix error message when accessing columns using a logical index vector (#337, @mundl).
krlmlr added a commit that referenced this issue Jan 23, 2018
Bug fixes
---------

- Fix OS X builds.
- The `tibble.width` option is honored again (#369).
- `tbl[1, , drop = TRUE]` now behaves identically to data frames (#367).
- Fix error message when accessing columns using a logical index vector (#337, @mundl).
- `glimpse()` returns its input for zero-column data frames.

Features
--------

- `enframe(NULL)` now returns the same as `enframe(logical())` (#352).
- `tribble()` now ignores trailing commas (#342, @LaDilettante).
- Updated vignettes and website documentation.

Performance
-----------

- Faster printing of very wide tibbles (#360).
- Faster construction and subsetting for tibbles (#353).
- Only call `nrow()` and `head()` in `glimpse()`, not `ncol()`.
@gvfarns
Copy link

@gvfarns gvfarns commented Apr 26, 2019

This issue persists in current versions and is very troublesome---I've had to take up the practice of converting all my tbls into data frames because of this. I have a very new and powerful machine and the example above takes

> system.time(print(tst_tibble))
user  system elapsed 
  5.304   0.008   5.343 

versus

system.time(print(tst_df))
   user  system elapsed 
  0.035   0.004   0.052 

I think it's a serious mistake to close this issue. Are you OK waiting more than 5 seconds every time you glance at a dataset? Why should looking at a tibble take that much longer than looking at the same data as a data.frame? I think this should be reopened.

── Attaching packages ─────────── tidyverse 1.2.1 ──
ggplot2 3.1.1 readr 1.3.1
tibble 2.1.1 purrr 0.3.2
tidyr 0.8.3 stringr 1.4.0
ggplot2 3.1.1 forcats 0.4.0

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 26, 2019

@gvfarns I recommend you open a new issue and link to this thread. That fits better with our workflow.

@github-actions
Copy link

@github-actions github-actions bot commented Dec 8, 2020

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants