New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow printing with very wide tibbles #360

Closed
blueprint-ade opened this Issue Jan 11, 2018 · 8 comments

Comments

Projects
3 participants
@blueprint-ade

blueprint-ade commented Jan 11, 2018

After updating to the most recent version of the package, I noticed that a) the new console output was great, and b) that printing was substantially slower for tibbles with ~50 or more columns. In addition to printing slower, the output process hangs between printing the tabular data preview and the list of columns excluded therefrom.

This reprex uses gapminder data to make a tibble with 1 row and 711 columns. I exaggerated the number of columns in an effort to make it reproducible on machines with better specs than my middling i-5 and 8 gigs of ram.

load packages

library(gapminder)
library(tidyverse)
#-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
# v ggplot2 2.2.1     v purrr   0.2.4
# v tibble  1.4.1     v dplyr   0.7.4
# v tidyr   0.7.2     v stringr 1.2.0
# v readr   1.1.1     v forcats 0.2.0

make the test data

tst_tibble <- gapminder %>%
  
  # change the year filter to add or subtract columns 
  # from the final tibble
  filter(year < 1975) %>% 
  unite(loc_yr, continent, country, year) %>%
  select(loc_yr, lifeExp) %>% 
  spread(loc_yr, lifeExp)


tst_df     <- as.data.frame(tst_tibble)

simple timing

system.time(print(tst_tibble))

# user  system elapsed 
# 8.24    0.00    8.28 

system.time(print(tst_df))

# user  system elapsed 
# 0.55    0.00    0.56

conclusions

Obviously tibble is doing more work to print its output than data.frame(), but the ~15X jump in time seems like quite a lot more than it was in previous versions, and also more than it should be to produce the output that is actually shown on screen. I unfortunately don't have time to downgrade tibble, or test timing more rigorously, but I'll check later and update.

My only hypothesis is that tibble is applying its print processing to all the columns, including the hidden ones, before it shrinks the output and sends it to the console, but I don't know enough to figure out whether or not that's the case.

session info

R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.2.0   stringr_1.2.0   dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.7.2    
[7] tibble_1.4.1    ggplot2_2.2.1   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14     cellranger_1.1.0 pillar_1.0.1     compiler_3.4.3   plyr_1.8.4      
 [6] bindr_0.1        tools_3.4.3      lubridate_1.7.1  jsonlite_1.5     nlme_3.1-131    
[11] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.1.6      psych_1.7.8     
[16] cli_1.0.0        rstudioapi_0.7   yaml_2.1.16      parallel_3.4.3   haven_1.1.0     
[21] bindrcpp_0.2     xml2_1.1.1       httr_1.3.1       hms_0.4.0        grid_3.4.3      
[26] glue_1.2.0       R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.1    
[31] reshape2_1.4.3   magrittr_1.5     scales_0.5.0     rvest_0.3.2      assertthat_0.2.0
[36] mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.6    lazyeval_0.2.1   munsell_0.4.3   
[41] broom_0.4.3      crayon_1.3.4    
@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 11, 2018

Thanks for raising this issue and for the example. We'll investigate the performance of printing and other operations for the upcoming tibble relese.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 14, 2018

We're doing expensive computations for all columns, of a data frame, not only for those to be displayed. Need to revisit that.

@krlmlr krlmlr referenced this issue Jan 14, 2018

Closed

Performance problems #85

1 of 3 tasks complete
@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 14, 2018

I fixed some of the worst problems in pillar in r-lib/pillar#87 (now merged), install with

# install.packages("remotes")
remotes::install_github("r-lib/pillar")

Still, our implementation of col_strwrap() is too slow if 100 extra columns (the default) are displayed. I'll implement a workaround, r-lib/pillar#86 might be a cleaner solution.

@krlmlr krlmlr added this to To Do in krlmlr Jan 15, 2018

@krlmlr krlmlr added the performance label Jan 15, 2018

@hadley

This comment has been minimized.

Member

hadley commented Jan 15, 2018

We should at least replicate the previous heuristic which is to only look at the first options("width") / 4 columns

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 17, 2018

That's what we do, it's the colored word wrap that now takes most of the time.

Fixing things here will also fix OS X test failures (because of different behavior of wrapping non-breaking spaces on Linux and OS X).

@hadley

This comment has been minimized.

Member

hadley commented Jan 17, 2018

The wrap of the data columns or the extra columns?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 17, 2018

Wrapping extra columns currently takes most of the time for wide tibbles.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Jan 19, 2018

The example above prints in 0.32s the first time, and in 0.16s from the second time on. This is about as fast as I could make it.

krlmlr added a commit that referenced this issue Jan 20, 2018

Merge tag 'v1.4.1.9001'
- `enframe(NULL)` now returns the same as `enframe(logical())` (#352).
- `tbl[1, , drop = TRUE]` now behaves identically to data frames (#367).
- The `tibble.width` option is honored again (#369).
- Faster printing of very wide tibbles (#360).
- Update vignette to match changes in 1.4.1 (#368, @bgreenwell).
- Don't rely on `ncol()` for `glimpse()`, only query `nrow()` and `head()`.
- Return input for zero-column data frames.
- Add test for `glimpse()` with unknown rows (#366, @kevinykuo).
- Faster construction and subsetting for tibbles (#353).
- `tribble()` now ignores trailing commas (#342, @LaDilettante).
- Fix error message when accessing columns using a logical index vector (#337, @mundl).

krlmlr added a commit that referenced this issue Jan 23, 2018

Merge tag 'v1.4.2'
Bug fixes
---------

- Fix OS X builds.
- The `tibble.width` option is honored again (#369).
- `tbl[1, , drop = TRUE]` now behaves identically to data frames (#367).
- Fix error message when accessing columns using a logical index vector (#337, @mundl).
- `glimpse()` returns its input for zero-column data frames.

Features
--------

- `enframe(NULL)` now returns the same as `enframe(logical())` (#352).
- `tribble()` now ignores trailing commas (#342, @LaDilettante).
- Updated vignettes and website documentation.

Performance
-----------

- Faster printing of very wide tibbles (#360).
- Faster construction and subsetting for tibbles (#353).
- Only call `nrow()` and `head()` in `glimpse()`, not `ncol()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment