Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate printing #747

Closed
wants to merge 2 commits into from
Closed

Conversation

etiennebr
Copy link
Member

@edzer, this is a first attempt to accelerate printing of geometries. For me, it really boosted the speed. Using the example from #703, a tbl_df (from read_sf) printed in around 30s, now below 3s. A data.frame prints also way faster.

There were not many tests for printing, so I don't know if it actually breaks something, but it seems fine from what I could test.

close #703, related #713

- accelerate wkt writing
- accelerate `print` for `sf` objects
- use the ellipsis rather than three dots

close r-spatial#703
@edzer
Copy link
Member

edzer commented May 15, 2018

I like the ellipsis, but it now does print a whole lot of digits:

> nc$geom
Geometry set for 100 features 
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
epsg (SRID):    4267
proj4string:    +proj=longlat +datum=NAD27 +no_defs
First 5 geometries:
MULTIPOLYGON (((-81.4727554321289 36.234355926513…
MULTIPOLYGON (((-81.2398910522461 36.365364074707…
MULTIPOLYGON (((-80.4563446044922 36.242557525634…
MULTIPOLYGON (((-76.0089721679688 36.319595336914…
MULTIPOLYGON (((-77.2176666259766 36.240982055664

and also no longer respects e.g.

options(digits = 3)

to reduce those. Given that there's hardly anything we want to print, giving up digits control is a pretty big loss; there must be a simpler way to gain these 30 seconds.

@etiennebr
Copy link
Member Author

Thanks for the feedback, I agree. I'll look into it.

@JanMarvin
Copy link

@etiennebr maybe something along the lines of this: https://github.com/JanMarvin/sf/commit/4a595ee9282a02f7e7c94d6f28d7fda290ea4842

Though in general I'm in favor of ignoring the values (as data.table would do) or convert only a few selected values. Converting nested lists within lists will never be fast. Just came across this pull request as I was bugged by the slow print a little while ago

@etiennebr
Copy link
Member Author

@JanMarvin thanks for the suggestion. My issue with format is that it inserts spaces before the string (the reason for the sub in the current version).

      MACROAREA           geom
1    CENTRE MULTIPOLYGON (((( 625766 4754…
                             ^

Which is why I was trying to use as.character, but it does print all digits. Also looking at the doc for option digits: controls the number of digits to print when printing numeric values. It is a suggestion only. (emphasis mine). So it will be hard to reproduce the behavior, but we could rather use the precision attribute from the sf object with a default when it is missing.

I just did a quick test with sprintf and it is surprisingly fast (for the same example it went to 1.5 sec).

fmt = function(x, ..., digits = 3) {
	sprintf(paste0("%.", digits, "f"), x)
}

I was going for a quick fix for now, but maybe it's more complex than I thought. I think once we have the right behavior it will be worth using C++ for the bottlenecks.

@JanMarvin
Copy link

JanMarvin commented May 21, 2018

For the whitespace we could add trim=TRUE to format(). Still the example for Italy will take quite some time. I assume a check is required, how many values shall be printed and formated. Formating every value is costly and imho not required.
[edit] Nevermind. The benefit is marginally.
[edit2] If only the first 3-4 values of a multipolygon are converted (e.g. values that are actually printed out by R) a print call will only take ~0.2 seconds for the Italy data set (~20 seconds on my computer). I have added both to my branch.

@edzer
Copy link
Member

edzer commented May 27, 2018

I also think that printing the first few coordinates only should be the way to solve this issue; I just didn't come up with a bug free way of doing this, so went back to printing all.

@JanMarvin
Copy link

Forgot to push a follow up commit, which will fetches the first three coordinates for printing. Should be easy to integrate an if clause to check for length(x) to select values accordingly to ones needs and add e.g. ... to indicate more values. Maybe @etiennebr can have another look since I just picked up his idea. All I can say is, right now it is fast enough for me.

edzer added a commit that referenced this pull request Jun 23, 2018
etiennebr added a commit to etiennebr/sf that referenced this pull request Jan 19, 2019
@etiennebr etiennebr mentioned this pull request Jan 19, 2019
@etiennebr
Copy link
Member Author

Closing in favor of #957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

show()ing large sf objects can be slow
3 participants