-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using sf::st_as_text in data.frame is very slow #947
Comments
Do you have a reproducible example? That could help benchmarking, testing and, in the context of this issue, knowing if it is 'solved'. |
See also #800 |
@Robinlovelace |
Reproducible example with a smaller dataset (a 13 MB geojson file): suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(sf))
f2 = "promenade-all.geojson"
u = "https://github.com/spnethack/spnethack/releases/download/0.1/promenade-all.geojson"
download.file(u, destfile = f2)
system.time({b = read_sf(f2)})
#> user system elapsed
#> 3.474 0.102 3.580
system.time({d = b %>% data.table::as.data.table()})
#> user system elapsed
#> 0.027 0.000 0.027
system.time(d[,geometry:=sf::st_as_text(geometry),])
#> user system elapsed
#> 6.422 0.004 6.427
system.time(d %>% readr::write_tsv("tmp.tsv"))
#> user system elapsed
#> 1.038 0.040 1.077 Created on 2019-01-12 by the reprex package (v0.2.1) |
Note, there's a bit on benchmarking here: https://geocompr.github.io/geocompkg/articles/benchmark.html Hoping that issues/conversations like this will motivate us to add more reproducible benchmarks to that document. |
Thanks for your proposal. My file is around 400M, which is great than 13M as your link. When I head 10000 records it works fine too but fail at more large scale. I guess the reason is sf::st_as_text running in an only single thread, in that when the data is not big, we can't find the problem. |
currently, I am working on district polygon building.
I wish to read a .geojson and convert it into .tsv for Hive.
this step will cost almost 30min with 1M rows data.
data.table
is commonly fasterdplyr
in this scenario.The text was updated successfully, but these errors were encountered: