Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using sf::st_as_text in data.frame is very slow #947

Closed
harryprince opened this issue Jan 11, 2019 · 6 comments
Closed

using sf::st_as_text in data.frame is very slow #947

harryprince opened this issue Jan 11, 2019 · 6 comments

Comments

@harryprince
Copy link

harryprince commented Jan 11, 2019

currently, I am working on district polygon building.
I wish to read a .geojson and convert it into .tsv for Hive.

library(dplyr)
library(data.table)
library(sf)
building  <– sf::st_read("~/buildings_wgs84_2016.geojson")
building  %>% 
  data.table::as.data.table() -> d

this step will cost almost 30min with 1M rows data.

d[,geometry:=sf::st_as_text(geometry),]

data.table is commonly faster dplyr in this scenario.

d %>% readr::write_tsv("tmp.tsv")
@Robinlovelace
Copy link
Contributor

Do you have a reproducible example? That could help benchmarking, testing and, in the context of this issue, knowing if it is 'solved'.

@edzer
Copy link
Member

edzer commented Jan 11, 2019

See also #800

@harryprince
Copy link
Author

harryprince commented Jan 11, 2019

@Robinlovelace
I can't provide this .geojson for you, for the sake of the company private. I have tried many .geojson files which downloaded from osm, when the table is more than 1,000,000 rows, it seems pretty slow and cost me 30 mins. Except the .geojson, the rest of part is all reproducible.

@Robinlovelace
Copy link
Contributor

Reproducible example with a smaller dataset (a 13 MB geojson file):

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(sf))
f2 = "promenade-all.geojson"
u = "https://github.com/spnethack/spnethack/releases/download/0.1/promenade-all.geojson"
download.file(u, destfile = f2)

system.time({b = read_sf(f2)})
#>    user  system elapsed 
#>   3.474   0.102   3.580
system.time({d = b  %>% data.table::as.data.table()})
#>    user  system elapsed 
#>   0.027   0.000   0.027
system.time(d[,geometry:=sf::st_as_text(geometry),])
#>    user  system elapsed 
#>   6.422   0.004   6.427
system.time(d %>% readr::write_tsv("tmp.tsv"))
#>    user  system elapsed 
#>   1.038   0.040   1.077

Created on 2019-01-12 by the reprex package (v0.2.1)

@Robinlovelace
Copy link
Contributor

Note, there's a bit on benchmarking here: https://geocompr.github.io/geocompkg/articles/benchmark.html

Hoping that issues/conversations like this will motivate us to add more reproducible benchmarks to that document.

@harryprince
Copy link
Author

harryprince commented Jan 13, 2019

Thanks for your proposal. My file is around 400M, which is great than 13M as your link. When I head 10000 records it works fine too but fail at more large scale. I guess the reason is sf::st_as_text running in an only single thread, in that when the data is not big, we can't find the problem.

etiennebr added a commit to etiennebr/sf that referenced this issue Jan 19, 2019
@etiennebr etiennebr added tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day and removed tidy-dev-day 🤓 Tidyverse Developer Day rstd.io/tidy-dev-day labels Jan 19, 2019
@edzer edzer closed this as completed Mar 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants