New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
writing speed sqlite and GPKG compared to shapefile #470
Comments
|
On Fedora, CRAN sf and rgdal, same GDAL, no difference for N=1000 ESRI Shapefile ten times faster than GPKG and SQLite (0.008s against 0.08s). For n=25000, writeOGR 0.12s, st_write 0.16 for ESRI Shapefile, writeOGR 1.46s for GPKG and SQLite, st_write 1.58 for GPKG and 1.25 for SQLite. Still Shapefile about an order of magnitude faster. Could someone check whether it is the number of features, or the number of fields - does the Shapefile driver slow down more than SQLite and GPKG as the field count increases? |
@tim-salabim Could the fact that I only see a single order of magnitude slowing from shp to gpkg be related to dropping -g when installing GDAL from source:
|
@rsbivand I am not sure. I am on
With this code library(sf)
library(rgdal)
library(RSQLite)
library(rbenchmark)
n<- 1000
d <-data.frame(a=1:n, X=rnorm(n,1,1), Y=rnorm(n,1,1))
mp1 <- st_as_sf(d, coords=c("X","Y"))
mp1_sp = as(mp1, "Spatial")
benchmark(
st_write(mp1, dsn = 'software/testing/gpkg/data1.shp', delete_dsn = TRUE),
st_write(mp1, dsn = 'software/testing/gpkg/test.sqlite', delete_dsn = TRUE),
st_write(mp1, "software/testing/gpkg/data1.gpkg", delete_dsn = TRUE),
writeOGR(mp1_sp, "software/testing/gpkg/data2.gpkg",
driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
delete_dsn = TRUE),
replications = 10
) I get similar timings to what you report (at least for the diff between replications elapsed relative user.self sys.self user.child sys.child
1 10 0.088 1.000 0.084 0.005 0 0
2 10 49.609 563.739 1.380 5.150 0 0
3 10 0.952 10.818 0.418 0.522 0 0
4 10 0.849 9.648 0.353 0.491 0 0 With replications elapsed relative user.self sys.self user.child sys.child
1 10 1.443 1.000 0.999 0.441 0 0
3 10 1227.511 850.666 31.640 119.604 0 0
2 10 1236.376 856.809 31.436 121.845 0 0
4 10 19.230 13.326 7.368 11.638 0 0
5 10 17.161 11.893 6.070 11.007 0 0 I did not test with increasing number of fields. |
I have just tested on a newer computer with Win10 64-Bit. It shows similar behaviour than Win7. user.self sys.self elapsed
t_shp 0.02 0.01 0.03
t_sqlite 0.31 2.47 11.97
t_gpkg 0.30 2.12 17.59 |
Here is a bit more testing on my Win7 laptop. nfeat<- 1000
nfields <- 100
d1 <-data.frame(matrix(rnorm(nfeat*(nfields+2),1,1),nrow=nfeat))
mp1 <- st_as_sf(d1, coords=c("X1","X2"))
nfields <- 1
d2 <-data.frame(matrix(rnorm(nfeat*(nfields+2),1,1),nrow=nfeat))
mp2 <- st_as_sf(d2, coords=c("X1","X2"))
mp3 <- st_sfc(st_multipoint(matrix(rnorm(nfeat*2,1,1),ncol=2)))
benchmark(
shp_mp1 = st_write(mp1, dsn = file.path(td, 'data1.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
shp_mp2 = st_write(mp2, dsn = file.path(td, 'data2.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
shp_mp3 = st_write(mp3, dsn = file.path(td, 'data3.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
sqlite_mp1 = st_write(mp1, dsn = file.path(td, 'test.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
sqlite_mp2 = st_write(mp2, dsn = file.path(td, 'test2.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
sqlite_mp3 = st_write(mp3, dsn = file.path(td, 'test3.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
gpkg_mp1 = st_write(mp1, file.path(td, "data1.gpkg"), delete_dsn = TRUE),
gpkg_mp2 = st_write(mp2, file.path(td, "data2.gpkg"), delete_dsn = TRUE),
gpkg_mp3 = st_write(mp3, file.path(td, "data3.gpkg"), delete_dsn = TRUE),
ogr_mp1 = writeOGR(as(mp1, "Spatial"), "data_ogr1.gpkg", driver = "GPKG", layer = "data1",
overwrite_layer = TRUE, delete_dsn = TRUE),
ogr_mp2 = writeOGR(as(mp2, "Spatial"), "data_ogr2.gpkg", driver = "GPKG", layer = "data2",
overwrite_layer = TRUE, delete_dsn = TRUE),
replications=5
) It appears that the number of fields is not an issue for gpkg and sqlite but slows down shapefile. Writing a multipoint feature is very fast so it is rather the number of features that is critical. In addition, similar speed with st_write and writeOGR suggests a GDAL issue. test replications elapsed relative user.self sys.self user.child sys.child
7 gpkg_mp1 5 31.48 242.154 2.91 4.63 NA NA
8 gpkg_mp2 5 28.80 221.538 0.81 4.79 NA NA
9 gpkg_mp3 5 0.72 5.538 0.11 0.08 NA NA
10 ogr_mp1 5 29.80 229.231 3.15 4.37 NA NA
11 ogr_mp2 5 25.24 194.154 0.49 4.11 NA NA
1 shp_mp1 5 7.31 56.231 6.82 0.19 NA NA
2 shp_mp2 5 0.33 2.538 0.12 0.06 NA NA
3 shp_mp3 5 0.13 1.000 0.03 0.03 NA NA
4 sqlite_mp1 5 80.58 619.846 3.07 8.45 NA NA
5 sqlite_mp2 5 110.79 852.231 0.95 12.11 NA NA
6 sqlite_mp3 5 0.78 6.000 0.06 0.16 NA NA |
I have tested the transfer and appending of shp to gpkg using ogr2ogr and this works fast. Files are generated as in the post above. system.time(system("C:/Tools/gdal/apps/ogr2ogr.exe -f GPKG C:/Temp/Data1.gpkg C:/Temp/Data1.shp"))
User System verstrichen
0.02 0.32 0.98
system.time(system("C:/Tools/gdal/apps/ogr2ogr.exe -f GPKG C:/Temp/Data1.gpkg C:/Temp/Data2.shp -append -update"))
User System verstrichen
0.01 0.36 0.90
system("C:/Tools/release-1800-gdal-2-2-1-mapserver-7-0-6/bin/gdal/apps/ogrinfo.exe C:/Temp/Data1.gpkg")
INFO: Open of `C:/Temp/Data1.gpkg'
using driver `GPKG' successful.
1: Data1 (Point)
2: Data2 (Point) Does this suggest some bottleneck in writeOGR or OGR_write? Should the issue be transferred to rgdal? I have not yet tested Sqlite using ogr2ogr but could use this as a workaround for gpkg. |
Timings on Fedora 26 on a 6 year old laptop: There are so large variations between OS that no intervention is obvious here. No point i doing special tratment of drivers on the R side - it's the GDAL drivers that count. If all drivers were slow, doing something in sf or rgdal might be justified. I'm assuming sf and rgdal are CRAN binaries (built from the same sources but using other compilers, and sf/rgdal built static for CRAN) - have you tried to install from source built against alternative GDAL binaries? Yes, rgdal::writeOGR() is slower than sf::st_write() (read is too), but that was part of the reason for re-implementing the GDAL vector interface. rgdal::writeOGR() is also weighed down by handing special cases, and is not written for efficiency, just to work at all was the main goal. |
Sorry, I have not answered your earlier question. Yes, it is all CRAN binaries as we use restricted machines at work on which compiling is not possible. I'll have a try on my private laptop. |
I have now installed sf from source on Windows 7 64-bit with the effect of similarly slow writing of gpkg and R crashing with the SQlite driver. user.self sys.self elapsed
t_shp 0.03 0.02 0.06
t_gpkg 0.47 1.96 3.74 I have an operational workaround using ogr2ogr of a temporary shp into gpkg, which satisfies my needs for the moment. Since this issue has nothing to do with sf, I am going to close it here. Many thanks for your support. |
I've asked on the gdal-dev mailing list about this problem - https://lists.osgeo.org/pipermail/gdal-dev/2017-October/047383.html. |
Time to redo your timings. Thanks Even @rouault ! |
Timings for library(sf)
library(rgdal)
library(rbenchmark)
n = 1000
d = data.frame(a = 1:n, X = rnorm(n,1,1), Y = rnorm(n,1,1))
mp1 = st_as_sf(d, coords = c("X","Y"))
mp1_sp = as(mp1, "Spatial")
benchmark(
sf_shp = st_write(mp1, dsn = 'software/testing/gpkg/data1.shp', delete_dsn = TRUE),
sf_sqlite = st_write(mp1, dsn = 'software/testing/gpkg/test.sqlite', delete_dsn = TRUE),
sf_gpkg = st_write(mp1, "software/testing/gpkg/data1.gpkg", delete_dsn = TRUE),
rgdal_gpkg = writeOGR(mp1_sp, "software/testing/gpkg/data2.gpkg",
driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
delete_dsn = TRUE),
replications = 10
)
test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg 10 0.598 7.973 0.238 0.357 0 0
3 sf_gpkg 10 0.603 8.040 0.244 0.360 0 0
1 sf_shp 10 0.075 1.000 0.059 0.017 0 0
2 sf_sqlite 10 48.894 651.920 1.495 4.426 0 0 |
I'm seeing
Are you sure you recompiled everything of sf? For the timings of @mkschneider I now see test replications elapsed relative user.self sys.self user.child sys.child
7 gpkg_mp1 5 0.449 40.818 0.436 0.012 0 0
8 gpkg_mp2 5 0.116 10.545 0.116 0.000 0 0
9 gpkg_mp3 5 0.055 5.000 0.056 0.000 0 0
10 ogr_mp1 5 0.705 64.091 0.516 0.188 0 0
11 ogr_mp2 5 0.403 36.636 0.208 0.192 0 0
1 shp_mp1 5 0.666 60.545 0.656 0.012 0 0
2 shp_mp2 5 0.030 2.727 0.024 0.004 0 0
3 shp_mp3 5 0.011 1.000 0.008 0.004 0 0
4 sqlite_mp1 5 0.673 61.182 0.356 0.012 0 0
5 sqlite_mp2 5 0.309 28.091 0.028 0.008 0 0
6 sqlite_mp3 5 0.281 25.545 0.024 0.000 0 0 which indicates gpkg is faster than shp in the case of many attributes (mp1). |
You're right, I now see test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg 10 0.591 6.156 0.284 0.305 0 0
3 sf_gpkg 10 0.206 2.146 0.200 0.004 0 0
1 sf_shp 10 0.096 1.000 0.081 0.016 0 0
2 sf_sqlite 10 0.285 2.969 0.108 0.009 0 0 which is really awesome! |
In the case of somewhat larger polygon datasets, e.g. library(sf)
library(rgdal)
library(rbenchmark)
options(width = 100)
demo(nc, ask = FALSE, echo = FALSE)
mp1 = rbind(nc, nc ,nc, nc, nc, nc, nc, nc, nc, nc) # 1000 features
mp1_sp = as(mp1, "Spatial")
benchmark(
sf_shp = st_write(mp1, dsn = 'data1.shp', delete_dsn = TRUE, quiet = TRUE),
sf_sqlite = st_write(mp1, dsn = 'test.sqlite', delete_dsn = TRUE, quiet = TRUE),
sf_gpkg = st_write(mp1, "data1.gpkg", delete_dsn = TRUE, quiet = TRUE),
rgdal_gpkg = writeOGR(mp1_sp, "data2.gpkg",
driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
delete_dsn = TRUE),
replications = 10
) we see that gpkg is equally fast as shapefile:
|
I am looking forward to trying this at work where we work with large layers in the order of 10 mio features of all geometry types all the time. If I get the chance/time I will report back some timings. If the write speeds of gpkg turn out to be similar to shp, then I'm gonna make a strong case for moving our analysis portfolio to gpkg. |
For the record using Edzer's test above, rgdal 1.2.14, sf 0.5-5:
and forthcoming rgdal 1.2.15 (using TestCapabilities):
|
Good to see that rgdal is now also up to speed! |
Ok, so here's a benchmark with large road network LINESTRING data. Seems that gpkg and sqlite eat the shapefile for breakfast when data gets larger: > dim(mp1)
[1] 483479 51
test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg 5 1891.39 14.940 257.23 1464.25 NA NA
3 sf_gpkg 5 226.23 1.787 194.52 31.38 NA NA
1 sf_shp 5 1272.97 10.055 1224.75 46.47 NA NA
2 sf_sqlite 5 126.60 1.000 121.63 1.14 NA NA An added benefit is that there is no data loss as these warnings only occur when writing to shp. > warnings()
Warnmeldungen:
1: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), ... :
GDAL Message 1: Value 12760003401453 of field ID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
2: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), ... :
GDAL Message 1: Value 12760019135320 of field F_JNCTID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
3: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), ... :
GDAL Message 1: Value 12760018932615 of field T_JNCTID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
... and so forth... |
@tim-salabim : which rgdal version was this? Might it be possible to check out the source from R-Forge (currently they don't have GDAL as I asked them to upgrade to GDAL 2, so no tarball there) and try that? If it's easier, I can put a tarball (or Windows binary from win-builder) on my server. |
This was rgdal 1.2.8. |
Sorry with ZIP i mean a Windows binary |
Yes, I'm waiting for winbuilder to respond; will post the link if it passes check. |
Here you have the check output and Windows binary of 1.2.15. |
Sweet, I'll give it a spin tomorrow at work |
@rsbivand sorry for the delay! Here you go (with rgdal: version: 1.2-15, (SVN revision 686)): test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg 5 223.84 1.786 191.95 31.31 NA NA
3 sf_gpkg 5 224.76 1.794 195.23 29.44 NA NA
1 sf_shp 5 1262.15 10.073 1215.30 45.63 NA NA
2 sf_sqlite 5 125.30 1.000 122.05 1.09 NA NA On par with sf! |
Thanks very much, very useful! I'll move to submission to CRAN ... |
Writing of datasets with many POINT features to SpatialLite-DB and Geopackage is several orders of magnitude slower than to shapefile for several OS, namely Win7 32-bit and 64-bit and Ubuntu, especially if multiple layers are written. Issue is not present in Fedora and also not for datasets with a small number of MULTIPOINT features (albeit these may contain many points).
See https://www.mail-archive.com/r-sig-geo@r-project.org/msg15669.html for details. I am a bit lost with ideas on how we can track this problem down in the computing system.
The text was updated successfully, but these errors were encountered: