Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing speed sqlite and GPKG compared to shapefile #470

Closed
mkschneider opened this issue Aug 25, 2017 · 28 comments
Closed

writing speed sqlite and GPKG compared to shapefile #470

mkschneider opened this issue Aug 25, 2017 · 28 comments

Comments

@mkschneider
Copy link

Writing of datasets with many POINT features to SpatialLite-DB and Geopackage is several orders of magnitude slower than to shapefile for several OS, namely Win7 32-bit and 64-bit and Ubuntu, especially if multiple layers are written. Issue is not present in Fedora and also not for datasets with a small number of MULTIPOINT features (albeit these may contain many points).
See https://www.mail-archive.com/r-sig-geo@r-project.org/msg15669.html for details. I am a bit lost with ideas on how we can track this problem down in the computing system.

@edzer
Copy link
Member

edzer commented Aug 25, 2017

sf doesn't treat writing ESRI shapefiles any different from writing gpkg or spatialite files; I assume therefore that this is an issue in the gdal library or the sqlite library gdal uses. If you want to be sure, check whether you observe the same effect in rgdal::writeOGR.

@rsbivand
Copy link
Member

On Fedora, CRAN sf and rgdal, same GDAL, no difference for N=1000 ESRI Shapefile ten times faster than GPKG and SQLite (0.008s against 0.08s). For n=25000, writeOGR 0.12s, st_write 0.16 for ESRI Shapefile, writeOGR 1.46s for GPKG and SQLite, st_write 1.58 for GPKG and 1.25 for SQLite. Still Shapefile about an order of magnitude faster.

Could someone check whether it is the number of features, or the number of fields - does the Shapefile driver slow down more than SQLite and GPKG as the field count increases?

@rsbivand
Copy link
Member

@tim-salabim Could the fact that I only see a single order of magnitude slowing from shp to gpkg be related to dropping -g when installing GDAL from source:

$ ls -l /usr/local/lib/libgdal.so.20.3.0
-rwxr-xr-x. 1 root root 19938984 Jul 23 16:14 /usr/local/lib/libgdal.so.20.3.0

@tim-salabim
Copy link
Member

tim-salabim commented Aug 26, 2017

@rsbivand I am not sure. I am on ubuntu mate 16.04 and use http://ppa.launchpad.net/ubuntugis/ubuntugis-unstable/ubuntuxenial/main to install GDAL via apt-get install. I see

root@ede-N130BU:/home/ede# ls -l /usr/lib/libgdal.so.20.3.0
-rw-r--r-- 1 root root 16260512 Jul 25 22:47 /usr/lib/libgdal.so.20.3.0

With this code

library(sf)
library(rgdal)
library(RSQLite)
library(rbenchmark)

n<- 1000
d <-data.frame(a=1:n, X=rnorm(n,1,1), Y=rnorm(n,1,1))
mp1 <- st_as_sf(d, coords=c("X","Y"))
mp1_sp = as(mp1, "Spatial")

benchmark(
  st_write(mp1, dsn = 'software/testing/gpkg/data1.shp', delete_dsn = TRUE),
  st_write(mp1, dsn = 'software/testing/gpkg/test.sqlite', delete_dsn = TRUE),
  st_write(mp1, "software/testing/gpkg/data1.gpkg", delete_dsn = TRUE),
  writeOGR(mp1_sp, "software/testing/gpkg/data2.gpkg", 
           driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
           delete_dsn = TRUE),
  replications = 10
)

I get similar timings to what you report (at least for the diff between shp and gpkg):

  replications elapsed relative user.self sys.self user.child sys.child
1           10   0.088    1.000     0.084    0.005          0         0
2           10  49.609  563.739     1.380    5.150          0         0
3           10   0.952   10.818     0.418    0.522          0         0
4           10   0.849    9.648     0.353    0.491          0         0

With n = 25000 especially sqlite writing seems to explode.

  replications  elapsed relative user.self sys.self user.child sys.child
1           10    1.443    1.000     0.999    0.441          0         0
3           10 1227.511  850.666    31.640  119.604          0         0
2           10 1236.376  856.809    31.436  121.845          0         0
4           10   19.230   13.326     7.368   11.638          0         0
5           10   17.161   11.893     6.070   11.007          0         0

I did not test with increasing number of fields.

@mkschneider
Copy link
Author

mkschneider commented Aug 27, 2017

I have just tested on a newer computer with Win10 64-Bit. It shows similar behaviour than Win7.

         user.self sys.self elapsed
t_shp         0.02     0.01    0.03
t_sqlite      0.31     2.47   11.97
t_gpkg        0.30     2.12   17.59

@mkschneider
Copy link
Author

mkschneider commented Aug 28, 2017

Here is a bit more testing on my Win7 laptop.

nfeat<- 1000
nfields <- 100
d1 <-data.frame(matrix(rnorm(nfeat*(nfields+2),1,1),nrow=nfeat))
mp1 <- st_as_sf(d1, coords=c("X1","X2"))
nfields <- 1
d2 <-data.frame(matrix(rnorm(nfeat*(nfields+2),1,1),nrow=nfeat))
mp2 <- st_as_sf(d2, coords=c("X1","X2"))
mp3 <- st_sfc(st_multipoint(matrix(rnorm(nfeat*2,1,1),ncol=2)))

benchmark(
 shp_mp1 = st_write(mp1, dsn = file.path(td, 'data1.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
 shp_mp2 = st_write(mp2, dsn = file.path(td, 'data2.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
 shp_mp3 = st_write(mp3, dsn = file.path(td, 'data3.shp'), driver = 'ESRI Shapefile', delete_dsn = TRUE),
 sqlite_mp1 = st_write(mp1, dsn = file.path(td, 'test.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
 sqlite_mp2 = st_write(mp2, dsn = file.path(td, 'test2.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
 sqlite_mp3 = st_write(mp3, dsn = file.path(td, 'test3.sqlite'), layer = 'data1', driver = 'SQLite', delete_dsn = TRUE),
 gpkg_mp1 = st_write(mp1, file.path(td, "data1.gpkg"), delete_dsn = TRUE),
 gpkg_mp2 = st_write(mp2, file.path(td, "data2.gpkg"), delete_dsn = TRUE),
 gpkg_mp3 = st_write(mp3, file.path(td, "data3.gpkg"), delete_dsn = TRUE),
 ogr_mp1 = writeOGR(as(mp1, "Spatial"), "data_ogr1.gpkg", driver = "GPKG", layer = "data1", 
overwrite_layer = TRUE, delete_dsn = TRUE),
 ogr_mp2 = writeOGR(as(mp2, "Spatial"), "data_ogr2.gpkg", driver = "GPKG", layer = "data2", 
overwrite_layer = TRUE, delete_dsn = TRUE),
replications=5
)

It appears that the number of fields is not an issue for gpkg and sqlite but slows down shapefile. Writing a multipoint feature is very fast so it is rather the number of features that is critical. In addition, similar speed with st_write and writeOGR suggests a GDAL issue.

         test replications elapsed relative user.self sys.self user.child sys.child
7    gpkg_mp1            5   31.48  242.154      2.91     4.63         NA        NA
8    gpkg_mp2            5   28.80  221.538      0.81     4.79         NA        NA
9    gpkg_mp3            5    0.72    5.538      0.11     0.08         NA        NA
10    ogr_mp1            5   29.80  229.231      3.15     4.37         NA        NA
11    ogr_mp2            5   25.24  194.154      0.49     4.11         NA        NA
1     shp_mp1            5    7.31   56.231      6.82     0.19         NA        NA
2     shp_mp2            5    0.33    2.538      0.12     0.06         NA        NA
3     shp_mp3            5    0.13    1.000      0.03     0.03         NA        NA
4  sqlite_mp1            5   80.58  619.846      3.07     8.45         NA        NA
5  sqlite_mp2            5  110.79  852.231      0.95    12.11         NA        NA
6  sqlite_mp3            5    0.78    6.000      0.06     0.16         NA        NA

@mkschneider
Copy link
Author

I have tested the transfer and appending of shp to gpkg using ogr2ogr and this works fast. Files are generated as in the post above.

system.time(system("C:/Tools/gdal/apps/ogr2ogr.exe -f GPKG C:/Temp/Data1.gpkg C:/Temp/Data1.shp"))
       User      System verstrichen 
       0.02        0.32        0.98 
system.time(system("C:/Tools/gdal/apps/ogr2ogr.exe -f GPKG C:/Temp/Data1.gpkg C:/Temp/Data2.shp -append -update"))
       User      System verstrichen 
       0.01        0.36        0.90 
system("C:/Tools/release-1800-gdal-2-2-1-mapserver-7-0-6/bin/gdal/apps/ogrinfo.exe C:/Temp/Data1.gpkg")
INFO: Open of `C:/Temp/Data1.gpkg'
      using driver `GPKG' successful.
1: Data1 (Point)
2: Data2 (Point)

Does this suggest some bottleneck in writeOGR or OGR_write? Should the issue be transferred to rgdal?

I have not yet tested Sqlite using ogr2ogr but could use this as a workaround for gpkg.

@rsbivand
Copy link
Member

rsbivand commented Aug 29, 2017

Timings on Fedora 26 on a 6 year old laptop:

write_times.txt

There are so large variations between OS that no intervention is obvious here. No point i doing special tratment of drivers on the R side - it's the GDAL drivers that count. If all drivers were slow, doing something in sf or rgdal might be justified. I'm assuming sf and rgdal are CRAN binaries (built from the same sources but using other compilers, and sf/rgdal built static for CRAN) - have you tried to install from source built against alternative GDAL binaries?

Yes, rgdal::writeOGR() is slower than sf::st_write() (read is too), but that was part of the reason for re-implementing the GDAL vector interface. rgdal::writeOGR() is also weighed down by handing special cases, and is not written for efficiency, just to work at all was the main goal.

@mkschneider
Copy link
Author

Sorry, I have not answered your earlier question. Yes, it is all CRAN binaries as we use restricted machines at work on which compiling is not possible. I'll have a try on my private laptop.

@mkschneider
Copy link
Author

I have now installed sf from source on Windows 7 64-bit with the effect of similarly slow writing of gpkg and R crashing with the SQlite driver.

       user.self sys.self elapsed
t_shp       0.03     0.02    0.06
t_gpkg      0.47     1.96    3.74

I have an operational workaround using ogr2ogr of a temporary shp into gpkg, which satisfies my needs for the moment. Since this issue has nothing to do with sf, I am going to close it here. Many thanks for your support.

@Nowosad
Copy link
Contributor

Nowosad commented Oct 15, 2017

I've asked on the gdal-dev mailing list about this problem - https://lists.osgeo.org/pipermail/gdal-dev/2017-October/047383.html.
@edzer is it possible to implement Even's solution?

@edzer
Copy link
Member

edzer commented Oct 15, 2017

Time to redo your timings. Thanks Even @rouault !

@tim-salabim
Copy link
Member

Timings for gpkg are better now, but still nowhere near shp

library(sf)
library(rgdal)
library(rbenchmark)

n = 1000
d = data.frame(a = 1:n, X = rnorm(n,1,1), Y = rnorm(n,1,1))
mp1 = st_as_sf(d, coords = c("X","Y"))
mp1_sp = as(mp1, "Spatial")

benchmark(
  sf_shp =     st_write(mp1, dsn = 'software/testing/gpkg/data1.shp', delete_dsn = TRUE),
  sf_sqlite =  st_write(mp1, dsn = 'software/testing/gpkg/test.sqlite', delete_dsn = TRUE),
  sf_gpkg =    st_write(mp1, "software/testing/gpkg/data1.gpkg", delete_dsn = TRUE),
  rgdal_gpkg = writeOGR(mp1_sp, "software/testing/gpkg/data2.gpkg", 
                        driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
                        delete_dsn = TRUE),
  replications = 10
)

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   0.598    7.973     0.238    0.357          0         0
3    sf_gpkg           10   0.603    8.040     0.244    0.360          0         0
1     sf_shp           10   0.075    1.000     0.059    0.017          0         0
2  sf_sqlite           10  48.894  651.920     1.495    4.426          0         0

@edzer
Copy link
Member

edzer commented Oct 15, 2017

I'm seeing

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   0.767    8.716     0.364    0.400          0         0
3    sf_gpkg           10   0.218    2.477     0.216    0.004          0         0
1     sf_shp           10   0.088    1.000     0.064    0.024          0         0
2  sf_sqlite           10   0.682    7.750     0.092    0.012          0         0

Are you sure you recompiled everything of sf?

For the timings of @mkschneider I now see

         test replications elapsed relative user.self sys.self user.child sys.child
7    gpkg_mp1            5   0.449   40.818     0.436    0.012          0         0
8    gpkg_mp2            5   0.116   10.545     0.116    0.000          0         0
9    gpkg_mp3            5   0.055    5.000     0.056    0.000          0         0
10    ogr_mp1            5   0.705   64.091     0.516    0.188          0         0
11    ogr_mp2            5   0.403   36.636     0.208    0.192          0         0
1     shp_mp1            5   0.666   60.545     0.656    0.012          0         0
2     shp_mp2            5   0.030    2.727     0.024    0.004          0         0
3     shp_mp3            5   0.011    1.000     0.008    0.004          0         0
4  sqlite_mp1            5   0.673   61.182     0.356    0.012          0         0
5  sqlite_mp2            5   0.309   28.091     0.028    0.008          0         0
6  sqlite_mp3            5   0.281   25.545     0.024    0.000          0         0

which indicates gpkg is faster than shp in the case of many attributes (mp1).

@tim-salabim
Copy link
Member

You're right, I now see

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   0.591    6.156     0.284    0.305          0         0
3    sf_gpkg           10   0.206    2.146     0.200    0.004          0         0
1     sf_shp           10   0.096    1.000     0.081    0.016          0         0
2  sf_sqlite           10   0.285    2.969     0.108    0.009          0         0

which is really awesome!
Thanks for this!
Time to ditch the shapefile for good :-)

@edzer
Copy link
Member

edzer commented Oct 16, 2017

In the case of somewhat larger polygon datasets, e.g.

library(sf)
library(rgdal)
library(rbenchmark)
options(width = 100)

demo(nc, ask = FALSE, echo = FALSE)
mp1 = rbind(nc, nc ,nc, nc, nc, nc, nc, nc, nc, nc) # 1000 features
mp1_sp = as(mp1, "Spatial")

benchmark(
  sf_shp =     st_write(mp1, dsn = 'data1.shp', delete_dsn = TRUE, quiet = TRUE),
  sf_sqlite =  st_write(mp1, dsn = 'test.sqlite', delete_dsn = TRUE, quiet = TRUE),
  sf_gpkg =    st_write(mp1, "data1.gpkg", delete_dsn = TRUE, quiet = TRUE),
  rgdal_gpkg = writeOGR(mp1_sp, "data2.gpkg", 
                        driver = "GPKG", layer = "data2", overwrite_layer = TRUE,
                        delete_dsn = TRUE),
  replications = 10
)

we see that gpkg is equally fast as shapefile:

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   0.982    3.262     0.560    0.416          0         0
3    sf_gpkg           10   0.301    1.000     0.284    0.016          0         0
1     sf_shp           10   0.306    1.017     0.272    0.032          0         0
2  sf_sqlite           10   0.827    2.748     0.216    0.012          0         0

@tim-salabim
Copy link
Member

I am looking forward to trying this at work where we work with large layers in the order of 10 mio features of all geometry types all the time. If I get the chance/time I will report back some timings. If the write speeds of gpkg turn out to be similar to shp, then I'm gonna make a strong case for moving our analysis portfolio to gpkg.

@rsbivand
Copy link
Member

For the record using Edzer's test above, rgdal 1.2.14, sf 0.5-5:

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   2.056    5.925     0.595    0.802          0         0
3    sf_gpkg           10   0.377    1.086     0.330    0.022          0         0
1     sf_shp           10   0.459    1.323     0.382    0.044          0         0
2  sf_sqlite           10   0.347    1.000     0.235    0.020          0         0

and forthcoming rgdal 1.2.15 (using TestCapabilities):

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg           10   0.549    1.496     0.470    0.032          0         0
3    sf_gpkg           10   0.403    1.098     0.329    0.018          0         0
1     sf_shp           10   0.466    1.270     0.369    0.050          0         0
2  sf_sqlite           10   0.367    1.000     0.238    0.016          0         0

@edzer
Copy link
Member

edzer commented Oct 20, 2017

Good to see that rgdal is now also up to speed!

@tim-salabim
Copy link
Member

Ok, so here's a benchmark with large road network LINESTRING data. Seems that gpkg and sqlite eat the shapefile for breakfast when data gets larger:

> dim(mp1)
[1] 483479     51

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg            5 1891.39   14.940    257.23  1464.25         NA        NA
3    sf_gpkg            5  226.23    1.787    194.52    31.38         NA        NA
1     sf_shp            5 1272.97   10.055   1224.75    46.47         NA        NA
2  sf_sqlite            5  126.60    1.000    121.63     1.14         NA        NA

An added benefit is that there is no data loss as these warnings only occur when writing to shp.

> warnings()
Warnmeldungen:
1: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  ... :
  GDAL Message 1: Value 12760003401453 of field ID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
2: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  ... :
  GDAL Message 1: Value 12760019135320 of field F_JNCTID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
3: In CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options),  ... :
  GDAL Message 1: Value 12760018932615 of field T_JNCTID of feature 0 not successfully written. Possibly due to too larger number with respect to field width
...

and so forth...
We are very happy with this :-)

@rsbivand
Copy link
Member

@tim-salabim : which rgdal version was this? Might it be possible to check out the source from R-Forge (currently they don't have GDAL as I asked them to upgrade to GDAL 2, so no tarball there) and try that? If it's easier, I can put a tarball (or Windows binary from win-builder) on my server.

@tim-salabim
Copy link
Member

This was rgdal 1.2.8.
@rsbivand if you could provide a zip on a server that would be easiest for me.

@tim-salabim
Copy link
Member

Sorry with ZIP i mean a Windows binary

@rsbivand
Copy link
Member

Yes, I'm waiting for winbuilder to respond; will post the link if it passes check.

@rsbivand
Copy link
Member

Here you have the check output and Windows binary of 1.2.15.

@tim-salabim
Copy link
Member

Sweet, I'll give it a spin tomorrow at work

@tim-salabim
Copy link
Member

tim-salabim commented Oct 27, 2017

@rsbivand sorry for the delay! Here you go (with rgdal: version: 1.2-15, (SVN revision 686)):

        test replications elapsed relative user.self sys.self user.child sys.child
4 rgdal_gpkg            5  223.84    1.786    191.95    31.31         NA        NA
3    sf_gpkg            5  224.76    1.794    195.23    29.44         NA        NA
1     sf_shp            5 1262.15   10.073   1215.30    45.63         NA        NA
2  sf_sqlite            5  125.30    1.000    122.05     1.09         NA        NA

On par with sf!

@rsbivand
Copy link
Member

Thanks very much, very useful! I'll move to submission to CRAN ...

@edzer edzer changed the title slow writing to Sqlite and Gpkg writing speed sqlite and GPKG compared to shapefile Oct 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants