-
Notifications
You must be signed in to change notification settings - Fork 300
Closed
Description
I'm trying to remove duplicate geometries, in this case points.
There are several ways to do so: my first idea was to use dplyr::distinct(), but it does not seem to work for geometry
columns.
Some examples below:
#Create example dataset
library(sf)
library(dplyr)
d <- structure(list(layer = 274.146911621094, geometry = structure(list(
`1` = structure(list(structure(c(-3162000, -3150000, -3150000,
-3162000, -3162000, 3162000, 3162000, 3150000, 3150000, 3162000
), .Dim = c(5L, 2L))), class = c("XY", "POLYGON", "sfg"))), .Names = "1", class = c("sfc_POLYGON",
"sfc"), precision = 0, bbox = structure(c(-3162000, 3150000,
-3150000, 3162000), .Names = c("xmin", "ymin", "xmax", "ymax"
), class = "bbox"), crs = structure(list(epsg = NA_integer_,
proj4string = "+proj=lcc +lat_1=30 +lat_2=65 +lat_0=48 +lon_0=9.75 +x_0=-6000 +y_0=-6000 +a=6371229 +b=6371229 +units=m +no_defs"), .Names = c("epsg",
"proj4string"), class = "crs"), n_empty = 0L)), .Names = c("layer",
"geometry"), row.names = 1L, class = c("sf", "data.frame"), sf_column = "geometry", agr = structure(NA_integer_, .Names = "layer", .Label = c("constant",
"aggregate", "identity"), class = "factor"))
dpoint <- (st_cast(d, "POINT"))
#Now let's try to eliminate the duplicate point: 4 different ways come to mind
dpoint %>% distinct(geometry) #does nothing <---- would be my preferred solution
st_intersection(dpoint) #Works, adds columns
st_cast(st_union(dpoint), "POINT") #Works
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) #Works
Now for some performance testing on a larger dataset which I do not attach (10k points, most of which duplicated):
library(microbenchmark)
mb <- microbenchmark(times=10,
st_intersection(dpoint),
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint))
)
Result:
Unit: milliseconds
expr
st_intersection(dpoint)
st_cast(st_union(dpoint), "POINT")
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(dpoint))
min lq mean median uq max neval cld
9683.35427 9777.54914 10036.55727 9975.24093 10205.31676 10667.48132 10 b
106.32318 108.49353 115.39371 110.64525 111.65030 143.65393 10 a
37.90596 38.33749 38.86942 38.61979 39.36003 40.67645 10 a
And on a much larger dataset (1.4M points), for the two fastest methods:
mb <- microbenchmark(times=10,
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint))
)
Result:
Unit: seconds
expr
st_cast(st_union(dpoint), "POINT")
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(dpoint))
min lq mean median uq max neval cld
14.26685 14.531474 15.70820 15.675282 16.283850 18.349245 10 b
5.06904 5.166566 5.98637 5.647217 6.737356 7.637938 10 a
Is there any faster, more elegant method? Can't dplyr::distinct(geometry)
be made to work?
rafapereirabr and philiporlando
Metadata
Metadata
Assignees
Labels
No labels