Skip to content

How to remove duplicate geometries? #669

@adrfantini

Description

@adrfantini

I'm trying to remove duplicate geometries, in this case points.
There are several ways to do so: my first idea was to use dplyr::distinct(), but it does not seem to work for geometry columns.

Some examples below:

#Create example dataset
library(sf)
library(dplyr)
d <- structure(list(layer = 274.146911621094, geometry = structure(list(
    `1` = structure(list(structure(c(-3162000, -3150000, -3150000, 
    -3162000, -3162000, 3162000, 3162000, 3150000, 3150000, 3162000
    ), .Dim = c(5L, 2L))), class = c("XY", "POLYGON", "sfg"))), .Names = "1", class = c("sfc_POLYGON", 
"sfc"), precision = 0, bbox = structure(c(-3162000, 3150000, 
-3150000, 3162000), .Names = c("xmin", "ymin", "xmax", "ymax"
), class = "bbox"), crs = structure(list(epsg = NA_integer_, 
    proj4string = "+proj=lcc +lat_1=30 +lat_2=65 +lat_0=48 +lon_0=9.75 +x_0=-6000 +y_0=-6000 +a=6371229 +b=6371229 +units=m +no_defs"), .Names = c("epsg", 
"proj4string"), class = "crs"), n_empty = 0L)), .Names = c("layer", 
"geometry"), row.names = 1L, class = c("sf", "data.frame"), sf_column = "geometry", agr = structure(NA_integer_, .Names = "layer", .Label = c("constant", 
"aggregate", "identity"), class = "factor"))
dpoint <- (st_cast(d, "POINT"))

#Now let's try to eliminate the duplicate point: 4 different ways come to mind
dpoint %>% distinct(geometry) #does nothing   <---- would be my preferred solution
st_intersection(dpoint) #Works, adds columns
st_cast(st_union(dpoint), "POINT") #Works
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) #Works

Now for some performance testing on a larger dataset which I do not attach (10k points, most of which duplicated):

library(microbenchmark)
mb <- microbenchmark(times=10,
st_intersection(dpoint),
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: milliseconds
                                                                                                        expr
                                                                                     st_intersection(dpoint)
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
        min         lq        mean     median          uq         max neval cld
 9683.35427 9777.54914 10036.55727 9975.24093 10205.31676 10667.48132    10   b
  106.32318  108.49353   115.39371  110.64525   111.65030   143.65393    10  a 
   37.90596   38.33749    38.86942   38.61979    39.36003    40.67645    10  a

And on a much larger dataset (1.4M points), for the two fastest methods:

mb <- microbenchmark(times=10,
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: seconds
                                                                                                        expr
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
      min        lq     mean    median        uq       max neval cld
 14.26685 14.531474 15.70820 15.675282 16.283850 18.349245    10   b
  5.06904  5.166566  5.98637  5.647217  6.737356  7.637938    10  a

Is there any faster, more elegant method? Can't dplyr::distinct(geometry) be made to work?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions