Skip to content

Wrap clustering_cv()#126

Merged
mikemahoney218 merged 3 commits intomainfrom
wrap_clustering_cv
Dec 9, 2022
Merged

Wrap clustering_cv()#126
mikemahoney218 merged 3 commits intomainfrom
wrap_clustering_cv

Conversation

@mikemahoney218
Copy link
Member

@mikemahoney218 mikemahoney218 commented Dec 8, 2022

This fixes #120 and fixes #104 by wrapping rsample::clustering_cv().

There's three big (breaking) changes here that I'm aware of:

  1. spatial_clustering_cv() no longer handles non-sf objects, because I think they'd be better handled via rsample::clustering_cv().
  2. Distances are now calculated between edges, not centroids, of non-point geometries. This is how the rest of the package works, and it makes the most sense to me; if you have polygons which touch, you'd probably assume their data has some amount of spatial relationship, regardless of where the midpoint of each polygon is.
  3. Because the new distance_function argument is now a function by default, and gets assigned as an attribute to the resulting rset, the distance_function attribute winds up having a somewhat complex environment, which is non-intuitive:
library(spatialsample)

clust <- spatial_clustering_cv(boston_canopy, v = 2)
lobstr::obj_sizes(
  boston_canopy,
  environment(attr(clust, "distance_function"))
)
#> * 984.07 kB
#> *  60.74 kB

ls(environment(attr(clust, "distance_function")))
#>  [1] "buffer"            "cluster_function"  "cv_att"           
#>  [4] "data"              "distance_function" "n"                
#>  [7] "radius"            "repeats"           "rset"             
#> [10] "v"

Created on 2022-12-08 by the reprex package (v2.0.1)

I'm not sure if there's a good way to "zero out" that environment, so that we aren't accidentally dragging extra data along.

Otherwise, this function should work the same as it always has.

Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me! Just a minor comment on the docs

#' to partition the data into disjointed sets via clustering.
#' This argument is ignored (with a warning) if `data` is an `sf` object.
#' @inheritParams buffer_indices
#' @inheritParams rsample::clustering_cv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leads to the docs for distance_function to state that the default would be stats::dist()

@mikemahoney218 mikemahoney218 merged commit 70e9529 into main Dec 9, 2022
@mikemahoney218 mikemahoney218 deleted the wrap_clustering_cv branch December 9, 2022 16:52
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spatial_clustering_cv() should have a repeats argument Wrap clustering_cv()

2 participants