kmeans and kmedoids step functions #399

brian-j-smith · 2019-10-15T18:21:14Z

@topepo : I've written a couple step functions (described below) that you might consider for inclusion in the package.

step_kmeans : conversion of numeric variables to a reduced set by averaging within a k-means cluster partitioning of them.

step_kmedoids : conversion of numeric variables to a reduced set by selecting the medoids from a k-medoids cluster partitioning.

Both are dimension reduction techniques that can be viewed as projections to a reduced number of components (num_comp), like step_pca. k_medoids can additionally be viewed as variable selection, like step_corr.

The source code is available here.

Feel free to let me know if you are open to a PR for these and, if so, whether you have any questions on or suggested changes to the implementations.

The text was updated successfully, but these errors were encountered:

topepo · 2019-11-24T14:13:51Z

A PR would be good but it might be better in the embed package.

A few things:

how does bake assign clusters to new samples?
it might be good to add a replace option to give people the option to keep the original variables or replace them with the cluster variables.
can you trim the cluster objects to only keep the parts that we need to assign samples to clusters.
instead of using ncol(training), count the number of variables with the appropriate roles or length(col_names).
I don't think that the user should be able to pass in a pre-made cluster object. That should always be created by prep().
Adding a tunable method for the algorithm would be a good idea too. If you make one, we can add it to dials.
step_kmeans() should also run recipes_pkg_check().

brian-j-smith · 2019-11-25T22:13:52Z

Thanks for the feedback.

Both functions work similarly. Variables are partitioned into clusters during the call to prep(). The variable cluster assignments are then saved and passed to bake(). A new sample would contain the same variables, the cluster assignments would be known, and the variables would be averaged within the clusters (step_kmeans()) or replaced with the cluster medoids (step_kmedoids()).
A replace option makes sense for step_kmeans(). Could be done for step_kmedoids() too if the medoid variable names are changed to distinguish them from the originals. Will plan to do.
Yes, will trim the cluster objects.
Will use length(col_names).
Agree about not wanting the user to pass a pre-made object. I had included that res argument only for consistency with other step functions, like step_pca(); but would otherwise not include function arguments that should not be passed a value. Would you be ok with it being left out altogether? How about the trained argument? Could it be left out for the same reason? If not left out, could I simply ignore any values passed to res or trained?
Did you have something in mind other than the tunable.step_kmeans and tunable.step_kmedoids methods already in my code?
Is a call to recipes_pkg_check() needed if only the R system libraries Matrix and stats are used by step_kmeans()?

Feel free to let me know your preference about which package is the best fit for these. They are most similar to step_pca() and step_corr() and not applicable to categorical predictors. My first choice is probably recipes and second to start a new package for these and some other related work.

Best regards.

brian-j-smith · 2019-12-01T20:53:47Z

The suggested changes have been made in this commit.

topepo · 2019-12-02T16:28:53Z

A new sample would contain the same variables, the cluster assignments would be known

But not for new data (that was not involved in the analysis that generated the clusters). In your example, attitude is used to model the data and get predictions but these are on the same data. Wouldn't you need some rule to associate new samples with their qualitative cluster membership (like nearest centroid)?

I assumed that, since you are doing clustering, the main output of that analysis would be the qualitative cluster membership (as opposed to returning functions of the centroids). I see the projection that you are doing instead but I'm not sure if I would associate with the output of clustering.

I'll look at it more in a few days.

brian-j-smith · 2019-12-02T17:56:46Z

You might be thinking about the more common application of clustering in which cases/samples are the clustering units (things being clustered). I know I was when initially learning about these approaches. Here, it is the other way around. The variables are the clustering units – note kmeans/pam applied to the transposed training matrices in the implementations. The approaches only require that a new dataset has the same set of variables as the training dataset; the cases can and usually will differ. Put another way, cluster membership is in terms of the variables rather than the cases.

Below is the step_kmeans() example modified to predict on the full and subsetted attitude data. Predictions on the common samples (1:10) are the same since the approach is using cluster memberships of the variables, as intended.

rec <- recipe(rating ~ ., data = attitude)
kmeans_rec <- rec %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_kmeans(all_predictors(), num_comp = 3)
kmeans_prep <- prep(kmeans_rec, training = attitude)
bake(kmeans_prep, attitude[1:10, ])
bake(kmeans_prep, attitude)[1:10, ]

topepo · 2020-06-05T13:54:37Z

It looks like this was implemented here

github-actions · 2021-02-21T00:03:16Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

pstraforelli mentioned this issue Feb 3, 2020

tidy tools for unsupervised learning methods? #465

Closed

topepo closed this as completed Jun 5, 2020

github-actions bot locked and limited conversation to collaborators Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmeans and kmedoids step functions #399

kmeans and kmedoids step functions #399

brian-j-smith commented Oct 15, 2019

topepo commented Nov 24, 2019

brian-j-smith commented Nov 25, 2019

brian-j-smith commented Dec 1, 2019

topepo commented Dec 2, 2019

brian-j-smith commented Dec 2, 2019

topepo commented Jun 5, 2020

github-actions bot commented Feb 21, 2021

kmeans and kmedoids step functions #399

kmeans and kmedoids step functions #399

Comments

brian-j-smith commented Oct 15, 2019

topepo commented Nov 24, 2019

brian-j-smith commented Nov 25, 2019

brian-j-smith commented Dec 1, 2019

topepo commented Dec 2, 2019

brian-j-smith commented Dec 2, 2019

topepo commented Jun 5, 2020

github-actions bot commented Feb 21, 2021