Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kmeans and kmedoids step functions #399

Closed
brian-j-smith opened this issue Oct 15, 2019 · 7 comments
Closed

kmeans and kmedoids step functions #399

brian-j-smith opened this issue Oct 15, 2019 · 7 comments

Comments

@brian-j-smith
Copy link

@topepo : I've written a couple step functions (described below) that you might consider for inclusion in the package.

step_kmeans : conversion of numeric variables to a reduced set by averaging within a k-means cluster partitioning of them.

step_kmedoids : conversion of numeric variables to a reduced set by selecting the medoids from a k-medoids cluster partitioning.

Both are dimension reduction techniques that can be viewed as projections to a reduced number of components (num_comp), like step_pca. k_medoids can additionally be viewed as variable selection, like step_corr.

The source code is available here.

Feel free to let me know if you are open to a PR for these and, if so, whether you have any questions on or suggested changes to the implementations.

@topepo
Copy link
Member

topepo commented Nov 24, 2019

A PR would be good but it might be better in the embed package.

A few things:

  • how does bake assign clusters to new samples?

  • it might be good to add a replace option to give people the option to keep the original variables or replace them with the cluster variables.

  • can you trim the cluster objects to only keep the parts that we need to assign samples to clusters.

  • instead of using ncol(training), count the number of variables with the appropriate roles or length(col_names).

  • I don't think that the user should be able to pass in a pre-made cluster object. That should always be created by prep().

  • Adding a tunable method for the algorithm would be a good idea too. If you make one, we can add it to dials.

  • step_kmeans() should also run recipes_pkg_check().

@brian-j-smith
Copy link
Author

Thanks for the feedback.

  • Both functions work similarly. Variables are partitioned into clusters during the call to prep(). The variable cluster assignments are then saved and passed to bake(). A new sample would contain the same variables, the cluster assignments would be known, and the variables would be averaged within the clusters (step_kmeans()) or replaced with the cluster medoids (step_kmedoids()).

  • A replace option makes sense for step_kmeans(). Could be done for step_kmedoids() too if the medoid variable names are changed to distinguish them from the originals. Will plan to do.

  • Yes, will trim the cluster objects.

  • Will use length(col_names).

  • Agree about not wanting the user to pass a pre-made object. I had included that res argument only for consistency with other step functions, like step_pca(); but would otherwise not include function arguments that should not be passed a value. Would you be ok with it being left out altogether? How about the trained argument? Could it be left out for the same reason? If not left out, could I simply ignore any values passed to res or trained?

  • Did you have something in mind other than the tunable.step_kmeans and tunable.step_kmedoids methods already in my code?

  • Is a call to recipes_pkg_check() needed if only the R system libraries Matrix and stats are used by step_kmeans()?

Feel free to let me know your preference about which package is the best fit for these. They are most similar to step_pca() and step_corr() and not applicable to categorical predictors. My first choice is probably recipes and second to start a new package for these and some other related work.

Best regards.

@brian-j-smith
Copy link
Author

The suggested changes have been made in this commit.

@topepo
Copy link
Member

topepo commented Dec 2, 2019

A new sample would contain the same variables, the cluster assignments would be known

But not for new data (that was not involved in the analysis that generated the clusters). In your example, attitude is used to model the data and get predictions but these are on the same data. Wouldn't you need some rule to associate new samples with their qualitative cluster membership (like nearest centroid)?

I assumed that, since you are doing clustering, the main output of that analysis would be the qualitative cluster membership (as opposed to returning functions of the centroids). I see the projection that you are doing instead but I'm not sure if I would associate with the output of clustering.

I'll look at it more in a few days.

@brian-j-smith
Copy link
Author

You might be thinking about the more common application of clustering in which cases/samples are the clustering units (things being clustered). I know I was when initially learning about these approaches. Here, it is the other way around. The variables are the clustering units – note kmeans/pam applied to the transposed training matrices in the implementations. The approaches only require that a new dataset has the same set of variables as the training dataset; the cases can and usually will differ. Put another way, cluster membership is in terms of the variables rather than the cases.

Below is the step_kmeans() example modified to predict on the full and subsetted attitude data. Predictions on the common samples (1:10) are the same since the approach is using cluster memberships of the variables, as intended.

rec <- recipe(rating ~ ., data = attitude)
kmeans_rec <- rec %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  step_kmeans(all_predictors(), num_comp = 3)
kmeans_prep <- prep(kmeans_rec, training = attitude)
bake(kmeans_prep, attitude[1:10, ])
bake(kmeans_prep, attitude)[1:10, ]

@topepo
Copy link
Member

topepo commented Jun 5, 2020

It looks like this was implemented here

@topepo topepo closed this as completed Jun 5, 2020
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants