Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidyclust #12

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Tidyclust #12

wants to merge 6 commits into from

Conversation

kbodwin
Copy link

@kbodwin kbodwin commented Feb 26, 2021

This is my writeup with some details about clustering (unsupervised learning) and how I envision that fitting into the tidymodels framework.

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super exciting to see! 🙌

tidy-clustering/README.Rmd Outdated Show resolved Hide resolved

## `parsnip`-like models

Obviously, any and all clustering methods would have to be implemented as model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might take out "obviously" here because if a goal is for folks to be able to use these as preprocessing steps, then building this out using a recipes framework might be worth considering (rather than parsnip).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited the sentence, but I disagree with using the recipes framework.

Even if an unsupervised procedure informs a preprocessing, the application of the clustering method is not itself a preprocessing step. The data is not altered by the search procedure.

Besides which, I think things could get a little confusing when clustering is the end point in and of itself - would it appear as a recipe with no "baking" step? That feels strange to me.

kbodwin and others added 2 commits March 1, 2021 09:45
Co-authored-by: Julia Silge <julia.silge@gmail.com>
@alexpghayes
Copy link

Just wanted to comment that for unsupervised methods there is often both a forward and a backward transformation. Less so with clustering, but for many PCA-like tools. I previously brought this up at little in tidymodels/recipes#264.

Another possible consideration before starting to prototype things is the difference between inductive/transductive models, or methods that can be applied to a new dataset versus those that cannot.

@michaelgaunt404
Copy link

Heyo! really excited to see this developing. I've recently fallen into the world of unsupervised clustering (via some gnar text projects) and have been having a hard time understanding the literature and not being able to find certain methods in tidymodels.

@brshallo
Copy link

Very cool! Looking forward to developments here and this functionality coming to tidymodels!

I posted a toy solution on SO for validating kmeans cluster partition stability on a holdout set: https://stackoverflow.com/a/68845111/9059865 . (For anyone stumbling onto this thread and looking for something simple in the interim before {celery} 😊 gets implemented in tidymodels.)

Copy link
Member

@topepo topepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it is great. My comments mostly reflect the idea that a different, but very similar, clas sof objects should be used to represent a clustering object.


## `parsnip`-like models

Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a parsnip-like interface would be a great idea. I don't think that the same exact object structure and class are needed though. For example, the mode of "partition" is only required if yo fit it into an object of class model_spec.

I think that an alternate class name (cluster_spec?) would be good with some minor difference. For example, the distance component is one that doesn't fit with a model spec.

In other words, the api is good but using the entirety of the existing model specification format will be ungainly or kludgy.


It's been suggested that "fit" might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this.

## `rsample` and cross-validating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is bootstrapping used at all in cluster validation? Would something like this come into play with the potential for tuning mentioned below?


## `yardstick`-like metrics

Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods. (see above for lots and lots of detail)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could change from the nomenclature of "performance metrics" to something like "partition characteristics", the former being more of a supervised learning thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants