-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tidyclust #12
base: main
Are you sure you want to change the base?
Tidyclust #12
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super exciting to see! 🙌
tidy-clustering/README.Rmd
Outdated
|
||
## `parsnip`-like models | ||
|
||
Obviously, any and all clustering methods would have to be implemented as model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might take out "obviously" here because if a goal is for folks to be able to use these as preprocessing steps, then building this out using a recipes framework might be worth considering (rather than parsnip).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited the sentence, but I disagree with using the recipes framework.
Even if an unsupervised procedure informs a preprocessing, the application of the clustering method is not itself a preprocessing step. The data is not altered by the search procedure.
Besides which, I think things could get a little confusing when clustering is the end point in and of itself - would it appear as a recipe with no "baking" step? That feels strange to me.
Co-authored-by: Julia Silge <julia.silge@gmail.com>
Just wanted to comment that for unsupervised methods there is often both a forward and a backward transformation. Less so with clustering, but for many PCA-like tools. I previously brought this up at little in tidymodels/recipes#264. Another possible consideration before starting to prototype things is the difference between inductive/transductive models, or methods that can be applied to a new dataset versus those that cannot. |
Heyo! really excited to see this developing. I've recently fallen into the world of unsupervised clustering (via some gnar text projects) and have been having a hard time understanding the literature and not being able to find certain methods in tidymodels. |
Very cool! Looking forward to developments here and this functionality coming to tidymodels! I posted a toy solution on SO for validating kmeans cluster partition stability on a holdout set: https://stackoverflow.com/a/68845111/9059865 . (For anyone stumbling onto this thread and looking for something simple in the interim before {celery} 😊 gets implemented in tidymodels.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it is great. My comments mostly reflect the idea that a different, but very similar, clas sof objects should be used to represent a clustering object.
|
||
## `parsnip`-like models | ||
|
||
Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a parsnip-like interface would be a great idea. I don't think that the same exact object structure and class are needed though. For example, the mode of "partition" is only required if yo fit it into an object of class model_spec
.
I think that an alternate class name (cluster_spec
?) would be good with some minor difference. For example, the distance component is one that doesn't fit with a model spec.
In other words, the api is good but using the entirety of the existing model specification format will be ungainly or kludgy.
|
||
It's been suggested that "fit" might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this. | ||
|
||
## `rsample` and cross-validating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is bootstrapping used at all in cluster validation? Would something like this come into play with the potential for tuning mentioned below?
|
||
## `yardstick`-like metrics | ||
|
||
Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods. (see above for lots and lots of detail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could change from the nomenclature of "performance metrics" to something like "partition characteristics", the former being more of a supervised learning thing.
This is my writeup with some details about clustering (unsupervised learning) and how I envision that fitting into the
tidymodels
framework.