Tidyclust #12

kbodwin · 2021-02-26T00:51:27Z

This is my writeup with some details about clustering (unsupervised learning) and how I envision that fitting into the tidymodels framework.

juliasilge

This is super exciting to see! 🙌

tidy-clustering/README.Rmd

juliasilge · 2021-03-01T15:34:17Z

tidy-clustering/README.Rmd

+
+## `parsnip`-like models
+
+Obviously, any and all clustering methods would have to be implemented as model


I might take out "obviously" here because if a goal is for folks to be able to use these as preprocessing steps, then building this out using a recipes framework might be worth considering (rather than parsnip).

I edited the sentence, but I disagree with using the recipes framework.

Even if an unsupervised procedure informs a preprocessing, the application of the clustering method is not itself a preprocessing step. The data is not altered by the search procedure.

Besides which, I think things could get a little confusing when clustering is the end point in and of itself - would it appear as a recipe with no "baking" step? That feels strange to me.

Co-authored-by: Julia Silge <julia.silge@gmail.com>

alexpghayes · 2021-03-17T18:18:00Z

Just wanted to comment that for unsupervised methods there is often both a forward and a backward transformation. Less so with clustering, but for many PCA-like tools. I previously brought this up at little in tidymodels/recipes#264.

Another possible consideration before starting to prototype things is the difference between inductive/transductive models, or methods that can be applied to a new dataset versus those that cannot.

michaelgaunt404 · 2021-03-25T05:49:34Z

Heyo! really excited to see this developing. I've recently fallen into the world of unsupervised clustering (via some gnar text projects) and have been having a hard time understanding the literature and not being able to find certain methods in tidymodels.

brshallo · 2021-09-20T23:07:31Z

Very cool! Looking forward to developments here and this functionality coming to tidymodels!

I posted a toy solution on SO for validating kmeans cluster partition stability on a holdout set: https://stackoverflow.com/a/68845111/9059865 . (For anyone stumbling onto this thread and looking for something simple in the interim before {celery} 😊 gets implemented in tidymodels.)

topepo

Overall it is great. My comments mostly reflect the idea that a different, but very similar, clas sof objects should be used to represent a clustering object.

topepo · 2021-09-21T14:23:21Z

tidy-clustering/README.Rmd

+
+## `parsnip`-like models
+
+Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g.


I agree that a parsnip-like interface would be a great idea. I don't think that the same exact object structure and class are needed though. For example, the mode of "partition" is only required if yo fit it into an object of class model_spec.

I think that an alternate class name (cluster_spec?) would be good with some minor difference. For example, the distance component is one that doesn't fit with a model spec.

In other words, the api is good but using the entirety of the existing model specification format will be ungainly or kludgy.

topepo · 2021-09-21T14:24:41Z

tidy-clustering/README.Rmd

+
+It's been suggested that "fit" might not be the right verb for unsupervised methods.  Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this.
+
+## `rsample` and cross-validating


Is bootstrapping used at all in cluster validation? Would something like this come into play with the potential for tuning mentioned below?

topepo · 2021-09-21T14:25:55Z

tidy-clustering/README.Rmd

+
+## `yardstick`-like metrics
+
+Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods.  (see above for lots and lots of detail)


Perhaps we could change from the nomenclature of "performance metrics" to something like "partition characteristics", the former being more of a supervised learning thing.

kbodwin added 2 commits February 25, 2021 15:00

readme

71c7504

update tidy clustering readme

b10b8ab

juliasilge reviewed Mar 1, 2021

View reviewed changes

kbodwin and others added 2 commits March 1, 2021 09:45

Update tidy-clustering/README.Rmd

9455315

Co-authored-by: Julia Silge <julia.silge@gmail.com>

Update README.Rmd

3f1a0ba

topepo mentioned this pull request Mar 17, 2021

Clustering: Add a step_kmeans() tidymodels/embed#77

Closed

juliasilge mentioned this pull request Mar 24, 2021

Dimensionality reduction of #TidyTuesday United Nations voting patterns | Julia Silge juliasilge/juliasilge.com#14

Open

topepo added 2 commits September 21, 2021 10:17

minor changes, remove hard breaks

6d2b445

refresh readme

6076709

topepo reviewed Sep 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tidyclust #12

Tidyclust #12

kbodwin commented Feb 26, 2021 •

edited

Loading

juliasilge left a comment

juliasilge Mar 1, 2021

kbodwin Mar 1, 2021

alexpghayes commented Mar 17, 2021

michaelgaunt404 commented Mar 25, 2021

brshallo commented Sep 20, 2021

topepo left a comment

topepo Sep 21, 2021

topepo Sep 21, 2021

topepo Sep 21, 2021


		## `parsnip`-like models

		Obviously, any and all clustering methods would have to be implemented as model


		## `parsnip`-like models

		Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g.


		It's been suggested that "fit" might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this.

		## `rsample` and cross-validating


		## `yardstick`-like metrics

		Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods. (see above for lots and lots of detail)

Tidyclust #12

Are you sure you want to change the base?

Tidyclust #12

Conversation

kbodwin commented Feb 26, 2021 • edited Loading

juliasilge left a comment

Choose a reason for hiding this comment

juliasilge Mar 1, 2021

Choose a reason for hiding this comment

kbodwin Mar 1, 2021

Choose a reason for hiding this comment

alexpghayes commented Mar 17, 2021

michaelgaunt404 commented Mar 25, 2021

brshallo commented Sep 20, 2021

topepo left a comment

Choose a reason for hiding this comment

topepo Sep 21, 2021

Choose a reason for hiding this comment

topepo Sep 21, 2021

Choose a reason for hiding this comment

topepo Sep 21, 2021

Choose a reason for hiding this comment

kbodwin commented Feb 26, 2021 •

edited

Loading