From 71c7504842afe0616370b7e4418a67b68384124e Mon Sep 17 00:00:00 2001 From: kbodwin Date: Thu, 25 Feb 2021 15:00:13 -0800 Subject: [PATCH 1/6] readme --- tidy-clustering/README.Rmd | 157 +++++++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 tidy-clustering/README.Rmd diff --git a/tidy-clustering/README.Rmd b/tidy-clustering/README.Rmd new file mode 100644 index 0000000..9bf4880 --- /dev/null +++ b/tidy-clustering/README.Rmd @@ -0,0 +1,157 @@ + +```{r} +library(tidyverse) +library(tidymodels) +``` + +# Overview + +`tidymodels` currently only implements supervised learning methods. This +document explores the possibility of including unsupervised methods in the same +framework. + +## Basic Definition + +I'll use the term "clustering" to refer generally to any method that does the +following: + +1. Define a **measure of similarity** between objects. + +2. Discover **groups of objects** that are similar. + +Many of the thoughts in this section are taken from [this paper](http://proceedings.mlr.press/v27/luxburg12a/luxburg12a.pdf) + +## Uses of Clustering + +1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, +we can apply unsupervised clustering, then let the clusters be defined by their +class enrichment of labelled objects. + + + A word of caution for this approach: Just because a clustering structure + doesn't align with known labels doesn't mean it is "wrong". It could be + capturing a different (true) aspect of the data than the one we have labels for. + +2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get +a sense of the structure of the data. This is somewhat nebulous, and usually +involves eyeballing a visualization. + + + In my experience this is usually a precursor to classification. We'd do + an unsupervised clustering and see how well the objects "group up". This + gives us an idea of whether the measurements we're using for classification + are a good choice, before we fit a formal model. + +3. **Pre-processing:** Clustering can be used to discover relationships in data +that are undesirable, so that we can *residualize* or *decorrelate* the objects +before applying an analysis. + + + A great example of this is in genetics, where we have measurements of gene + expression for several subjects. Typically, gene expression is most + strongly correlated by race. If we cluster the subjects on gene expression, + we can then identify unwanted dependence to remove from the data. + +4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the +end goal of the study. For example: + + + In the Enron corruption case in 2001, researchers created a network based + on who emailed who within the company. They then looked at which clusters + contained known consiprators, and investigated the other individuals in those + groups. + + + In the early days of breast cancer genetic studies, researchers clustered + known patients on genetic expression, which led to the discovery of different + tumor types (e.g. Basal, Her-2, Luminal). These have later been clinically + validated and better defined. + + +## Ways to validate a cluster + +The major departure from supervised learning is this: With a supervised method, +we have a very clear way to measure success, namely, how well does it predict? + +With clustering, there is no "right answer" to compare results againsts. + +There are several ways people typically validate a clustering result: + +1. **within-group versus without-group similarity:** The goal is to find +groups of similar objects. Thus, we can check how close objects in the same +cluster are as compared to how close objects in different clusters are. + + + A problem with this is that there's not objective baseline about what is + a "good" ratio. + + (Gao, Witten, Bien have a cool new paper suggesting a statistical test + for cluster concentration.) + + +2. **Stability:** If we regard the objects being clustered as a random subset +of a population, we can ask whether the same cluster structure would have emerged +in a different random subset. We can measure this with bootstrapped subsampling. + + + A cluster structure being stable doesn't necessarily mean it is meaningful. + See e.g. [this tweet](https://twitter.com/AndrewZalesky/status/1363741266761027585) + + +3. **Enrichment in known labels:** (See *semi-supervised learning* above) + +4. **Statistical significance/generative models:** If our clustering method +places model assumptions on how the data was generated, we can formulate a notion +of statistically signficant clusters. + + + A good example of this is "model-based clustering", which typically assumes + the data is generated from a mixture of multivariate Gaussians. We can then + estimate the probability that a particular object came from a particular + distribution in the mixture. + + +## Other details + +#### Partitions versus Extractions + +In many clustering algorithms, each object is placed into **exactly one** cluster. +This is called a **partition**. + +However, some algorithms allow for overlapping clusters, i.e. objects belonging to +multiple clusters - or for some objects to be "background" and have no cluster +membership at all. + +(This is sometimes called **community detection** or **cluster extraction**) + +#### What are the variables, what are the samples? + +Consider the example of data consisting of many gene expression measurements +for several individual subjects. + +From a statistical perspective, we'd typically regard the subjects as samples +from a population, and the gene expression levels as variables being studied. + +When we cluster this data, sometimes we look for clusters of **subjects**, i.e., +people whose gene signature is similar. Sometimes we look for clusters of **genes**, +i.e., genes that activate or deactivate together. + +Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we +cluster on. In statistical algorithms, your model assumptions need to match +your notion of samples/variables. + +#### Beyond the distance matrix + +In the vast majority of cases, all you need for a clustering algorithm is a +**similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity +measure between object $i$ and object $j$. + +Of course, the choice of similarity measure matters enormously. But the point is +you only need pairwise similarities to run e.g. kmeans. + +Some approaches, though, need more information. Statistical approaches in particular +often estimate nuisance variables (like variance) from multiple measurements before +collapsing the data into pairwise similarities. + + +# Inclusion in tidymodels + + + +# Existing framework + + + +# New framework needed + From b10b8abf67fc660631d62c74cde78f047dd87397 Mon Sep 17 00:00:00 2001 From: kbodwin Date: Thu, 25 Feb 2021 16:50:14 -0800 Subject: [PATCH 2/6] update tidy clustering readme --- tidy-clustering/README.Rmd | 159 ++++++++++++- tidy-clustering/README.html | 422 +++++++++++++++++++++++++++++++++ tidy-clustering/README.utf8.md | 299 +++++++++++++++++++++++ 3 files changed, 870 insertions(+), 10 deletions(-) create mode 100644 tidy-clustering/README.html create mode 100644 tidy-clustering/README.utf8.md diff --git a/tidy-clustering/README.Rmd b/tidy-clustering/README.Rmd index 9bf4880..6f384d8 100644 --- a/tidy-clustering/README.Rmd +++ b/tidy-clustering/README.Rmd @@ -1,9 +1,3 @@ - -```{r} -library(tidyverse) -library(tidymodels) -``` - # Overview `tidymodels` currently only implements supervised learning methods. This @@ -131,10 +125,10 @@ Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we cluster on. In statistical algorithms, your model assumptions need to match your notion of samples/variables. -#### Beyond the distance matrix +#### Beyond the similarity matrix In the vast majority of cases, all you need for a clustering algorithm is a -**similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity +*similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity measure between object $i$ and object $j$. Of course, the choice of similarity measure matters enormously. But the point is @@ -147,11 +141,156 @@ collapsing the data into pairwise similarities. # Inclusion in tidymodels +Here's an overview of how I see unsupervised learning fitting into a tidymodels +framework. + +## Recipes and Workflow + +These steps can remain the same; there's no reason unsupervised methods need +different preprocessing options/structure. + +## `parsnip`-like models + +Obviously, any and all clustering methods would have to be implemented as model +spec functions, and the setup would look similar, e.g. + +```{r, eval = FALSE} +km_mod <- k_means(centers = 5) %>% + set_mode("partition") %>% + set_engine("stats") +``` + + +In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` +that is essentially the same structure. The reason I think this should be a +separate package is that there are two major philosophical shifts for +supervised vs. unsupervised learning. + +I'm going to refer to the twin package as `celery` for now because that name makes +me laugh. + +#### Modes + +In `parsnip`, the possible modes are *classification* or *regression*, and they +correspond to the nature of the response variable. + +In `celery`, there is no response variable. I believe the modes should be +*partition* or *extraction*. (see above for details) The way that a model +fit should output results is fundamentally different for those two tasks. + +I don't really see a parallel between the classification/regression divide and the +partition/extraction divide. It doesn't feel right to just lump them together +into four "modes" options. + +#### Choosing a dissimilarity metric + +Every model comes with choices about parameters and assumptions and such. To +some degree, this would work the same in `parsnip` and `celery`: + +```{r, eval = FALSE} +knn_mod <- nearest_neighbors(neighbors = 5) + +km_mod <- k_means(centers = 5) + +``` + +However, for clustering, the decision of how to define "dissimilarity" feels like +a more major decision. That is, supervised methods require **one major choice** +(what model) and then many minor choices (the parameters/penalties) for that model. +Unsupervised methods require **two major choices**: the algorithm and the similarity +measure. + +I would like to see the dissimilarity measure be given "front-row billing", to + +a. make it easy to swap between choices and compare outputs +b. make it obvious what we are looking for in our clusters +c. force the user to choose very deliberately rather than relying on defaults + +For example, +```{r, eval = FALSE} +k_means(centers = 5) %>% + set_dist("euclidean") -# Existing framework +k_means(centers = 5) %>% + set_dist("manhattan") +``` + +Much like with *modes*, this would require the `celery` package to pre-specify +which disssimilarities were available for particular clustering algorithms. For +example, some methods require a true distance measure, while others are less strict. + +These would have to be more flexible than modes, though, since clustering methods +have **so** many possible dissimilarity measures. + +## fitting to data + +There is no real reason the `fit` step needs to look different, once a model +is specified: + +```{r, eval = FALSE} + +km_mod <- k_means(centers = 5) %>% + set_dist("euclidean") %>% + set_mode("partition") %>% + set_engine("stats") + + +km_mod %>% + fit(vars(bill_length_mm:body_mass_g), data = penguins) +``` + +It's been suggested that "fit" might not be the right verb for unsupervised +methods. Technicallly, many supervised methods are not truly fitting a model either +(see: K nearest neighbors), so I'm not worried about this. + +## `rsample` and cross-validating + +Cross-validation as such doesn't make sense for clustering, because there is no +notion of prediction on the test set. + +However, a similar framework could make sense for subsampling. Instead +of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain +success metrics for each subsample to get a sense of variability in the metric. + + +## `yardstick`-like metrics + +Unsupervised methods have very different (and more ambiguous) metrics for "success" +than supervised methods. (see above for lots and lots of detail) + +The underlying structure of the package doesn't need to change like it would for +`parsnip`/`celery`. We just need more functions implementing metrics that make +sense for clustering. + +# Meta: What does this buy us? + +Here are the advantages to a tidymodels implementation of clustering: + +1. **Consistency and clarity:** It's nice that all my modeling code looks similar +even for different analyses with different models. I'd like that to extend to +unsupervised analyses. + +2. **Easy framework for beginners:** One problem tidymodels solves is that +beginners tend to shove their data into a model until they get an output without +error. Tidymodels separates the important decisions in the data processing and +the model specification and the validation. This advantage would also exist in +the unsupervised case. + +3. **Comparison of methods:** Sometimes it's hard to decide between clustering +algorithms. There's less of a clear measure of success in unsupervised learning - +but it would still be a huge help if it were easier to "just try a bunch of things +and see what works". + +4. **Wrappers to automated repetition:** In the same way that tidymodels condenses +the v-fold cross-validation process into a single function, it could condense +subsampling into a single function. Also, I haven't said much about tuning, but +it'd be great to be able to do something like: + +```{r, eval = FALSE} +k_means(clusters = tune()) +``` -# New framework needed diff --git a/tidy-clustering/README.html b/tidy-clustering/README.html new file mode 100644 index 0000000..5c2ce01 --- /dev/null +++ b/tidy-clustering/README.html @@ -0,0 +1,422 @@ + + + + + + + + + + + + + +README.utf8 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
library(tidyverse)
+
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
+
## v ggplot2 3.3.3     v purrr   0.3.4
+## v tibble  3.0.6     v dplyr   1.0.3
+## v tidyr   1.1.2     v stringr 1.4.0
+## v readr   1.4.0     v forcats 0.5.0
+
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
+## x dplyr::filter() masks stats::filter()
+## x dplyr::lag()    masks stats::lag()
+
library(tidymodels)
+
## -- Attaching packages -------------------------------------- tidymodels 0.1.2 --
+
## v broom     0.7.3      v recipes   0.1.15
+## v dials     0.0.9      v rsample   0.0.8 
+## v infer     0.5.4      v tune      0.1.2 
+## v modeldata 0.1.0      v workflows 0.2.1 
+## v parsnip   0.1.5      v yardstick 0.0.7
+
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
+## x scales::discard() masks purrr::discard()
+## x dplyr::filter()   masks stats::filter()
+## x recipes::fixed()  masks stringr::fixed()
+## x dplyr::lag()      masks stats::lag()
+## x yardstick::spec() masks readr::spec()
+## x recipes::step()   masks stats::step()
+
+

Overview

+

tidymodels currently only implements supervised learning methods. This document explores the possibility of including unsupervised methods in the same framework.

+
+

Basic Definition

+

I’ll use the term “clustering” to refer generally to any method that does the following:

+
    +
  1. Define a measure of similarity between objects.

  2. +
  3. Discover groups of objects that are similar.

  4. +
+

Many of the thoughts in this section are taken from this paper

+
+
+

Uses of Clustering

+
    +
  1. Semi-Supervised Learning: If we have class labels on some of the objects, we can apply unsupervised clustering, then let the clusters be defined by their class enrichment of labelled objects.

    +
      +
    • A word of caution for this approach: Just because a clustering structure doesn’t align with known labels doesn’t mean it is “wrong”. It could be capturing a different (true) aspect of the data than the one we have labels for.
    • +
  2. +
  3. EDA: Sometimes clustering is applied as a first exploratory step, to get a sense of the structure of the data. This is somewhat nebulous, and usually involves eyeballing a visualization.

    +
      +
    • In my experience this is usually a precursor to classification. We’d do an unsupervised clustering and see how well the objects “group up”. This gives us an idea of whether the measurements we’re using for classification are a good choice, before we fit a formal model.
    • +
  4. +
  5. Pre-processing: Clustering can be used to discover relationships in data that are undesirable, so that we can residualize or decorrelate the objects before applying an analysis.

    +
      +
    • A great example of this is in genetics, where we have measurements of gene expression for several subjects. Typically, gene expression is most strongly correlated by race. If we cluster the subjects on gene expression, we can then identify unwanted dependence to remove from the data.
    • +
  6. +
  7. Clusters as analysis: Sometimes, assignment of cluster membership is the end goal of the study. For example:

    +
      +
    • In the Enron corruption case in 2001, researchers created a network based on who emailed who within the company. They then looked at which clusters contained known consiprators, and investigated the other individuals in those groups.

    • +
    • In the early days of breast cancer genetic studies, researchers clustered known patients on genetic expression, which led to the discovery of different tumor types (e.g. Basal, Her-2, Luminal). These have later been clinically validated and better defined.

    • +
  8. +
+
+
+

Ways to validate a cluster

+

The major departure from supervised learning is this: With a supervised method, we have a very clear way to measure success, namely, how well does it predict?

+

With clustering, there is no “right answer” to compare results againsts.

+

There are several ways people typically validate a clustering result:

+
    +
  1. within-group versus without-group similarity: The goal is to find groups of similar objects. Thus, we can check how close objects in the same cluster are as compared to how close objects in different clusters are.

    +
      +
    • A problem with this is that there’s not objective baseline about what is a “good” ratio.
    • +
    • (Gao, Witten, Bien have a cool new paper suggesting a statistical test for cluster concentration.)
    • +
  2. +
  3. Stability: If we regard the objects being clustered as a random subset of a population, we can ask whether the same cluster structure would have emerged in a different random subset. We can measure this with bootstrapped subsampling.

    +
      +
    • A cluster structure being stable doesn’t necessarily mean it is meaningful. See e.g. this tweet
    • +
  4. +
  5. Enrichment in known labels: (See semi-supervised learning above)

  6. +
  7. Statistical significance/generative models: If our clustering method places model assumptions on how the data was generated, we can formulate a notion of statistically signficant clusters.

    +
      +
    • A good example of this is “model-based clustering”, which typically assumes the data is generated from a mixture of multivariate Gaussians. We can then estimate the probability that a particular object came from a particular distribution in the mixture.
    • +
  8. +
+
+
+

Other details

+
+

Partitions versus Extractions

+

In many clustering algorithms, each object is placed into exactly one cluster. This is called a partition.

+

However, some algorithms allow for overlapping clusters, i.e. objects belonging to multiple clusters - or for some objects to be “background” and have no cluster membership at all.

+

(This is sometimes called community detection or cluster extraction)

+
+
+

What are the variables, what are the samples?

+

Consider the example of data consisting of many gene expression measurements for several individual subjects.

+

From a statistical perspective, we’d typically regard the subjects as samples from a population, and the gene expression levels as variables being studied.

+

When we cluster this data, sometimes we look for clusters of subjects, i.e., people whose gene signature is similar. Sometimes we look for clusters of genes, i.e., genes that activate or deactivate together.

+

Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we cluster on. In statistical algorithms, your model assumptions need to match your notion of samples/variables.

+
+
+

Beyond the similarity matrix

+

In the vast majority of cases, all you need for a clustering algorithm is a similarity matrix; that is, entry \((i,j)\) of the matrix is the pairwise similarity measure between object \(i\) and object \(j\).

+

Of course, the choice of similarity measure matters enormously. But the point is you only need pairwise similarities to run e.g. kmeans.

+

Some approaches, though, need more information. Statistical approaches in particular often estimate nuisance variables (like variance) from multiple measurements before collapsing the data into pairwise similarities.

+
+
+
+
+

Inclusion in tidymodels

+

Here’s an overview of how I see unsupervised learning fitting into a tidymodels framework.

+
+

Recipes and Workflow

+

These steps can remain the same; there’s no reason unsupervised methods need different preprocessing options/structure.

+
+
+

parsnip-like models

+

Obviously, any and all clustering methods would have to be implemented as model spec functions, and the setup would look similar, e.g.

+
km_mod <- k_means(centers = 5) %>%
+    set_mode("partition") %>%
+    set_engine("stats")
+

In my opinion, it makes sense to create a “fraternal twin” package to parsnip that is essentially the same structure. The reason I think this should be a separate package is that there are two major philosophical shifts for supervised vs. unsupervised learning.

+

I’m going to refer to the twin package as celery for now because that name makes me laugh.

+
+

Modes

+

In parsnip, the possible modes are classification or regression, and they correspond to the nature of the response variable.

+

In celery, there is no response variable. I believe the modes should be partition or extraction. (see above for details) The way that a model fit should output results is fundamentally different for those two tasks.

+

I don’t really see a parallel between the classification/regression divide and the partition/extraction divide. It doesn’t feel right to just lump them together into four “modes” options.

+
+
+

Choosing a dissimilarity metric

+

Every model comes with choices about parameters and assumptions and such. To some degree, this would work the same in parsnip and celery:

+
knn_mod <- nearest_neighbors(neighbors = 5)
+
+km_mod <- k_means(centers = 5)
+

However, for clustering, the decision of how to define “dissimilarity” feels like a more major decision. That is, supervised methods require one major choice (what model) and then many minor choices (the parameters/penalties) for that model. Unsupervised methods require two major choices: the algorithm and the similarity measure.

+

I would like to see the dissimilarity measure be given “front-row billing”, to

+
    +
  1. make it easy to swap between choices and compare outputs
  2. +
  3. make it obvious what we are looking for in our clusters
  4. +
  5. force the user to choose very deliberately rather than relying on defaults
  6. +
+

For example,

+
k_means(centers = 5) %>%
+    set_dist("euclidean")
+
+k_means(centers = 5) %>%
+    set_dist("manhattan")
+

Much like with modes, this would require the celery package to pre-specify which disssimilarities were available for particular clustering algorithms. For example, some methods require a true distance measure, while others are less strict.

+

These would have to be more flexible than modes, though, since clustering methods have so many possible dissimilarity measures.

+
+
+
+

fitting to data

+

There is no real reason the fit step needs to look different, once a model is specified:

+
km_mod <- k_means(centers = 5) %>%
+    set_dist("euclidean") %>%
+    set_mode("partition") %>%
+    set_engine("stats")
+
+
+km_mod %>%
+    fit(vars(bill_length_mm:body_mass_g), data = penguins)
+

It’s been suggested that “fit” might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I’m not worried about this.

+
+
+

rsample and cross-validating

+

Cross-validation as such doesn’t make sense for clustering, because there is no notion of prediction on the test set.

+

However, a similar framework could make sense for subsampling. Instead of \(v\) cross-validations, we’d have \(v\) subsamples, and we’d look at certain success metrics for each subsample to get a sense of variability in the metric.

+
+
+

yardstick-like metrics

+

Unsupervised methods have very different (and more ambiguous) metrics for “success” than supervised methods. (see above for lots and lots of detail)

+

The underlying structure of the package doesn’t need to change like it would for parsnip/celery. We just need more functions implementing metrics that make sense for clustering.

+
+
+
+

Meta: What does this buy us?

+

Here are the advantages to a tidymodels implementation of clustering:

+
    +
  1. Consistency and clarity: It’s nice that all my modeling code looks similar even for different analyses with different models. I’d like that to extend to unsupervised analyses.

  2. +
  3. Easy framework for beginners: One problem tidymodels solves is that beginners tend to shove their data into a model until they get an output without error. Tidymodels separates the important decisions in the data processing and the model specification and the validation. This advantage would also exist in the unsupervised case.

  4. +
  5. Comparison of methods: Sometimes it’s hard to decide between clustering algorithms. There’s less of a clear measure of success in unsupervised learning - but it would still be a huge help if it were easier to “just try a bunch of things and see what works”.

  6. +
  7. Wrappers to automated repetition: In the same way that tidymodels condenses the v-fold cross-validation process into a single function, it could condense subsampling into a single function. Also, I haven’t said much about tuning, but it’d be great to be able to do something like:

  8. +
+
k_means(clusters = tune())
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/tidy-clustering/README.utf8.md b/tidy-clustering/README.utf8.md new file mode 100644 index 0000000..6e5be45 --- /dev/null +++ b/tidy-clustering/README.utf8.md @@ -0,0 +1,299 @@ +# Overview + +`tidymodels` currently only implements supervised learning methods. This +document explores the possibility of including unsupervised methods in the same +framework. + +## Basic Definition + +I'll use the term "clustering" to refer generally to any method that does the +following: + +1. Define a **measure of similarity** between objects. + +2. Discover **groups of objects** that are similar. + +Many of the thoughts in this section are taken from [this paper](http://proceedings.mlr.press/v27/luxburg12a/luxburg12a.pdf) + +## Uses of Clustering + +1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, +we can apply unsupervised clustering, then let the clusters be defined by their +class enrichment of labelled objects. + + + A word of caution for this approach: Just because a clustering structure + doesn't align with known labels doesn't mean it is "wrong". It could be + capturing a different (true) aspect of the data than the one we have labels for. + +2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get +a sense of the structure of the data. This is somewhat nebulous, and usually +involves eyeballing a visualization. + + + In my experience this is usually a precursor to classification. We'd do + an unsupervised clustering and see how well the objects "group up". This + gives us an idea of whether the measurements we're using for classification + are a good choice, before we fit a formal model. + +3. **Pre-processing:** Clustering can be used to discover relationships in data +that are undesirable, so that we can *residualize* or *decorrelate* the objects +before applying an analysis. + + + A great example of this is in genetics, where we have measurements of gene + expression for several subjects. Typically, gene expression is most + strongly correlated by race. If we cluster the subjects on gene expression, + we can then identify unwanted dependence to remove from the data. + +4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the +end goal of the study. For example: + + + In the Enron corruption case in 2001, researchers created a network based + on who emailed who within the company. They then looked at which clusters + contained known consiprators, and investigated the other individuals in those + groups. + + + In the early days of breast cancer genetic studies, researchers clustered + known patients on genetic expression, which led to the discovery of different + tumor types (e.g. Basal, Her-2, Luminal). These have later been clinically + validated and better defined. + + +## Ways to validate a cluster + +The major departure from supervised learning is this: With a supervised method, +we have a very clear way to measure success, namely, how well does it predict? + +With clustering, there is no "right answer" to compare results againsts. + +There are several ways people typically validate a clustering result: + +1. **within-group versus without-group similarity:** The goal is to find +groups of similar objects. Thus, we can check how close objects in the same +cluster are as compared to how close objects in different clusters are. + + + A problem with this is that there's not objective baseline about what is + a "good" ratio. + + (Gao, Witten, Bien have a cool new paper suggesting a statistical test + for cluster concentration.) + + +2. **Stability:** If we regard the objects being clustered as a random subset +of a population, we can ask whether the same cluster structure would have emerged +in a different random subset. We can measure this with bootstrapped subsampling. + + + A cluster structure being stable doesn't necessarily mean it is meaningful. + See e.g. [this tweet](https://twitter.com/AndrewZalesky/status/1363741266761027585) + + +3. **Enrichment in known labels:** (See *semi-supervised learning* above) + +4. **Statistical significance/generative models:** If our clustering method +places model assumptions on how the data was generated, we can formulate a notion +of statistically signficant clusters. + + + A good example of this is "model-based clustering", which typically assumes + the data is generated from a mixture of multivariate Gaussians. We can then + estimate the probability that a particular object came from a particular + distribution in the mixture. + + +## Other details + +#### Partitions versus Extractions + +In many clustering algorithms, each object is placed into **exactly one** cluster. +This is called a **partition**. + +However, some algorithms allow for overlapping clusters, i.e. objects belonging to +multiple clusters - or for some objects to be "background" and have no cluster +membership at all. + +(This is sometimes called **community detection** or **cluster extraction**) + +#### What are the variables, what are the samples? + +Consider the example of data consisting of many gene expression measurements +for several individual subjects. + +From a statistical perspective, we'd typically regard the subjects as samples +from a population, and the gene expression levels as variables being studied. + +When we cluster this data, sometimes we look for clusters of **subjects**, i.e., +people whose gene signature is similar. Sometimes we look for clusters of **genes**, +i.e., genes that activate or deactivate together. + +Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we +cluster on. In statistical algorithms, your model assumptions need to match +your notion of samples/variables. + +#### Beyond the similarity matrix + +In the vast majority of cases, all you need for a clustering algorithm is a +*similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity +measure between object $i$ and object $j$. + +Of course, the choice of similarity measure matters enormously. But the point is +you only need pairwise similarities to run e.g. kmeans. + +Some approaches, though, need more information. Statistical approaches in particular +often estimate nuisance variables (like variance) from multiple measurements before +collapsing the data into pairwise similarities. + + +# Inclusion in tidymodels + +Here's an overview of how I see unsupervised learning fitting into a tidymodels +framework. + +## Recipes and Workflow + +These steps can remain the same; there's no reason unsupervised methods need +different preprocessing options/structure. + +## `parsnip`-like models + +Obviously, any and all clustering methods would have to be implemented as model +spec functions, and the setup would look similar, e.g. + + +```r +km_mod <- k_means(centers = 5) %>% + set_mode("partition") %>% + set_engine("stats") +``` + + +In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` +that is essentially the same structure. The reason I think this should be a +separate package is that there are two major philosophical shifts for +supervised vs. unsupervised learning. + +I'm going to refer to the twin package as `celery` for now because that name makes +me laugh. + +#### Modes + +In `parsnip`, the possible modes are *classification* or *regression*, and they +correspond to the nature of the response variable. + +In `celery`, there is no response variable. I believe the modes should be +*partition* or *extraction*. (see above for details) The way that a model +fit should output results is fundamentally different for those two tasks. + +I don't really see a parallel between the classification/regression divide and the +partition/extraction divide. It doesn't feel right to just lump them together +into four "modes" options. + +#### Choosing a dissimilarity metric + +Every model comes with choices about parameters and assumptions and such. To +some degree, this would work the same in `parsnip` and `celery`: + + +```r +knn_mod <- nearest_neighbors(neighbors = 5) + +km_mod <- k_means(centers = 5) +``` + +However, for clustering, the decision of how to define "dissimilarity" feels like +a more major decision. That is, supervised methods require **one major choice** +(what model) and then many minor choices (the parameters/penalties) for that model. +Unsupervised methods require **two major choices**: the algorithm and the similarity +measure. + +I would like to see the dissimilarity measure be given "front-row billing", to + +a. make it easy to swap between choices and compare outputs +b. make it obvious what we are looking for in our clusters +c. force the user to choose very deliberately rather than relying on defaults + +For example, + + +```r +k_means(centers = 5) %>% + set_dist("euclidean") + +k_means(centers = 5) %>% + set_dist("manhattan") +``` + +Much like with *modes*, this would require the `celery` package to pre-specify +which disssimilarities were available for particular clustering algorithms. For +example, some methods require a true distance measure, while others are less strict. + +These would have to be more flexible than modes, though, since clustering methods +have **so** many possible dissimilarity measures. + +## fitting to data + +There is no real reason the `fit` step needs to look different, once a model +is specified: + + +```r +km_mod <- k_means(centers = 5) %>% + set_dist("euclidean") %>% + set_mode("partition") %>% + set_engine("stats") + + +km_mod %>% + fit(vars(bill_length_mm:body_mass_g), data = penguins) +``` + +It's been suggested that "fit" might not be the right verb for unsupervised +methods. Technicallly, many supervised methods are not truly fitting a model either +(see: K nearest neighbors), so I'm not worried about this. + +## `rsample` and cross-validating + +Cross-validation as such doesn't make sense for clustering, because there is no +notion of prediction on the test set. + +However, a similar framework could make sense for subsampling. Instead +of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain +success metrics for each subsample to get a sense of variability in the metric. + + +## `yardstick`-like metrics + +Unsupervised methods have very different (and more ambiguous) metrics for "success" +than supervised methods. (see above for lots and lots of detail) + +The underlying structure of the package doesn't need to change like it would for +`parsnip`/`celery`. We just need more functions implementing metrics that make +sense for clustering. + +# Meta: What does this buy us? + +Here are the advantages to a tidymodels implementation of clustering: + +1. **Consistency and clarity:** It's nice that all my modeling code looks similar +even for different analyses with different models. I'd like that to extend to +unsupervised analyses. + +2. **Easy framework for beginners:** One problem tidymodels solves is that +beginners tend to shove their data into a model until they get an output without +error. Tidymodels separates the important decisions in the data processing and +the model specification and the validation. This advantage would also exist in +the unsupervised case. + +3. **Comparison of methods:** Sometimes it's hard to decide between clustering +algorithms. There's less of a clear measure of success in unsupervised learning - +but it would still be a huge help if it were easier to "just try a bunch of things +and see what works". + +4. **Wrappers to automated repetition:** In the same way that tidymodels condenses +the v-fold cross-validation process into a single function, it could condense +subsampling into a single function. Also, I haven't said much about tuning, but +it'd be great to be able to do something like: + + +```r +k_means(clusters = tune()) +``` + + + + From 9455315acbb8ed0b902044564ec99d27d86a859c Mon Sep 17 00:00:00 2001 From: "Kelly N. Bodwin" Date: Mon, 1 Mar 2021 09:45:30 -0800 Subject: [PATCH 3/6] Update tidy-clustering/README.Rmd Co-authored-by: Julia Silge --- tidy-clustering/README.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/tidy-clustering/README.Rmd b/tidy-clustering/README.Rmd index 6f384d8..049edd4 100644 --- a/tidy-clustering/README.Rmd +++ b/tidy-clustering/README.Rmd @@ -1,6 +1,6 @@ # Overview -`tidymodels` currently only implements supervised learning methods. This +`tidymodels` currently focuses on supervised learning methods, with unsupervised methods in [recipes](https://recipes.tidymodels.org/) and related-packages like [embed](https://embed.tidymodels.org/) mainly framed as preprocessing steps for supervised learning. This document explores the possibility of including unsupervised methods in the same framework. @@ -293,4 +293,3 @@ k_means(clusters = tune()) - From 3f1a0ba775de61a51fa0ff3505ce95efa777df09 Mon Sep 17 00:00:00 2001 From: "Kelly N. Bodwin" Date: Mon, 1 Mar 2021 09:47:00 -0800 Subject: [PATCH 4/6] Update README.Rmd --- tidy-clustering/README.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tidy-clustering/README.Rmd b/tidy-clustering/README.Rmd index 049edd4..12eea77 100644 --- a/tidy-clustering/README.Rmd +++ b/tidy-clustering/README.Rmd @@ -151,7 +151,7 @@ different preprocessing options/structure. ## `parsnip`-like models -Obviously, any and all clustering methods would have to be implemented as model +Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g. ```{r, eval = FALSE} From 6d2b445d97831eea6d14cb3866021e1d7b2e139f Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Tue, 21 Sep 2021 10:17:51 -0400 Subject: [PATCH 5/6] minor changes, remove hard breaks --- tidy-clustering/README.Rmd | 161 ++++++++++--------------------------- 1 file changed, 44 insertions(+), 117 deletions(-) diff --git a/tidy-clustering/README.Rmd b/tidy-clustering/README.Rmd index 12eea77..6d6aeaa 100644 --- a/tidy-clustering/README.Rmd +++ b/tidy-clustering/README.Rmd @@ -1,13 +1,10 @@ # Overview -`tidymodels` currently focuses on supervised learning methods, with unsupervised methods in [recipes](https://recipes.tidymodels.org/) and related-packages like [embed](https://embed.tidymodels.org/) mainly framed as preprocessing steps for supervised learning. This -document explores the possibility of including unsupervised methods in the same -framework. +`tidymodels` currently focuses on supervised learning methods, with unsupervised methods in [recipes](https://recipes.tidymodels.org/) and related-packages like [embed](https://embed.tidymodels.org/) mainly framed as preprocessing steps for supervised learning. This document explores the possibility of including unsupervised methods in the same framework. ## Basic Definition -I'll use the term "clustering" to refer generally to any method that does the -following: +I'll use the term "clustering" to refer generally to any method that does the following: 1. Define a **measure of similarity** between objects. @@ -17,34 +14,27 @@ Many of the thoughts in this section are taken from [this paper](http://proceedi ## Uses of Clustering -1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, -we can apply unsupervised clustering, then let the clusters be defined by their -class enrichment of labelled objects. +1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, we can apply unsupervised clustering, then let the clusters be defined by their class enrichment of labelled objects. + A word of caution for this approach: Just because a clustering structure doesn't align with known labels doesn't mean it is "wrong". It could be capturing a different (true) aspect of the data than the one we have labels for. -2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get -a sense of the structure of the data. This is somewhat nebulous, and usually -involves eyeballing a visualization. +2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get a sense of the structure of the data. This is somewhat nebulous, and usually involves eyeballing a visualization. + In my experience this is usually a precursor to classification. We'd do an unsupervised clustering and see how well the objects "group up". This gives us an idea of whether the measurements we're using for classification are a good choice, before we fit a formal model. -3. **Pre-processing:** Clustering can be used to discover relationships in data -that are undesirable, so that we can *residualize* or *decorrelate* the objects -before applying an analysis. +3. **Pre-processing:** Clustering can be used to discover relationships in data that are undesirable, so that we can *residualize* or *decorrelate* the objects before applying an analysis. + A great example of this is in genetics, where we have measurements of gene expression for several subjects. Typically, gene expression is most strongly correlated by race. If we cluster the subjects on gene expression, we can then identify unwanted dependence to remove from the data. -4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the -end goal of the study. For example: +4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the end goal of the study. For example: + In the Enron corruption case in 2001, researchers created a network based on who emailed who within the company. They then looked at which clusters @@ -59,16 +49,13 @@ end goal of the study. For example: ## Ways to validate a cluster -The major departure from supervised learning is this: With a supervised method, -we have a very clear way to measure success, namely, how well does it predict? +The major departure from supervised learning is this: With a supervised method, we have a very clear way to measure success, namely, how well does it predict? With clustering, there is no "right answer" to compare results againsts. There are several ways people typically validate a clustering result: -1. **within-group versus without-group similarity:** The goal is to find -groups of similar objects. Thus, we can check how close objects in the same -cluster are as compared to how close objects in different clusters are. +1. **within-group versus without-group similarity:** The goal is to find groups of similar objects. Thus, we can check how close objects in the same cluster are as compared to how close objects in different clusters are. + A problem with this is that there's not objective baseline about what is a "good" ratio. @@ -76,9 +63,7 @@ cluster are as compared to how close objects in different clusters are. for cluster concentration.) -2. **Stability:** If we regard the objects being clustered as a random subset -of a population, we can ask whether the same cluster structure would have emerged -in a different random subset. We can measure this with bootstrapped subsampling. +2. **Stability:** If we regard the objects being clustered as a random subset of a population, we can ask whether the same cluster structure would have emerged in a different random subset. We can measure this with bootstrapped subsampling. + A cluster structure being stable doesn't necessarily mean it is meaningful. See e.g. [this tweet](https://twitter.com/AndrewZalesky/status/1363741266761027585) @@ -86,9 +71,7 @@ in a different random subset. We can measure this with bootstrapped subsampling 3. **Enrichment in known labels:** (See *semi-supervised learning* above) -4. **Statistical significance/generative models:** If our clustering method -places model assumptions on how the data was generated, we can formulate a notion -of statistically signficant clusters. +4. **Statistical significance/generative models:** If our clustering method places model assumptions on how the data was generated, we can formulate a notion of statistically signficant clusters. + A good example of this is "model-based clustering", which typically assumes the data is generated from a mixture of multivariate Gaussians. We can then @@ -100,59 +83,42 @@ of statistically signficant clusters. #### Partitions versus Extractions -In many clustering algorithms, each object is placed into **exactly one** cluster. -This is called a **partition**. +In many clustering algorithms, each object is placed into **exactly one** cluster. This is called a **partition**. -However, some algorithms allow for overlapping clusters, i.e. objects belonging to -multiple clusters - or for some objects to be "background" and have no cluster -membership at all. +However, some algorithms allow for overlapping clusters, i.e. objects belonging to multiple clusters - or for some objects to be "background" and have no cluster membership at all. (This is sometimes called **community detection** or **cluster extraction**) #### What are the variables, what are the samples? -Consider the example of data consisting of many gene expression measurements -for several individual subjects. +Consider the example of data consisting of many gene expression measurements for several individual subjects. -From a statistical perspective, we'd typically regard the subjects as samples -from a population, and the gene expression levels as variables being studied. +From a statistical perspective, we'd typically regard the subjects as samples from a population, and the gene expression levels as variables being studied. -When we cluster this data, sometimes we look for clusters of **subjects**, i.e., -people whose gene signature is similar. Sometimes we look for clusters of **genes**, -i.e., genes that activate or deactivate together. +When we cluster this data, sometimes we look for clusters of **subjects**, i.e., people whose gene signature is similar. Sometimes we look for clusters of **genes**, i.e., genes that activate or deactivate together. -Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we -cluster on. In statistical algorithms, your model assumptions need to match -your notion of samples/variables. +Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we cluster on. In statistical algorithms, your model assumptions need to match your notion of samples/variables. #### Beyond the similarity matrix -In the vast majority of cases, all you need for a clustering algorithm is a -*similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity -measure between object $i$ and object $j$. +In the vast majority of cases, all you need for a clustering algorithm is a *similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity measure between object $i$ and object $j$. -Of course, the choice of similarity measure matters enormously. But the point is -you only need pairwise similarities to run e.g. kmeans. +Of course, the choice of similarity measure matters enormously. But the point is you only need pairwise similarities to run e.g. kmeans. -Some approaches, though, need more information. Statistical approaches in particular -often estimate nuisance variables (like variance) from multiple measurements before -collapsing the data into pairwise similarities. +Some approaches, though, need more information. Statistical approaches in particular often estimate nuisance variables (like variance) from multiple measurements before collapsing the data into pairwise similarities. # Inclusion in tidymodels -Here's an overview of how I see unsupervised learning fitting into a tidymodels -framework. +Here's an overview of how I see unsupervised learning fitting into a tidymodels framework. ## Recipes and Workflow -These steps can remain the same; there's no reason unsupervised methods need -different preprocessing options/structure. +These steps can remain the same; there's no reason unsupervised methods need different preprocessing options/structure. ## `parsnip`-like models -Any and all clustering methods could be implemented as model -spec functions, and the setup would look similar, e.g. +Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g. ```{r, eval = FALSE} km_mod <- k_means(centers = 5) %>% @@ -161,44 +127,29 @@ km_mod <- k_means(centers = 5) %>% ``` -In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` -that is essentially the same structure. The reason I think this should be a -separate package is that there are two major philosophical shifts for -supervised vs. unsupervised learning. +In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` that is essentially the same structure. The reason I think this should be a separate package is that there are two major philosophical shifts for supervised vs. unsupervised learning. -I'm going to refer to the twin package as `celery` for now because that name makes -me laugh. +I'm going to refer to the twin package as `celery` for now because that name makes me laugh. #### Modes -In `parsnip`, the possible modes are *classification* or *regression*, and they -correspond to the nature of the response variable. +In `parsnip`, the possible modes are *classification*, *censored regression*, or *regression*, and they correspond to the nature of the response variable. -In `celery`, there is no response variable. I believe the modes should be -*partition* or *extraction*. (see above for details) The way that a model -fit should output results is fundamentally different for those two tasks. +In `celery`, there is no response variable. I believe the modes should be *partition* or *extraction*. (see above for details) The way that a model fit should output results is fundamentally different for those two tasks. -I don't really see a parallel between the classification/regression divide and the -partition/extraction divide. It doesn't feel right to just lump them together -into four "modes" options. +I don't really see a parallel between the classification/regression divide and the partition/extraction divide. It doesn't feel right to just lump them together into four "modes" options. #### Choosing a dissimilarity metric -Every model comes with choices about parameters and assumptions and such. To -some degree, this would work the same in `parsnip` and `celery`: +Every model comes with choices about parameters and assumptions and such. To some degree, this would work the same in `parsnip` and `celery`: ```{r, eval = FALSE} knn_mod <- nearest_neighbors(neighbors = 5) km_mod <- k_means(centers = 5) - ``` -However, for clustering, the decision of how to define "dissimilarity" feels like -a more major decision. That is, supervised methods require **one major choice** -(what model) and then many minor choices (the parameters/penalties) for that model. -Unsupervised methods require **two major choices**: the algorithm and the similarity -measure. +However, for clustering, the decision of how to define "dissimilarity" feels like a more major decision. That is, supervised methods require **one major choice** (what model) and then many minor choices (the parameters/penalties) for that model. Unsupervised methods require **two major choices**: the algorithm and the similarity measure. I would like to see the dissimilarity measure be given "front-row billing", to @@ -216,17 +167,13 @@ k_means(centers = 5) %>% set_dist("manhattan") ``` -Much like with *modes*, this would require the `celery` package to pre-specify -which disssimilarities were available for particular clustering algorithms. For -example, some methods require a true distance measure, while others are less strict. +Much like with *modes*, this would require the `celery` package to pre-specify which disssimilarities were available for particular clustering algorithms. For example, some methods require a true distance measure, while others are less strict. -These would have to be more flexible than modes, though, since clustering methods -have **so** many possible dissimilarity measures. +These would have to be more flexible than modes, though, since clustering methods have **so** many possible dissimilarity measures. ## fitting to data -There is no real reason the `fit` step needs to look different, once a model -is specified: +There is no real reason the `fit` step needs to look different, once a model is specified: ```{r, eval = FALSE} @@ -240,52 +187,32 @@ km_mod %>% fit(vars(bill_length_mm:body_mass_g), data = penguins) ``` -It's been suggested that "fit" might not be the right verb for unsupervised -methods. Technicallly, many supervised methods are not truly fitting a model either -(see: K nearest neighbors), so I'm not worried about this. +It's been suggested that "fit" might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this. ## `rsample` and cross-validating -Cross-validation as such doesn't make sense for clustering, because there is no -notion of prediction on the test set. +Cross-validation as such doesn't make sense for clustering, because there is no notion of prediction on the test set. -However, a similar framework could make sense for subsampling. Instead -of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain -success metrics for each subsample to get a sense of variability in the metric. +However, a similar framework could make sense for subsampling. Instead of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain success metrics for each subsample to get a sense of variability in the metric. ## `yardstick`-like metrics -Unsupervised methods have very different (and more ambiguous) metrics for "success" -than supervised methods. (see above for lots and lots of detail) +Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods. (see above for lots and lots of detail) -The underlying structure of the package doesn't need to change like it would for -`parsnip`/`celery`. We just need more functions implementing metrics that make -sense for clustering. +The underlying structure of the package doesn't need to change like it would for `parsnip`/`celery`. We just need more functions implementing metrics that make sense for clustering. # Meta: What does this buy us? Here are the advantages to a tidymodels implementation of clustering: -1. **Consistency and clarity:** It's nice that all my modeling code looks similar -even for different analyses with different models. I'd like that to extend to -unsupervised analyses. - -2. **Easy framework for beginners:** One problem tidymodels solves is that -beginners tend to shove their data into a model until they get an output without -error. Tidymodels separates the important decisions in the data processing and -the model specification and the validation. This advantage would also exist in -the unsupervised case. - -3. **Comparison of methods:** Sometimes it's hard to decide between clustering -algorithms. There's less of a clear measure of success in unsupervised learning - -but it would still be a huge help if it were easier to "just try a bunch of things -and see what works". - -4. **Wrappers to automated repetition:** In the same way that tidymodels condenses -the v-fold cross-validation process into a single function, it could condense -subsampling into a single function. Also, I haven't said much about tuning, but -it'd be great to be able to do something like: +1. **Consistency and clarity:** It's nice that all my modeling code looks similar even for different analyses with different models. I'd like that to extend to unsupervised analyses. + +2. **Easy framework for beginners:** One problem tidymodels solves is that beginners tend to shove their data into a model until they get an output without error. Tidymodels separates the important decisions in the data processing and the model specification and the validation. This advantage would also exist in the unsupervised case. + +3. **Comparison of methods:** Sometimes it's hard to decide between clustering algorithms. There's less of a clear measure of success in unsupervised learning - but it would still be a huge help if it were easier to "just try a bunch of things and see what works". + +4. **Wrappers to automated repetition:** In the same way that tidymodels condenses the v-fold cross-validation process into a single function, it could condense subsampling into a single function. Also, I haven't said much about tuning, but it'd be great to be able to do something like: ```{r, eval = FALSE} k_means(clusters = tune()) From 6076709085deaa061d5c5912e5ca9add2a83122c Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Tue, 21 Sep 2021 10:28:30 -0400 Subject: [PATCH 6/6] refresh readme --- tidy-clustering/{README.utf8.md => README.md} | 155 +++++------------- 1 file changed, 41 insertions(+), 114 deletions(-) rename tidy-clustering/{README.utf8.md => README.md} (55%) diff --git a/tidy-clustering/README.utf8.md b/tidy-clustering/README.md similarity index 55% rename from tidy-clustering/README.utf8.md rename to tidy-clustering/README.md index 6e5be45..738d4c1 100644 --- a/tidy-clustering/README.utf8.md +++ b/tidy-clustering/README.md @@ -1,13 +1,10 @@ # Overview -`tidymodels` currently only implements supervised learning methods. This -document explores the possibility of including unsupervised methods in the same -framework. +`tidymodels` currently focuses on supervised learning methods, with unsupervised methods in [recipes](https://recipes.tidymodels.org/) and related-packages like [embed](https://embed.tidymodels.org/) mainly framed as preprocessing steps for supervised learning. This document explores the possibility of including unsupervised methods in the same framework. ## Basic Definition -I'll use the term "clustering" to refer generally to any method that does the -following: +I'll use the term "clustering" to refer generally to any method that does the following: 1. Define a **measure of similarity** between objects. @@ -17,34 +14,27 @@ Many of the thoughts in this section are taken from [this paper](http://proceedi ## Uses of Clustering -1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, -we can apply unsupervised clustering, then let the clusters be defined by their -class enrichment of labelled objects. +1. **Semi-Supervised Learning:** If we have class labels on *some* of the objects, we can apply unsupervised clustering, then let the clusters be defined by their class enrichment of labelled objects. + A word of caution for this approach: Just because a clustering structure doesn't align with known labels doesn't mean it is "wrong". It could be capturing a different (true) aspect of the data than the one we have labels for. -2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get -a sense of the structure of the data. This is somewhat nebulous, and usually -involves eyeballing a visualization. +2. **EDA:** Sometimes clustering is applied as a first exploratory step, to get a sense of the structure of the data. This is somewhat nebulous, and usually involves eyeballing a visualization. + In my experience this is usually a precursor to classification. We'd do an unsupervised clustering and see how well the objects "group up". This gives us an idea of whether the measurements we're using for classification are a good choice, before we fit a formal model. -3. **Pre-processing:** Clustering can be used to discover relationships in data -that are undesirable, so that we can *residualize* or *decorrelate* the objects -before applying an analysis. +3. **Pre-processing:** Clustering can be used to discover relationships in data that are undesirable, so that we can *residualize* or *decorrelate* the objects before applying an analysis. + A great example of this is in genetics, where we have measurements of gene expression for several subjects. Typically, gene expression is most strongly correlated by race. If we cluster the subjects on gene expression, we can then identify unwanted dependence to remove from the data. -4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the -end goal of the study. For example: +4. **Clusters as analysis:** Sometimes, assignment of cluster membership *is* the end goal of the study. For example: + In the Enron corruption case in 2001, researchers created a network based on who emailed who within the company. They then looked at which clusters @@ -59,16 +49,13 @@ end goal of the study. For example: ## Ways to validate a cluster -The major departure from supervised learning is this: With a supervised method, -we have a very clear way to measure success, namely, how well does it predict? +The major departure from supervised learning is this: With a supervised method, we have a very clear way to measure success, namely, how well does it predict? With clustering, there is no "right answer" to compare results againsts. There are several ways people typically validate a clustering result: -1. **within-group versus without-group similarity:** The goal is to find -groups of similar objects. Thus, we can check how close objects in the same -cluster are as compared to how close objects in different clusters are. +1. **within-group versus without-group similarity:** The goal is to find groups of similar objects. Thus, we can check how close objects in the same cluster are as compared to how close objects in different clusters are. + A problem with this is that there's not objective baseline about what is a "good" ratio. @@ -76,9 +63,7 @@ cluster are as compared to how close objects in different clusters are. for cluster concentration.) -2. **Stability:** If we regard the objects being clustered as a random subset -of a population, we can ask whether the same cluster structure would have emerged -in a different random subset. We can measure this with bootstrapped subsampling. +2. **Stability:** If we regard the objects being clustered as a random subset of a population, we can ask whether the same cluster structure would have emerged in a different random subset. We can measure this with bootstrapped subsampling. + A cluster structure being stable doesn't necessarily mean it is meaningful. See e.g. [this tweet](https://twitter.com/AndrewZalesky/status/1363741266761027585) @@ -86,9 +71,7 @@ in a different random subset. We can measure this with bootstrapped subsampling 3. **Enrichment in known labels:** (See *semi-supervised learning* above) -4. **Statistical significance/generative models:** If our clustering method -places model assumptions on how the data was generated, we can formulate a notion -of statistically signficant clusters. +4. **Statistical significance/generative models:** If our clustering method places model assumptions on how the data was generated, we can formulate a notion of statistically signficant clusters. + A good example of this is "model-based clustering", which typically assumes the data is generated from a mixture of multivariate Gaussians. We can then @@ -100,59 +83,42 @@ of statistically signficant clusters. #### Partitions versus Extractions -In many clustering algorithms, each object is placed into **exactly one** cluster. -This is called a **partition**. +In many clustering algorithms, each object is placed into **exactly one** cluster. This is called a **partition**. -However, some algorithms allow for overlapping clusters, i.e. objects belonging to -multiple clusters - or for some objects to be "background" and have no cluster -membership at all. +However, some algorithms allow for overlapping clusters, i.e. objects belonging to multiple clusters - or for some objects to be "background" and have no cluster membership at all. (This is sometimes called **community detection** or **cluster extraction**) #### What are the variables, what are the samples? -Consider the example of data consisting of many gene expression measurements -for several individual subjects. +Consider the example of data consisting of many gene expression measurements for several individual subjects. -From a statistical perspective, we'd typically regard the subjects as samples -from a population, and the gene expression levels as variables being studied. +From a statistical perspective, we'd typically regard the subjects as samples from a population, and the gene expression levels as variables being studied. -When we cluster this data, sometimes we look for clusters of **subjects**, i.e., -people whose gene signature is similar. Sometimes we look for clusters of **genes**, -i.e., genes that activate or deactivate together. +When we cluster this data, sometimes we look for clusters of **subjects**, i.e., people whose gene signature is similar. Sometimes we look for clusters of **genes**, i.e., genes that activate or deactivate together. -Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we -cluster on. In statistical algorithms, your model assumptions need to match -your notion of samples/variables. +Non-statistical algorithms (e.g. kmeans) are agnostic to what set of objects we cluster on. In statistical algorithms, your model assumptions need to match your notion of samples/variables. #### Beyond the similarity matrix -In the vast majority of cases, all you need for a clustering algorithm is a -*similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity -measure between object $i$ and object $j$. +In the vast majority of cases, all you need for a clustering algorithm is a *similarity matrix*; that is, entry $(i,j)$ of the matrix is the pairwise similarity measure between object $i$ and object $j$. -Of course, the choice of similarity measure matters enormously. But the point is -you only need pairwise similarities to run e.g. kmeans. +Of course, the choice of similarity measure matters enormously. But the point is you only need pairwise similarities to run e.g. kmeans. -Some approaches, though, need more information. Statistical approaches in particular -often estimate nuisance variables (like variance) from multiple measurements before -collapsing the data into pairwise similarities. +Some approaches, though, need more information. Statistical approaches in particular often estimate nuisance variables (like variance) from multiple measurements before collapsing the data into pairwise similarities. # Inclusion in tidymodels -Here's an overview of how I see unsupervised learning fitting into a tidymodels -framework. +Here's an overview of how I see unsupervised learning fitting into a tidymodels framework. ## Recipes and Workflow -These steps can remain the same; there's no reason unsupervised methods need -different preprocessing options/structure. +These steps can remain the same; there's no reason unsupervised methods need different preprocessing options/structure. ## `parsnip`-like models -Obviously, any and all clustering methods would have to be implemented as model -spec functions, and the setup would look similar, e.g. +Any and all clustering methods could be implemented as model spec functions, and the setup would look similar, e.g. ```r @@ -162,31 +128,21 @@ km_mod <- k_means(centers = 5) %>% ``` -In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` -that is essentially the same structure. The reason I think this should be a -separate package is that there are two major philosophical shifts for -supervised vs. unsupervised learning. +In my opinion, it makes sense to create a "fraternal twin" package to `parsnip` that is essentially the same structure. The reason I think this should be a separate package is that there are two major philosophical shifts for supervised vs. unsupervised learning. -I'm going to refer to the twin package as `celery` for now because that name makes -me laugh. +I'm going to refer to the twin package as `celery` for now because that name makes me laugh. #### Modes -In `parsnip`, the possible modes are *classification* or *regression*, and they -correspond to the nature of the response variable. +In `parsnip`, the possible modes are *classification*, *censored regression*, or *regression*, and they correspond to the nature of the response variable. -In `celery`, there is no response variable. I believe the modes should be -*partition* or *extraction*. (see above for details) The way that a model -fit should output results is fundamentally different for those two tasks. +In `celery`, there is no response variable. I believe the modes should be *partition* or *extraction*. (see above for details) The way that a model fit should output results is fundamentally different for those two tasks. -I don't really see a parallel between the classification/regression divide and the -partition/extraction divide. It doesn't feel right to just lump them together -into four "modes" options. +I don't really see a parallel between the classification/regression divide and the partition/extraction divide. It doesn't feel right to just lump them together into four "modes" options. #### Choosing a dissimilarity metric -Every model comes with choices about parameters and assumptions and such. To -some degree, this would work the same in `parsnip` and `celery`: +Every model comes with choices about parameters and assumptions and such. To some degree, this would work the same in `parsnip` and `celery`: ```r @@ -195,11 +151,7 @@ knn_mod <- nearest_neighbors(neighbors = 5) km_mod <- k_means(centers = 5) ``` -However, for clustering, the decision of how to define "dissimilarity" feels like -a more major decision. That is, supervised methods require **one major choice** -(what model) and then many minor choices (the parameters/penalties) for that model. -Unsupervised methods require **two major choices**: the algorithm and the similarity -measure. +However, for clustering, the decision of how to define "dissimilarity" feels like a more major decision. That is, supervised methods require **one major choice** (what model) and then many minor choices (the parameters/penalties) for that model. Unsupervised methods require **two major choices**: the algorithm and the similarity measure. I would like to see the dissimilarity measure be given "front-row billing", to @@ -218,17 +170,13 @@ k_means(centers = 5) %>% set_dist("manhattan") ``` -Much like with *modes*, this would require the `celery` package to pre-specify -which disssimilarities were available for particular clustering algorithms. For -example, some methods require a true distance measure, while others are less strict. +Much like with *modes*, this would require the `celery` package to pre-specify which disssimilarities were available for particular clustering algorithms. For example, some methods require a true distance measure, while others are less strict. -These would have to be more flexible than modes, though, since clustering methods -have **so** many possible dissimilarity measures. +These would have to be more flexible than modes, though, since clustering methods have **so** many possible dissimilarity measures. ## fitting to data -There is no real reason the `fit` step needs to look different, once a model -is specified: +There is no real reason the `fit` step needs to look different, once a model is specified: ```r @@ -242,52 +190,32 @@ km_mod %>% fit(vars(bill_length_mm:body_mass_g), data = penguins) ``` -It's been suggested that "fit" might not be the right verb for unsupervised -methods. Technicallly, many supervised methods are not truly fitting a model either -(see: K nearest neighbors), so I'm not worried about this. +It's been suggested that "fit" might not be the right verb for unsupervised methods. Technicallly, many supervised methods are not truly fitting a model either (see: K nearest neighbors), so I'm not worried about this. ## `rsample` and cross-validating -Cross-validation as such doesn't make sense for clustering, because there is no -notion of prediction on the test set. +Cross-validation as such doesn't make sense for clustering, because there is no notion of prediction on the test set. -However, a similar framework could make sense for subsampling. Instead -of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain -success metrics for each subsample to get a sense of variability in the metric. +However, a similar framework could make sense for subsampling. Instead of $v$ cross-validations, we'd have $v$ subsamples, and we'd look at certain success metrics for each subsample to get a sense of variability in the metric. ## `yardstick`-like metrics -Unsupervised methods have very different (and more ambiguous) metrics for "success" -than supervised methods. (see above for lots and lots of detail) +Unsupervised methods have very different (and more ambiguous) metrics for "success" than supervised methods. (see above for lots and lots of detail) -The underlying structure of the package doesn't need to change like it would for -`parsnip`/`celery`. We just need more functions implementing metrics that make -sense for clustering. +The underlying structure of the package doesn't need to change like it would for `parsnip`/`celery`. We just need more functions implementing metrics that make sense for clustering. # Meta: What does this buy us? Here are the advantages to a tidymodels implementation of clustering: -1. **Consistency and clarity:** It's nice that all my modeling code looks similar -even for different analyses with different models. I'd like that to extend to -unsupervised analyses. +1. **Consistency and clarity:** It's nice that all my modeling code looks similar even for different analyses with different models. I'd like that to extend to unsupervised analyses. -2. **Easy framework for beginners:** One problem tidymodels solves is that -beginners tend to shove their data into a model until they get an output without -error. Tidymodels separates the important decisions in the data processing and -the model specification and the validation. This advantage would also exist in -the unsupervised case. +2. **Easy framework for beginners:** One problem tidymodels solves is that beginners tend to shove their data into a model until they get an output without error. Tidymodels separates the important decisions in the data processing and the model specification and the validation. This advantage would also exist in the unsupervised case. -3. **Comparison of methods:** Sometimes it's hard to decide between clustering -algorithms. There's less of a clear measure of success in unsupervised learning - -but it would still be a huge help if it were easier to "just try a bunch of things -and see what works". +3. **Comparison of methods:** Sometimes it's hard to decide between clustering algorithms. There's less of a clear measure of success in unsupervised learning - but it would still be a huge help if it were easier to "just try a bunch of things and see what works". -4. **Wrappers to automated repetition:** In the same way that tidymodels condenses -the v-fold cross-validation process into a single function, it could condense -subsampling into a single function. Also, I haven't said much about tuning, but -it'd be great to be able to do something like: +4. **Wrappers to automated repetition:** In the same way that tidymodels condenses the v-fold cross-validation process into a single function, it could condense subsampling into a single function. Also, I haven't said much about tuning, but it'd be great to be able to do something like: ```r @@ -296,4 +224,3 @@ k_means(clusters = tune()) -