Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
144 lines (105 sloc) 5.26 KB
---
title: "kmeans with dplyr and broom"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{kmeans with dplyr and broom}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Tidying k-means clustering
```{r setup, include = FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
K-means clustering serves as a very useful example of tidy data, and especially the distinction between the three tidying functions: `tidy`, `augment`, and `glance`.
Let's start by generating some random two-dimensional data with three clusters. Data in each cluster will come from a multivariate gaussian distribution, with different means for each cluster:
```{r}
library(dplyr)
library(ggplot2)
library(purrr)
library(tibble)
library(tidyr)
set.seed(27)
centers <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <- centers %>%
mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(x1, x2)
ggplot(labelled_points, aes(x1, x2, color = cluster)) +
geom_point()
```
This is an ideal case for k-means clustering. We'll use the built-in `kmeans` function, which accepts a dataframe with all numeric columns as it's primary argument.
```{r}
points <- labelled_points %>%
select(-cluster)
kclust <- kmeans(points, centers = 3)
kclust
summary(kclust)
```
The output is a list of vectors, where each component has a different length. There's one of length `r nrow(points)`: the same as our original dataset. There are a number of elements of length 3: `withinss`, `tot.withinss`, and `betweenss`- and `centers` is a matrix with 3 rows. And then there are the elements of length 1: `totss`, `tot.withinss`, `betweenss`, and `iter`.
These differing lengths have a deeper meaning when we want to tidy our dataset: they signify that each type of component communicates a *different kind* of information.
- `cluster` (`r nrow(points)` values) contains information about each *point*
- `centers`, `withinss` and `size` (3 values) contain information about each *cluster*
- `totss`, `tot.withinss`, `betweenss`, and `iter` (1 value) contain information about the *full clustering*
Which of these do we want to extract? There is no right answer: each of them may be interesting to an analyst. Because they communicate entirely different information (not to mention there's no straightforward way to combine them), they are extracted by separate functions. `augment` adds the point classifications to the original dataset:
```{r}
library(broom)
augment(kclust, points)
```
The `tidy` function summarizes on a per-cluster level:
```{r}
tidy(kclust)
```
And as it always does, the `glance` function extracts a single-row summary:
```{r}
glance(kclust)
```
## broom and dplyr for exploratory clustering
While these summaries are useful, they would not have been too difficult to extract out from the dataset yourself. The real power comes from combining these analyses with dplyr.
Let's say we want to explore the effect of different choices of `k`, from 1 to 9, on this clustering. First cluster the data 9 times, each using a different value of `k`, then create columns containing the tidied, glanced and augmented data:
```{r}
kclusts <- tibble(k = 1:9) %>%
mutate(
kclust = map(k, ~kmeans(points, .x)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, points)
)
kclusts
```
We can turn these into three separate datasets each representing a different type of data:
Then tidy the clusterings three ways: using `tidy`, using `augment`, and using `glance`. Each of these goes into a separate dataset as they represent different types of data.
```{r}
clusters <- kclusts %>%
unnest(tidied)
assignments <- kclusts %>%
unnest(augmented)
clusterings <- kclusts %>%
unnest(glanced, .drop = TRUE)
```
Now we can plot the original points, with each point colored according to the predicted cluster.
```{r}
p1 <- ggplot(assignments, aes(x1, x2)) +
geom_point(aes(color = .cluster)) +
facet_wrap(~ k)
p1
```
Already we get a good sense of the proper number of clusters (3), and how the k-means algorithm functions when k is too high or too low. We can then add the centers of the cluster using the data from `tidy`:
```{r}
p2 <- p1 + geom_point(data = clusters, size = 10, shape = "x")
p2
```
The data from `glance` fits a different but equally important purpose: it lets you view trends of some summary statistics across values of k. Of particular interest is the total within sum of squares, saved in the `tot.withinss` column.
```{r}
ggplot(clusterings, aes(k, tot.withinss)) +
geom_line()
```
This represents the variance within the clusters. It decreases as `k` increases, but one can notice a bend (or "elbow") right at k=3. This bend indicates that additional clusters beyond the third have little value. (See [here](http://web.stanford.edu/~hastie/Papers/gap.pdf) for a more mathematically rigorous interpretation and implementation of this method). Thus, all three methods of tidying data provided by broom are useful for summarizing clustering output.