## Clustering Data
*Multidimensional grouping*

Sometimes data in multiple dimensions shows trends which are not consistent throughout the set. Often, when plotted, it is clear to the human viewer that we can describe the data in multiple groups, or "clusters".

Let's take an example from the [Palmer Archipelago penguins data set](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data).

This tutorial uses an [inverted table](inverted.ipynb) format. See the [nested matrix]() format for comparison as well as the [performance, ergonomics and aesthetics comparison]().

We can use `⎕CSV` to directly bring in an inverted table, but we have to specify how to interpret each column. `3` means to interpret columns as numeric, but replace invalid values with `0`.

In [63]:
ps←(⎕CSV⎕OPT'Invert' 1)'../csv/penguins_size.csv' '' (1 1 3 3 3 3 1) 1
≢⊃⊃ps   ⍝ How many rows?

Although the format is more efficient than a nested matrix, we benefit from utilities to view and filter the table.

In [64]:
]box on -fns=on
]rows on -fns=on -fold=3
View←{↑(2⊃⍵)(⍪¨⊃⍵)}
I←{(⊂⍺)∘⌷¨⍵}
View ps

These three penguins have quite distinctive physicalities. If we plot the beak length (culmen length) against flipper length and colour the points according to species, a grouping emerges.

In [78]:
(data head)←ps
cols←'culmen_length_mm' 'flipper_length_mm' 'species'
lengths←(head⍳⊆cols) ⊃¨⊂ data
table←lengths cols
View table

In [81]:
'InitCauseway'⎕CY'sharpplot'
InitCauseway⍬

In [120]:
∇ svg←Scatter table;sp;header;beak;flipper;species
  ((beak flipper species) header)←table
  sp←⎕NEW Causeway.SharpPlot
  sp.SetYCaption 1⊃header
  sp.SetXCaption 2⊃header
  sp.SetKeyText ↓∪species
  sp.SplitBy ↓species
  sp.SetXRange 170 240
  sp.SetYRange 30 60
  sp.SetMarkerScales 1.5
  sp.DrawScatterPlot beak flipper
  svg←sp.RenderSvg''
∇

In [124]:
]html Scatter table

There is some overlap, but we can clearly see three clusters of data in these two dimensions. How could we compute these groups? One method is the k-means clustering algorithm. We choose a number, $k$, of clusters and compute the mean value of each cluster such that the distance from data points in that cluster to its mean is smaller than to any other mean.

In [95]:
I←{(⊂⊂⍺)⌷¨⍵}
From←{((2⊃⍵)⍳⊆⍺)⊃¨1⌷⍵}

We begin with random data points to choose as our starting means. We call them `c` for the **C**enter of each cluster.

In [96]:
length←'culmen_length_mm' 'flipper_length_mm' From table
⎕←c←3(?∘≢∘⊃I⊢)length

How close are we? Let's find the closest center to each data point. Once we've labelled them, the standard deviation of distances within each group will give an indication of how tight our clusters are. 

In [97]:
⍴d←0.5*⍨⊃+/2*⍨length∘.-¨c
d⍳⍤1 0⌊/d

In [98]:
:Namespace stats                      ⍝ statistical functions namespace
    AVG←{(+⌿⍵)÷≢⍵}                    ⍝   average
    STD←{(÷2)*⍨(+.×⍨⍵-AVG⍵)÷(≢⍵)-1}   ⍝   standard deviation (of the sample)
:EndNamespace

In [100]:
(d⍳⍤1 0⌊/d){stats.STD ⍵}⌸⌊/d

Let's try again, but this time we'll use the mean of points in each cluster as the new centers.

In [105]:
⎕←c←(⊂d⍳⍤1 0⌊/d){(+⌿÷≢)⍵}⌸¨length
d←0.5*⍨⊃+/2*⍨length∘.-¨c
⎕←(d⍳⍤1 0⌊/d){stats.STD ⍵}⌸⌊/d

Let's keep iterating on this until our new mean estimates are equal to our old estimates, otherwise known as a fixed-point `⍣≡`.

In [106]:
∇ KMeans←{
⍝ ⍺: number of clusters :: scalar integer
⍝ ⍵: data set           :: inverted table
  n←⍺
  Centroids←{
    d←0.5*⍨⊃+/2*⍨⍺∘.-¨⍵        ⍝ distances from points to centroids
    g←d⍳⍤1 0⌊/d                ⍝ cluster (group) for each data point
    (⊂d⍳⍤1 0⌊/d){(+⌿÷≢)⍵}⌸¨⍺   ⍝ new clusters are means of points in each group
  }
  i←0
  I←{(⊂⊂⍺)⌷¨⍵}
  c←3(?∘≢∘⊃I⊢)⍵    ⍝ guess random centroids
  ⍵ Centroids⍣≡c   ⍝ Compute centroids
}
∇

In [108]:
3 KMeans length

I don't think (0,0) is a good place for a cluster. Let's remove the zero entries and try again.

In [109]:
length~¨←0
3 KMeans length

That looks better. Let's see how well our clusters match up with the species.

In [110]:
c←3 KMeans length
d←⊃+/2*⍨length∘.-¨c
g←d⍳⍤1 0⌊/d

In [122]:
]html Scatter (length,⊂⍕⍪g)('Beak Length (mm)' 'Flipper Length (mm)' 'Cluster')

Not bad. The Adelie penguins with long flippers are being grouped in with Chinstraps, but the clusters contain most of the samples grouped by species.