## Clustering Data
*Multidimensional grouping*

Sometimes data in multiple dimensions shows trends which are not consistent throughout the set. Often, when plotted, it is clear to the human viewer that we can describe the data in multiple groups, or "clusters".

This version uses an inverted table format. See the [nested matrix](nested.ipynb) format for comparison as well as the [performance, ergonomics and aesthetics comparison](comparison.ipynb).

In [10]:
∇ KMeans←{
⍝ ⍺: number of clusters :: scalar integer
⍝ ⍵: data set           :: inverted table
  n←⍺
  ComputeCentroids←{
    d←0.5*⍨⊃+/2*⍨⍺∘.-¨⍵        ⍝ distances from points to centroids
    g←d⍳⍤1 0⌊/d                ⍝ cluster (group) for each data point
    (⊂d⍳⍤1 0⌊/d){(+⌿÷≢)⍵}⌸¨⍺   ⍝ new clusters are means of points in each group
  }
  i←0
  I←{(⊂⊂⍺)⌷¨⍵}
  c←3(?∘≢∘⊃I⊢)⍵                ⍝ guess random centroids
  ⍵ ComputeCentroids⍣≡c   ⍝ Compute centroids
}
∇

## Configuration

### Maximum Iterations
Provide a scalar integer indicating the maximum number of iterations allowed.

```apl
 (n max)←⍺
 i←0
 End←{
     ⍺≡⍵: 1       ⍝ Converged
     i=max⊣i+←1   ⍝ Maximum iterations reached
 }
 ⍵ ComputeCentroids⍣End c
```

### Convergence Threshold

The iteration will end if the [distance](#Distance-Metric) between the centroids of the previous and this iteration is less than or equal to the convergence threshold.

```apl
 (n e)←⍺
 End←{
     d←0.5*⍨⊃+/2*⍨⍺-⍵
     ∧/d≤e
 }
 ⍵ ComputeCentroids⍣End c
```

### Distance Metric
Change the function that computes the distance between any two points.

```apl
d←0.5*⍨+/×⍨⍺∘.-¨⍵                ⍝ Euclidean distance from points to centroid
d←+/|⍺∘.-¨⍵                      ⍝ Manhattan distance
⍝ Cosine similarity
```

### Data Scaling / Normalisation Method
Optionally scale the data before using it to scale centroids.

```apl
s ← Scale ⍵
s ComputeCentroids⍣End⊢c
```

#### Min-max scaling

#### Z-score standardisation

#### Robust scaling with quartiles

#### Log transform scaling

#### Power transform


## Initialisation Method

### K-means++
Choose new centroids from data with likelihood proportional to distance to nearest existing centroid.

```apl
∇  Kpp ← {
⍝ K-means++ initialisation
   ⍺←⍉⍪(?∘≢⌷⊢)⍵           ⍝ Random initial centroid
   n=≢⍺:⍺                 ⍝ We have n centroids, stop
   d2←⌊/+/×⍨⍵∘.-¨⍺   ⍝ Smallest squared distance between data and centroids
   p←d2÷+/d2              ⍝ Normalised distances as probabilities (weights)
   c←⍵⌷⍨⊂(+\p)⍸?0         ⍝ Choose new centroid
   (⍺⍪c) ∇ ⍵      
 }
∇
```