umitanuki/kmeans-postgresql

Subversion checkout URL

You can clone with HTTPS or Subversion.

96 lines (70 sloc) 3.79 kb

kmeans 1.1.0

Synopsis

``````SELECT kmeans(ARRAY[x, y, z], 10) OVER (), * FROM samples;
SELECT kmeans(ARRAY[x, y, z], 2, ARRAY[0.5, 0.5, 1.0, 1.0]) OVER (), * FROM samples;
SELECT kmeans(ARRAY[x, y, z], 2, ARRAY[ARRAY[0.5, 0.5], ARRAY[1.0, 1.0]]) OVER (PARTITION BY group_key), * FROM samples
``````

Description

This module implements the K-means clustering algorithm as a user-defined window function in PostgreSQL. K-means clustering is a simple way to classify a data set by an N-dimensional vector.

It is known that the algorithm depends on how to define initial mean vectors. So the module provides two signatures, one for auto-initialization and the other for user-given initial vectors.

The `kmeans` function calculates and returns a class number which starts from 0, by using input vectors. The first argument is the vector, represented by an array of float8. You must give the same-length 1-dimensional array across all the input rows. The second argument is the integer K in "K-means", the number of class you want to partition. The third argument is optional, an array of vectors which represents initial mean vectors. Since the algorithm is sensitive with initial mean vectors, and although the function calculates them by input vectors in the first form, you may sometimes want to give fixed mean vectors. The vectors can be passed as a 1-d array or 2-d array of float8. In the case of 1-d, its length must match k * lengthof(vector).

If the third argument, initial centroids, is not given, it scans all the input vectors and stores min/max for each element of the vector. Then it decides initial hypothetical vector elements by the next formula:

``````init = (max - min) * (i + 1) / (k + 1) + min
``````

where `i` varies from 0 to k-1. This means that the vector of elements are decided as the linear interpolation from min to max divided by k. Then one of the input vectors nearest to this hypothetical vector is picked up as the centroid. Note that input vector is not picked more than twice as far as possible, to gain a good result.

See input/kmeans.source as an example for how to call the function.

History

v1.1.0 (2011-07-22)

• Fix NULL input case, thanks to the report from Mike Toews
• Improve initialization of centroids

v1.0.0 (2011-05-20)

• Initial version

Support

This library is stored in an open GitHub repository. Feel free to fork and contribute! Please file bug reports via GitHub Issues.