Skip to content

Commit

Permalink
Add kmeans cookbook page
Browse files Browse the repository at this point in the history
  • Loading branch information
OXPHOS committed May 11, 2016
1 parent 112dd79 commit 2f7f73f
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 0 deletions.
47 changes: 47 additions & 0 deletions doc/cookbook/source/examples/classifier/kmeans.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
==================
:math:`k`-means clustering
==================
:math:`k`-means clustering aims to partition :math:`n` observations into :math:`k` (:math:`\leq n`) clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

The :math:`n` observations are represented by :math:`n` :math:`d`-dimensional real vecoters, :math:`\mathbf{x} = (x_1, x_2, ..., x_n)`.



In :math:`k`-means clustering, the :math:`n` observations will partitioned into :math:`k` (:math:`\leq n`) sets :math:`\mathbf{S} = {S_1, S_2, ..., S_k}`, with minimal within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the :math:`k^{th}` center).
In other words, its objective is to find:

.. math::
k = \underset{\mathbf{S}}{argmin} \sum_{i=1}^{k}\sum_{\mathbf{x}\in S_k}\left \|\boldsymbol{x} - \boldsymbol{\mu}_i \right \|^{2}
where :math:`\mathbf{μ}_i` is the mean of points in :math:`S_i`.

-------
Example
-------
Imagine we have files with training and test data. We create CDenseFeatures (here 64 bit floats aka RealFeatures) as

.. sgexample:: kmeans.sg:create_features

In order to run :sgclass:`CKMeans`, we need to choose a distance, for example :sgclass:`CEuclideanDistance`, or other sub-classes of :sgclass:`CDistance`. The distance is initialized with the data we want to classify.

.. sgexample:: kmeans.sg:choose_distance

Once we have chosen a distance, we create an instance of the :sgclass:`CKMeans` classifier.
We explicitly set the number of clusters we are expecting to have as 2 and pass it to :math:`k`, together with training method Lloyd's method.

.. sgexample:: kmeans.sg:create_instance

Then we train the dataset:

.. sgexample:: kmeans.sg:train_and_apply

And we can extract centers and radius of each cluster:

.. sgexample:: kmeans.sg:extract_centers_and_radius

----------
References
----------
:wiki:`K-means_clustering`

:wiki:`Lloyd's_algorithm`
23 changes: 23 additions & 0 deletions examples/meta/src/classifier/kmeans.sg
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
CSVFile f_feats_train("../../data/classifier_binary_2d_linear_features_train.dat")

#![create_features]
RealFeatures features_train(f_feats_train)
#![create_features]

#![choose_distance]
EuclideanDistance distance(features_train, features_train)
#![choose_distance]

#![create_instance]
KMeans kmeans(3, distance, enum EKMeansMethod.KMM_LLOYD)
#![create_instance]

#![train_and_apply]
kmeans.train()
#![train_and_apply]

#![extract_centers_and_radius]
RealMatrix c = kmeans.get_cluster_centers()
RealVector r = kmeans.get_radiuses()
#![extract_centers_and_radius]

0 comments on commit 2f7f73f

Please sign in to comment.