Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kmeans page to cookbook #3183

Merged
merged 1 commit into from
Jun 15, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion data
56 changes: 56 additions & 0 deletions doc/cookbook/source/examples/clustering/kmeans.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
=======
K-means
=======
:math:`K`-means clustering aims to partition :math:`n` observations into :math:`k\leq n` clusters (sets :math:`\mathbf{S}`),
in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In other words, its objective is to minimize:

.. math::
\argmin_\mathbf{S} \sum_{i=1}^{k}\sum_{\mathbf{x}\in S_k}\left \|\boldsymbol{x} - \boldsymbol{\mu}_i \right \|^{2}

where :math:`\mathbf{μ}_i` is the mean of points in :math:`S_i`.

See Chapter 20 in :cite:`barber2012bayesian` for a detailed introduction.

-------
Example
-------
Imagine we have files with training and test data. We create CDenseFeatures (here 64 bit floats aka RealFeatures) as

.. sgexample:: kmeans.sg:create_features

In order to run :sgclass:`CKMeans`, we need to choose a distance, for example :sgclass:`CEuclideanDistance`, or other sub-classes of :sgclass:`CDistance`. The distance is initialized with the data we want to classify.

.. sgexample:: kmeans.sg:choose_distance

Once we have chosen a distance, we create an instance of the :sgclass:`CKMeans` classifier.
We explicitly set :math:`k`, the number of clusters we are expecting to have as 3 and pass it to :sgclass:`CKMeans`. In this example, we apply Lloyd's method for `k`-means clustering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention (or remove in code) this "fixed_centers"?

.. sgexample:: kmeans.sg:create_instance_lloyd

Then we train the model:

.. sgexample:: kmeans.sg:train_dataset

We can extract centers and radius of each cluster:

.. sgexample:: kmeans.sg:extract_centers_and_radius


:sgclass:`CKMeans` also supports mini batch :math:`k`-means clustering.
We can create an instance of :sgclass:`CKMeans` classifier with mini batch :math:`k`-means method by providing the batch size and iteration number.

.. sgexample:: kmeans.sg:create_instance_mb

Then train the model and extract the centers and radius information as mentioned above.

----------
References
----------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add david barbers book as reference as well?

:wiki:`K-means_clustering`

:wiki:`Lloyd's_algorithm`

.. bibliography:: ../../references.bib
:filter: docname in docnames
28 changes: 28 additions & 0 deletions examples/meta/src/clustering/kmeans.sg
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
CSVFile f_feats_train("../../data/classifier_binary_2d_linear_features_train.dat")
Math:init_random(1)

#![create_features]
RealFeatures features_train(f_feats_train)
#![create_features]

#![choose_distance]
EuclideanDistance distance(features_train, features_train)
#![choose_distance]

#![create_instance_lloyd]
KMeans kmeans(2, distance)
#![create_instance_lloyd]

#![train_dataset]
kmeans.train()
#![train_dataset]

#![extract_centers_and_radius]
RealMatrix c = kmeans.get_cluster_centers()
RealVector r = kmeans.get_radiuses()
#![extract_centers_and_radius]

#![create_instance_mb]
KMeansMiniBatch kmeans_mb(2, distance)
kmeans_mb.set_mb_params(4, 1000)
#![create_instance_mb]
25 changes: 0 additions & 25 deletions examples/undocumented/csharp_modular/clustering_kmeans_modular.cs

This file was deleted.

31 changes: 0 additions & 31 deletions examples/undocumented/java_modular/clustering_kmeans_modular.java

This file was deleted.

19 changes: 0 additions & 19 deletions examples/undocumented/octave_modular/clustering_kmeans_modular.m

This file was deleted.

24 changes: 0 additions & 24 deletions examples/undocumented/python_modular/clustering_kmeans_modular.py

This file was deleted.