GitHub - solevillar/kmeans_sdp: Implementation of kmeans SDP and experiments from paper: Clustering subgaussian mixtures by semidefinite programming. Dustin Mixon, Soledad Villar, Rachel Ward

solevillar / kmeans_sdp Public

Notifications You must be signed in to change notification settings
Fork 3
Star 10

Implementation of kmeans SDP and experiments from paper: Clustering subgaussian mixtures by semidefinite programming. Dustin Mixon, Soledad Villar, Rachel Ward

arxiv.org/abs/1602.06612

10 stars 3 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
.gitignore		.gitignore
generate_data.py		generate_data.py
get_data.m		get_data.m
kmeans_sdp.m		kmeans_sdp.m
main.m		main.m
misclassification.m		misclassification.m
readme		readme
sdp_rounding.m		sdp_rounding.m
value_kmeans.m		value_kmeans.m

Repository files navigation

SDP clustering on NMIST data
*****
Authors: Dustin Mixon, Soledad Villar, Rachel Ward.
contact: mvillar@math.utexas.edu
*****
This code runs Peng and Wei's kmeans SDP [5] (using SDPNAL+ [6]) and 
Matlab's built in kmeans++ version of Lloyd's algorithm on NMIST data [3] 
previously preprocessed using TensorFlow [1] (mapped into feature space 
and saved in './data/data_features.mat'). 

It prints out a numerical comparison and produces the graphs in [4].

The python script generate_data.py uses TensorFlow to create the feature
data and saves it in './data/data_features.mat'.  
It is not necessary to run the python code in this experiment since data 
is already included but we include the code for completeness.
*****
Kmeans SDP implementation

The kmeans semidefinite relaxation has several advantages with respect 
to Lloyds algorithm and kmeans++.
- It is deterministic.
- It always converges to the global minimizer of kmeans SDP objective.
(kmeans++ often gets stuck in local optimizers).
- Kmeans SDP produces the kmeans optimal for data coming from the
stochastic ball model with high probability [2].
- It produces an approximate optimal solution for points coming from
separated subgaussian mixtures [4].
- If P is a matrix which columns are the coordinates of the data points
to cluster; and X is the solution of kmeans SDP, then P*X is a ‘denoised’
version of the data points. In fact, we observe empirically that, even 
when the relaxation is not tight, X has many repeated columns.
- Dual certificates of this SDP can be leveraged to implement a fast 
certifier of kmeans optimality of clusterings found by faster algorithms
(see [2]).
*****
Requires CVX and SDPNAL+.

Note: CVX is only required for running misclassification.m which is not
critic for this experiment.

*****
How to run
Run main.m (in Matlab)

Use kmeans_sdp.m to solve the kmeans SDP optimization problem for a 
provided set of points and number of clusters.
*****

 References:
[1] Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous 
systems.
[2] Iguchi, Mixon, Peterson, Villar. Probably certifiably correct kmeans 
clustering.
[3] LeCun, Cortes. Mnist handwritten digit database.
[4] Mixon, Villar, Ward. Clustering subgaussian mixtures via semidefinite
programming
[5] Peng, Wei. Approximating k-means-type clustering via semidefinite 
programming.
[6] Yang, Sun, Toh. Sdpnal+: a majorized semismooth newton-cg augmented
lagrangian method for semidefinite programming with nonnegative 
constraints.
[7] CVX Research, Inc. CVX: Matlab Software for Disciplined Convex 
Programming

About

Implementation of kmeans SDP and experiments from paper: Clustering subgaussian mixtures by semidefinite programming. Dustin Mixon, Soledad Villar, Rachel Ward

arxiv.org/abs/1602.06612

Readme

Activity

10 stars

2 watching

3 forks

Report repository