Implementation of kmeans SDP and experiments from paper: Clustering subgaussian mixtures by semidefinite programming. Dustin Mixon, Soledad Villar, Rachel Ward
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
.gitignore
generate_data.py
get_data.m
kmeans_sdp.m
main.m
misclassification.m
readme
sdp_rounding.m
value_kmeans.m

readme

SDP clustering on NMIST data
*****
Authors: Dustin Mixon, Soledad Villar, Rachel Ward.
contact: mvillar@math.utexas.edu
*****
This code runs Peng and Wei's kmeans SDP [5] (using SDPNAL+ [6]) and 
Matlab's built in kmeans++ version of Lloyd's algorithm on NMIST data [3] 
previously preprocessed using TensorFlow [1] (mapped into feature space 
and saved in './data/data_features.mat'). 

It prints out a numerical comparison and produces the graphs in [4].

The python script generate_data.py uses TensorFlow to create the feature
data and saves it in './data/data_features.mat'.  
It is not necessary to run the python code in this experiment since data 
is already included but we include the code for completeness.
*****
Kmeans SDP implementation

The kmeans semidefinite relaxation has several advantages with respect 
to Lloyds algorithm and kmeans++.
- It is deterministic.
- It always converges to the global minimizer of kmeans SDP objective.
(kmeans++ often gets stuck in local optimizers).
- Kmeans SDP produces the kmeans optimal for data coming from the
stochastic ball model with high probability [2].
- It produces an approximate optimal solution for points coming from
separated subgaussian mixtures [4].
- If P is a matrix which columns are the coordinates of the data points
to cluster; and X is the solution of kmeans SDP, then P*X is a ‘denoised’
version of the data points. In fact, we observe empirically that, even 
when the relaxation is not tight, X has many repeated columns.
- Dual certificates of this SDP can be leveraged to implement a fast 
certifier of kmeans optimality of clusterings found by faster algorithms
(see [2]).
*****
Requires CVX and SDPNAL+.

Note: CVX is only required for running misclassification.m which is not
critic for this experiment.

*****
How to run
Run main.m (in Matlab)

Use kmeans_sdp.m to solve the kmeans SDP optimization problem for a 
provided set of points and number of clusters.
*****

 References:
[1] Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous 
systems.
[2] Iguchi, Mixon, Peterson, Villar. Probably certifiably correct kmeans 
clustering.
[3] LeCun, Cortes. Mnist handwritten digit database.
[4] Mixon, Villar, Ward. Clustering subgaussian mixtures via semidefinite
programming
[5] Peng, Wei. Approximating k-means-type clustering via semidefinite 
programming.
[6] Yang, Sun, Toh. Sdpnal+: a majorized semismooth newton-cg augmented
lagrangian method for semidefinite programming with nonnegative 
constraints.
[7] CVX Research, Inc. CVX: Matlab Software for Disciplined Convex 
Programming