## KMeans

In [14]:
from __future__ import print_function

import numpy as np
import tensorflow as tf
from tensorflow.contrib.factorization import KMeans
# Ignore all GPUs, tf random forest does not benefit from it.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [15]:
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
full_data_x = mnist.train.images

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


Setting up

In [129]:
epochs=100
batch_size=50
display_points=50 
clusters=25
features=784
classes=10

X = tf.placeholder(tf.float32, shape=[None, features])
Y = tf.placeholder(tf.float32, shape=[None, classes])

kmeans = KMeans(num_clusters=clusters, use_mini_batch=True,inputs=X,distance_metric='cosine')

1. all_scores: A matrix (or list of matrices) of dimensions (num_input, num_clusters) where the value is the distance of an input vector and a cluster center.  
2. cluster_idx: Each element in the vector corresponds to an input row and specifies the cluster id corresponding to the input.  
3. cluster_centers_initialized: scalar indicating whether clusters have been initialized.  
4. scores: Similar to cluster_idx but specifies the distance to the assigned cluster instead.  
5. init_op: an op to initialize the clusters.  
6. training_op: an op that runs an iteration of training.  

#### training_graph()  
Generate a training graph for kmeans algorithm.  

This returns, among other things, an op that chooses initial centers (init_op), a boolean variable that is set to True when the initial centers are chosen (cluster_centers_initialized), and an op to perform either an entire Lloyd iteration or a mini-batch of a Lloyd iteration (training_op). The caller should use these components as follows. A single worker should execute init_op multiple times until cluster_centers_initialized becomes True. Then multiple workers may execute training_op any number of times.

In [130]:
# Build KMeans graph
(all_scores,cluster_idx,scores,cluster_centers_initialized,init_op,train_op) = kmeans.training_graph()
cluster_idx = cluster_idx[0] # fix for cluster_idx being a tuple
init = tf.global_variables_initializer()
# these measure need to be controlled (average distance of points from the cluter center)
avg_distance = tf.reduce_mean(scores)

In [131]:
# let the session begin
sess = tf.Session()
sess.run(init)
sess.run(init_op, feed_dict={X: full_data_x})

# Training
for epoch in range(epochs):
    _,d,idx = sess.run([train_op,avg_distance,cluster_idx],feed_dict={X:full_data_x})
    if (epoch)%10==0 or epoch == 1:
        print("Epoch %i, Avg Distance: %f"%(epoch,d))

Epoch 0, Avg Distance: 0.341471
Epoch 1, Avg Distance: 0.234025
Epoch 10, Avg Distance: 0.221393
Epoch 20, Avg Distance: 0.220257
Epoch 30, Avg Distance: 0.219734
Epoch 40, Avg Distance: 0.219390
Epoch 50, Avg Distance: 0.219131
Epoch 60, Avg Distance: 0.218920
Epoch 70, Avg Distance: 0.218748
Epoch 80, Avg Distance: 0.218600
Epoch 90, Avg Distance: 0.218473


In [132]:
# array of cluster labels which tells us which input goes to which cluster
idx.shape

(55000,)

In [133]:
idx

array([13, 10,  9, ..., 24,  3,  2])

In [134]:
# mnist.train.labels (2d array of identifying)
mnist.train.labels[1]

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])

In [135]:
# shape of labels
mnist.train.labels.shape

(55000, 10)

In [136]:
# Here we are assigning each cluster a label for which we need to check which label is maximum inside which cluster
counts = np.zeros(shape=(clusters, classes))
for i in range(len(idx)):
    counts[idx[i]]+=mnist.train.labels[i]

In [137]:
# what each cluster points to what as a handwritten digit
# clusters are marked with labels according to majority votings
labels_map = [np.argmax(c) for c in counts]
labels_map

[3, 1, 8, 6, 2, 6, 4, 5, 7, 9, 3, 8, 2, 4, 8, 0, 0, 6, 7, 1, 2, 3, 6, 1, 5]

In [138]:
# converted to tensor
labels_map_tensor = tf.convert_to_tensor(labels_map)

In [139]:
# which cluster should get what label after training
# labels_map_tensor has cluster id wise centroid labels and cluster_idx has cluster ids so this 
# look up converts cluster_ids into mnist digit values
cluster_labels = tf.nn.embedding_lookup(labels_map_tensor, cluster_idx)

In [140]:
# accuracy (tf.int32 for 0s and 1s)
correct_prediction = tf.equal(cluster_labels, tf.cast(tf.argmax(Y, 1), tf.int32))
accuracy_op = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# test the model
test_x, test_y = mnist.test.images, mnist.test.labels
print("Test Accuracy:", sess.run([accuracy_op], feed_dict={X: test_x, Y: test_y}))

Test Accuracy: [0.7273]


#### Learning
1. accuracy_op calls --> correct_prediction calls --> cluster_labels calls --> labels_map_tensor calls --> labels_map calls --> counts calls --> idx calls --> train_op, avg_distance =-= needs X