# DAAL KMeans

K-Means is among the most popular and simplest clustering methods. It is intended to partition a data set into a small number of clusters such that feature vectors within a cluster have greater similarity with one another than with feature vectors from other clusters. Each cluster is characterized by a representative point, called a centroid, and a cluster radius.

In other words, the clustering methods enable reducing the problem of analysis of the entire data set to the analysis of clusters.

- Read more about [KMeans in the intel developer zone](https://software.intel.com/en-us/node/564616)
- See [daaltk Documentation](https://github.com/trustedanalytics/daal-tk) for more information about the the API's

In [1]:
# First, let's verify that the SparkTK and daaltk libraries are installed
import sparktk
import daaltk

print "sparktk installation path = %s" % (sparktk.__path__)
print "daaltk installation path = %s" % (daaltk.__path__)

sparktk installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']
daaltk installation path = ['/opt/anaconda2/lib/python2.7/site-packages/daaltk']


In [2]:
from sparktk import TkContext
tc = sparktk.TkContext(other_libs=[daaltk])

In [3]:
# Create a new frame by providing data and schema
data = [[2,"ab"],[1,"cd"],[7,"ef"],[1,"gh"],[9,"ij"],[2,"kl"],[0,"mn"],[6,"op"],[5,"qr"], [120, "outlier"]]
schema = [("data", float),("name", str)]

frame = tc.frame.create(data, schema)
frame.inspect()

[#]  data  name   
[0]     2  ab
[1]     1  cd
[2]     7  ef
[3]     1  gh
[4]     9  ij
[5]     2  kl
[6]     0  mn
[7]     6  op
[8]     5  qr
[9]   120  outlier

In [4]:
# Consider the following frame containing two columns.
frame.inspect()

[#]  data  name   
[0]     2  ab
[1]     1  cd
[2]     7  ef
[3]     1  gh
[4]     9  ij
[5]     2  kl
[6]     0  mn
[7]     6  op
[8]     5  qr
[9]   120  outlier

In [5]:
# DAAL KMeans model is trained using the frame from above
model = tc.daaltk.models.clustering.kmeans.train(frame, ["data"], k=2, max_iterations=20)
model

centroids           = {u'Cluster:1': [120.0], u'Cluster:0': [3.6666666666666665]}
cluster_sizes       = {u'Cluster:1': 1L, u'Cluster:0': 9L}
column_scalings     = []
k                   = 2
label_column        = predicted_cluster
observation_columns = [u'data']

In [6]:
#call the modelto predict
predicted_frame = model.predict(frame, ["data"])
predicted_frame.inspect()

[#]  data   name     distance_from_cluster_0  distance_from_cluster_1
[0]    2.0  ab                 2.77777777778                  13924.0
[1]    1.0  cd                 7.11111111111                  14161.0
[2]    7.0  ef                 11.1111111111                  12769.0
[3]    1.0  gh                 7.11111111111                  14161.0
[4]    9.0  ij                 28.4444444444                  12321.0
[5]    2.0  kl                 2.77777777778                  13924.0
[6]    0.0  mn                 13.4444444444                  14400.0
[7]    6.0  op                 5.44444444444                  12996.0
[8]    5.0  qr                 1.77777777778                  13225.0
[9]  120.0  outlier            13533.4444444                      0.0

[#]  predicted_cluster
[0]                  0
[1]                  0
[2]                  0
[3]                  0
[4]                  0
[5]                  0
[6]                  0
[7]                  0
[8]                  0

In [7]:
# Inspect HDFS directly using hdfsclient

import hdfsclient
from hdfsclient import ls, mkdir, rm, mv

In [8]:
try:
    rm("sandbox/myKMeansModel", recurse=True)
except:
    pass
model.save("sandbox/myKMeansModel")

In [9]:
restored = tc.load("sandbox/myKMeansModel")

In [10]:
restored

centroids           = {u'Cluster:1': [120.0], u'Cluster:0': [3.6666666666666665]}
cluster_sizes       = {u'Cluster:1': 1L, u'Cluster:0': 9L}
column_scalings     = []
k                   = 2
label_column        = predicted_cluster
observation_columns = [u'data']

In [11]:
full_path = model.export_to_mar("sandbox/myKMeansModel.mar")

In [12]:
full_path

u'/home/vcap/jupyter/examples/tklibs/daaltk/sandbox/myKMeansModel.mar'

In [13]:
model.save

<bound method KMeansModel.save of centroids           = {u'Cluster:1': [120.0], u'Cluster:0': [3.6666666666666665]}
cluster_sizes       = {u'Cluster:1': 1L, u'Cluster:0': 9L}
column_scalings     = []
k                   = 2
label_column        = predicted_cluster
observation_columns = [u'data']>