# DAAL KMeans

K-Means is among the most popular and simplest clustering methods. It is intended to partition a data set into a small number of clusters such that feature vectors within a cluster have greater similarity with one another than with feature vectors from other clusters. Each cluster is characterized by a representative point, called a centroid, and a cluster radius.

In other words, the clustering methods enable reducing the problem of analysis of the entire data set to the analysis of clusters.

- Read more about [KMeans in the intel developer zone](https://software.intel.com/en-us/node/564616)
- See [daaltk Documentation](https://github.com/trustedanalytics/daal-tk) for more information about the the API's

In [None]:
# First, let's verify that the SparkTK and daaltk libraries are installed
import sparktk
import daaltk

print "sparktk installation path = %s" % (sparktk.__path__)
print "daaltk installation path = %s" % (daaltk.__path__)

In [None]:
from sparktk import TkContext
tc = sparktk.TkContext(other_libs=[daaltk])

In [None]:
# Create a new frame by providing data and schema
data = [[2,"ab"],[1,"cd"],[7,"ef"],[1,"gh"],[9,"ij"],[2,"kl"],[0,"mn"],[6,"op"],[5,"qr"], [120, "outlier"]]
schema = [("data", float),("name", str)]

frame = tc.frame.create(data, schema)
frame.inspect()

In [None]:
# Consider the following frame containing two columns.
frame.inspect()

In [None]:
# DAAL KMeans model is trained using the frame from above
model = tc.daaltk.models.clustering.kmeans.train(frame, ["data"], k=2, max_iterations=20)
model

In [None]:
#call the modelto predict
predicted_frame = model.predict(frame, ["data"])
predicted_frame.inspect()