# Clustering Documentation Example

<h2 id="k-means">K-means</h2>

<p><a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a> is one of the
most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the <a href="http://en.wikipedia.org/wiki/K-means%2B%2B">k-means++</a> method
called <a href="http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf">kmeans||</a>.</p>

<p><code>KMeans</code> is implemented as an <code>Estimator</code> and generates a <code>KMeansModel</code> as the base model.</p>

<h3 id="input-columns">Input Columns</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>featuresCol</td>
      <td>Vector</td>
      <td>"features"</td>
      <td>Feature vector</td>
    </tr>
  </tbody>
</table>

<h3 id="output-columns">Output Columns</h3>

<table class="table">
  <thead>
    <tr>
      <th align="left">Param name</th>
      <th align="left">Type(s)</th>
      <th align="left">Default</th>
      <th align="left">Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>predictionCol</td>
      <td>Int</td>
      <td>"prediction"</td>
      <td>Predicted cluster center</td>
    </tr>
  </tbody>
</table>

In [1]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [4]:
from pyspark.ml.clustering import KMeans

In [7]:
dataset = spark.read.format('libsvm').load('sample_kmeans_data.txt')

In [8]:
dataset.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



In [9]:
dataset.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



In [10]:
final_data = dataset.select(['features'])

In [33]:
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)

In [12]:
model = kmeans.fit(final_data)

In [34]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(final_data)

In [14]:
print(wssse)

0.11999999999994547


In [17]:
for item in final_data.head(5):
    print(item[0])

(3,[],[])
(3,[0,1,2],[0.1,0.1,0.1])
(3,[0,1,2],[0.2,0.2,0.2])
(3,[0,1,2],[9.0,9.0,9.0])
(3,[0,1,2],[9.1,9.1,9.1])


In [18]:
centers = model.clusterCenters()

In [19]:
centers

[array([ 0.1,  0.1,  0.1]), array([ 9.1,  9.1,  9.1])]

In [23]:
results = model.transform(final_data)

In [24]:
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         0|
|(3,[0,1,2],[0.1,0...|         0|
|(3,[0,1,2],[0.2,0...|         0|
|(3,[0,1,2],[9.0,9...|         1|
|(3,[0,1,2],[9.1,9...|         1|
|(3,[0,1,2],[9.2,9...|         1|
+--------------------+----------+



### What if we increase K value

In [36]:
kmeans = KMeans().setK(3).setSeed(1)
model = kmeans.fit(final_data)
wssse = model.computeCost(final_data)
centers = model.clusterCenters()
print("K value: 3", "\n")
print("Sum of Squared Errors: " ,wssse)
print("-" * 110, "\n")
print("Clusters' Centers: ")
for center in centers:
    print(center)

K value: 3 

Sum of Squared Errors:  0.07499999999994544
-------------------------------------------------------------------------------------------------------------- 

Clusters' Centers: 
[ 9.1  9.1  9.1]
[ 0.05  0.05  0.05]
[ 0.2  0.2  0.2]


Not Surprise, Sum squared Error decreases as the K value increases

In [32]:
results = model.transform(final_data)
results.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|           (3,[],[])|         1|
|(3,[0,1,2],[0.1,0...|         1|
|(3,[0,1,2],[0.2,0...|         2|
|(3,[0,1,2],[9.0,9...|         0|
|(3,[0,1,2],[9.1,9...|         0|
|(3,[0,1,2],[9.2,9...|         0|
+--------------------+----------+

