A real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.


Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

We will try cluster them in to 3 groups with K-means!

## 1. Import and get spark instance

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import (VectorAssembler,
                                StandardScaler)
spark = SparkSession.builder.appName('clustering').getOrCreate()

## 2. Explore dataset

In [2]:
df_dataset = spark.read.csv("seeds_dataset.csv", header=True, inferSchema=True)

In [3]:
df_dataset.head()

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)

In [4]:
df_dataset.describe().show()

+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|
+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|  count|               210|               210|                 210|                210|               210|                  210|                210|
|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|
| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|
|    min|             10.59|             12.41|              0.8081|              4.899|            

## 3. Make features column and scale it

#### Make features column

In [5]:
df_dataset.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [6]:
assembler = VectorAssembler(inputCols=df_dataset.columns, outputCol='features')

In [7]:
df_final_data = assembler.transform(df_dataset)

In [8]:
df_final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



#### Scale the Data
It is a good idea to scale our data to deal with the curse of dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality

In [9]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [10]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(df_final_data)

In [11]:
# Normalize each feature to have unit standard deviation.
df_final_data = scalerModel.transform(df_final_data)

## 4. Train the Model and Evaluate

In [12]:
%%time
# Trains a k-means model.
kmeans = KMeans(featuresCol='scaledFeatures', k=3)  # k=3: choose 3 clusterings
model = kmeans.fit(df_final_data)

CPU times: user 7.11 ms, sys: 4.7 ms, total: 11.8 ms
Wall time: 7.04 s


In [13]:
# # Evaluate clustering by computing Within Set Sum of Squared Errors.
# wssse = model.computeCost()
# wssse = model.computeCost(df_final_data)
# print("Within Set Sum of Squared Errors = " + str(wssse))

In [14]:
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 4.91309043 10.92012526 37.32658724 12.37724251  8.59393872  1.8226071
 10.35389957]
[ 6.32636687 12.38115343 37.39222755 13.9206997   9.75485787  2.41428142
 12.28078861]
[ 4.0648023  10.14242485 35.82143905 11.81918014  7.51855717  3.19361875
 10.40520609]


In [15]:
output = model.transform(df_final_data)
output.show(3)

+-----+---------+-----------+-----------------+------------------+---------------------+----------------+--------------------+--------------------+----------+
| area|perimeter|compactness| length_of_kernel|   width_of_kernel|asymmetry_coefficient|length_of_groove|            features|      scaledFeatures|prediction|
+-----+---------+-----------+-----------------+------------------+---------------------+----------------+--------------------+--------------------+----------+
|15.26|    14.84|      0.871|            5.763|             3.312|                2.221|            5.22|[15.26,14.84,0.87...|[5.24452795332028...|         0|
|14.88|    14.57|     0.8811|5.553999999999999|             3.333|                1.018|           4.956|[14.88,14.57,0.88...|[5.11393027165175...|         0|
|14.29|    14.09|      0.905|            5.291|3.3369999999999997|                2.699|           4.825|[14.29,14.09,0.90...|[4.91116018695588...|         0|
+-----+---------+-----------+-----------------

In [16]:
output.select(['features', 'scaledFeatures', 'prediction']).show()

+--------------------+--------------------+----------+
|            features|      scaledFeatures|prediction|
+--------------------+--------------------+----------+
|[15.26,14.84,0.87...|[5.24452795332028...|         0|
|[14.88,14.57,0.88...|[5.11393027165175...|         0|
|[14.29,14.09,0.90...|[4.91116018695588...|         0|
|[13.84,13.94,0.89...|[4.75650503761158...|         0|
|[16.14,14.99,0.90...|[5.54696468981581...|         0|
|[14.38,14.21,0.89...|[4.94209121682475...|         0|
|[14.69,14.49,0.87...|[5.04863143081749...|         0|
|[14.11,14.1,0.891...|[4.84929812721816...|         0|
|[16.63,15.46,0.87...|[5.71536696354628...|         1|
|[16.44,15.25,0.88...|[5.65006812271202...|         0|
|[15.26,14.85,0.86...|[5.24452795332028...|         0|
|[14.03,14.16,0.87...|[4.82180387844584...|         0|
|[13.89,14.02,0.88...|[4.77368894309428...|         0|
|[13.78,14.06,0.87...|[4.73588435103234...|         0|
|[13.74,14.05,0.87...|[4.72213722664617...|         0|
|[14.59,14