# Clustering on Seeds Data

We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.


Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

Let's see if we can cluster them in to 3 groups with K-means!

In [1]:
# Initiate SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("cluster").getOrCreate()

## Load the data

In [2]:
# Load the data

dataset = spark.read.csv("resources/seeds_dataset.csv", inferSchema=True, header=True)
dataset.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [3]:
# View the data
dataset.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355|             5.175|
|14.38|    14.21|     0.8951|             5.386|             3.312|   2.4619999999999997|             4.956|
|14.69|    14.49|  

## Data Transformation

In [6]:
# Now we will format the data for clustering

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['area', 'perimeter', 'compactness', 'length_of_kernel', 'width_of_kernel', 
                                       'asymmetry_coefficient', 'length_of_groove'], 
                           outputCol='features')

final_data = assembler.transform(dataset)

final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



## Scale the data

In [8]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol='scaled_features')

final_data = scaler.fit(final_data).transform(final_data)

final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaled_features: vector (nullable = true)



## Train the model and Evaluate

In [10]:
# trains a k-means model
from pyspark.ml.clustering import KMeans

kmeans = KMeans(featuresCol='scaled_features', k=3)
model = kmeans.fit(final_data)

In [11]:
# Evaluate the clusters by computing within set sum of Squared Errors
print(f"Within Set Sum of Squared Errors : {model.computeCost(final_data)}")

Within Set Sum of Squared Errors : 428.76536612890413


In [12]:
# Shows the result
centers = model.clusterCenters()
print('Cluster Centers')
for center in centers:
    print(center)

Cluster Centers
[ 4.078007   10.15076404 35.87686106 11.81860981  7.5430707   3.17727834
 10.39174095]
[ 6.32636687 12.38115343 37.39222755 13.9206997   9.75485787  2.41428142
 12.28078861]
[ 4.9360523  10.94499696 37.33487983 12.40173794  8.61516278  1.7804233
 10.36535821]


In [13]:
# to see the cluster numbers along with the data points
model.transform(final_data).select('scaled_features', 'prediction').show()

+--------------------+----------+
|     scaled_features|prediction|
+--------------------+----------+
|[5.24452795332028...|         2|
|[5.11393027165175...|         2|
|[4.91116018695588...|         2|
|[4.75650503761158...|         2|
|[5.54696468981581...|         2|
|[4.94209121682475...|         2|
|[5.04863143081749...|         2|
|[4.84929812721816...|         2|
|[5.71536696354628...|         1|
|[5.65006812271202...|         2|
|[5.24452795332028...|         2|
|[4.82180387844584...|         2|
|[4.77368894309428...|         2|
|[4.73588435103234...|         2|
|[4.72213722664617...|         2|
|[5.01426361985209...|         2|
|[4.80805675405968...|         2|
|[5.39230954047151...|         2|
|[5.05206821191403...|         2|
|[4.37158555479908...|         0|
+--------------------+----------+
only showing top 20 rows

