# Clustering Code Along

We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.


Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

Let's see if we can cluster them in to 3 groups with K-means!

In [30]:
from pyspark.sql import SparkSession

In [31]:
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [32]:
dataset = spark.read.csv('seeds_dataset.csv', header=True, inferSchema=True)

In [33]:
dataset.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [34]:
dataset.head(1)

[Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)]

## Format the Data

In [35]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

In [36]:
dataset.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [37]:
assembler = VectorAssembler(inputCols=dataset.columns, outputCol='features')

In [38]:
final_data = assembler.transform(dataset)

In [39]:
final_data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



In [40]:
final_data.select(['features']).show()

+--------------------+
|            features|
+--------------------+
|[15.26,14.84,0.87...|
|[14.88,14.57,0.88...|
|[14.29,14.09,0.90...|
|[13.84,13.94,0.89...|
|[16.14,14.99,0.90...|
|[14.38,14.21,0.89...|
|[14.69,14.49,0.87...|
|[14.11,14.1,0.891...|
|[16.63,15.46,0.87...|
|[16.44,15.25,0.88...|
|[15.26,14.85,0.86...|
|[14.03,14.16,0.87...|
|[13.89,14.02,0.88...|
|[13.78,14.06,0.87...|
|[13.74,14.05,0.87...|
|[14.59,14.28,0.89...|
|[13.99,13.83,0.91...|
|[15.69,14.75,0.90...|
|[14.7,14.21,0.915...|
|[12.72,13.57,0.86...|
+--------------------+
only showing top 20 rows



## Scale the Data
It is a good idea to scale our data to deal with the curse of dimensionality: https://en.wikipedia.org/wiki/Curse_of_dimensionality

In [41]:
from pyspark.ml.feature import StandardScaler

In [42]:
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')

In [43]:
# Compute summary statistics by fitting the StandardScaler
scaler_model = scaler.fit(final_data)

In [44]:
final_data = scaler_model.transform(final_data)

In [45]:
final_data.select(['features', 'scaledFeatures']).show()

+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|[15.26,14.84,0.87...|[5.24452795332028...|
|[14.88,14.57,0.88...|[5.11393027165175...|
|[14.29,14.09,0.90...|[4.91116018695588...|
|[13.84,13.94,0.89...|[4.75650503761158...|
|[16.14,14.99,0.90...|[5.54696468981581...|
|[14.38,14.21,0.89...|[4.94209121682475...|
|[14.69,14.49,0.87...|[5.04863143081749...|
|[14.11,14.1,0.891...|[4.84929812721816...|
|[16.63,15.46,0.87...|[5.71536696354628...|
|[16.44,15.25,0.88...|[5.65006812271202...|
|[15.26,14.85,0.86...|[5.24452795332028...|
|[14.03,14.16,0.87...|[4.82180387844584...|
|[13.89,14.02,0.88...|[4.77368894309428...|
|[13.78,14.06,0.87...|[4.73588435103234...|
|[13.74,14.05,0.87...|[4.72213722664617...|
|[14.59,14.28,0.89...|[5.01426361985209...|
|[13.99,13.83,0.91...|[4.80805675405968...|
|[15.69,14.75,0.90...|[5.39230954047151...|
|[14.7,14.21,0.915...|[5.05206821191403...|
|[12.72,13.57,0.86...|[4.3715855

## Train the Model and Evaluate

In [51]:
# Trains a k-means model.
kmeans = KMeans(featuresCol='scaledFeatures', k=3)

In [47]:
model = kmeans.fit(final_data)

In [52]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
print('WSSSE')
print(model.computeCost(final_data))

WSSSE
428.60820118716356


In [49]:
print("Clusters' centers: ")
for center in model.clusterCenters():
    print(center)

Clusters' centers: 
[  6.35645488  12.40730852  37.41990178  13.93860446   9.7892399
   2.41585013  12.29286107]
[  4.07497225  10.14410142  35.89816849  11.80812742   7.54416916
   3.15410901  10.38031464]
[  4.96198582  10.97871333  37.30930808  12.44647267   8.62880781
   1.80061978  10.41913733]


In [50]:
features_and_predictions = model.transform(final_data).select(['features', 'prediction'])
features_and_predictions.show()

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[15.26,14.84,0.87...|         2|
|[14.88,14.57,0.88...|         2|
|[14.29,14.09,0.90...|         2|
|[13.84,13.94,0.89...|         2|
|[16.14,14.99,0.90...|         2|
|[14.38,14.21,0.89...|         2|
|[14.69,14.49,0.87...|         2|
|[14.11,14.1,0.891...|         2|
|[16.63,15.46,0.87...|         0|
|[16.44,15.25,0.88...|         2|
|[15.26,14.85,0.86...|         2|
|[14.03,14.16,0.87...|         2|
|[13.89,14.02,0.88...|         2|
|[13.78,14.06,0.87...|         2|
|[13.74,14.05,0.87...|         2|
|[14.59,14.28,0.89...|         2|
|[13.99,13.83,0.91...|         2|
|[15.69,14.75,0.90...|         2|
|[14.7,14.21,0.915...|         2|
|[12.72,13.57,0.86...|         1|
+--------------------+----------+
only showing top 20 rows

