We'll be working with a real data set about seeds, from UCI repository: https://archive.ics.uci.edu/ml/datasets/seeds.

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The data set can be used for the tasks of classification and cluster analysis.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured:

1. area A,
2. perimeter P,
3. compactness C = 4piA/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove. All of these parameters were real-valued continuous.

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('seed').getOrCreate()

In [0]:
data = spark.read.csv('dbfs:/FileStore/seeds_dataset.csv', inferSchema=True, header=True)

In [0]:
data.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355|             5.175|
|14.38|    14.21|     0.8951|             5.386|             3.312|   2.4619999999999997|             4.956|
|14.69|    14.49|  

In [0]:
data.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



In [0]:
print((data.count(), len(data.columns)))

(210, 7)


In [0]:
# missing values
from pyspark.sql.functions import col, sum as _sum

missing_val = data.select([_sum(col(c).isNull().cast('int')).alias(c) for c in data.columns])
missing_val.show()

+----+---------+-----------+----------------+---------------+---------------------+----------------+
|area|perimeter|compactness|length_of_kernel|width_of_kernel|asymmetry_coefficient|length_of_groove|
+----+---------+-----------+----------------+---------------+---------------------+----------------+
|   0|        0|          0|               0|              0|                    0|               0|
+----+---------+-----------+----------------+---------------+---------------------+----------------+



In [0]:
# duplicated
duplicated = data.exceptAll(data.dropDuplicates())
duplicated.show()

+----+---------+-----------+----------------+---------------+---------------------+----------------+
|area|perimeter|compactness|length_of_kernel|width_of_kernel|asymmetry_coefficient|length_of_groove|
+----+---------+-----------+----------------+---------------+---------------------+----------------+
+----+---------+-----------+----------------+---------------+---------------------+----------------+



## Format for Mllib

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
data.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [0]:
assembler = VectorAssembler(inputCols=data.columns, outputCol='features')

In [0]:
output = assembler.transform(data)

## StandardScaler

In [0]:
from pyspark.ml.feature import StandardScaler

In [0]:
scaler = StandardScaler(inputCol='features', outputCol='scalerfeatures', withStd= True, withMean= False)

In [0]:
final_data = scaler.fit(output).transform(output)

## Model cluster and evaluation silhouette score

In [0]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [0]:
silhouette_score = []
k_values = range(2, 12)

for k in k_values:
  kmeans = KMeans(featuresCol='scalerfeatures').setK(k)
  model = kmeans.fit(final_data)
  predict = model.transform(final_data)

  evalcluster = ClusteringEvaluator()
  silhouette = evalcluster.evaluate(predict)
  silhouette_score.append(silhouette)
  print(f'k : {k}, silhouette score : {silhouette}')
  print('*' * 100)

k : 2, silhouette score : 0.7093882893548508
****************************************************************************************************
k : 3, silhouette score : 0.616267393520126
****************************************************************************************************
k : 4, silhouette score : 0.5118605771341731
****************************************************************************************************
k : 5, silhouette score : 0.3527725468098333
****************************************************************************************************
k : 6, silhouette score : 0.4457065728535895
****************************************************************************************************
k : 7, silhouette score : 0.36862594319683506
****************************************************************************************************
k : 8, silhouette score : 0.3513475917622866
*******************************************************************************

## Final cluster model k=2

In [0]:
kmeans = KMeans(featuresCol='scalerfeatures', k = 2)

In [0]:
model = kmeans.fit(final_data)

In [0]:
prediction = model.transform(final_data)

In [0]:
# evaluation
evalcluster = ClusteringEvaluator()
silhouette_score = evalcluster.evaluate(prediction)
print(f'Silhouette score cluster k_2 : {silhouette_score}')

Silhouette score cluster k_2 : 0.7093882893548508


In [0]:
centers = model.clusterCenters()
for center in centers:
  print(center)

[ 4.42210624 10.46640451 36.50749337 12.04573012  7.98091715  2.56066803
 10.33364421]
[ 6.20884577 12.25651292 37.43485358 13.77282897  9.67731721  2.2989371
 12.09236686]


In [0]:
model.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   80|
|         0|  130|
+----------+-----+



In [0]:
prediction.select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         1|
|         0|
|         0|
|         0|
|         1|
|         1|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         1|
|         0|
|         0|
+----------+
only showing top 20 rows



## Good Job..!