## DS/CMPSC 410 MiniProject #2

### Spring 2020
### Instructor: John Yen
### TA: Dongkuan Xu and Rupesh Prajapati
### Learning Objectives
- Be able to apply k-means clustering to the Darknet dataset
- Be able to choose a set of top ports for one-hot encoding the set of ports scanned by a scanner.
- Be able to intepret the features of the cluster centers generated
- Be able to compare the result of k-means clustering with different value of k using Silhouette score.

### Total points: 100 
- Exercise 1: 5 points
- Exercise 2: 5 points 
- Exercise 3: 10 points 
- Exercise 4: 10 points
- Exercise 5: 20 points
- Exercise 6: 10 points
- Exercise 7: 15 points
- Exercise 8: 25 points
  
### Due: 5 pm, April 14, 2021

In [1]:
import pyspark
import csv

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [3]:
ss = SparkSession.builder.master("local").appName("ClusteringOHE").getOrCreate()

## Exercise 1 (5 points)
Complete the path for input file in the code below and enter your name in this Markdown cell:
- Name: Kangdong Yuan

In [4]:
Scanners_df = ss.read.csv("/storage/home/kky5082/ds410/Lab10/sampled_profile.csv", header= True, inferSchema=True )

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [5]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



# Part A: One Hot Encoding
We want to apply one hot encoding to the set of ports scanned by scanners.  
- A.1 Like Mini Project deliverable 1, we first convert the feature "ports_scanned_str" to a feature that is an Array of ports
- A.2 We then calculate the total number of scanners for each port
- A.3 We identify the top n port to use for one-hot encoding (You choose the number n).
- A.4 Generate one-hot encoded feature for these top n ports.

In [6]:
Scanners_df.select("ports_scanned_str").show(30)

+--------------------+
|   ports_scanned_str|
+--------------------+
|               13716|
|         17128-17136|
|               35134|
|               17140|
|               54594|
|               17130|
|               54594|
|               37876|
|               17142|
|17128-17130-17132...|
|               54594|
|               12941|
|               30188|
|23-80-81-1023-232...|
|               54594|
|17128-17132-17136...|
|               17136|
|               54594|
|               17134|
|                 445|
|               34226|
|               17130|
|               17134|
|           137-17130|
|               17142|
|               17142|
|17128-17130-17132...|
|                  23|
|               54594|
|               54594|
+--------------------+
only showing top 30 rows



In [7]:
Scanners_df2=Scanners_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
Scanners_df2.persist().show(10)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|1645181|1645181|       1|     0.0|   60|      1|                60|             1|             1|               1|         

## A.1 We only need the column ```Ports_Array``` to calculate the top ports being scanned

In [8]:
Ports_Scanned_RDD = Scanners_df2.select("Ports_Array").rdd

In [9]:
Ports_Scanned_RDD.persist().take(5)

[Row(Ports_Array=['13716']),
 Row(Ports_Array=['17128', '17136']),
 Row(Ports_Array=['35134']),
 Row(Ports_Array=['17140']),
 Row(Ports_Array=['54594'])]

## Because each port number in the Ports_Array column for each row occurs only once, we can count the total occurance of each port number through flatMap.

In [10]:
Ports_list_RDD = Ports_Scanned_RDD.map(lambda row: row[0] )

In [11]:
Ports_list_RDD.persist()

PythonRDD[27] at RDD at PythonRDD.scala:53

In [12]:
Ports_list2_RDD = Ports_Scanned_RDD.flatMap(lambda row: row[0] )

In [13]:
Port_count_RDD = Ports_list2_RDD.map(lambda x: (x, 1))
Port_count_RDD.take(2)

[('13716', 1), ('17128', 1)]

In [14]:
Port_count_total_RDD = Port_count_RDD.reduceByKey(lambda x,y: x+y, 1)
Port_count_total_RDD.persist().take(5)

[('13716', 14),
 ('17128', 31850),
 ('17136', 31617),
 ('35134', 13),
 ('17140', 31865)]

## Exercise 2 (5%)
Find the total number of ports being scanned.

In [15]:
Port_count_total_RDD.count()

65536

 ## Exercise 2 Answer: 65536
Type your answer after you find out the answer from completing and executing the Pyspark code above.

In [16]:
Sorted_Count_Port_RDD = Port_count_total_RDD.map(lambda x: (x[1], x[0])).sortByKey( ascending = False)

In [17]:
Sorted_Count_Port_RDD.persist().take(50)

[(32014, '17132'),
 (31865, '17140'),
 (31850, '17128'),
 (31805, '17138'),
 (31630, '17130'),
 (31617, '17136'),
 (29199, '23'),
 (25466, '445'),
 (25216, '54594'),
 (21700, '17142'),
 (21560, '17134'),
 (15010, '80'),
 (13698, '8080'),
 (8778, '0'),
 (6265, '2323'),
 (5552, '5555'),
 (4930, '81'),
 (4103, '1023'),
 (4058, '52869'),
 (4012, '8443'),
 (3954, '49152'),
 (3885, '7574'),
 (3874, '37215'),
 (3318, '34218'),
 (3279, '34220'),
 (3258, '33968'),
 (3257, '34224'),
 (3253, '34228'),
 (3252, '33962'),
 (3236, '33960'),
 (3209, '33964'),
 (3179, '34216'),
 (3167, '34226'),
 (3155, '33970'),
 (3130, '33972'),
 (2428, '50401'),
 (1954, '34222'),
 (1921, '34230'),
 (1919, '33966'),
 (1819, '33974'),
 (1225, '3389'),
 (1064, '1433'),
 (885, '22'),
 (878, '5353'),
 (604, '21'),
 (594, '8291'),
 (554, '8728'),
 (512, '443'),
 (382, '5900'),
 (330, '8000')]

## Exercise 3 (10%)
Select top_ports to be the number of top ports you want to use for one-hot encoding.  I recommend a number between 20 and 40.

In [18]:
top_ports= 30
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: x[1])
Top_Ports_list = Sorted_Ports_RDD.take(top_ports)

In [19]:
Top_Ports_list

['17132',
 '17140',
 '17128',
 '17138',
 '17130',
 '17136',
 '23',
 '445',
 '54594',
 '17142',
 '17134',
 '80',
 '8080',
 '0',
 '2323',
 '5555',
 '81',
 '1023',
 '52869',
 '8443',
 '49152',
 '7574',
 '37215',
 '34218',
 '34220',
 '33968',
 '34224',
 '34228',
 '33962',
 '33960']

In [20]:
Top_Ports_list[0]

'17132'

In [21]:
FeatureName = "Port"+Top_Ports_list[0]

In [22]:
FeatureName

'Port17132'

In [23]:
from pyspark.sql.functions import array_contains

In [24]:
Scanners_df3=Scanners_df2.withColumn(FeatureName, array_contains("Ports_Array", Top_Ports_list[0]))

In [25]:
Scanners_df3.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

In [26]:
Scanners_df3.show(10)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|Port17132|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|1645181|1645181|       1|     0.0|   60|      1|                60|             1|           

## Exercise 4 (10%)
Check whether one-hot encoding of the first top port is encoded correctly. Complete and execute the code below. Fill your answer in the Markdown cell for Solution for Exercise 4.

In [27]:
First_top_port_scanners_count = Scanners_df3.where(col("Port17132") == True).rdd.count()

In [28]:
print(First_top_port_scanners_count)

32014


## Answer for Exercise 4:
- The total number of scanners that scan the first top port is:
## 32014
- Is this number the same as what you saw from Sorted_Count_Port_RDD?  
## no

## A.4 Generate Hot-One Encoded Feature for each of the top ports in the Top_Ports_list

- Iterate through the Top_Ports_list so that each top port is one-hot encoded.

## Exercise 5 (20%)
Complete the following PySpark code for encoding the n ports using One Hot Encoding, where n is specified by the variable ```top_ports```

In [29]:
for i in range(0, top_ports-1):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding
    Scanners_df3 = Scanners_df2.withColumn("Port" + Top_Ports_list[i], array_contains("Ports_Array", Top_Ports_list[i]))
    Scanners_df2 = Scanners_df3

In [30]:
Scanners_df2.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

## Exercise 6 (10 points)
Use k-means to cluster the scanners using the one-hot-encoded feature and the following input features:
- numports  : The total number of ports scanned by each scanner.
- lifetime  : The average lifetime of scanners.
- Packets   : The average number of packets scanned by each scanner.

## Specify Parameters for k Means Clustering

In [31]:
km = KMeans(featuresCol="features", predictionCol="prediction").setK(50).setSeed(123)
km.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 50)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction)\nseed: random seed. (default: 2393636147997110994, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [32]:
input_features = ["numports", "lifetime", "Packets"]
for i in range(0, top_ports - 1):
    input_features.append( "Port"+Top_Ports_list[i] )

In [33]:
print(input_features)

['numports', 'lifetime', 'Packets', 'Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962']


In [34]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [35]:
data= va.transform(Scanners_df2)

In [36]:
data.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33962: boo

In [37]:
kmModel=km.fit(data)

In [38]:
kmModel

KMeansModel: uid=KMeans_8653e5d89c4f, k=30, distanceMeasure=euclidean, numFeatures=32

In [39]:
predictions = kmModel.transform(data)

In [40]:
predictions.persist().show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+--------------------+----------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str|host_tags_per_censys|host_services_per_censys|   Ports_Array|Port17132|Port17140|Port17128|Port17138|Port17130|Port17136|Port23|Port445|Port54594|Port17142|Port17134|Port80|Port8080|Port0|Port232

In [41]:
Cluster1_df=predictions.where(col("prediction")==0)

In [42]:
Cluster1_df.persist().count()

217137

In [43]:
summary = kmModel.summary

In [44]:
summary.clusterSizes

[217137,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 9593,
 1,
 1,
 235,
 1,
 1,
 1,
 1,
 6,
 1,
 1,
 62,
 1,
 1,
 1,
 1,
 1,
 1,
 4,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [45]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

In [46]:
print('Silhouette Score of the Clustering Result is ', silhouette)

Silhouette Score of the Clustering Result is  0.9401983976474807


In [47]:
centers = kmModel.clusterCenters()

In [48]:
print("Cluster Centers:")
i=0
for center in centers:
    print("Cluster ", str(i+1), center, len(center))
    i = i+1

Cluster Centers:
Cluster  1 [2.95576986e+00 1.99681251e+03 9.40357608e+01 1.38355048e-01
 1.37701083e-01 1.37673450e-01 1.37461603e-01 1.36623422e-01
 1.36535920e-01 1.09999678e-01 1.14545195e-01 1.15816282e-01
 9.11498271e-02 9.05557321e-02 4.90105325e-02 4.34241976e-02
 3.84826170e-02 1.88176128e-02 1.79471946e-02 1.31299594e-02
 1.24069136e-02 1.22595412e-02 1.21536173e-02 1.19233479e-02
 1.19049264e-02 1.16378139e-02 1.14904415e-02 1.13200422e-02
 1.12371452e-02 1.11496429e-02 1.11956967e-02 1.12509614e-02] 32
Cluster  2 [3.82490000e+04 5.39254186e+08 2.29895860e+07 0.00000000e+00
 0.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
 1.00000000e+00 1.00000000e+00 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 1.00000000e+00 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 0.00000000e+00 1.00000000e

## Exercise 7 Analyze the result of k-means clustering (k = 50) (15 points)
- (a) Compute the Silhouette Score of the clustering result. (5 points)
- (b) Describe scanners characteristics of the largest two cluster in this clustering result. What characteristics distinguish them? (10 point)

## Answer for Exercise 7
- (a) 0.9401983976474807
- (b) The two largest cluset are 1 and 10 clusers. The characteristics of two largest clusters is the location of two centers are not near the Origin center, so the data points of these two center are not near 0 or 1. 

## Exercise 8 Perform k-means clustering for a different choice of the value of k  (25  points)
- a) INCREASE the value of k to a value of your choice. (10 points)
- b) Compare the "Silhouette Score" of this clustering result with that with k=30. (5 ponts)
- c) Compare the top two clusters generated with this k value with those generated with k=30. (10 points)

## 8(a) set k = 120, the k in model is 42

In [58]:
km30 = KMeans(featuresCol="features", predictionCol="prediction").setK(120).setSeed(123)
kmModel1=km30.fit(data)
kmModel1
predictions1 = kmModel1.transform(data)
silhouette1 = evaluator.evaluate(predictions1)
kmModel1

KMeansModel: uid=KMeans_09a15781ed34, k=42, distanceMeasure=euclidean, numFeatures=32

In [59]:
silhouette1

0.9100791604264801

In [61]:
summary1 = kmModel1.summary
summary1.clusterSizes

[202683,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 15758,
 1,
 1,
 1,
 32,
 6916,
 1,
 1,
 1,
 7,
 37,
 1,
 2,
 81,
 168,
 1,
 1,
 16,
 1,
 1,
 7,
 48,
 1,
 7,
 1,
 1,
 1,
 1206,
 2,
 66,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [60]:
centers1 = kmModel1.clusterCenters()
print("Cluster Centers:")
i=0
for center in centers:
    print("Cluster ", str(i+1), center, len(center))
    i = i+1

Cluster Centers:
Cluster  1 [2.95576986e+00 1.99681251e+03 9.40357608e+01 1.38355048e-01
 1.37701083e-01 1.37673450e-01 1.37461603e-01 1.36623422e-01
 1.36535920e-01 1.09999678e-01 1.14545195e-01 1.15816282e-01
 9.11498271e-02 9.05557321e-02 4.90105325e-02 4.34241976e-02
 3.84826170e-02 1.88176128e-02 1.79471946e-02 1.31299594e-02
 1.24069136e-02 1.22595412e-02 1.21536173e-02 1.19233479e-02
 1.19049264e-02 1.16378139e-02 1.14904415e-02 1.13200422e-02
 1.12371452e-02 1.11496429e-02 1.11956967e-02 1.12509614e-02] 32
Cluster  2 [3.82490000e+04 5.39254186e+08 2.29895860e+07 0.00000000e+00
 0.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
 1.00000000e+00 1.00000000e+00 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 1.00000000e+00 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 0.00000000e+00 1.00000000e

## Answer for Exercise 8 
- (b) silhouette for k = 30 is 0.9401, the silhouette for higher k=42 is 0.9100. So, the silhouette for high k decrease.
- (c) The top two clusets for different key have the similar pattern, the all first cluster have the large cluster size, and all second cluster have the small cluster size. And, we conclude that for different k we choose, the first cluster has large size and don't near the original center, the second cluster has small size and near the original center. 