## DS/CMPSC 410 MiniProject #3

### Spring 2021
### Instructor: John Yen
### TA: Rupesh Prajapati and Dongkuan Xu
### Learning Objectives
- Be able to apply thermometer encoding to encode numerical variables into binary variable format.
- Be able to apply k-means clustering to the Darknet dataset based on both thermometer encoding and one-hot encoding.
- Be able to use external labels (e.g., mirai, zmap, and masscan) to evaluate the result of k-means clustering.
- Be able to investigate characteristics of a cluster using one-hot encoded feature.

### Total points: 100 
- Exercise 1: 5 points
- Exercise 2: 5 points 
- Exercise 3: 5 points 
- Exercise 4: 15 points
- Exercise 5: 5 points
- Exercise 6: 10 points
- Exercise 7: 5 points
- Exercise 8: 5 points
- Exercise 9: 10 points
- Exercise 10: 5 points
- Exercise 11: 10 points
- Exercise 12: 20 points
  
### Due: 5 pm, April 23, 2021

In [10]:
import pyspark
import csv

In [11]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString, PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [12]:
import pandas as pd
import numpy as np
import math

In [13]:
ss = SparkSession.builder.master("local").appName("ClusteringTE").getOrCreate()

## Exercise 1 (5 points)
Complete the path for input file in the code below and enter your name in this Markdown cell:
- Name: Kangdong Yuan

In [14]:
Scanners_df = ss.read.csv("/storage/home/kky5082/ds410/Lab10/sampled_profile.csv", header= True, inferSchema=True )

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [15]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



In [16]:
Scanners_df.where(col('mirai')).count()

17132

# Part A: One Hot Encoding 
## This part is identical to that of Miniproject Deliverable #2
We want to apply one hot encoding to the set of ports scanned by scanners.  
- A.1 Like Mini Project deliverable 1 and 2, we first convert the feature "ports_scanned_str" to a feature that is an Array of ports
- A.2 We then calculate the total number of scanners for each port
- A.3 We identify the top n port to use for one-hot encoding (You choose the number n).
- A.4 Generate one-hot encoded feature for these top n ports.

In [17]:
# Scanners_df.select("ports_scanned_str").show(30)

In [18]:
Scanners_df2=Scanners_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
# Scanners_df2.persist().show(10)

## A.1 We only need the column ```Ports_Array``` to calculate the top ports being scanned

In [19]:
Ports_Scanned_RDD = Scanners_df2.select("Ports_Array").rdd

In [20]:
# Ports_Scanned_RDD.persist().take(5)

## Because each port number in the Ports_Array column for each row occurs only once, we can count the total occurance of each port number through flatMap.

In [21]:
Ports_list_RDD = Ports_Scanned_RDD.map(lambda row: row[0] )

In [22]:
# Ports_list_RDD.persist()

In [23]:
Ports_list2_RDD = Ports_Scanned_RDD.flatMap(lambda row: row[0] )

In [24]:
Port_count_RDD = Ports_list2_RDD.map(lambda x: (x, 1))
# Port_count_RDD.take(2)

In [25]:
Port_count_total_RDD = Port_count_RDD.reduceByKey(lambda x,y: x+y, 1)
# Port_count_total_RDD.persist().take(5)

In [26]:
Sorted_Count_Port_RDD = Port_count_total_RDD.map(lambda x: (x[1], x[0])).sortByKey( ascending = False)

In [27]:
# Sorted_Count_Port_RDD.persist().take(50)

## Exercise 2 (5%)
Select top_ports to be the number of top ports you want to use for one-hot encoding.  I recommend a number between 20 and 40.

In [28]:
top_ports=30
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: x[1])
Top_Ports_list = Sorted_Ports_RDD.take(top_ports)

In [29]:
# Top_Ports_list

In [30]:
# Scanners_df3=Scanners_df2.withColumn(FeatureName, array_contains("Ports_Array", Top_Ports_list[0]))

In [31]:
# Scanners_df3.show(10)

## A.4 Generate Hot-One Encoded Feature for each of the top ports in the Top_Ports_list

- Iterate through the Top_Ports_list so that each top port is one-hot encoded.

## Exercise 3 (5 %)
Complete the following PySpark code for encoding the n ports using One Hot Encoding, where n is specified by the variable ```top_ports```

In [32]:
for i in range(0, top_ports - 1):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding
    Scanners_df3 = Scanners_df2.withColumn("Port" + Top_Ports_list[i], array_contains("Ports_Array", Top_Ports_list[i]))
    Scanners_df2 = Scanners_df3

In [33]:
Scanners_df2.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

# Part B Thermometer Encoding of Numerical Variables

## We encode the numerical variable numports (number of ports being scanned) using thermometer encoding

In [34]:
pow(2,15)

32768

In [35]:
Scanners_df3=Scanners_df2.withColumn("TE_numports_0", col("numports") > 0) 
Scanners_df2 = Scanners_df3

In [36]:
Scanners_df3.count()

227062

In [37]:
Scanners_df3.where(col('TE_numports_0')).count()

227062

# Exercise 4 (15%)
Complete the following pyspark code to use the column "numports" to create 16 additional columns as follows:
- TE_numports_0 : True, if the scanner scans more than 0 ports, otherwise False.
- TE_numports_1 : True, if the scanner scans more than 2**0 (1) port, otherwise False.
- TE_numports_2 : True, if the scanner scans more than 2**1 (2) ports, otherwise False.
- TE_numports_3 : True, if the scanner scans more than 2**2 (4) ports, otherwise False
        ...
- TE_numports_15 : True, if the scanner scans more than 2**14 ports, otherwise False
- TE_numports_16 : True, if the scanner scans more than 2**15 (32768) ports, otherwise False

In [38]:
for i in range(0, 16):
    # "TE_numports_" + str(i+1)  is the name of each new feature created for each Bin in Thermometer Encoding
    Scanners_df3 = Scanners_df2.withColumn("TE_numports_" + str(i+1), col("numports") > pow(2,i))
    Scanners_df2 = Scanners_df3

In [39]:
Scanners_df2.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

# Exercise 5 (5 points)
What is the total number of scanners that scan more than 2^15 (i.e., 32768) ports? Complete the code below using Scanners_df2 to find out the answer.

In [40]:
HFScanners_df2 = Scanners_df2.where(col('TE_numports_15'))

In [41]:
HFScanners_df2.count()

16

# Exercise 6 (10 points)
Complete the following code to use k-means to cluster the scanners using the following 
- thermometer encoding of 'numports' numerical feature
- one-hot encoding of top k ports (k chosen by you in Exercise 2).

## Specify Parameters for k Means Clustering

In [42]:
km = KMeans(featuresCol="features", predictionCol="prediction").setK(50).setSeed(123)
km.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 50)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction)\nseed: random seed. (default: -606032289246360211, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [43]:
input_features = []
for i in range(0, top_ports - 1):
    input_features.append( "Port"+Top_Ports_list[i] )
for i in range(0, 15):
    input_features.append( "TE_numports_" + str(i))

In [44]:
print(input_features)

['Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962', 'TE_numports_0', 'TE_numports_1', 'TE_numports_2', 'TE_numports_3', 'TE_numports_4', 'TE_numports_5', 'TE_numports_6', 'TE_numports_7', 'TE_numports_8', 'TE_numports_9', 'TE_numports_10', 'TE_numports_11', 'TE_numports_12', 'TE_numports_13', 'TE_numports_14']


In [45]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [46]:
data= va.transform(Scanners_df2)

In [47]:
data.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33962: boo

In [48]:
kmModel=km.fit(data)

In [49]:
kmModel

KMeansModel: uid=KMeans_264e207bbf3d, k=50, distanceMeasure=euclidean, numFeatures=44

In [50]:
predictions = kmModel.transform(data)

In [51]:
predictions.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33962: boo

In [52]:
Cluster1_df=predictions.where(col("prediction")==0)

In [53]:
Cluster1_df.persist().count()

5650

## Exercise 7 (5 points)
Complete the following code to find the size of all of the clusters generated.

In [54]:
summary = kmModel.summary

In [55]:
summary.clusterSizes

[5650,
 24562,
 3186,
 7920,
 8153,
 16044,
 6765,
 1800,
 1037,
 3869,
 3309,
 21722,
 8810,
 6804,
 2174,
 1037,
 1235,
 7734,
 36565,
 7042,
 916,
 1299,
 2548,
 2070,
 3520,
 2909,
 1561,
 3606,
 692,
 327,
 1084,
 1304,
 478,
 7269,
 1449,
 940,
 1262,
 1204,
 1440,
 4776,
 614,
 1031,
 738,
 507,
 1304,
 1043,
 582,
 616,
 3367,
 1188]

# Exercise 8 (5 points)
Complete the following code to find the Silhouette Score of the clustering result.

In [56]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

In [57]:
print('Silhouette Score of the Clustering Result is ', silhouette)

Silhouette Score of the Clustering Result is  0.7291143984546943


In [58]:
centers = kmModel.clusterCenters()

In [59]:
centers[0]

array([9.87079646e-01, 9.83893805e-01, 9.85663717e-01, 9.87256637e-01,
       9.84424779e-01, 9.82477876e-01, 7.07964602e-04, 8.84955752e-04,
       1.94690265e-03, 1.00000000e+00, 9.40707965e-01, 5.30973451e-04,
       3.53982301e-04, 6.37168142e-03, 1.76991150e-04, 1.59292035e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.05486726e-01,
       1.99115044e-01, 1.97345133e-01, 2.03362832e-01, 2.00000000e-01,
       1.99469027e-01, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 1.00000000e+00, 1.52212389e-02, 1.59292035e-03,
       5.30973451e-04, 1.76991150e-04, 1.76991150e-04, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [60]:
print("Cluster Centers:")
i=0
for center in centers:
    print("Cluster ", str(i+1), center)
    i = i+1

Cluster Centers:
Cluster  1 [9.87079646e-01 9.83893805e-01 9.85663717e-01 9.87256637e-01
 9.84424779e-01 9.82477876e-01 7.07964602e-04 8.84955752e-04
 1.94690265e-03 1.00000000e+00 9.40707965e-01 5.30973451e-04
 3.53982301e-04 6.37168142e-03 1.76991150e-04 1.59292035e-03
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.05486726e-01
 1.99115044e-01 1.97345133e-01 2.03362832e-01 2.00000000e-01
 1.99469027e-01 1.00000000e+00 1.00000000e+00 1.00000000e+00
 1.00000000e+00 1.00000000e+00 1.52212389e-02 1.59292035e-03
 5.30973451e-04 1.76991150e-04 1.76991150e-04 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
Cluster  2 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.22139891e-04
 0.00000000e+00 0.00000000e+00 0.00000000e+00

# Part C Percentage of Mirai Malwares in Each Cluster


# Exercise 9 (10 points)
Complete the following code to compute the percentage of Mirai Malwares, Zmap, and Masscan in each cluster.

In [62]:
cluster_eval_df = pd.DataFrame( columns = ['cluster ID', 'size', 'cluster center', 'mirai_ratio', 'zmap_ratio', 'masscan_ratio'] )

for i in range(0, 50):
    cluster_i = predictions.where(col('prediction')==i)
    cluster_i_size = cluster_i.count()
    cluster_i_mirai_count = cluster_i.where(col('mirai')).count()
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    cluster_i_zmap_ratio = (cluster_i.where(col('zmap')).count())/cluster_i_size
    cluster_i_masscan_ratio = (cluster_i.where(col('masscan')).count())/cluster_i_size
    cluster_eval_df.loc[i]=[i, cluster_i_size, centers[i], cluster_i_mirai_ratio, cluster_i_zmap_ratio, cluster_i_masscan_ratio ]
                               

Cluster  5 ; Mirai Ratio: 0.8424333084018948 ; Cluster Size:  16044
Cluster  10 ; Mirai Ratio: 0.009066183136899365 ; Cluster Size:  3309
Cluster  18 ; Mirai Ratio: 0.06232736223164228 ; Cluster Size:  36565
Cluster  20 ; Mirai Ratio: 0.07641921397379912 ; Cluster Size:  916
Cluster  22 ; Mirai Ratio: 0.00706436420722135 ; Cluster Size:  2548
Cluster  33 ; Mirai Ratio: 0.001513275553721282 ; Cluster Size:  7269
Cluster  37 ; Mirai Ratio: 0.8878737541528239 ; Cluster Size:  1204
Cluster  39 ; Mirai Ratio: 0.027219430485762145 ; Cluster Size:  4776
Cluster  47 ; Mirai Ratio: 0.01461038961038961 ; Cluster Size:  616


# Exercise 10 (5 points) 
Identify all of the clusters that have a large percentage of Mirai malware. For example, you can choose clusters with at least 80% of Mirai ratio. If you use a different threshold (other than 80%), describe the threshold you used and the rational of your choice.

## Answer to Exercise 10:
## if I choose 80% as threshold
- Cluster  5 ; Mirai Ratio: 0.8424333084018948 ; Cluster Size:  16044
- Cluster  37 ; Mirai Ratio: 0.8878737541528239 ; Cluster Size:  1204
...

In [71]:
# You can filter predictions DataFrame (Spark) to get all scanners in a cluster. 
# For example, the code below selects scanners in cluster 5. However, you should
# replace 5 with the ID of the cluster you want to investigate.
cluster_selected = predictions.where((col('prediction')==5) | (col('prediction')==37))

In [72]:
# If you prefer to use Pandas dataframe, you can use the following to convert a cluster to a Pandas dataframe
cluster_selected_df = cluster_selected.select("*").toPandas()

In [73]:
cluster_selected.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

# Exercise 11 (10 points)
Complete the following code to find out, for each of the clusters you identified in Exercise 10, 
- (1) (5 points) determine whether they scan a common port, and 
- (2) (5 points) what is the port number if most of them in a cluster scan a common port. 
You canuse the code below to find out what top port is scanned by the scanner in a cluster.

In [83]:
# You fill in the ??? based on the cluster you want to investigate.
cluster_5= predictions.where(col('prediction')==5)
cluster_37= predictions.where(col('prediction')==37)

In [84]:
for i in range(0, top_ports -1):
    port_num = "Port" + Top_Ports_list[i]
    port_i_count = cluster_5.where(col(port_num)).count()
    if port_i_count > 0:
        print("Scanners of Port ", Top_Ports_list[i], " = ", port_i_count)

Scanners of Port  23  =  16044


In [85]:
for i in range(0, top_ports -1):
    port_num = "Port" + Top_Ports_list[i]
    port_i_count = cluster_37.where(col(port_num)).count()
    if port_i_count > 0:
        print("Scanners of Port ", Top_Ports_list[i], " = ", port_i_count)

Scanners of Port  23  =  1192
Scanners of Port  445  =  1
Scanners of Port  54594  =  1
Scanners of Port  80  =  30
Scanners of Port  8080  =  34
Scanners of Port  0  =  1
Scanners of Port  2323  =  1204
Scanners of Port  5555  =  17
Scanners of Port  81  =  14
Scanners of Port  1023  =  22
Scanners of Port  52869  =  17
Scanners of Port  8443  =  12
Scanners of Port  49152  =  17
Scanners of Port  7574  =  12
Scanners of Port  37215  =  9


# Answer to Exercise 11
- (1) (5 points) They all scan the common ports, cluster scan the port 23, and cluster 37 also scan the port23, and port 23 is the common port. 
- (2) (5 points) The top port in cluster 5 is port2 which has 16044 times. And the top port in cluster 37 is port 2323, which is also common port and has 1204 times. 

# Exercise 12 (20 points)
Based on the results above and those of mini project deliverable #2, answer the following questions:
- (a) Why the clustering result of mini project #3 is better than that of #2? (5 points)
- (b) Based on your answer of (a), what is the general lesson you learned for solving clustering problems? (5 points)
- (c) Did you find anything interesting and/or surprising using Mirai labels to evaluate the clustering result? (5 points)
- (d) Based on your answer of (c), what is the general lesson you learned regarding evaluating clustering? (5 points)

# Answer to Exercise 12: 
- (a) Because the mini project#3 use the thermometer Encoding, because  mini-project-#2 contain mixture of numerical variables and (One Hot Encoded) categorical variables, but thermometer Encoding can improve it. But, in real execution the project 3 get worser score than project 2. I think we can use hypermeter tunning to improve the proformance of kmean.
- (b) To improve the cluster proformacne, I need avoid mixture of numerical variables and (One Hot Encoded), numerical variables not normalized, and high dimensional feature space. Moreover, I can use thermometer Encoding to improve my kmean cluster.
- (c) I find that Mirai labels give me a better way to do Clustering Validation. And, it return accurate ratio for me to know the validation score.
- (d) External label are used when we propose a new clustering technique and we want to validate it or we want to compare it to existing techniques. In these cases, we get a bunch of datasets for which we know the ground truth and see if our clustering technique is able to produce clustering solutions that are similar to it. So we can use external validity to improve our Clustering Validation.