# Clustering-Hack 


## Task: Use kmeans to determine from the data contained in the hack_data set if there were two or three hackers.  If there were two hackers, the clusters should divide 50/50.  If there were three hack, the clusters should divide 33/33/33.

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.

## Result: Since in the model with kmeans=2, the hacks were evenly divided 50/50, it is concluded that there were two hackers, not three hackers.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hack').getOrCreate()

In [7]:
from pyspark.ml.clustering import KMeans

# Loads data.
data = spark.read.csv("spark_master/Spark_for_Machine_Learning/Clustering/hack_data.csv",header=True,inferSchema=True)

In [8]:
for item in data.head(5):
    print(item)
    print('\n')

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)


Row(Session_Connection_Time=20.0, Bytes Transferred=720.99, Kali_Trace_Used=0, Servers_Corrupted=3.04, Pages_Corrupted=9.0, Location='British Virgin Islands', WPM_Typing_Speed=69.08)


Row(Session_Connection_Time=31.0, Bytes Transferred=356.32, Kali_Trace_Used=1, Servers_Corrupted=3.71, Pages_Corrupted=8.0, Location='Tokelau', WPM_Typing_Speed=70.58)


Row(Session_Connection_Time=2.0, Bytes Transferred=228.08, Kali_Trace_Used=1, Servers_Corrupted=2.48, Pages_Corrupted=8.0, Location='Bolivia', WPM_Typing_Speed=70.8)


Row(Session_Connection_Time=20.0, Bytes Transferred=408.5, Kali_Trace_Used=0, Servers_Corrupted=3.57, Pages_Corrupted=8.0, Location='Iraq', WPM_Typing_Speed=71.28)




## Format the Data

In [9]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [10]:
#Print columns
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [12]:
#Print schema
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [15]:
#Examine Location variable. Since it is not factorable, it will not be used to train the model.
count = data.groupBy('Location').count().show()

+--------------------+-----+
|            Location|count|
+--------------------+-----+
|            Anguilla|    1|
|            Paraguay|    2|
|               Macao|    2|
|Heard Island and ...|    2|
|               Yemen|    1|
|             Tokelau|    2|
|              Sweden|    3|
|French Southern T...|    3|
|            Kiribati|    1|
|              Guyana|    2|
|         Philippines|    3|
|            Malaysia|    2|
|           Singapore|    1|
|United States Vir...|    6|
|              Turkey|    1|
|      Western Sahara|    2|
|              Malawi|    2|
|                Iraq|    3|
|Northern Mariana ...|    3|
|             Germany|    1|
+--------------------+-----+
only showing top 20 rows



In [16]:
#Create vec_assembler object
vec_assembler = VectorAssembler(inputCols = ['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed'], outputCol='features')

In [17]:
#Create features colum in the form of a vector
final_data = vec_assembler.transform(data)

In [42]:
#Print schema
final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



## Scale the Data

In [18]:
from pyspark.ml.feature import StandardScaler

In [19]:
# Create scaler object
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [20]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)

In [21]:
# Normalize each feature to have unit standard deviation.
final_data = scalerModel.transform(final_data)

## Train the Model and Evaluate

In [34]:
# Trains a k-means model.
kmeans_3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans_2 = KMeans(featuresCol='scaledFeatures',k=2)

In [35]:
#Fit model
model_3 = kmeans_3.fit(final_data)
model_2 = kmeans_2.fit(final_data)

In [36]:
# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model_2.computeCost(final_data)
print("Within Set Sum of Squared Errors = " + str(wssse))
wssse = model_3.computeCost(final_data)
print("Within Set Sum of Squared Errors = " + str(wssse))

Within Set Sum of Squared Errors = 601.7707512676716
Within Set Sum of Squared Errors = 434.75507308487647


In [24]:
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 1.26023837  1.31829808  0.99280765  1.36491885  2.5625043   5.26676612]
[ 2.93719177  2.88492202  0.          3.19938371  4.52857793  3.30407351]
[ 3.05623261  2.95754486  1.99757683  3.2079628   4.49941976  3.26738378]


In [39]:
#Show clusters with counts
model_3.transform(final_data).select('prediction').groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         2|   88|
|         0|  167|
+----------+-----+



In [40]:
#Show clusters with counts
model_2.transform(final_data).select('prediction').groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



## Result: Since in the model with kmeans=2, the hacks were evenly divided into two groups, it is concluded that there were two hackers, not three hackers.