# Clustering Project 

A large technology firm needs a help as they've been hacked! Luckily their forensic engineers have grabbed 
valuable data about the hacks:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers 
but they aren't very sure if the third hacker was involved or not. I will figure out whether or not the third suspect had anything to do with the attacks, or was it just
two hackers.


**key fact: the forensic engineer knows that the hackers trade off attacks. 
Meaning they should each have roughly the same amount of attacks. 
For example if there were 100 total attacks, 
then in a 2 hacker situation each should have about 50 hacks,
in a three hacker situation each would have about 33 hacks. 
The engineer believes this is the key element to solving this,
but needs to know how to distinguish this unlabeled data into groups of hackers.

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('hackers').getOrCreate()

In [3]:
data = spark.read.csv('hack_data.csv', header = True, inferSchema = True)

In [4]:
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [5]:
data.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

In [6]:
data.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

# Format Data

In [26]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [28]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [50]:
assembler = VectorAssembler(inputCols = ['Session_Connection_Time',
                                        'Bytes Transferred',
                                        'Kali_Trace_Used',
                                        'Servers_Corrupted',
                                        'Pages_Corrupted',
                                         #'Location', is excluded as the hackers use VPNs 
                                        'WPM_Typing_Speed'],
                            outputCol = 'features')

In [51]:
datavec = assembler.transform(data)

# Scale The Data

In [46]:
from pyspark.ml.feature import StandardScaler

In [47]:
Scaler = StandardScaler(inputCol ='features', outputCol='scaled_features', withMean= False, withStd = True)

In [53]:
final_data = Scaler.fit(datavec).transform(datavec)

# Train the Model and Evaluate

In [54]:
from pyspark.ml.clustering import KMeans

In [55]:
#Train KMeans model
kmeans_model_2 = KMeans(featuresCol = 'scaled_features', k = 2)
model2 = kmeans_model_2.fit(final_data)
kmeans_model_3 = KMeans(featuresCol = 'scaled_features', k = 3)
model3 = kmeans_model_3.fit(final_data)

In [56]:
#Evaluate the cluster model using Within Set Sum of Squared Error (WSSE) 
# but ComputeCost has been deprecated in Spark3.0 
wsse2 = model2.computeCost()
wsse3 = model3.computeCost()

AttributeError: 'KMeansModel' object has no attribute 'computeCost'

### Show the result

In [57]:
centers_of_clusters2 = model2.clusterCenters()
print('Cluster Center2 :')
for center in centers_of_clusters2 :
    print (center)

Cluster Center2 :
[1.26023837 1.31829808 0.99280765 1.36491885 2.5625043  5.26676612]
[2.99991988 2.92319035 1.05261534 3.20390443 4.51321315 3.28474   ]


In [58]:
centers_of_clusters3 = model3.clusterCenters()
print('Cluster Center :')
for center in centers_of_clusters3 :
    print (center)

Cluster Center :
[1.26023837 1.31829808 0.99280765 1.36491885 2.5625043  5.26676612]
[3.05623261 2.95754486 1.99757683 3.2079628  4.49941976 3.26738378]
[2.93719177 2.88492202 0.         3.19938371 4.52857793 3.30407351]


In [59]:
model2.transform(final_data).select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows



In [60]:
model3.transform(final_data).select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows



In [61]:
hacker_pred2= model2.transform(final_data).select('prediction')
hacker_pred2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



In [62]:
hacker_pred3= model3.transform(final_data).select('prediction')
hacker_pred3.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   88|
|         2|   79|
|         0|  167|
+----------+-----+



As hinted by the engineer, each hacker would have have a similar amount of actions,
so from the table comparison it is shown that there are 2 hackers sharing the same amount of incidents.