**KMeans Clustering project **

Here is the scenario: A technology firm was hacked, and we are asked to help them with the data that the forensic engineers have grabbed in order to find out if there are attacked by 2 or 3 hackers.

The hint which is given to us is that the attacks are distributed evenly between the hackers. This clue might be a crucial key to this mystery.

Set up PySpark in GoogleColab environment

In [0]:
!wget -q https://archive.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

In [0]:
!tar -zxf spark-2.1.1-bin-hadoop2.7.tgz

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [0]:
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.1.1-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hack_find').getOrCreate()

In [0]:
from pyspark.ml.clustering import KMeans

**Load the data**

In [0]:
dataset = spark.read.csv("hack_data.csv",header=True,inferSchema=True)

In [12]:
dataset.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

In [13]:
dataset.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

In [15]:
dataset.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

**Feature Transformation**

In [0]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [0]:
feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used',
             'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']

In [0]:
vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

In [0]:
final_data = vec_assembler.transform(dataset)

**Feature normalization**

In [0]:
from pyspark.ml.feature import StandardScaler

In [0]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [0]:
scalerModel = scaler.fit(final_data)

In [0]:
cluster_final_data = scalerModel.transform(final_data)

Question answering: 2 or 3 hackers?

In [0]:
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

In [0]:
model_k3 = kmeans3.fit(cluster_final_data)
model_k2 = kmeans2.fit(cluster_final_data)

In [0]:
wssse_k3 = model_k3.computeCost(cluster_final_data)
wssse_k2 = model_k2.computeCost(cluster_final_data)

In [29]:
print("With K=3")
print("Within Set Sum of Squared Errors = " + str(wssse_k3))
print('--'*30)
print("With K=2")
print("Within Set Sum of Squared Errors = " + str(wssse_k2))

With K=3
Within Set Sum of Squared Errors = 434.75507308487647
------------------------------------------------------------
With K=2
Within Set Sum of Squared Errors = 601.7707512676716


Not much to be gained from the WSSSE, after all, we would expect that as K increases, the WSSSE decreases. We could however continue the analysis by seeing the drop from K=3 to K=4 to check if the clustering favors even or odd numbers. This won't be substantial, but its worth a look:

In [30]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    wssse = model.computeCost(cluster_final_data)
    print("With K={}".format(k))
    print("Within Set Sum of Squared Errors = " + str(wssse))
    print('--'*30)

With K=2
Within Set Sum of Squared Errors = 601.7707512676716
------------------------------------------------------------
With K=3
Within Set Sum of Squared Errors = 434.75507308487647
------------------------------------------------------------
With K=4
Within Set Sum of Squared Errors = 419.2753165228254
------------------------------------------------------------
With K=5
Within Set Sum of Squared Errors = 398.07796475234807
------------------------------------------------------------
With K=6
Within Set Sum of Squared Errors = 228.37504907726444
------------------------------------------------------------
With K=7
Within Set Sum of Squared Errors = 214.52113001230254
------------------------------------------------------------
With K=8
Within Set Sum of Squared Errors = 201.09551851852257
------------------------------------------------------------


Nothing definitive can be concluded with the above, but not forget about the fact that the engineer mentioned "the attacks should be evenly numbered between the hackers!" Let's check with the transform and prediction columns that result form this!

In [31]:
model_k3.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   88|
|         2|   79|
|         0|  167|
+----------+-----+



In [32]:
model_k2.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



________

**From this we can come to this conclusion that there are definitely 2 hackers than 3 of them. Problem solved**

