A company has been potentially hacked and we aim to find out the types of hackers that have caused the threat.
We have the valuable data about the hacks, including information like session time, locations, wpm typing speed, etc.
The forensic department has also provided us with the meta-data of each session that the hackers used to connect to their servers. We have worked here with such attributes to find out the potential group of hackers.

The company has reseached on the hack and let us know that there are three types of hackers causing the threat. One trend that they have noticed from the market is that these hackers trade-off hacks which means that they should have roughly caused equal number of threats. 

Our aim is to find out the potential groups of hackers and conclude if all these hackers are causing the threat.

In [0]:
from pyspark.sql import SparkSession

In [0]:
#creating the spark session
spark = SparkSession.builder.appName("kmeans_hack").getOrCreate()

In [0]:
#reading the data and loading the dataset in the spark dataframe 
df = spark.read.format("csv").load("dbfs:/FileStore/shared_uploads/sanchari.gautam@gmail.com/hack_data-1.csv",inferSchema=True,header=True)
df.show()

In [0]:
#display the columns present in the dataset
df.columns

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
#assembling all the input columns needed for the clustering into the features column to make it libsvm compatible
assembler = VectorAssembler(inputCols=['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed'],outputCol='features')

In [0]:
#displaying the features column having the input columns in dense vector format
final_data = assembler.transform(df).select('features')
final_data.show()

In [0]:
from pyspark.ml.feature import StandardScaler

In [0]:
#scaling the data 
scaler = StandardScaler(inputCol="features",outputCol="scaledFeatures")
scaler_model = scaler.fit(final_data)
final_data = scaler_model.transform(final_data).select('scaledFeatures')
final_data.show()

In [0]:
from pyspark.ml.clustering import KMeans

In [0]:
#creating the Kmeans model object with k=3
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
kmeans_model = kmeans.fit(final_data)

In [0]:
#predicting the clusters on final data with k=3
results = kmeans_model.transform(final_data)
results.show()

In [0]:
#finding the number of threats caused by three potential hackers
results.groupby("prediction").count().show()

Therefore, we can see that the 3rd hacker has a huge difference in the number of attacks with the other two hackers, who has more or less same number of attacks. 

Since we know that the hackers trade off attacks, we can rule out the possibility of the third hacker. Now, let us try with two attackers.

In [0]:
#creating the Kmeans model object with k=2
kmeans = KMeans(featuresCol='scaledFeatures',k=2)
kmeans_model = kmeans.fit(final_data)

In [0]:
#predicting the clusters on final data with k=2
results = kmeans_model.transform(final_data)
results.show()

In [0]:
#finding the number of threats caused by two potential hackers
results.groupby("prediction").count().show()

It is evident from above that there were indeed only two hackers causing the threat to the firm since these two hackers are having exact equal number of hacks.

In [0]:
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator(featuresCol='scaledFeatures')

silhouette = evaluator.evaluate(results)
print("Silhouette with squared euclidean distance = " + str(silhouette))


The silhoutte score indicates how similar an object is to it's own cluster (cohesion) and distinguished from other clusters (separation). This score ranges from -1 to 1 and a score of 0.75 indicates that our clusters are 75% clearly distinguished and well apart from each other.