# Sherlock: Classifying Malicious Cell Phone Sessions 

This subset of the well-known Sherlock dataset contains data extracted from 37 user's cell phones across 3 months at the beginning of 2016. There are two tables involved in this analysis:  
T4.tsv: ~26Gb of data on battery level, memory usage, packet inflows and outflows and the like. Each row represents a scan, and scans are conducted every 5 seconds. 
Moriartyprobe.tsv: ~90Mb of data from an app called Moriarty which starts "sessions", which are a variety of realistic of attacks on the user's cellphone that stop and start intermittently. The sessions are either benign or malicious.  

Our model looks at the activity on the cell phone while these sessions are occuring. We develop a machine learning model that could be implemented to track T4 cell phone usage stats in real time to identify whether an attack is occurring.  

In order to use this dataset in a Spark ML pipeline, it must be imported, transformed, cleaned, subsetted, then both tables must be combined. The code below uses awk, PySpark, and a SparkSQL api to do all of that.  

In [1]:
!pwd

/home/jovyan/SherLock


In [7]:
# change permissions to make file executable
!chmod +x import.awk
# This next terminal command will take a while to run, as it subsets and transforms the 26gb dataset from tsv to csv line by line 
# In order of lines: 
# Path to where the executable is stored 
    # Print to path to the T4.tsv dataset 
# Path to where the T4.csv dataset will be stored 
# Path to where the executable is stored 
# Path to the Moriartyprobe.tsv dataset 
    # Print to path to where the T4.csv dataset will be stored 

!./import.awk \
tsvs/T4tiny.tsv \
    > csvs/T4tiny.csv 
!./import.awk \
tsvs/Moriartyprobe.tsv \
    > csvs/Moriartyprobe.csv

In [1]:
# Import pyspark. Start a session 
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.sql import SparkSession
from functools import reduce
import pandas as pd
import numpy as np
import sklearn

In [2]:
spark = SparkSession.builder.master('local[2]').config("spark.executor.memory", "1g").config("spark.driver.memory", "1g").appName('spark_sh_data').getOrCreate()

In [3]:
# Import data: t4 and Moriarty
t4 = spark.read.options(header=True, nullValue='NULL', inferSchema=True).csv('/home/jovyan/cybr/data/csvs/T4subset.csv')
mor = spark.read.options(header=True, nullValue='NULL', inferSchema=True).csv('/home/jovyan/cybr/data/csvs/Moriarityprobe.csv')

In [4]:
# create colnames T4
t4_colnames = ['userid', 'uuid', 'Version', 'CpuGHz', 'CPU_0', 'CPU_1', 'CPU_2', 'CPU_3', 'Total_CPU', 'TotalMemory_freeSize', 'TotalMemory_max_size',
'TotalMemory_total_size', 'TotalMemory_used_size', 'Traffic_MobileRxBytes', 'Traffic_MobileRxPackets', 'Traffic_MobileTxBytes',
'Traffic_MobileTxPackets','Traffic_TotalRxBytes', 'Traffic_TotalRxPackets', 'Traffic_TotalTxBytes', 'Traffic_TotalTxPackets',
'Traffic_TotalWifiRxBytes', 'Traffic_TotalWifiRxPackets', 'Traffic_TotalWifiTxBytes', 'Traffic_TotalWifiTxPackets',
'Traffic_timestamp', 'Battery_charge_type', 'Battery_current_avg']

# create colnames Moriarty 
mor_colnames = ['userid', 'uuid', 'actionType', 'action', 'behavior', 'sessionType', 'sessionID', 'version']
# Add column names
t4_oldColumns = t4.schema.names
t4_newColumns = t4_colnames

mor_oldColumns = mor.schema.names
mor_newColumns = mor_colnames


t4 = reduce(lambda t4, idx: t4.withColumnRenamed(t4_oldColumns[idx], t4_newColumns[idx]), range(len(t4_oldColumns)), t4)
mor = reduce(lambda mor, idx: mor.withColumnRenamed(mor_oldColumns[idx], mor_newColumns[idx]), range(len(mor_oldColumns)), mor)
t4.printSchema()
mor.printSchema()

root
 |-- userid: string (nullable = true)
 |-- uuid: long (nullable = true)
 |-- Version: string (nullable = true)
 |-- CpuGHz: string (nullable = true)
 |-- CPU_0: string (nullable = true)
 |-- CPU_1: string (nullable = true)
 |-- CPU_2: string (nullable = true)
 |-- CPU_3: string (nullable = true)
 |-- Total_CPU: string (nullable = true)
 |-- TotalMemory_freeSize: string (nullable = true)
 |-- TotalMemory_max_size: string (nullable = true)
 |-- TotalMemory_total_size: string (nullable = true)
 |-- TotalMemory_used_size: string (nullable = true)
 |-- Traffic_MobileRxBytes: string (nullable = true)
 |-- Traffic_MobileRxPackets: string (nullable = true)
 |-- Traffic_MobileTxBytes: string (nullable = true)
 |-- Traffic_MobileTxPackets: string (nullable = true)
 |-- Traffic_TotalRxBytes: string (nullable = true)
 |-- Traffic_TotalRxPackets: string (nullable = true)
 |-- Traffic_TotalTxBytes: string (nullable = true)
 |-- Traffic_TotalTxPackets: string (nullable = true)
 |-- Traffic_Total

In [5]:
# verify successful import
import pandas as pd
pd.set_option('display.max_columns', None)
mor.toPandas().head()

Unnamed: 0,userid,uuid,actionType,action,behavior,sessionType,sessionID,version
0,0a50e09262,1451638991449,App entered onPause(),App Mode change,benign,benign,1.0,21
1,0a50e09262,1451637887475,Application entered onCreate(),Application started,benign,benign,1.0,21
2,0a50e09262,1451637887633,User started to play a game (name);solo,Game stared,benign,benign,1.0,21
3,0a50e09262,1451637921510,App entered onPause(),App Mode change,benign,benign,1.0,21
4,0a50e09262,1451638167470,App entered onResume,App Mode change,benign,benign,1.0,21


## Join the t4 and Moriarty datasets. They do not share a common key. 
This is a critical move, and a challenging one. I will need to join the tables based on the uuid values, only joining values which are within the same range of time period. UUID is measured in milliseconds, so I will distribute labels of malicious or benign across the time ranges.
Utilizing a CTE, a window function, and a subquery, I successfully joined the two tables.

In [6]:
# create temp table 
t4.createOrReplaceTempView('t4')
mor.createOrReplaceTempView('mor')
t4_mor_sql = spark.sql("""
with CTE as (SELECT uuid
    , min(sessionType) OVER (PARTITION BY m_grp) as sessionType
FROM (
    SELECT uuid, m.sessionType 
        , count(m.sessionType) OVER (ORDER BY uuid) as m_grp
    FROM mor m
    FULL OUTER JOIN t4 t using(uuid)
    ) sub)
SELECT * 
FROM CTE
JOIN t4 using(uuid);
""")

In [7]:
t4_mor_sql.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("t4_mor.csv")

KeyboardInterrupt: 

In [None]:
t4_mor = spark.read.options(header=True, nullValue='NULL', inferSchema=True).csv('/home/jovyan/cybr/data/csvs/t4_mor.csv')

In [None]:
t4_mor.toPandas().head()