#  PySpark RDD 

This notebook is based on https://github.com/jadianes/spark-py-notebooks

Resilient Distributed Dataset (RDD) is a distributed collection of elements. Most of work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

## Getting the data files
In this notebook, we will use the reduced dataset (10 percent) provided for the KDD cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally

In [1]:
import urllib.request
data = urllib.request.urlretrieve("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

## Create a PySpark session

In [2]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName('RDD-PySpark').setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

### Create a RDD from a file

In [3]:
data_file = './kddcup.data_10_percent.gz'
raw_data = sc.textFile(data_file)

In [4]:
# some functions to check data
raw_data.count()
raw_data.take(3)

['0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.']

### The filter transformation

This transformation can be applied to RDDs in order to keep just elements that satisfy a certain condition. More concretely, a function is evaluated on every element in the original RDD. The new resulting RDD will contain just those elements that make the function return True.

In [5]:
normal_raw_data = raw_data.filter(lambda x: 'normal.' in x)

In [6]:
# count how many elements we have in the new RDD
normal_count = normal_raw_data.count()
print("There are {} 'normal' interactions".format(normal_count))

There are 97278 'normal' interactions


### The map transformation

By using the map transformation in Spark, we can apply a function to every element in our RDD.

In [7]:
csv_data = raw_data.map(lambda x: x.split(","))
head_rows = csv_data.take(5)
print(head_rows[0])

['0', 'tcp', 'http', 'SF', '181', '5450', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '8', '8', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '9', '9', '1.00', '0.00', '0.11', '0.00', '0.00', '0.00', '0.00', '0.00', 'normal.']


#### Using map and predefined functions

Of course, we can use predefined functions with map. Imagine we want to have each element in the RDD as a key-value pair where the key is the tag (e.g. normal) and the value is the whole list of elements that represents the row in the CSV formatted file. We could proceed as follows.

In [8]:
def parse_interaction(line):
    elems = line.split(',')
    tag = elems[41]
    return (tag, elems)

key_csv_data = raw_data.map(parse_interaction)
print(key_csv_data.take(3)[0])

('normal.', ['0', 'tcp', 'http', 'SF', '181', '5450', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '8', '8', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '9', '9', '1.00', '0.00', '0.11', '0.00', '0.00', '0.00', '0.00', '0.00', 'normal.'])


### The collect action

So far we have used the action count and take. Another basic action we need to learn is collect. Basically it will get all the elements in the RDD into memory for us to work with them. For this reason it has to be used with care, specially when working with large RDDs.

In [10]:
all_raw_data = raw_data.collect()
print(len(all_raw_data))

494021


## Sampling RDDs

In Spark, there are two sampling operations, the transformation sample and the action takeSample. By using a transformation, we can tell Spark to apply successive transformation on a sample of a given RDD. By using an action, we retrieve a given sample and we can have it in local memory to be used by any other standard library (e.g. Scikit-learn)

### The sample transformation

The sample transformation takes up to three parameters. First is whether the sampling is done with replacement or not. Second is the sample size as a fraction. Finally we can optionally provide a random seed.

In [11]:
raw_data_sample = raw_data.sample(False, 0.1, 1234)
sample_size = raw_data_sample.count()
total_size = raw_data.count()
print("Sample size is {} of {}".format(sample_size, total_size))

# transformation to be applied
raw_data_sample_items = raw_data_sample.map(lambda x: x.split(','))
sample_normal_tags = raw_data_sample_items.filter(lambda x: 'normal.' in x)

sample_normal_tags_count = sample_normal_tags.count()

sample_normal_ratio = sample_normal_tags_count/float(sample_size)
print("The ratio of 'normal' interactions is {}".format(round(sample_normal_ratio,3)))

Sample size is 49493 of 494021
The ratio of 'normal' interactions is 0.195


### The takeSample action

If what we need is to grab a sample of raw data from our RDD into local memory in order to be used by other non-Spark libraries, takeSample can be used.

However, it took longer, even with a slightly smaller sample. The reason is that Spark just distributed the execution of the sampling process. The filtering and splitting of the results were done locally in a single node.

In [12]:
raw_data_sample = raw_data.takeSample(False, 400000, 1234)
normal_data_sample = [x.split(",") for x in raw_data_sample if "normal." in x]

normal_sample_size = len(normal_data_sample)

normal_ratio = normal_sample_size / 400000.0
print("The ratio of 'normal' interactions is {}".format(normal_ratio))

The ratio of 'normal' interactions is 0.1967175


## Set operations on RDDs

Spark supports many of the operations we have in mathematical sets, such as union and intersection, even when the RDDs themselves are not properly sets. It is important to note that these operations require that the RDDs being operated on are of the same type.

Set operations are quite straightforward to understand as it works as expected. The only consideration comes from the fact that RDDs are not real sets, and therefore operations such as the union of RDDs does not remove duplicates. 

In [13]:
attack_raw_data = raw_data.subtract(normal_raw_data)
print('All count {} while attack count {}'.format(round(raw_data.count(),3), round(attack_raw_data.count(),3)))

All count 494021 while attack count 396743


### Cartesian

We can compute the Cartesian product between two RDDs by using the cartesian transformations. It returns all possible pairs of elements between two RDDs. In our case, we will use it to generate all the possible combinations between service and protocal in our network interactions.

First of all we need to isolate each collection of values in two separate RDDs. For that we will use distinct on the CSV-parsed dataset. From the dataset description we know that protocol is the second column and service is the third (tag is the last one and not the first as appears in the page).

In [14]:
# protocols 
csv_data = raw_data.map(lambda x: x.split(','))
protocols = csv_data.map(lambda x: x[1]).distinct()
protocols.collect()

# services 
services = csv_data.map(lambda x: x[2]).distinct()
services.collect()

# now we can do the cartesian product
product = protocols.cartesian(services).collect()
print("there are {} combinations of protocol X service".format(len(product)))

there are 198 combinations of protocol X service


### Reduce action

The function that we pass to reduce gets and returns elements of the same type of the RDD. 

In [15]:
# parse data
csv_data = raw_data.map(lambda x:x.split(','))

# separate into different RDDs
normal_csv_data = csv_data.filter(lambda x: x[41]=='normal.')
attack_csv_data = csv_data.filter(lambda x: x[41]!='normal.')

normal_duration_data = normal_csv_data.map(lambda x: int(x[0]))
attack_duration_data = attack_csv_data.map(lambda x: int(x[0]))

total_normal_duration = normal_duration_data.reduce(lambda x, y: x + y)
total_attack_duration = attack_duration_data.reduce(lambda x, y: x + y)

print("Total duration for 'normal' interactions is {}".format(total_normal_duration))
print("Total duration for 'attack' interactions is {}".format(total_attack_duration))

Total duration for 'normal' interactions is 21075991
Total duration for 'attack' interactions is 2626792


### Aggregate action

The aggregate action frees us from the constraint of having the return be the same type as the RDD we are working on. Like with fold, we supply an initial zero value of the type we want to return. Then we provide two functions. The first one is used to combine the elements from our RDD with the accumulator. The second function is needed to merge two accumulators. Let's see it in action calculating the mean we did before.

In [16]:
normal_sum_count = normal_duration_data.aggregate(
    (0,0), # the initial value
    (lambda acc, value: (acc[0] + value, acc[1] + 1)), # combine value with acc
    (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1])) # combine accumulators
)

print("Mean duration for 'normal' interactions is {}".format(round(normal_sum_count[0]/float(normal_sum_count[1]),3)))

Mean duration for 'normal' interactions is 216.657


## Working with key/value pair RDDs

Spark provides specific functions to deal with RDDs which elements are key/value pairs. They are usually used to perform aggregations and other processings by key.

Key/value pair aggregations will show to be particularly effective when trying to explore each type of tag in our network attacks, in an individual way.

### Creating a pair RDD

To create the RDD where each interaction is parsed as a CSV row presenting the value, and is put together with its corresponding tag as a key.

Normally we create key/value pair RDDs by applying a function using map to the original data. This function returns the corresponding pair for a given RDD element. 

In [17]:
csv_data = raw_data.map(lambda x: x.split(','))
key_value_data = csv_data.map(lambda x: (x[41], x)) # x[41] contains the network interaction tag
key_value_data.take(1)

[('normal.',
  ['0',
   'tcp',
   'http',
   'SF',
   '181',
   '5450',
   '0',
   '0',
   '0',
   '0',
   '0',
   '1',
   '0',
   '0',
   '0',
   '0',
   '0',
   '0',
   '0',
   '0',
   '0',
   '0',
   '8',
   '8',
   '0.00',
   '0.00',
   '0.00',
   '0.00',
   '1.00',
   '0.00',
   '0.00',
   '9',
   '9',
   '1.00',
   '0.00',
   '0.11',
   '0.00',
   '0.00',
   '0.00',
   '0.00',
   '0.00',
   'normal.'])]

In [None]:
# calculate the total duration of each network interaction type
