# Spark & Python

Some preliminary terms:

Term                   |Definition
----                   |-------
RDD                    |Resilient Distributed Dataset
Transformation         |Spark operation that produces an RDD
Action                 |Spark operation that produces a local object
Spark Job              |Sequence of transformations on data with a final action

### Creating the SparkContext

In [1]:
from pyspark import SparkContext

*Note! We can only have one SparkContext at a time the way we are running things here.*

In [2]:
sc = SparkContext()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/05 17:25:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Basic Operations

Use special jupyter notebook command to write a file:

In [3]:
%%writefile example.txt
first line
second line
third line
fourth line

Overwriting example.txt


### Creating an RDD:

There are two common ways to create an RDD:

Method                      |Result
----------                               |-------
`sc.parallelize(array)`                  |Create RDD of elements of array (or list)
`sc.textFile(path/to/file)`                      |Create RDD of lines from file

Read a text file from HDFS, a local file system available in all nodes, and return it as an RDD of strings:

In [4]:
# creates an RDD object using the textFile command
textFile = sc.textFile('example.txt')

In [5]:
# count num of elements in the RDD (4 lines of text)
textFile.count()

                                                                                

4

In [6]:
# show 1st element in the RDD object
textFile.first()

'first line'

### RDD Transformations

- We can use transformations to create a set of instructions we want to preform on the RDD (before we call an action and actually execute them)

- Transformations won't display an output

Transformation Example                          |Result
----------                               |-------
`filter(lambda x: x % 2 == 0)`           |Discard non-even elements
`map(lambda x: x * 2)`                   |Multiply each RDD element by `2`
`map(lambda x: x.split())`               |Split each string into words
`flatMap(lambda x: x.split())`           |Split each string into words and flatten sequence
`sample(withReplacement=True,0.25)`      |Create sample of 25% of elements with replacement
`union(rdd)`                             |Append `rdd` to existing RDD
`distinct()`                             |Remove duplicates in RDD
`sortBy(lambda x: x, ascending=False)`   |Sort elements in descending order

In [7]:
# creates a receipe without executing it
secfind = textFile.filter(lambda line: 'second' in line)

In [8]:
# it's lazyly stored but not executed
secfind

PythonRDD[4] at RDD at PythonRDD.scala:53

### RDD Actions

Once we have our 'recipe' of transformations ready, we can execute them by calling an action. Here are some common actions:

Action                             |Result
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(3)`                              |First 3 elements of RDD 
`top(3)`                               |Top 3 elements of RDD
`takeSample(withReplacement=True,3)`   |Create sample of 3 elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)

In [9]:
secfind.collect()

['second line']

In [10]:
secfind.count()

1