# `spark`

## `emr` spinup

we will want to work with an `emr` `hadoop` cluster for the majority of this class, but it takes about 12 or so minutes to fully spin up, so let's get cracking on that early!

**<div align="center">starting up an `emr` cluster</div>**

the below steps will create a cluster which, if we leave on throughout class, will cost a little more than 1 USD.

+ in the `aws` web console open the `emr` service
+ click create cluster, and on the "quick options" screen select "advanced options"
+ software and steps
    + stay at emr-5.19
    + for software, click:
        + `hadoop`
        + `ganglia`
        + `hive`
        + `hue`
        + `spark`
        + `livy`
        + `pig`
    + notice but don't click: `jupyterhub`, `mxnet`, `tensorflow`
    + click next
+ hardware config -- leave all defaults
    + generally, think about space requirements [ala this](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html#emr-plan-instances-hdfs)
+ general options -- pick a name
+ security options -- ***choose a key pair!***

## `spark`


while it *is* actually still somewhat common to write `hadoop streaming` `python` scripts for doing `etl` work, it's really not great for data science.

as you've seen, it's pretty orchestrated and low-level. we're thinking hard about the simple things we want to do (like count words or calculate averages), and when we've figured them out, we're only doing them *once*

think about how that looks when we want to move on to something more complicated, like a gradient descent algorithm.

or anything iterative, for that matter.

1. take parameters and applying a model defined by those parameters to every record in our `hdfs` dataset -- map records of features to predicted `y` values, calcualte individual `y` error term and gradients for each record, and `emit` those
2. reduce those partial gradients to determine the parameter update
3. update paraemters

each time we move between steps we are reading and writing to `hdfs` and that can be crazy wasteful

also, this is 2017. we should expect that someone has just done this for us. that's fair.

the name of the game in data science applications is `spark`.

`spark` is a `scala` (an OO functional programming language which runs on the `jvm`) application which is a fast query and iterative algorithm calculation platform. it was built as a replacement for `mapreduce` for calculation workflows just like the ones in which we are most interested.

perhaps most importantly for us: there are libraries for the `spark` `api` in `python` and `R`, and this has lead to pretty wide adoption in the communities.

### the `spark` stack

`spark` ignores data management and focuses exclusively on resource management.

the primary programmer interface is the `spark` core api module (i.e. the standard library of `spark`), and this library is entirely focused on implementing commong computation tasks (file `io`, `mapping`, `reducing` , `filtering`) in as efficient a way as possible.

just like in `python` and `R`, after the core language has taken care of the low-level stuff, specialized tools are built on top of this. in `spark`, some of the most important are:

+ `spark sql`: an implementation of the `sql` standard against `rdd`s. dirty.
+ `spark streaming`: unbounded data stream processing
+ `mllib` and `mahout`: common machine learning algorithms implemented
    + `spark-sklearn` is an implementation of the ubiquitous `sklearn` `api` with `mllib` implementation under the hood
+ `graphx`: manipulate and calculate graphs
+ `zeppelin`: the `jupyter notebook` of the `spark` world

### resilient distributed datasets

the fundamental data structure of `spark` is a resilient distributed dataset (`rdd`)

previously we cited the requirements of distributed computing frameworks to be *fault tolerance, recoverability, consistency, and scalability*

`rdd`s are `spark`'s way of performing distributed computation while hitting those requirements. at the simplest level, `spark` takes a functional plan of attack (a sequence of functions) and figures out how to distribute the data to many different nodes (in memory) to optimize that plan.

this is very similar to the `tensorflow` execution graph -- delay computation until the whole roadmap is defined and the users asks for something specific

some important facts about `rdd`s

+ `rdd`s are immutable, read-only collections of objects
+ they can be built from a lineage (a series of `fpl` function calls)
    + this makes them *fault tolerant, recoverable, consistent*
+ they work in parallel, so *scalable*
+ they are operated on by `scala`, and `fpl`, so *consistent*
+ they are immutable, so *recoverable*

when you're using the `spark` api, then, your basic abilities are to create, transform, or export these `rdd`s.

you need to shift paradigms into a functional mindset: think of things you can do.

`spark` breaks these things down into basically two types

1. *transformations*: `rdd` $\rightarrow$ new `rdd`
    + `map`: take a big `rdd`, apply something, create a new `rdd` as a result
2. *actions*: return something back to the client (aggregation, e.g.)
    + `reduce`: repartition `rdd` by key, aggregate (sum, mean)
    

#### programming with `rdd`s

the way we actually deploy programs in `spark` is similar to how we deployed `mapreduce` jobs in `hadoop streaming`: we write some code, send it to some local machine, that distributes the computation elsewhere

what changes in `spark` is that a master program (the "driver") creates `rdd`s by *parallelizing* a `hadoop` dataset (that is, it partitions a given dataset and pushes those partitions to nodes that perform local computations in memory).

an `rdd` is a structure that manages this partitionting / parallelizing.

from the point of view of the `spark` program, the order of operations is

1. build `rdd`s
    + access data from `hdfs` or local disk storage
    + parallelize that collection of data
    + transform it as necessary
    + cache everything we can
2. pass *closures* (stateless functions, ignorant of the rest of the world) to each element of the `rdd`
    + *closures* are then locally applied in-memory and the outputs are also cached
3. output `rdd`s are *acted on* (aggregated)
    + this is the only place we atually have an eval step.

one quick note on some common terms: *variables* and *closures*

+ *closures* do not rely in any way on external data
    + if they have variables within, they are copied to the nodes with them, but kept in local scope
+ external data, if needed, is passed through shared variables
    + *broadcast* variables: read only, distributed (e.g. lookup tables / stopword lists)
    + *accumulators*: meant to be associatively updated (e.g. counters)

### interactive `spark`

`spark` itself is based on `java`, which means it is a compiled language at its heart. however, most development work with `spark` is done in interactive `repl` sessions.

`pyspark` is a `repl` for the `spark` api bindings in `python`, so if you want to code `spark` programs using `python`, this is your starting point.

just like with the `python` language, there are a few different ways you could execute `pyspark` commands:

+ in a terminal shell via the `pyspark` command
+ in a notebook via several options
    + `zeppelin`
    + amazon `emr` `notebooks`
    + extension kernels for `jupyter`

we are going to launch both the `pyspark` command line version and an amazon `notebook` just to demonstrate how, but in the lecture we will follow the `notebook`.

the underlying execution -- using `pyspark` as a kernel -- is the same for both methods, so feel free to follow along executing the following commands using whichever interface you prefer.

in our quick walkthrough we will do the canonical `hadoop` example (word count) again. we will find the frequency of words in the `shakespeare.txt` file. you'll see that the syntax is pretty familiar from traditional `python`, with a few twists.

#### `pyspark` `cli`

first, let's open the command line `repl`. on the `emr` master node (`ssh -i $KEY_PAIR hadoop@$MASTER_DNS`)

```bash
cd ~/code/hadoop-fundamentals
# you may get errors!
pyspark
```

verify it worked:

```py
sc.version
```

if you would like to execute code in this `cli`, leave it open. otherwise, exit with `exit()`

#### `emr` notebook

this is a special `aws`-specific thing (don't confuse with `jupyter` or `zeppelin`. notes are [here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html)

notes

+ notebooks are saved to `s3`; create an `s3` bucket for them
+ `aws` `emr` web console > notebooks (left menu)
+ create notebook
+ update name, description, cluster, and `s3` bucket (leave rest as defaults)
+ click open button

### Shakespeare word frequency demo

let's go through some simple `pyspark` commands in one of our two `repl` environments (`pyspark` or an amazon `emr` `notebook` with a `pyspark`-kernel)

to start, we use the `sc` spark context object (built as part of the `pyspark` `repl` to create a hook to the Shakespeare text file we loaded into `hdfs` long ago

```python
text = sc.textFile('/data/shakespeare.txt')
print(text)
help(text)

# look at flatMap
help(text.flatMap)
```

we can use `flatMap` to apply a tokenization function to every text string:

```python
# make a tokenize function
def tokenize(text):
    return text.split()

# looks good. let's tokenize our text into words
words = text.flatMap(tokenize)
```

we can take that flat list of words and do our standard map and reduce. before we move on here, though, look at the linear (the defining sequence of functions) via `wc.toDebugString`

```python
# let's apply a map function for word counts
wc = words.map(lambda x: (x, 1))

# you can see the lineage:
print(wc.toDebugString())
```

finally, we'll use built-in functions to do our summation `reduce` step

```python
# include a reduce step to sort/shuffle/partition by key and add the values
from operator import add
cts = wc.reduceByKey(add)
```

finally, save the result to file. note that this is when something actually *happens*. up until this point, we were merely defining a *lineage*; now we're actually asking that some *action* take place, and `spark` springs into action

```python

# finally do something with it all
cts.saveAsTextFile('wc')

# exit so we can see the results
exit()
```

```bash
# look at them beautiful results
hadoop fs -ls /tmp/wc/
hadoop fs -cat /tmp/wc/part-00000 | head -n25
```

**<div align="center">what are your questions so far?</div>**

<strong><em><div align="center"><code>s = 'spark'; s.replace('a', 'o')</code></div></em></strong>
<div align="center"><img width=300 src="https://images-na.ssl-images-amazon.com/images/I/61u0oKyy3wL._SX466_.jpg"></div>

# END OF LECTURE