## MSAN-694-Lec5-Notes

Gonna have some fun with AWS



### Week 4 Material: Recapping The Pair RDD
- countByKey()
- lookupKey()

In [1]:
PATH2 = '/Users/tlee010/Desktop/github_repos/2017-msan694-example/Data/filtered_registered_business_sf.csv'

From ”filtered_registered_business_sf.csv”, create a pair RDD of (zip, (store name, city))
- Countpairswhichdonothaveakey.
- Filterpairsthatdonotinclude“San Francisco” in the city value.

#### Count Pairs which don't have a key

In [13]:
lines = sc.textFile(PATH2)
lines.collect()

parse = lines.map(lambda x : x.split(','))
parse = parse.map(lambda x : (x[0],(x[1],x[2])))

len(parse.lookup(''))

92

In [16]:
parse.countByKey()['']

92

#### Filter pairs that do not include “San Francisco” in the city value.

In [14]:
lines = sc.textFile(PATH2)
lines.collect()

parse = lines.map(lambda x : x.split(','))
parse = parse.map(lambda x : (x[0],(x[1],x[2])))
parse.filter(lambda x : x[1][1]).count()

198561

## Week 4: About them RDDs (resilient distributed dataset)

### 4.1 - how to add some data redundancy for data/system failures

RDDs are by default **recomputed each time**. calls:

---

To cache the data for repeat:
 
**`line_with_spark.persist(StorageLevel.persistency_level)`**

**`persistency_level`** - storage levels can be the following:

- **`MEMORY_ONLY`** : put as much possible in RAM, recompute the rest

- **`MEMORY_AND_DISK`** : put as much as possible in MEM, remainder on DISK

- **`MEMORY_ONLY_SER`** : turn data into a stream of bytes ( reduction AKA serialized) but needs to conver the data back later

- **`MEMORY_AND_DISK_ONLY_SER`** : turn data into a stream of bytes ( reduction AKA serialized) but needs to conver the da"ta back later
    
- **`DISK_ONLY`** - keeps RDD data only on disk
    
- **`MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc`** - replicate each partition on two different nodes


### Why Replicate?

If a server fails, then you will have much faster recovery during computation. And for repeat calls, 

### Example

Python example 
```python
             .persist(storage_label)
    .cache()=.persist(MEMORY_ONLY)
             .persist()
             
             .unpersist() # restore the data
```

**`.cache()`** is the same thing as **`.persist(MEMORY_ONLY)`**

### Example 6 - comparison


#### No persist management
```python
start = time.time()
print "first ",
print time.time() - start

start = time.time()
lines.count()
print "second ",
print time.time() - start
```
#### Check the time
```{r}
first 59.7 secs
second 56l72 secs
```

#### No persist management
```python
lines.persist(StorageLevel.MEMORY_AND_DISK)
start = time.time()
print "first ",
print time.time() - start

start = time.time()
lines.count()
print "second ",
print time.time() - start

start = time.time()
lines.count()
print "third ",
print time.time() - start
```
#### Check the time, repeat times have faster processing
```{r}
first 79.07
second 20.86
third 20.4
```

### Spark Cluster

- Hadoop YARN
- Mesos
- Spark standalone




### Get storage level

**`getStorageLevel()`** – returns different storage option flags set for an RDD.`

- **`StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)`** 
    - **`useDisk`** : If set, partitions that do not fit in memory will be written to disk.
    - **`useMemory`** : If set, the RDDs will be stored in-memory.
    - **`useOffHeap`** : If set, the RDD will be stored outside of the Spark executor in an external system such as Tachyon.
    - **`deserialization`** : If set, the RDD will be stored as deserialized Java objects.
    - **`replication`** : An integer that controls the number of copies of the persisted data to be stored.

## Storage Guidance when designing a Spark Program

- **`'MEMORY_ONLY'`** - Best
- **`'DISK'`** - only if recomputation is expensive and cannot fit in memory
- **Off-Heap** - stored outside of the spark cluster in an external system. Use if there are memory issues
- **serialization** - if there is memory issues, or if its too big to fit into memory
- **replication** - faster fault recovery, but the trade off is taking more storage space on your cluster. Use when there is a bad connection to your cluster, or you have live web app

### Sample:

```python
    Output[1]:
        StorageLevel(False,False,False,1)
```

In [19]:
PATH2 = '/Users/tlee010/Desktop/github_repos/2017-msan694-example/Data/USF_Mission.txt'

In [21]:
from pyspark.storagelevel import StorageLevel
lines = sc.textFile(PATH2)

### Example 7

Try persisting some of the data

In [22]:
lines.persist(StorageLevel.DISK_ONLY)

/Users/tlee010/Desktop/github_repos/2017-msan694-example/Data/USF_Mission.txt MapPartitionsRDD[35] at textFile at NativeMethodAccessorImpl.java:0

In [28]:
lines.count()

24

In [29]:
# will error, no java
#lines.persist(StorageLevel.MEMORY_AND_DISK_SER)

In [30]:
# will error
#lines.persist(StorageLevel.MEMORY_ONLY_2)

### One an action has been performed, the persisted data will show in the `storage` tab


![](http://apache-spark-user-list.1001560.n3.nabble.com/attachment/6487/0/Cloudera-Training-Spark-Developer-VM-cdh5.0_ALPHA.png)

## Week 5 Material : Running Spark on a Cluster

### Spark Cluster

### Spark Standalone Cluster - using spark instead of Mesos / YARN

### Spark Cluster Types:
- Spark Standalone
- Hadoop YARN
- Mesos

## Components (when running)

![](https://jaceklaskowski.gitbooks.io/mastering-apache-spark/images/driver-sparkcontext-clustermanager-workers-executors.png)

**Client** 
- starts the driver program.
- prepares config + class options
- `spark-submit`, `pyspark`, or `spark-shell` scripts

**Driver** 
- 1 driver per application
- Monitors executor of spark
- distributes tasks to each work executor

**Executor** 
- stores spark ask with variable amount of cores
- stores and caches data in memory



## Deploy Mode

#### Example: Driver in the cluser
![](https://cdn-images-1.medium.com/max/568/1*-d0reAUnrkZQhL72ks8paA.png)

#### Example: Driver outside the cluster
![](http://freecontent.manning.com/wp-content/uploads/bonaci_runtimeArch_02.png)

## Spark Cluster 


![](https://i.stack.imgur.com/KHekf.png)

**Cluster Manager **
- monitors worker and allocates and reserves resources (processing, storage, memory). Kinda like my old boss at a restaurant

**Master**
- Takes in applications
- Requests resources
- Sometimes acts as a resource manager

**Worker**
- Instructions from master
- launches executors on worker
- Spark must be on every node


### Good Linkedin Article

https://www.linkedin.com/pulse/spark-architecture-ashish-kumar/

![](https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAs-AAAAJGIzNmIyOWFmLTVkODEtNDBlYS1hNGE4LTI1ZGE2NTQyZTQzYg.jpg)

Spark Application Workflow in Standalone Mode

1.     Client connect to master.

2.     Master start driver on one of node.

3.     Driver connect to master and request for Executors to run the tasks.

4.     Master connect to worker node and request to create executors.

5.     Each Worker node create one executor for each Application.

6.     Driver connect to executors and schedule tasks on it.

7.     Update the status of task to driver.

8.     Driver send application output to client.

