
# **Apache Hadoop (MapReduce)**

![Hadoop Logo](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Hadoop_logo.svg/220px-Hadoop_logo.svg.png)

It is an open source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged code for nodes to process in parallel, based on the data each node needs to process. This approach takes advantage of data locality — nodes manipulating the data that they have on hand — to allow the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.

![caption](./images/data_sharing_mapreduce.jpg)

# ** Apache Spark**

![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png)

Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster's memory and query it repeatedly, Spark is well-suited to machine learning algorithms.

![caption](./images/data_sharing_spark.jpg)


# **Spark Driver and Workers**

* A Spark code runs two different programs:
    * A driver program and several workers programs
* The driver program runs in a given gateway node
* Worker programs run on cluster nodes or in local threads
* RDDs are distributed  across workers

![caption](./images/driver_workers_spark_architecture.png)

# **Spark Context**

* A Spark program first creates a SparkContext object
    * Tells Spark how and where to access a cluster
    * pySpark shell automatically creates the sc variable
    * Notebooks and programs must use a constructor to create a new SparkContext
* Use SparkContext to create RDDs

* The master parameter for a SparkContext determines which type and size of cluster to use

In the Databricks cloud, the SparkContext is already created for you

# **Resilient Distributed Datasets (RDD)**

* The primary abstraction in Spark
    * Immutable once constructed
    * Track lineage information to efficiently recompute lost data
    * Enable operations on collection of elements in parallel

* You construct RDDs
    * by parallelizing existing Python collections (lists)
    * by transforming an existing RDDs
    * from files in HDFS or any other storage system  
    
* Programmer specifies number of partitions for an RDD (Default value used if unspecified)
    * In general:  more partitions = more parallelism (be carefull with the number of nodes / threads)
    

![caption](./images/RDD.png)

* Two types of operations: transformations and actions
    * Transformations are lazy (not computed immediately)
    * Transformed RDD is executed when action runs on it
    * Persist (cache) RDDs in memory or disk

* Example: 
    * Create an RDD from a data source: from a python list (sc.parallelize()) or a file (sc.textFile())
    * Apply transformations to an RDD: map filter
    * Apply actions to an RDD: collect
 
![caption](./images/RDD_actions.png)

* Further information about RDD: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf 

In [1]:
#Creation
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList)
#Filter
FilteredWordsRDD = wordsRDD.filter(lambda w: w != "rat")
#map
FilteredPluralWordsRDD = FilteredWordsRDD.map(lambda w: w + "s" )
# Collect
print FilteredPluralWordsRDD.collect()

['cats', 'elephants', 'cats']


# **SparkTransformations**

* Create new datasets from an existing one
* Use lazy evaluation: results not computed right away – instead Spark remembers set of transformations applied to base dataset
    * Spark optimizes the required calculations
    * Spark recovers from failures and slow workers

* Hint: Think of this as a recipe for creating result

![caption](./images/transformations.png)

* Difference between ``map`` and ``flatMap``: 
    - ``map`` transforms an RDD of length N into another RDD of length N
    - ``flatMap`` transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results

In [2]:
#flatmap example
rdd = sc.parallelize([1, 2, 3])
squaredRdd = rdd.flatMap(lambda x: [[x, x**2]])
print squaredRdd.collect()

[[1, 1], [2, 4], [3, 9]]


# **Spark Actions**

*  Cause Spark to execute recipe to transform source
*  Mechanism for getting results out of Spark

![caption](./images/actions.png)

In [3]:
#takeOrdered example
rdd = sc.parallelize([5,3,1,2])
print rdd.takeOrdered(3, lambda s: -1 * s)

[5, 3, 2]


# **Review: Python lambda Functions ** 

* Small anonymous functions (not bound to a name)
    * lambda a, b: a + b => returns the sum of its two arguments
* Can use lambda functions wherever function objects are required
* Restricted to a single expression

In [4]:
a = 5
b = 3

f = lambda a, b: a + b

f(a,b)

8

# **Caching RDDs**

* You can mark an RDD to be persisted using the _persist()_ or _cache()_ methods on it 
* The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In [5]:
rdd.cache() # save, don't recompute!

ParallelCollectionRDD[4] at parallelize at PythonRDD.scala:392

# **Spark Program Lifecycle**

1. Create RDDs from external data or parallelize a collection in your driver program
2. Lazily transform them into new RDDs
3. _cache( )_ some RDDs for reuse
4.  Perform actions to execute parallel computation and produce results

# **Spark Key-Value RDDs**

* Similar to Map Reduce, Spark supports Key-Value pairs
* Each element of a Pair RDD is a pair tuple ex. [(1, 2), (3, 4)]
* Some Key-Value Transformations
* <font color="red"> Be careful using _groupByKey()_ as it can cause a lot of data movement across the network and create large Iterables at workers </font>

![caption](./images/key-value_transformations.png)

In [6]:
rdd = sc.parallelize([(1,2), (3,4), (3,6)])
reducedRdd = rdd.reduceByKey(lambda a, b: a + b)
print reducedRdd.collect()

rdd = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
sortedRdd = rdd.sortByKey()
print sortedRdd.collect()

[(1, 2), (3, 10)]
[(1, 'a'), (1, 'b'), (2, 'c')]


# **pySpark Shared Variables**


* Spark automatically creates closures (functions with its own envirnment) when
    * Functions that run on RDDs at workers (ex. map function)
    * Any global variables used by those workers

* Nice, but why? No communication between workers is required

* Problems:
    * Changes to global variables at workers are not sent to driver (one way closures: from driver to workers). Imagine that you want to count missing values at the workers and obtain the total number of missing values
    * Even worse -> Iterative or single jobs with large global variables send large read-only lookup table to workers. ex. A large feature vector in a ML algorithm (very Inefficient)
    
* Solution:
    * Broadcast Variables
        * Efficiently send large, read-only value to all workers (distributed using efficient broadcast algorithms)
        * Saved at workers for use in one or more Spark operations
        * Like sending a large, read-only lookup table to all the nodes
    * Accumulators (Types: integers, double, long, float)
        * Aggregate values from workers back to driver
        * Only driver can access value of accumulator (Tasks at workers cannot access accumulator’s values)
        * For tasks, accumulators are write-only
        * Accumulators can be used in actions or transformations, but remember only actions guarantee its execution

In [7]:
#Broadcast variable

#At the driver:
broadcastVar = sc.broadcast([1, 2, 3])
#At a worker (in code passed via a closure)
broadcastVar.value

[1, 2, 3]

In [8]:
#Accumulator variable

#At the driver
accum = sc.accumulator(0)
rdd = sc.parallelize([1, 2, 3, 4])

#At the worker
def f(x):
    global accum #global (like python global variables)
    accum += x

rdd.foreach(f)
accum.value

10