# Why Apache Spark for Big Data?

1. Easy to use. Provides high-level API that focuses on the content of the computation.
2. Fast, enabling interactive use and complex algorithms.
3. General engine. Combines multiple types of computations (SQL queries, text processing, and ML)

# Chapter 1: Introduction to Data Analysis with Spark

## What is Apache Spark?

1. Apache Spark is a cluster computing platform designed to be fast and general-purpose.
2. Ability to run computation in memory.
3. More efficient than MapReduce for complex applications.
4. Integrate closely with other Big Data tools.

## A Unified Stack

1. Spark Core - Task scheduling, memory management, RDD API
2. Spark SQL - Structured data
3. Spark streaming - Live stream of data in real time
4. MLlib machine learning
5. GraphX graph processing
6. Cluster Managers - Standalone, YARN, Mesos

## Users of Spark

1. Data Scientist
2. Engineer

# Chapter 2: Downloading Spark and Getting Started

Spark shell allow us to interact with data that is distributed on disk or in memory across many machines.
Provides Scala and Python shells.

1. Scala shell: bin/spark-shell
2. Python shell (PySpark): bin/pyspark

## Changing verbosity of logging in spark shell

Make a copy of conf/log4j.properties.template called conf/log4j.properties and find the following line:  
log4j.rootCategory=INFO, console  
And change it to  
log4j.rootCategory=WARN, console

## Working with RDD

In [4]:
lines = sc.textFile('file:///usr/local/spark/README.md')

In [5]:
lines

file:///usr/local/spark/README.md MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [6]:
lines.count()

104

In [7]:
lines.first()

u'# Apache Spark'

## Introduction to Core Spark Concepts

Every Spark application consists of a driver program that launches various parallel operations on a cluster.  
Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.  
Driver programs manages a number of nodes called executors.

## Standalone Applications

In standalone applications, such as scripts, we have to initialize our own SparkContext.  
In Java and Scala, one has to give the application a Maven dependency on the spark-core artifact.  
In Python, application must be run using bin/spark-submit script.

### Initializing a SparkContext

In [8]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('My App')
#sc = SparkContext(conf=conf) # Spark context already running inside Ipython notebook

# Chapter 3: Programming with RDDs

Resilient Distributed Dataset (RDD) is distributed collection of elements.  
In Spark, all works is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.

## RDD Basics

Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.  
RDDs can contain any type of Python, Java, or Scala objects, including user defined classes.  

Creating RDD:

1. Loading an external dataset
2. Distributing a collection of objects in driver program.

Once created, RDDs offer two types of operations.

1. Transformations - construct a new RDD from a previous one.
2. Actions - compute a result based on an RDD, and either return it to the driver program or save it to an external sotrage system. (e.g HDFS)

Spark performs transformations in lazy fashion, i.e transformations are only computed when an action is called.

Spark RDDs are by default recomputed each time one run an action on them. In order to overcome this, use

In [9]:
# rdd.persist()

Every Spark application will work as follows:

1. Create RDD
2. Transform RDD
3. Persist RDD
4. Perfom Action on RDD

## Creating RDDs

1. Take an existing collection in your program and pass it to SparkContext's parallelize()

In [10]:
lines = sc.parallelize(['pandas', 'I like pandas'])
lines

ParallelCollectionRDD[7] at parallelize at PythonRDD.scala:475

2. Load data from external storage

In [11]:
lines = sc.textFile('file:///usr/local/spark/README.md')
lines

file:///usr/local/spark/README.md MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:0

## RDD Operations

1. Transformations - returns RDD
2. Actions - returns data type

### Transformations

Transformed RDDs are computed lazily, only when one use them in an action.

In [12]:
inputRDD = sc.textFile('log.txt')
errorsRDD = inputRDD.filter(lambda x: 'error' in x)
warningsRDD = inputRDD.filter(lambda x: 'warning' in x)
# badLinesRDD = errorsRDD.union(warningsRDD)

Spark keeps track of the set of dependencies between different RDDs, called the lineage graph.  
It uses this information to compute each RDD on demand and to recover lost data if part of persistent RDD is lost.

### Actions

Operations that return a final value to the driver program or write data to an external storage system.

In [13]:
# print 'Input had ' + badLinesRDD.count() + ' concerning lines'
# print 'Here are the 10 examples'
# for line in badLinesRDD.take(10):
#     print line

RDD also have collect() function to retrieve the entire RDD.  
In order to collect large RDD, better save the content of an RDD using saveAsTextFile() function.

### Lazy Evaluation

When we call a transformation on an RDD, the operation is not immediately performed.  
Think of each RDD as consisting of instructions on how to compute the data that we build up through transformations.

## Passing functions to Spark

Most of Spark’s transformations, and some of its actions, depend on passing in functions that are used by Spark to compute data.

Three options for passing functions
1. lambda
2. Top-level functions
3. Locally defined functions

## Common Transformations and Actions

### Element-wise transformations

1. map() - takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD. 
2. filter() - takes in a function and returns an RDD that only has elements that pass the filter() function.
3. flatMap() - we return an iterator with our return values. Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators. 

In [14]:
lines = sc.parallelize(['hello world', 'hi'])

In [15]:
words = lines.map(lambda line: line.split())
words.collect()

[['hello', 'world'], ['hi']]

In [16]:
words = lines.flatMap(lambda line: line.split())
words.collect()

['hello', 'world', 'hi']

### Pseudo set operations

1. rdd.distinct()
2. rdd1.union(rdd2)
3. rdd1.intersection(rdd2)
4. rdd1.subtract(rdd2)
5. rdd1.cartesian(rdd2)

### Actions

1. reduce() - takes a function that operates on two elements of the type in your RDD and returns a new element of the same type.
2. fold() - takes a function with the same signature as needed for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition. The zero value you provide should be the identity element for your operation.
3. aggregate() - we supply an initial zero value of the type we want to return. We then supply a function to combine the elements from our RDD with the accumulator. Finally, we need to supply a second function to merge two accumulators, given that each node accumulates its own results locally.
4. collect() - returns the entire RDD's content to the driver.
5. take(n) - returns n elements from the RDD and attempts to minimize the number of partitions it accesses, so it may represent a biased collection.
6. top() - extract the top elements from an RDD.
7. takeSample(withReplacement, num, seed) - take a sample of our data either with or without replacement.
8. foreach() - lets us perform computations on each element in the RDD without bringing it back locally.
9. count()
10. countByValue() - returns a map of each unique value to its count.

Note: Return type of the result in reduce() and fold() should be the same type as that of the elements in the RDD we are operating over.


## Converting between RDD types

Some functions are available only on certain types of RDDs, such as mean() and variance() on numeric RDDs or join() on key/value pair RDDs.  
In Scala and Java, these methods aren’t defined on the standard RDD class, so to access this additional functionality we have to make sure we get the correct specialized class.

## Persistence (Caching)

To avoid computing an RDD multiple times, we can ask Spark to persist the data.

In [17]:
# rdd.persist(StorageLevel.MEMORY_ONLY)

# Chapter 4: Working with Key/Value Pairs

Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network.

## Creating Pair RDDs

1. Few reading formats directly return pair RDDs for their key/value data.
2. Use map() to convert RDD into Pair RDD.

In [18]:
pairs = words.map(lambda w: (w, 1))
pairs.collect()

[('hello', 1), ('world', 1), ('hi', 1)]

## Transformations on Pair RDDs

1. reduceByKey()
2. groupByKey()
3. combineByKey()
4. mapValues()
5. flatMapValues()
6. keys()
7. values()
8. sortByKey()

### Aggregations

1. reduceByKey() - runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key.
2. foldByKey()
3. combineByKey() - is the most general of the per-key aggregation functions. Most of the other per-key combiners are implemented using it.

In [19]:
## Word Count
rdd = sc.textFile('file:///usr/local/spark/README.md')
words = rdd.flatMap(lambda x: x.split())
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
result.take(10)

[(u'when', 1),
 (u'R,', 1),
 (u'including', 4),
 (u'computation', 1),
 (u'contributing', 1),
 (u'submit', 1),
 (u'using:', 1),
 (u'guidance', 2),
 (u'Scala,', 1),
 (u'environment', 1)]

#### Tuning the level of parallelism

Spark will always try to infer a sensible default value based on the size of your cluster, but in some cases you will want to tune the level of parallelism for better performance.  
Pass number of paritions or use repartition() or coalesce() for tuning.

### Grouping

1. groupByKey() - If our data is already keyed in the way we want, groupByKey() will group our data using the key in our RDD. On an RDD consisting of keys of type K and values of type V, we get back an RDD of type [K, Iterable[V]].
2. groupBy() - works on unpaired data or data where we want to use a different condition besides equality on the current key. It takes a function that it applies to every element in the source RDD and uses the result to determine the key.
3. cogroup() - group data sharing the same key from multiple RDDs.

### Joins

1. rdd1.join(rdd2)
2. rdd1.leftOuterJoin(rdd2)
3. rdd1.rightOuterJoin(rdd2)

### Sorting data

1. sortByKey()

## Actions Available on the Pair RDDs

1. countByKey()
2. collectAsMap()
3. lookup(key) - Return all values associated with the provided key.

## Data Partitioning (Advanced)

Spark programs can choose to control their RDDs’ partitioning to reduce communication.  
Use partitionBy() transformation at the start of the program.

### Determining an RDD's partitioner

Use rdd.partitioner in Scala and Java to determine the partitioner

### Operations that benefit from Partitioning

cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(), combineByKey(), and lookup()

### Example: PageRank

### Custom Partitioners

While Spark’s HashPartitioner and RangePartitioner are well suited to many use cases, Spark also allows you to tune how an RDD is partitioned by providing a custom Partitioner object.

# Chapter 5: Loading and Saving your data

Three common sets of data sources:

1. File formats and filesystems - local or distributed filesystem.
2. Structured data sources through Spark SQL
3. Databases and key/value stores

## File formats

### Text Files

Loading:  
sc.texfile() - load a single text file as an RDD, each input line becomes an element in the RDD.  
sc.wholeTextFiles() - load multiple whole text files at the same time into a pair RDD, with the key being the name and the value being the contents of each file.  

Saving:  
result.saveAsTextFile() - The path is treated as a directory and Spark will output multiple files underneath that directory.

### JSON

Loading the data as a text file and then parsing the JSON data is an approach that we can use in all of the supported languages. This works assuming that you have one JSON record per row.

In [20]:
import json
data = rdd.map(lambda x: json.loads(x))

# data.filter(lambda x: x['lovesPandas']).map(lambda x: json.dumps(x)).saveAsTextFile(outputFile)

### CSV and TSV

Loading CSV/TSV data is similar to loading JSON data in that we can first load it as text and then process it.

## Sequence files

SequenceFiles are a popular Hadoop format composed of flat files with key/value pairs. SequenceFiles have sync markers that allow Spark to seek to a point in the file and then resynchronize with the record boundaries. This allows Spark to efficiently read SequenceFiles in parallel from multiple nodes.

Use sc.sequenceFile() function to read sequence files  
Use pairRDD.saveAsSequenceFile() to save sequence file

### Object Files

Use sc.objectFile() to read an object file  
Use rdd.saveAsObjectFile() to save an object file

In Python, use saveAsPickleFile() and pickleFile() instead.

### Hadoop Input/Output formats

Use sc.hadoopFile() to load old Hadoop file  
Use sc.newAPIHadoopFile() to load new Hadoop file  
Use rdd.saveAsHadoopFile() to save an RDD as an old Hadoop file  
Use rdd.saveAsNewAPIHadoopFile() to save an RDD as a new Hadoop file 

### Non-filesystem data sources
1. Protocol buffers

### File Compression

Working with Big Data, we find ourselves needing to use compressed data to save storage space and network overhead.

## Filesystems

Spark supports a large number of filesystems for reading and writing to, which we can use with any of the file formats we want.

### Local/Regular FS

While Spark supports loading files from the local filesystem, it requires that the files are available at the same path on all nodes in your cluster.

### Amazon S3

### HDFS

## Structured data with Spark SQL

Spark SQL is a component to work with structured and semistructured data. By structured data, we mean data that has a schema that is, a consistent set of fields across data records.

### Apache Hive 

Hive can store tables in a variety of formats, from plain text to column-oriented formats, inside HDFS or other storage systems. Spark SQL can load any table supported by Hive.

### JSON

To load JSON data, first create a HiveContext as when using Hive. Then use the HiveContext.jsonFile method to get an RDD of Row objects for the whole file. Apart from using the whole Row object, you can also register this RDD as a table and select specific fields from it.

## Databases

### Java Database Connectivity

Spark can load data from any relational database that supports Java Database Con‐ nectivity (JDBC), including MySQL, Postgres, and other systems. 

### Cassandra

The Spark Cassandra connector is currently only available in Java and Scala.

### HBase

### Elasticsearch