# What is Apache Spark?
* distributed framework
* in-memory data structures 
* data processing
* it imptoves (most of the times) Hadoop workloads

Spark enables data scientists to tackle problems with larger data sizes than they could before with tools like R or Pandas

## First Steps with Apache Spark Interactive Programming

First of all check that PySpark is running propertly. You can check if PySpark is correctly loaded:
In case it is not, you can follow these posts:
    * Windows (IPython): http://jmdvinodjmd.blogspot.com.es/2015/08/installing-ipython-notebook-with-apache.html 
    * Windows (Jupyter): http://www.ithinkcloud.com/tutorials/tutorial-on-how-to-install-apache-spark-on-windows/


In [1]:
sc

<pyspark.context.SparkContext at 0x3b737b8>

The first thing to note is that with Spark all computation is parallelized by means of distributed data structures that are spreadd through the cluster. These collections are called Resilient Distributed Datasets (RDD). We will talk more about RDD, as they are the main piece in Spark.

As we have successfully loaded the Spark Context, we are ready to do some interactive analysis. We can read a simple file:

In [2]:
lines = sc.textFile("../data/people.csv")
lines.count()

11

In [None]:
lines.first()

This is a very simple first example, where we create an RDD (variable lines) and then we apply some operations (count and first) in a parallel manner. It has to be noted, that as we are running all our examples in a single computer the parallelization is not applied. 

In the next section we will cover the core Spark concepts that allow Spark users to do parallel computation.

## Core Spark Concepts

We will talk about **Spark applications** that are in charge of loading data and aplying some distributed computation over it. Every application has a **driver program** that launches parallel operations to the cluster. In the case of interactive programming, the driver program is the shell (or Notebook) itself.

The "access point" to Spark from the driver program is the Spark Context object. As we have previously seen, using the referenced documentation, the sc object, is automatically loaded in the notebook.

Once we have an Spark Context we can use it to build RDDs. In the previous examples we used sc.textFile() to represent the lines of the textFile. Then we run different operations over the RDD lines. 

To run these operations over RDDs, driver programs manage different nodes called executors. For example, for the count operation, it is possible to run count in different ranges of the file. 

Spark's API allows passing functions to its operators to run them on the cluster. For example, we could extend our example by filtering the lines in the file that contain a word, such as individuum.


In [None]:
lines = sc.textFile("../data/people.csv")
filtered_lines = lines.filter(lambda line: "individuum" in line)
filtered_lines.first()

## RDD Basics

An RDD can be defined as adistributed collection of elements. All work done with Spark can be summarized as **creating**, **trasnforming** and **applying** operations over RDDs to compute a result. Under the hood, Spark automatically **distributes the data contained in RDDs** across your cluster and **parallelizes the operations** you perform on them.

RDD properties:
* it is an **immutable distributed** collection of objects
* it is split into multiple **partitions**
* it is computed on different nodes of the cluster
* it can contain any type of Python object (user defined ones included)

An RDD can be created in **two ways**:
1. loading an external dataset
2. distributing a collection of objects in the driver program

We have already seen the two ways of creating an RDD. 

In [None]:
# loading an external dataset
lines = sc.textFile("../data/people.csv")
print type(lines)
# applying a transformation to an existing RDD
filtered_lines = lines.filter(lambda line: "individuum" in line)
print type(filtered_lines)

It is important to note that once we have an RDD, we can run **two kind of operations**:
* **transformations**: construct a new RDD from a previous one. For example, by filtering lines RDD we create a new RDD that holds the lines that contain "individuum" string. Note that the returning result is an RDD.
* **actions**: *compute* a result based on an RDD, and returns the result to the driver program or stores it to an external storage system (e.g. HDFS). Note that the returning result is not an RDD but another kind of variable.

In [None]:
action_result = lines.first()
print type(action_result)
action_result

Transformations and actions are very different because of the way Spark computes RDDs. 

Transformations are defined in a **lazy** mannerm this is they are **only computed once they are used in an action**.

In [None]:
# filtered_lines is not computed until the next action is applied over it
# it make sense when working with big data sets, as it is not necessary to 
# transform the whole RDD to get an action over a subset
# Spark doesn't even reads the complete file!
filtered_lines.first()

The drawback is that Spark  **recomputes** again the RDD at **each action application**. 

This means that the computing effort over an already computed RDD may be lost. 

To mitigate this drwaback, the user can take de decision of **persisting** the RDD after computing it the first time, **Spark will store the RDD contents in memory**  (partitioned across the machines in your cluster), and reuse them in future actions. 

**Persisting RDDs on disk** instead of memory is also possible.

Let's see an example on the impact of persisting:

In [3]:
import time

lines = sc.textFile("../data/REFERENCE/*")
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
words = lines_nonempty.flatMap(lambda x: x.split())
words_persisted = lines_nonempty.flatMap(lambda x: x.split())

t1 = time.time()
words.count()
print "Word count 1:",time.time() - t1

t1 = time.time()
words.count()
print "Word count 2:",time.time() - t1

t1 = time.time()
words_persisted.persist()
words_persisted.count()
print "Word count persisted 1:",time.time() - t1

t1 = time.time()
words_persisted.count()
print "Word count persisted 2:", time.time() - t1


Word count 1: 17.3249998093
Word count 2: 16.6799998283
Word count persisted 1: 30.1779999733
Word count persisted 2: 14.0679998398


## RDD Operations

We have already seen that RDDs have two basic operations: **transformations** and **actions**.

**Transformations** are operations that return a new RDD. *Examples:* filter, map.

Remember that , transformed RDDs are **computed lazily**, only when you use them in an action.

Lazy evaluation means that when we call a transformation on an RDD (for instance, calling map()), the operation is **not immediately performed**. 

Instead, Spark internally records **metadata** to indicate that this operation has been requested. 

**Loading data** into an RDD is lazily evaluated in the same way trans formations are. So, when we call sc.textFile(), the data is **not loaded** until it is necessary. 

As with transformations, the operation (in this case, reading the data) can occur multiple times. Take in mind that transformations **DO HAVE** impact over computation time.

Many transformations are **element-wise**; that is, they work on one element at a time; but this is not true for all transformations.


In [None]:
lines = sc.textFile("../data/REFERENCE/*")
lines_nonempty = lines.filter( lambda x: len(x) > 0 )
words = lines_nonempty.flatMap(lambda x: x.split())
words_persisted = lines_nonempty.flatMap(lambda x: x.split())
words.take(10)

* filter applies the lambda function to each line in lines RDD, only lines that accmolish the condition that the length is creater than zero are in lines_nonempty variable (**this RDD is not computed yet!**)
* flatMap applies the lambda function to each element of the RDD and then the result is flattened (i.e. a list of lists would be converted to a simple list)

**Actions** are operations that return an object to the driver program or wirte to external storage, they kick a computation. *Examples:* first, count.

In [None]:
import time

t1 = time.time()
words.count()
print "Word count 1:",time.time() - t1

t1 = time.time()
words.count()
print "Word count 2:",time.time() - t1

t1 = time.time()
words_persisted.persist()
words_persisted.count()
print "Word count persisted 1:",time.time() - t1

t1 = time.time()
words_persisted.count()
print "Word count persisted 2:", time.time() - t1

Actions are the operations that return a **final value** to the driver program or write data to an external storage system. 

Actions **force the evaluation** of the transformations required for the **RDD** they were called on, since they need to actually produce output.

Returning to the previous example, until we call count over words and words persisted, the RDD are not computed. See that we persisted words_persisted, and util its second computation we cannot see the impact of persisting that RDD in memory.

If we want to see a part of the RDD, we can use take, and to have the full RDD we can use collect.

In [None]:
lines = sc.textFile("../data/people.csv")
print "Three elements", lines.take(3)
print "The whole RDD", lines.collect()

## Passing functions to Spark

Most of Spark’s transformations, and some of its actions, depend on **passing in functions** that are used by Spark to **compute** data.

In Python, we have three options for passing functions into Spark. 
 * For shorter functions, we can pass in lambda expressions
 * We can pass in top-level functions, or 
 * Locally defined functions.

In [None]:
lines = sc.textFile("../data/people.csv")

first_cells = lines.map(lambda x: x.split(",")[0])
print first_cells.collect()

# how to pass estra arguments
def get_cell(x):
    return x.split(",")[0]
first_cells = lines.map(get_cell)
print first_cells.collect()

## Working with common Spark transformations

The two most common transformations you will likely be using are map() and filter(). 

The **map()** transformation takes in a function and applies it to each element in the RDD with the result of the function being the new value of each element in the resulting RDD. 

The **filter()** transformation takes in a function and returns an RDD that only has elements that pass the filter() function.

Sometimes ** map() ** returns nested lists, to flattern these nested lists we can use ** flatMap() **. So, ** flatMap() ** is called individually for each element in our input RDD. Instead of returning a single element, we return an iterator with our return values. Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.

### Set operations

* **distinct()** transformation to produce a new RDD with only distinct items. Note that distinct() is expensive, however, as it requires shuffling all the data over the network to ensure that we receive only one copy of each element. 

* **RDD.union(other)** back an RDD consisting of the data from both sources. Unlike the mathematical union(), if there are duplicates in the input RDDs, the result of Spark’s union() will contain duplicates (which we can fix if desired with distinct()).

* **RDD.intersection(other)**  returns only elements in both RDDs. intersection() also removes all duplicates (including duplicates from a single RDD) while running. While intersection() and union() are two similar concepts, the performance of intersection() is much worse since it requires a shuffle over the network to identify common elements.

* ** RDD.subtract(other)** function takes in another RDD and returns an RDD that has only values present in the first RDD and not the second RDD. Like intersection(), it performs a shuffle.

* ** RDD.cartesian(other) ** transformation returns all possible pairs of (a,b) where a is in the source RDD and b is in the other RDD. The Cartesian product can be useful when we wish to consider the similarity between all possible pairs, such as computing every user’s expected interest in each offer. We can also take the Cartesian product of an RDD with itself, which can be useful for tasks like user similarity. Be warned, however, that the Cartesian product is very expensive for large RDDs.

### Actions

* **reduce():** which takes a function that operates on two elements of the type in your RDD and returns a new element of the same type. 

* **aggreggate():** takes an initial zero value of the type we want to return. We then supply a function to combine the elements from our RDD with the accumulator. Finally, we need to supply a second function to merge two accumulators, given that each node accumulates its own results locally. To know more:
    * http://stackoverflow.com/questions/28240706/explain-the-aggregate-functionality-in-spark
    * http://atlantageek.com/2015/05/30/python-aggregate-rdd/
    
* **collect():** returns the entire RDD’s contents. collect() is commonly used in unit tests where the entire contents of the RDD are expected to fit in memory, as that makes it easy to compare the value of our RDD with our expected result.

* **take(n):** returns n elements from the RDD and attempts to minimize the number of partitions it accesses, so it may represent a biased collection

* **top():** will use the default ordering on the data, but we can supply our own comparison function to extract the top elements. 


### Exercices

** Exercice 1: ** Download all books, from books.csv using the map function.

** Exercice 2: ** Identify transformations and actions. When the returned data is calculated?

** Exercice 3: ** Imagine that you only want to download Dickens books, how would you do that? Which is the impact of not persisting dickens_books_content?

** Exercice 4: ** Use flatMap() in the resulting RDD of the previous exercice, how the result is different?

** Exercice 5: ** You want to know the different books authors there are.

** Exercice 6: ** Return Poe's and Dickens' books URLs (use union function).

** Exercice 7: ** Return the list of books without Dickens' and Poe's books.

** Exercice 8: ** Count the number of books using reduce function.

** Exercice 9: ** Compute the mean price of estates from csv containing Sacramento's estate price using aggregate function.

** Exercice 10: ** Get top 5 highest and lowest prices in Sacramento estate's transactions

** Answer 1: **

In [61]:
import urllib3

def download_file(csv_line):
    link = csv_line[0]
    http = urllib3.PoolManager()
    r = http.request('GET', link, preload_content=False)
    response = r.read()
    return response
    
books_info = sc.textFile("../data/books.csv").map(lambda x: x.split(","))
print books_info.take(10)

books_content = books_info.map(download_file)
print books_content.take(1)[0][:100]

[[u'http://www.textfiles.com/etext/REFERENCE/15-songs.txt', u'15-songs.txt', u'17619', u'A Civil War Songbook (January 1990)'], [u'http://www.textfiles.com/etext/REFERENCE/1776-va.rts', u'1776-va.rts', u'5907', u'The Virginia Declaration of Rights'], [u'http://www.textfiles.com/etext/REFERENCE/1mlkd11.txt', u'1mlkd11.txt', u'817486', u'"Project Gutenberg: Martin Luther King\'s ""I have a Dream"" Speech"'], [u'http://www.textfiles.com/etext/REFERENCE/1st_than.txt', u'1st_than.txt', u'2979', u'"The First Thanksgiving Proclomation', u' June 20', u' 1676"'], [u'http://www.textfiles.com/etext/REFERENCE/2sqrt10a.txt', u'2sqrt10a.txt', u'5262079', u'"Project Gutenberg: The Square Root of Two', u' to 5 Million digits"'], [u'http://www.textfiles.com/etext/REFERENCE/32pri10.txt', u'32pri10.txt', u'247391', u'Project Gutenberg: The 32nd Mersenne prime'], [u'http://www.textfiles.com/etext/REFERENCE/all11.txt', u'all11.txt', u'85580', u'Project Gutenberg: The Declaration of Independence of The Unit

** Answer 2: **
If we consider the text readong as a transformation...
Transformations:
* books_info = sc.textFile("../data/books.csv").map(lambda x: x.split(","))
* books_content = books_info.map(lambda x: download_file(x[0]))

Actions:
* print books_info.take(10)
* print books_content.take(1)[0][:100]

Computation is carried out in acions. In this case we take advantage of it, as for downloading data we only apply the function to one element of the books_content RDD

** Answer 3: **

In [60]:
import re

def is_dickens(csv_line):
    link = csv_line[0]
    t = re.match("http://www.textfiles.com/etext/AUTHORS/DICKENS/",link)
    return t != None

dickens_books_info = books_info.filter(is_dickens)
print dickens_books_info.take(4)

dickens_books_content = dickens_books_info.map(download_file)

# take into considertaion that each time an action is performed over dickens_book_content, the file is downloaded
# this has a big impact into calculations
print dickens_books_content.take(2)[1][:100]


[[u'http://www.textfiles.com/etext/AUTHORS/DICKENS/dickens-american-631.txt', u'dickens-american-631.txt', u'604047', u'"PROJECT GUTENBERG: American Notes for General Circulation', u' by Charles Dickens"'], [u'http://www.textfiles.com/etext/AUTHORS/DICKENS/dickens-battle-630.txt', u'dickens-battle-630.txt', u'181551', u'"PROJECT GUTENBERG: The Battle of Life', u' by Charles Dickens"'], [u'http://www.textfiles.com/etext/AUTHORS/DICKENS/dickens-childs-629.txt', u'dickens-childs-629.txt', u'934709', u'"PROJECT GUTENBERG: A Child\'s History of England', u' by Charles Dickens"'], [u'http://www.textfiles.com/etext/AUTHORS/DICKENS/dickens-chimes-379.txt', u'dickens-chimes-379.txt', u'170704', u'"The Chimes', u' by Charles Dickens"']]


NameError: name 'download_file' is not defined

** Answer 4: **

In [None]:
flat_content = dickens_books_info.flatMap(lambda x: x)
print flat_content.take(4)

** Answer 5: **

In [None]:
def get_author(csv_line):
    link = csv_line[0]
    t = re.match("http://www.textfiles.com/etext/AUTHORS/(\w+)/",link)
    if t:
        return t.group(1)
    return u'UNKNOWN'

authors = books_info.map(get_author)
authors.distinct().collect()

** Answer 6 **

In [7]:
import re

def get_author_and_link(csv_line):
    link = csv_line[0]
    t = re.match("http://www.textfiles.com/etext/AUTHORS/(\w+)/",link)
    if t:
        return (t.group(1), link)
    return (u'UNKNOWN',link)

authors_links = books_info.map(get_author_and_link)

# not very efficient
dickens_books = authors_links.filter(lambda x: x[0]=="DICKENS")
poes_books = authors_links.filter(lambda x: x[0]=="POE")

poes_dickens_books = poes_books.union(dickens_books)
poes_dickens_books.sample(True,0.05).collect()

[(u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-al-425.txt'),
 (u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-conqueror-676.txt'),
 (u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-eldorado-436.txt'),
 (u'POE',
  u'http://www.textfiles.com/etext/AUTHORS/POE/poe-metzengerstein-557.txt'),
 (u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-never-562.txt'),
 (u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-premature-700.txt'),
 (u'POE', u'http://www.textfiles.com/etext/AUTHORS/POE/poe-x-726.txt')]

** Answer 7 **

In [None]:
authors_links.subtract(poes_dickens_books).map(lambda x: x[0]).distinct().collect()

** Answer 8 **

In [None]:
authors_links.map(lambda x: 1).reduce(lambda x,y: x+y) == authors_links.count()

**Answer 9**

In [None]:
sacramento_estate_csv = sc.textFile("../data/Sacramentorealestatetransactions.csv")
header = sacramento_estate_csv.first()

sacramento_estate = sacarmento_estate_csv.filter(lambda x: x != header)\
        .map(lambda x: x.split(","))\
        .map(lambda x: int(x[9]))

seqOp = (lambda x,y: (x[0] + y, x[1] + 1))
combOp = (lambda x,y: (x[0] + y[0], x[1] + y[1]))

total_sum, number = sacramento_estate.aggregate((0,0),seqOp,combOp)
mean = float(total_sum)/number
mean

** Answer 10**

In [None]:
print sacramento_estate.top(5)
print sacramento_estate.top(5, key=lambda x: -x)

## Spark Key/Value Pairs

Spark provides special operations on RDDs containing key/value pairs. 

These RDDs are called pair RDDs, but are simple RDDs with an special structure. In Python, for the functions on keyed data to work we need to return an RDD composed of tuples.

** Exercice 1:** Create a pair RDD from our books information data, having author as key and the rest of the information as value. (Hint: the answer is very similar to the previous section Exercice 6)

** Exercice 2:** Check that pair RDDs are also RDDs and that common RDD operations work aswell. Filter elements with author equals to "UNKNOWN" from previous RDD. 

** Exercice 3:** Check mapValue in Spark API (http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.mapValues) function that works on pair RDDs.

** Answer 1:**

In [2]:
import re

def get_author_data(csv_line):
    link = csv_line[0]
    t = re.match("http://www.textfiles.com/etext/AUTHORS/(\w+)/",link)
    if t:
        return (t.group(1), csv_line)
    return (u'UNKNOWN', csv_line)

books_info = sc.textFile("../data/books.csv").map(lambda x: x.split(","))
authors_info = books_info.map(get_author_data)

print authors_info.take(5)

[(u'UNKNOWN', [u'http://www.textfiles.com/etext/REFERENCE/15-songs.txt', u'15-songs.txt', u'17619', u'A Civil War Songbook (January 1990)']), (u'UNKNOWN', [u'http://www.textfiles.com/etext/REFERENCE/1776-va.rts', u'1776-va.rts', u'5907', u'The Virginia Declaration of Rights']), (u'UNKNOWN', [u'http://www.textfiles.com/etext/REFERENCE/1mlkd11.txt', u'1mlkd11.txt', u'817486', u'"Project Gutenberg: Martin Luther King\'s ""I have a Dream"" Speech"']), (u'UNKNOWN', [u'http://www.textfiles.com/etext/REFERENCE/1st_than.txt', u'1st_than.txt', u'2979', u'"The First Thanksgiving Proclomation', u' June 20', u' 1676"']), (u'UNKNOWN', [u'http://www.textfiles.com/etext/REFERENCE/2sqrt10a.txt', u'2sqrt10a.txt', u'5262079', u'"Project Gutenberg: The Square Root of Two', u' to 5 Million digits"'])]


** Answer 2: **

The operations over pair RDDs will also be slightly different.

But take into account that pair RDDs are just *special* RDDs that some operations can be applied, however common RDDs also fork for them.


In [10]:
authors_info.filter(lambda x: x[0] != "UNKNOWN").take(3)

[(u'WILDE',
  [u'http://www.textfiles.com/etext/AUTHORS/WILDE/wilde-ballad-611.txt',
   u'wilde-ballad-611.txt',
   u'27238',
   u'"The Ballad of Reading Gaol',
   u' by Oscar Wilde (1898)"']),
 (u'WILDE',
  [u'http://www.textfiles.com/etext/AUTHORS/WILDE/wilde-burden-612.txt',
   u'wilde-burden-612.txt',
   u'17887',
   u'"The Burden of Itys',
   u' by Oscar Wilde (1890)"']),
 (u'WILDE',
  [u'http://www.textfiles.com/etext/AUTHORS/WILDE/wilde-charmides-601.txt',
   u'wilde-charmides-601.txt',
   u'34648',
   u'"Charmides',
   u' by Oscar Wilde (1890)"'])]

** Answer 3:** 

Sometimes is awkward to work with pairs, and Spark provides a map function that operates over values.

In [12]:
authors_info.mapValues(lambda x: x[2]).take(5)

[(u'UNKNOWN', u'17619'),
 (u'UNKNOWN', u'5907'),
 (u'UNKNOWN', u'817486'),
 (u'UNKNOWN', u'2979'),
 (u'UNKNOWN', u'5262079')]

## Transformations on Pair RDDs

 Since pair RDDs contain tuples, we need to pass functions that operate on tuples rather than on individual elements.
 
 * reduceByKey(func): Combine values with the same key.
 * groupByKey(): Group values with the same key.
 * combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner): Combine values with the same key using a different result type.
 * keys(): return RDD keys
 * values(): return RDD values

** Exercice 1: ** Get the total size of files for each author. 

** Exercice 2: ** Get the top 5 authors with more data.

** Exercice 3:** Try the combineByKey() with a randomly generated set of 5 values for 4 keys. Get the average value of the random variable for each key.

** Exercice 4:** Compute the average book size per author using combineByKey(). If you were an English Literature student and your teacher says: "Pick one Author and I'll randomly pick a book for you to read", what would be a Data Scientist answer?

** Exercice 5: ** All Spark books have the word count example. Let's count words over all our books! (This might take some time)
 

** Answer 1**

In [21]:
authors_data = authors_info.mapValues(lambda x: int(x[2]))
authors_data.reduceByKey(lambda y,x: y+x).collect()

[(u'BURROUGHS', 10070497),
 (u'DICKENS', 14236826),
 (u'STEVENSON', 6965452),
 (u'TWAIN', 13259786),
 (u'EMERSON', 2655619),
 (u'WILDE', 669926),
 (u'ARISTOTLE', 6219825),
 (u'DOYLE', 8450256),
 (u'KANT', 5901915),
 (u'UNKNOWN', 266421132),
 (u'HAWTHORNE', 1898878),
 (u'PLATO', 3648947),
 (u'IRVING', 2223565),
 (u'KEATS', 360556),
 (u'JEFFERSON', 6565921),
 (u'SHAKESPEARE', 5347823),
 (u'POE', 3395985),
 (u'MILTON', 911458)]

** Answer 2:**

In [24]:
aauthors_data.reduceByKey(lambda y,x: y+x).top(5,key=lambda x: x[1])

[(u'UNKNOWN', 266421132),
 (u'DICKENS', 14236826),
 (u'TWAIN', 13259786),
 (u'BURROUGHS', 10070497),
 (u'DOYLE', 8450256)]

**Answer 3:**

In [55]:
import numpy as np

# generate the data
rdd = sc.parallelize(zip(range(5)*5, np.random.normal(0,1,5*5)))

createCombiner = lambda value: (value,1)
# you can check what createCombiner does
# rdd.mapValues(createCombiner).collect()

# here x is the combiner (sum,count) and value is value in the 
# initial RDD (the random variable)
mergeValue = lambda x, value: (x[0] + value, x[1] + 1)

# here, all combiners are summed (sum,count)
mergeCombiner = lambda x, y: (x[0] + y[0], x[1] + y[1])

sumCount = rdd.combineByKey(createCombiner,
                        mergeValue,
                         mergeCombiner)

sumCount.mapValues(lambda x: x[0]/x[1]).collect()


[(0, 0.36496102496138977),
 (4, 0.11892759636084722),
 (1, -0.66545959341697147),
 (2, 0.30208685716557426),
 (3, -0.30334391176249054)]

** Answer 4:**

In [53]:
createCombiner = lambda value: (value,1)
# you can check what createCombiner does
# rdd.mapValues(createCombiner).collect()

# here x is the combiner (sum,count) and value is value in the 
# initial RDD (the random variable)
mergeValue = lambda x, value: (x[0] + value, x[1] + 1)

# here, all combiners are summed (sum,count)
mergeCombiner = lambda x, y: (x[0] + y[0], x[1] + y[1])

sumCount = authors_data.combineByKey(createCombiner,
                        mergeValue,
                         mergeCombiner)

sumCount.mapValues(lambda x: x[0]/x[1]).collect()
# I would choose the author with lowest average book size
sumCount.mapValues(lambda x: x[0]/x[1]).top(5,lambda x: -x[1])

[(u'KEATS', 10604),
 (u'POE', 24431),
 (u'MILTON', 30381),
 (u'WILDE', 39407),
 (u'IRVING', 55589)]

In [59]:
authors_data.values().collect()

[17619,
 5907,
 817486,
 2979,
 5262079,
 247391,
 85580,
 63397,
 2974,
 8446,
 27452,
 9163,
 5074,
 105559,
 1037461,
 260802,
 22360,
 1205791,
 1205850,
 10788,
 16519,
 27881,
 803301,
 15273,
 29849,
 12356,
 27879,
 83543,
 10516,
 1646,
 1577192,
 1588927,
 14651,
 1540646,
 8441343,
 1414431,
 524655,
 834319,
 11559,
 1589767,
 2634,
 41844,
 210073,
 1536408,
 1505344,
 1532901,
 18219,
 1925720,
 2473400,
 2638067,
 2872323,
 929797,
 46515,
 155150,
 972284,
 296995,
 84800,
 215746,
 206500,
 395591,
 811645,
 119149,
 304262,
 297713,
 123450,
 445194,
 141533,
 129524,
 173692,
 396640,
 227940,
 363949,
 613135,
 11748,
 296626,
 248020,
 378607,
 144042,
 851974,
 406935,
 865715,
 142229,
 16935,
 478567,
 10105,
 372787,
 1174684,
 1169724,
 260369,
 311275,
 123786,
 23289,
 146184,
 8031,
 231825,
 53484,
 162346,
 180278,
 840162,
 8033,
 9163,
 4432803,
 495700,
 17426,
 36958,
 1168206,
 282338,
 21341,
 1578,
 298001,
 1605768,
 113936,
 321521,
 437549,
 174