In [2]:
from __future__ import print_function
%matplotlib inline
import matplotlib.pylab as plt
import sys, os, glob
import numpy as np

plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')

# Gutenberg N-Grams

In this notebook, we will quantitatively explore the text of the [Gutenberg E-Books Project](https://www.gutenberg.org/), a free repository of e-books that are in the public domain. All of the English and German books have been downloaded for this tutorial and a small python package has been made available that allows you to easily parse the text and the associated metadata. 

In this part "zero" notebook, we just ingest the data, process the text, and save the raw text RDD (with punctuation and html tags removed). 

In [3]:
import findspark
findspark.init()

In [4]:
import pyspark
from pyspark import SparkConf, SparkContext

In [5]:
# put the number of executors and cores into variables so we can refer to it later
num_execs = 20
exec_cores = 4

In [6]:
# initializing the SparkConf
conf = SparkConf()

In [7]:
conf.set('spark.executor.memory', '9g')
conf.set('spark.executor.instances', str(num_execs))
conf.set('spark.executor.cores', str(exec_cores))

conf.set('spark.storage.memoryFraction', 0.3)
conf.set('spark.shuffle.memoryFraction', 0.5)

conf.set('spark.yarn.am.memory', '8g')
conf.set('spark.yarn.am.cores', 2)

conf.set('spark.executorEnv.PYTHONPATH', 
         '/cluster/apps/spark/spark-current/python:/cluster/apps/spark/spark-current/python/lib/py4j-0.8.2.1-src.zip')

conf.set('spark.executorEnv.PATH', os.environ['PATH'])

<pyspark.conf.SparkConf at 0x2b005fb1ff10>

### Starting the `SparkContext`
This is our entry point to the Spark runtime - it is used to push data into spark or load RDDs from disk etc. 

In [13]:
sc = SparkContext(master = 'yarn-client', conf = conf)

If this works successfully, you can check the [YARN application scheduler](http://hadoop.ethz.ch:8088/cluster) and you should see your app listed there. Clicking on the "Application Master" link will bring up the familiar Spark Web UI. 

## Make a key-value RDD of book metadata and text

Getting data into spark from a collection of local files is a very common task. A useful pattern to keep in mind is the following: 

1. make a list of filenames and distribute it among the workers
3. "map" each filename to the data you want to get out
4. now you are left with the RDD of raw data distributed among the workers!

In our case of the Gutenberg Project e-book data, we have a directory of `html` files which hold the actual book text, and another directory of associated metadata files (the RDF files). To make your life easier for the purpose of this tutorial, we have made a small python module called `gutenberg_cleanup` that has some handy functions for pulling out the relevant text and metadata out of the raw dataset. 

The [`gutenberg_cleanup`](gutenberg_cleanup.py) module contains three functions that can help with this: `get_gid`, `get_metadata` and `get_text`.

They pretty much do the obvious: 

`get_gid` takes an html path and pulls out the book ID (`gid`)

`get_metadata` takes a `gid` and returns a metadata object with various useful fields that will be used to create a unique key for each book

`get_text` takes a path to an html file and returns the raw text extracted from HTML, cleaned of tags and punctuation and converted to lower case. 

### Initializing the raw dataset using `sc.parallelize`

In [14]:
import glob

# get a list of all html files in the data directory
flist = glob.glob('/cluster/work/sdid/roskarr/gutenberg/html/*html')
print('number of books: ', len(flist))

number of books:  42085


When you use `sc.parallelize` to distribute a dataset across the cluster, you can choose the number of partitions across which to distribute the dataset. The higher the number of partitions, the higher the "parallelism". When Spark subsequently executes maps and reduces on this dataset, it does so by dispatching tasks to different executors, which then request the cores under their control to do the actual work. By increasing the number of partitions, you increase the number of tasks - more tasks gives the Spark scheduler more flexibility in distributing the work across the cluster and therefore maximally leveraging the compute resources at its disposal. In some cases, where a single partition might require a lot of memory it can cause `Out of memory` errors - in such cases, simply reducing the amount of data per task by increasing the parallelism can help. 

Note that as long as the tasks take a few hundred milliseconds the scheduler should have no trouble dispatching them. On the other hand, there is a bit of overhead associated with partitioning the data so you don't want an unreasonably high number of partitions. You can see the [Spark guide](http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism) for a bit more detail. 

Below, we will choose to use 5 times as many partitions as we have cores in the job. 

In [15]:
files_rdd = sc.parallelize(flist, num_execs*exec_cores*5)

In [16]:
files_rdd.take(5)

['/cluster/work/sdid/roskarr/gutenberg/html/1000.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10000.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10001.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10002.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10003.html']

### Transforming the list of filenames into a `key,value` pair RDD of metadata and text

The raw Gutenberg Project dataset consists of HTML files and files that hold metadata in JSON format. For example: 

In [17]:
!head /cluster/work/sdid/roskarr/gutenberg/html/10000.html

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title> </title><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/><link rel="schema.DCTERMS" href="http://purl.org/dc/terms/"/>
<link rel="schema.MARCREL" href="http://id.loc.gov/vocabulary/relators/"/>
<meta content="The Magna Carta" name="DCTERMS.title"/>
<meta content="http://www.gutenberg.orgfiles/10000/10000.txt" name="DCTERMS.source"/>
<meta content="en" scheme="DCTERMS.RFC4646" name="DCTERMS.language"/>
<meta content="2015-04-04T04:40:30.599547+00:00" scheme="DCTERMS.W3CDTF" name="DCTERMS.modified"/>
<meta content="Public domain in the USA." name="DCTERMS.rights"/>


In [18]:
!head /cluster/work/sdid/roskarr/gutenberg/rdf-files/10000/pg10000.rdf

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:cc="http://web.resource.org/cc/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcam="http://purl.org/dc/dcam/"
>
  <pgterms:ebook rdf:about="ebooks/10000">


#### Data Ingestion procedure

Our first task is to ingest this dataset by doing the following: 

1. convert the html into raw text
2. deal with special characters, HTML tags, etc.
3. match each metadata entry with its corresponding raw text and compose tuples of the type (metadata, text)


These three steps are often the first step of any analysis, and can be quite time consuming to get right. For the purposes of this exercise, we have already built the functions needed to perform these operations. They are found in [`gutenberg_cleanup.py`](gutenberg_cleanup.py) if you want to have a look. 

The important functions are:

* `get_gid` -- returns the Gutenberg ID given an html file
* `get_text` -- get cleaned, raw text out of an html file
* `get_metadata` -- return the metadata given an ID 

These will be used to construct a `key,value` pair RDD. The `key` will be the dictionary returned by `get_metadata`, while the `value` we will use the raw text returned by `get_text`. 

In [19]:
import gutenberg_cleanup
from gutenberg_cleanup import get_metadata, get_text, get_gid

To pass the `gutenberg_cleanup` source file to the executors, we will use the `addPyFile` method of the `SparkContext`:

In [20]:
sc.addPyFile('{cwd}/gutenberg_cleanup.py'.format(cwd=os.getcwd()))

Use the `map` method of `files_rdd` to map the filenames to `(metadata, text)` tuples using `get_gid` and `get_text` functions:

In [21]:
# TODO
id_text_rdd = files_rdd.map(lambda filename: (get_gid(filename), get_text(filename)))

Now we have (ID, text), and we need to make another `map` to get (`metadata, text`) tuples:

In [22]:
# TODO
text_rdd = (id_text_rdd.map(lambda (ID, text): (get_metadata(ID), text)))

So that we don't have to constantly re-load the data off disk, lets cache this RDD: 

In [None]:
%%time
text_rdd.cache()
text_rdd.count()

Since we called `count()`, it means that the entire RDD was generated/calculated. This combination of `cache` and `count` is a common way to check how much memory your dataset needs - once `count` completes you can check the memory taken up by the RDD by going to the "Storage" tag of the Spark UI. 

Because the data is cached, next time you try to use `text_rdd` it will be much much quicker. For example, 

In [22]:
%%time
text_rdd.count()

CPU times: user 37 ms, sys: 8 ms, total: 45 ms
Wall time: 25.8 s


21469

In [23]:
#assert(_ == 15081)

As an aside, we could call the native python `map` in exactly the same way (and run it on the local machine only), though this would take much longer to complete, i.e. 

    text = map(lambda f: (get_metadata(get_gid(f)), get_text(f)), flist)

## Save the raw dataset to HDFS (or local storage)

As a final bit of preparation before continuing with analysis, we save the raw data in a way that makes it faster to access later. We don't want to have to read the data off local disk every time we need to repeat some part of the analysis. Instead, it's much more advantageous to use the Hadoop Distributed File System (HDFS) to store the data once we've read it in and put it in a `key,value` format. 

By storing the data in HDFS, we make sure that the system can take advantage of data-locality at a later stage in our analysis. 

In [24]:
# TODO
text_rdd.saveAsPickleFile('hdfs:///user/roskarr/gutenberg/raw_text_rdd')

Now, whenever we need it, we can read the data off the HDFS instead: 

In [25]:
# TODO
text_rdd = sc.pickleFile('hdfs:///user/roskarr/gutenberg/raw_text_rdd')

In [26]:
%time text_rdd.cache().count()

CPU times: user 44 ms, sys: 9 ms, total: 53 ms
Wall time: 6 s


21469

In [32]:
meta_dict[101]

{'birth_year': u'1954',
 'death_year': None,
 'downloads': u'352',
 'first_name': u'Bruce',
 'gid': 101,
 'lang': u'en',
 'last_name': u'Sterling',
 'title': u'The Hacker Crackdown: Law and Disorder on the Electronic Frontier'}

To get, for example, the author birth year for book with `gid = 101`:

In [33]:
meta_dict[101]['birth_year']

u'1954'

Now we need to create the broadcast variable: 

In [34]:
# call it meta_b for 'broadcast'
meta_b = sc.broadcast(meta_dict)

The underlying data object stored in `meta_b` can be accessed simply by

    > meta_b.value
    
We'll make use of this soon. If you check the console output, you will see an INFO message that the broadcast has been created, i.e. 

```
15/06/24 17:18:44 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 910.7 KB, free 4.1 GB)
15/06/24 17:18:44 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.201.20.22:47821 (size: 910.7 KB, free: 4.1 GB)
15/06/24 17:18:44 INFO spark.SparkContext: Created broadcast 6 from broadcast at PythonRDD.scala:403
```

### Save the metadata dictionary for later use
We will need the metadata dictionary at a later point, so we save it to disk now to avoid having to regenerate it later. 

In [35]:
from cPickle import dump

In [36]:
dump(meta_dict, open('{home}/gutenberg_metadata.dump'.format(home=os.environ['HOME']), 'w'))

## Cleaning the data with filtering

Now we're ready to do some quality checks on the data. Let's check out the first couple of metadata entries: 

In [37]:
text_rdd.keys().take(5)

[{'birth_year': None,
  'death_year': None,
  'downloads': u'243',
  'first_name': None,
  'gid': 1000,
  'lang': u'en',
  'last_name': None,
  'title': '- No Title -'},
 {'birth_year': None,
  'death_year': None,
  'downloads': u'269',
  'first_name': None,
  'gid': 10000,
  'lang': u'en',
  'last_name': u'Anonymous',
  'title': u'The Magna Carta'},
 {'birth_year': u'1863',
  'death_year': u'1950',
  'downloads': u'274',
  'first_name': u'Lucius Annaeus',
  'gid': 10001,
  'lang': u'en',
  'last_name': u'Seneca',
  'title': u'Apocolocyntosis'},
 {'birth_year': u'1877',
  'death_year': u'1918',
  'downloads': u'865',
  'first_name': u'William Hope',
  'gid': 10002,
  'lang': u'en',
  'last_name': u'Hodgson',
  'title': u'The House on the Borderland'},
 {'birth_year': u'1833',
  'death_year': u'1923',
  'downloads': u'15',
  'first_name': u'Mary King',
  'gid': 10003,
  'lang': u'en',
  'last_name': u'Waddington',
  'title': u'My First Years as a Frenchwoman, 1876-1879'}]

If you look at just the first few entries it becomes clear that we're going to have to do some quality control here. For example, we probably don't want books with "None" as either of the author names, and likewise we have to have the birth date in order to be able to create a time series out of the data in the end. 

Construct an RDD, as above, except that you filter out all the elements that have `None` for `title`, `first_name`, `last_name`, or `birth_year`. In addition, filter out the data with "BC" in either birth or death year. 

As a reminder, here is a cartoon illustration of the difference between `map` and `filter` RDD methods. `map` simply applies the function to each element, returning another element. 

![map](../figs/map_example.svg)

In this example, with `filter` we are filtering out all the even elements of the RDD. The function that is passed to `filter` just has to evaluate to either `True` (1) or `False` (0) given the input data. The function `lambda (k,v): v%2` evaluates to 0 if `v` is even and 1 of `v` is odd. Hence, only the odd values pass the filter. 

![filter](../figs/filter_example.svg)

The `filter_func` has already been defined for you below, but you need to apply it to `text_rdd`. 

In [38]:
def filter_func(meta) : 
    no_none = all([meta[name] is not None for name in ['title', 'first_name', 'last_name', 'birth_year']])
    if not no_none : 
        return False
    else : 
        no_birth_bc = 'BC' not in meta['birth_year']
        no_death_bc = True if meta['death_year'] is None else 'BC' not in meta['death_year']
        return no_birth_bc + no_death_bc

In [39]:
# TODO
filtered_rdd = text_rdd.filter(lambda (meta, text): filter_func(meta))

In [40]:
filtered_rdd.keys().take(5)

[{'birth_year': u'1863',
  'death_year': u'1950',
  'downloads': u'274',
  'first_name': u'Lucius Annaeus',
  'gid': 10001,
  'lang': u'en',
  'last_name': u'Seneca',
  'title': u'Apocolocyntosis'},
 {'birth_year': u'1877',
  'death_year': u'1918',
  'downloads': u'865',
  'first_name': u'William Hope',
  'gid': 10002,
  'lang': u'en',
  'last_name': u'Hodgson',
  'title': u'The House on the Borderland'},
 {'birth_year': u'1833',
  'death_year': u'1923',
  'downloads': u'15',
  'first_name': u'Mary King',
  'gid': 10003,
  'lang': u'en',
  'last_name': u'Waddington',
  'title': u'My First Years as a Frenchwoman, 1876-1879'},
 {'birth_year': u'1864',
  'death_year': u'1948',
  'downloads': u'9',
  'first_name': u'Anna Robertson Brown',
  'gid': 10004,
  'lang': u'en',
  'last_name': u'Lindsay',
  'title': u'The Warriors'},
 {'birth_year': u'1775',
  'death_year': u'1861',
  'downloads': u'17',
  'first_name': u'George',
  'gid': 10005,
  'lang': u'en',
  'last_name': u'Tucker',
  'title

How many do we have left? 

In [41]:
nfiltered = filtered_rdd.count()
print('number of books after filtering: ', nfiltered)
#assert(nfiltered == 11872)

number of books after filtering:  16227


A final bit of cleanup: 

some of the books end up split across multiple entries. Since it's the same book, each of the entries should have the same `gid`. 

To check for this we will use one of the most basic and common MapReduce patterns -- the key count: 

* map the data into `key`,`value` pairs where `key` is the quantity we want to count and `value` is just 1. In this case, the `key` will be `gid`
* invoke a reduction *by key*, where the reduction operator is a simple addition

Finally, we will sort the result in descending order and print out the first few elements to check whether we have to worry about documents spanning multiple files or not. 

The RDD operations that are needed are [`reduceByKey`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey) and [sortBy](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy).

`reduceByKey` works by grouping all data of a key together and applying the reduction function just to that data. Here's a simple illustration, in this case using a simple addition of two elements as a reduction:

![reducebykey](../figs/reduceByKey_example.svg)



For the `keyFunc` of the call to `sortBy`, use a `lambda` function that extracts the counts obtained from the `reduceByKey`. 

So, the procedure should be : 

1. `map` the `filtered_rdd` using a lambda function to contain (`gid`, 1) tuples
2. `reduceByKey`
3. `sortBy` (specify decreasing order, see the API) 

In [42]:
from operator import add

In [43]:
# FILL IN: map the filtered_rdd to contain just the tuple (gid, 1)
map_filtered = filtered_rdd.map(lambda (meta, text): (meta['gid'],1))

# reduce the map_filtered rdd by key to get the total counts per gid
reduced_gid_rdd = map_filtered.reduceByKey(add)

# sort by count and print out the top 10
reduced_gid_rdd.sortBy(lambda (key, count): count, False).take(10)

[(6478, 43),
 (3772, 40),
 (8700, 35),
 (3332, 33),
 (12233, 29),
 (3425, 23),
 (2440, 16),
 (6475, 15),
 (12145, 9),
 (12383, 7)]

In [44]:
#assert(_ == [(6478, 43), (3772, 40), (8700, 35), (3332, 33), (12233, 29), (3425, 23), (2440, 16), (6475, 15), (12145, 9), (12383, 7)])

Note that there are several transformations here that lead to the final result, `sorted_reduced`. A common syntax is to group them all together, by enclosing them in `( )` and chaining them: 

In [45]:
# TODO
(filtered_rdd.map(lambda (meta, text): (meta['gid'], 1))
             .reduceByKey(add)
             .sortBy(lambda (key,count): count, False)
             .take(10))

[(6478, 43),
 (3772, 40),
 (8700, 35),
 (3332, 33),
 (12233, 29),
 (3425, 23),
 (2440, 16),
 (6475, 15),
 (12145, 9),
 (12383, 7)]

Looks like we have a few that are made up of multiple sections. To combine them, we will use `reduceByKey` which will result in having an RDD of `gid`'s as keys and the combined text of each `gid`. The reduction function in `reduceByKey` can be a simple in-line function that just adds two elements together (but can't be the `add` function because that expects the arguments to be numbers). 

In [46]:
cleaned_rdd = (filtered_rdd.map(lambda (meta, text): (meta['gid'], text))
                           .reduceByKey(lambda a,b: a+b))

As a simple sanity check, lets look at `gid`=6478, which according to the cell above has 43 sections in the original dataset: 

In [47]:
len(filtered_rdd.map(lambda (meta, text): (meta['gid'],1))
                .lookup(6478))

43

In [48]:
len(cleaned_rdd.lookup(6478))

1

To avoid having to do all these pre-processing steps again at a later point, lets also save the `cleaned_rdd`:

In [49]:
cleaned_rdd.saveAsPickleFile('/user/roskarr/gutenberg/cleaned_rdd')

This is now saved in the directory we specified, one file per partition:

In [50]:
!hadoop fs -ls /user/roskarr/gutenberg/cleaned_rdd | head

Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1
15/09/04 09:32:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 401 items
-rw-r--r--   3 roskarr supergroup          0 2015-09-04 09:32 /user/roskarr/gutenberg/cleaned_rdd/_SUCCESS
-rw-r--r--   3 roskarr supergroup   63571800 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00000
-rw-r--r--   3 roskarr supergroup   19921938 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00001
-rw-r--r--   3 roskarr supergroup   13487504 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00002
-rw-r--r--   3 roskarr supergroup   15221090 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00003
-rw-r--r--   3 roskarr supergroup   17893803 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00004
-rw-r--r--   3 roskarr supergroup   18726897 2015-09-04 09:31 /user/roskarr/gutenberg/cleaned_rdd/part-00005
-rw-r--r--   3 roskarr sup

Note that here we used the `hadoop` command in the local bash shell (the `!` at the beginning of the line means we are executing the command in the shell). This allows us to access the hadoop filesystem (HDFS), which is separate from the local file system we are used to. You'll notice, for example, that this directory doesn't exist in the local filesystem:

In [51]:
!ls /user/roskarr/gutenberg/

ls: cannot access /user/roskarr/gutenberg/: No such file or directory


You can also browse the filesystem via the [HDFS web UI](http://hadoop.ethz.ch:50070). The `hadoop fs` command has many of the same options as regular Linux/Unix shell commands you might be used to for manipulating files and directories. Try running

```bash
cluster $> module load hadoop
cluster $> hadoop fs -help
```

in a new shell to see all the options. 

### Recap of steps up until this point

We've done quite a lot already with our dataset in Spark, although it's only the beginning!

1. created an RDD of filenames (`filename_rdd`)
2. transformed the `filename_rdd` into an RDD of `(metadata, text)` (`text_rdd`); we also saved this to HDFS
3. filtered out data with bad metadata, e.g. missing author names etc.
3. cleaned up the entries a bit more by merging ones with identical IDs; we called this `cleaned_rdd`

## Shutting down the `SparkContext`

Now that the pre-processing is done, we will shut down the `SparkContext` before continuing to the data analysis notebook. We have all of our results saved in HDFS, so to continue from where we left off will just require loading data from there. 

In [52]:
sc.stop()

Now that the pre-processing steps are complete, we can continue to the [analysis notebook](part2-ngram-viewer-SOLUTIONS.ipynb)