In [1]:
from __future__ import print_function
%matplotlib inline
import matplotlib.pylab as plt
import sys, os, glob
import numpy as np
import os

plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')

# Gutenberg N-Grams

In this series of notebooks, we will quantitatively explore the text of the [Gutenberg E-Books Project](https://www.gutenberg.org/), a free repository of e-books that are in the public domain. All of the English and German books have been downloaded for this tutorial and a small python package has been made available that allows you to easily parse the text and the associated metadata. 

In this part "zero" notebook, we just ingest the data, process the text, and save the raw text RDD (with punctuation and html tags removed). 

In [2]:
import findspark
findspark.init()

In [3]:
import pyspark
from pyspark import SparkConf, SparkContext

In [4]:
# put the number of executors and cores into variables so we can refer to it later
num_execs = 20
exec_cores = 4

In [5]:
# initializing the SparkConf
conf = SparkConf()

In [6]:
conf.set('spark.executor.memory', '9g')
conf.set('spark.executor.instances', str(num_execs))
conf.set('spark.executor.cores', str(exec_cores))

conf.set('spark.storage.memoryFraction', 0.3)
conf.set('spark.shuffle.memoryFraction', 0.5)

conf.set('spark.yarn.am.memory', '8g')
conf.set('spark.yarn.am.cores', 2)

conf.set('spark.executorEnv.PYTHONPATH', 
         '/cluster/apps/spark/spark-current/python:/cluster/apps/spark/spark-current/python/lib/py4j-0.8.2.1-src.zip')

conf.set('spark.executorEnv.PATH', os.environ['PATH'])

<pyspark.conf.SparkConf at 0x2ae6b1c1a110>

### Starting the `SparkContext`
This is our entry point to the Spark runtime - it is used to push data into spark or load RDDs from disk etc. 

In [7]:
sc = SparkContext(master = 'yarn-client', conf = conf)

If this works successfully, you can check the [YARN application scheduler](http://hadoop.ethz.ch:8088/cluster) and you should see your app listed there. Clicking on the "Application Master" link will bring up the familiar Spark Web UI. 

## Make a key-value RDD of book metadata and text

Getting data into spark from a collection of local files is a very common task. A useful pattern to keep in mind is the following: 

1. make a list of filenames and distribute it among the workers
3. "map" each filename to the data you want to get out
4. now you are left with the RDD of raw data distributed among the workers!

In our case of the Gutenberg Project e-book data, we have a directory of `html` files which hold the actual book text, and another directory of associated metadata files (the RDF files). To make your life easier for the purpose of this tutorial, we have made a small python module called `gutenberg_cleanup` that has some handy functions for pulling out the relevant text and metadata out of the raw dataset. 

The [`gutenberg_cleanup`](gutenberg_cleanup.py) module contains three functions that can help with this: `get_gid`, `get_metadata` and `get_text`.

They pretty much do the obvious: 

`get_gid` takes an html path and pulls out the book ID (`gid`)

`get_metadata` takes a `gid` and returns a metadata object with various useful fields that will be used to create a unique key for each book

`get_text` takes a path to an html file and returns the raw text extracted from HTML, cleaned of tags and punctuation and converted to lower case. 

### Initializing the raw dataset using `sc.parallelize`

In [8]:
import glob

# get a list of all html files in the data directory
flist = glob.glob('/cluster/work/sdid/roskarr/gutenberg/html/*html')
print('number of books: ', len(flist))

number of books:  48177


When you use `sc.parallelize` to distribute a dataset across the cluster, you can choose the number of partitions across which to distribute the dataset. The higher the number of partitions, the higher the "parallelism". When Spark subsequently executes maps and reduces on this dataset, it does so by dispatching tasks to different executors, which then request the cores under their control to do the actual work. By increasing the number of partitions, you increase the number of tasks - more tasks gives the Spark scheduler more flexibility in distributing the work across the cluster and therefore maximally leveraging the compute resources at its disposal. In some cases, where a single partition might require a lot of memory it can cause `Out of memory` errors - in such cases, simply reducing the amount of data per task by increasing the parallelism can help. 

Note that as long as the tasks take a few hundred milliseconds the scheduler should have no trouble dispatching them. On the other hand, there is a bit of overhead associated with partitioning the data so you don't want an unreasonably high number of partitions. You can see the [Spark guide](http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism) for a bit more detail. 

Below, we will choose to use 5 times as many partitions as we have cores in the job. 

In [9]:
files_rdd = sc.parallelize(flist, num_execs*exec_cores*5)

In [10]:
files_rdd.take(5)

['/cluster/work/sdid/roskarr/gutenberg/html/1000.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10000.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10001.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10002.html',
 '/cluster/work/sdid/roskarr/gutenberg/html/10003.html']

### Transforming the list of filenames into a `key,value` pair RDD of metadata and text

The raw Gutenberg Project dataset consists of HTML files and files that hold metadata in JSON format. For example: 

In [11]:
!head /cluster/work/sdid/roskarr/gutenberg/html/10000.html

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title> </title><meta http-equiv="Content-Style-Type" content="text/css"/><meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/><link rel="schema.DCTERMS" href="http://purl.org/dc/terms/"/>
<link rel="schema.MARCREL" href="http://id.loc.gov/vocabulary/relators/"/>
<meta content="The Magna Carta" name="DCTERMS.title"/>
<meta content="http://www.gutenberg.orgfiles/10000/10000.txt" name="DCTERMS.source"/>
<meta content="en" scheme="DCTERMS.RFC4646" name="DCTERMS.language"/>
<meta content="2015-04-04T04:40:30.599547+00:00" scheme="DCTERMS.W3CDTF" name="DCTERMS.modified"/>
<meta content="Public domain in the USA." name="DCTERMS.rights"/>


In [12]:
!head /cluster/work/sdid/roskarr/gutenberg/rdf-files/10000/pg10000.rdf

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
  xmlns:cc="http://web.resource.org/cc/"
  xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dcam="http://purl.org/dc/dcam/"
>
  <pgterms:ebook rdf:about="ebooks/10000">


#### Data Ingestion procedure

Our first task is to ingest this dataset by doing the following: 

1. convert the html into raw text
2. deal with special characters, HTML tags, etc.
3. match each metadata entry with its corresponding raw text and compose tuples of the type (metadata, text)


These three steps are often the first step of any analysis, and can be quite time consuming to get right. For the purposes of this exercise, we have already built the functions needed to perform these operations. They are found in [`gutenberg_cleanup.py`](gutenberg_cleanup.py) if you want to have a look. 

The important functions are:

* `get_gid` -- returns the Gutenberg ID given an html file
* `get_text` -- get cleaned, raw text out of an html file
* `get_metadata` -- return the metadata given an ID 

These will be used to construct a `key,value` pair RDD. The `key` will be the dictionary returned by `get_metadata`, while the `value` we will use the raw text returned by `get_text`. 

In [13]:
import gutenberg_cleanup
from gutenberg_cleanup import get_metadata, get_text, get_gid

To pass the `gutenberg_cleanup` source file to the executors, we will use the `addPyFile` method of the `SparkContext`:

In [14]:
sc.addPyFile('{cwd}/gutenberg_cleanup.py'.format(cwd=os.getcwd()))

Use the `map` method of `files_rdd` to map the filenames to `(metadata, text)` tuples using `get_gid` and `get_text` functions:

In [15]:
# TODO
id_text_rdd = files_rdd.map(lambda filename: (get_gid(filename), get_text(filename)))

Now we have (ID, text), and we need to make another `map` to get (`metadata, text`) tuples:

In [16]:
# TODO
text_rdd = (id_text_rdd.map(lambda (ID, text): (get_metadata(ID), text)))

So that we don't have to constantly re-load the data off disk, lets cache this RDD: 

In [17]:
%%time
text_rdd.cache()
text_rdd.count()

CPU times: user 111 ms, sys: 31 ms, total: 142 ms
Wall time: 4min 10s


Since we called `count()`, it means that the entire RDD was generated/calculated. This combination of `cache` and `count` is a common way to check how much memory your dataset needs - once `count` completes you can check the memory taken up by the RDD by going to the "Storage" tag of the Spark UI. 

Because the data is cached, next time you try to use `text_rdd` it will be much much quicker. For example, 

In [19]:
%%time
assert(text_rdd.count() == 48177)

CPU times: user 60 ms, sys: 4 ms, total: 64 ms
Wall time: 3.02 s


As an aside, we could call the native python `map` in exactly the same way (and run it on the local machine only), though this would take much longer to complete, i.e. 

    text = map(lambda f: (get_metadata(get_gid(f)), get_text(f)), flist)

## Save the raw dataset to HDFS (or local storage)

As a final bit of preparation before continuing with analysis, we save the raw data in a way that makes it faster to access later. We don't want to have to read the data off local disk every time we need to repeat some part of the analysis. Instead, it's much more advantageous to use the Hadoop Distributed File System (HDFS) to store the data once we've read it in and put it in a `key,value` format. 

By storing the data in HDFS, we make sure that the system can take advantage of data-locality at a later stage in our analysis. 

In [20]:
# TODO
text_rdd.saveAsPickleFile('hdfs:///user/<YOUR USERNAME>/gutenberg/raw_text_rdd')

Now, whenever we need it, we can read the data off the HDFS instead: 

In [21]:
# TODO
text_rdd = sc.pickleFile('hdfs:///user/<YOUR USERNAME>/gutenberg/raw_text_rdd')

In [22]:
%time text_rdd.cache().count()

CPU times: user 72 ms, sys: 16 ms, total: 88 ms
Wall time: 11 s


48177

In [23]:
sc.stop()