In [1]:
from __future__ import print_function
%matplotlib inline
import matplotlib.pylab as plt
import sys, os, glob
import numpy as np
import os, subprocess
import re

# Gutenberg N-Grams

In this series of notebooks, we will quantitatively explore the text of the [Gutenberg E-Books Project](https://www.gutenberg.org/), a free repository of e-books that are in the public domain. small python package has been created that allows you to easily parse the text and the associated metadata. 

In this part "zero" notebook, we just ingest the data, process the text, and save the raw text RDD (with punctuation and html tags removed). 


## Raw data setup

### The books
To begin, download the DVD image using a torrent client and mount it on your system. See the instructions on the [Gutenberg DVD page](http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project)).

### The metadata
The metadata (things like author birth date, language etc.) are stored in a series of '.rdf' files, which need to be [downloaded separately](http://www.gutenberg.org/wiki/Gutenberg:Feeds). Once you download the `rdf-files.tar.gz`, untar and unzip it into a directory on your computer. 

In the cell below, set the `rdf_path` to where you extracted the metadata, `data_path` to where the DVD volume is mounted, and `extract_path` where the text of all the books will be extracted to. 

In [2]:
rdf_path = '/Users/rok/gutenberg_data/cache/epub'
data_path = '/Volumes/PGDVD_2010_04_RC2/'
extract_path = '/Users/rok/gutenberg_data/new_dload/'

### Extracting the text

First, we use the `gutenberg_cleanup` code to extract all data from the zip files located in the DVD archive. This will take some time. 

In [3]:
import gutenberg_cleanup

In [4]:
#gutenberg_cleanup.extract_data(data_path, extract_path)

## Ingesting raw data into Spark

With the raw data on disk, we are ready to start processing it in Spark. 

### Spark configuration

Below we specify that this notebook should use the configuration stored in <code>./spark_config</code> -- the options will be discussed in detail in the next notebook.

<div class="alert alert-info">
Note that the environment variables have to be declared before any other spark initialization takes place (including creating a <code>SparkConf</code> object.
</div>

In [5]:
# ncores = int(os.environ.get('LSB_DJOB_NUMPROC', 1))

# os.environ['SPARK_CONF_DIR'] = os.path.abspath('./spark_config')
# os.environ['SPARK_DRIVER_MEMORY'] = '%dG'%(ncores*2*0.7)
# os.environ['PYSPARK_PYTHON'] = subprocess.check_output('which python', shell=True).rstrip()

import findspark
findspark.init()

import pyspark
from pyspark import SparkConf, SparkContext

### Starting the `SparkContext`
This is our entry point to the Spark runtime - it is used to push data into spark or load RDDs from disk etc. If you are running in a hadoop environment, set the `master` keyword in `SparkContext` to 'yarn-client' - otherwise use the 'local[\*]' master, which will run spark locally on all available cores. 

In [6]:
sc = SparkContext(master = 'local[*]')

If this works successfully, you can check UI at the URL listed in the cell below (run it first): 

In [7]:
sc.uiWebUrl

'http://129.132.179.130:4040'

## Make a key-value RDD of book metadata and text

Getting data into spark from a collection of local files is a very common task. A useful pattern to keep in mind is the following: 

1. make a list of filenames and distribute it among the workers
3. "map" each filename to the data you want to get out
4. now you are left with the RDD of raw data distributed among the workers!

In our case of the Gutenberg Project e-book data, we have `.zip` files which hold the actual book text in `.txt` files, and another directory of associated metadata files (the RDF files). To make your life easier for the purpose of this tutorial, we have made a small python module called `gutenberg_cleanup` that has some handy functions for pulling out the relevant text and metadata out of the raw dataset. 

The [`gutenberg_cleanup`](gutenberg_cleanup.py) module contains several functions that can help with this: `get_filelist`, `read_file`, `get_gid`, `get_metadata` and `get_text`.

They pretty much do the obvious: 

`get_gid` takes an html path and pulls out the book ID (`gid`)

`get_metadata` takes a `gid` and returns a metadata object with various useful fields that will be used to create a unique key for each book

`get_text` takes a path to an html file and returns the raw text extracted from HTML, cleaned of tags and punctuation and converted to lower case. 

First, we create a lookup table for the `.rdf` metadata files so we don't have to search the filesystem repeatedly: 

In [8]:
rdf_lookup = {}
find_gid = re.compile('(\d+)')
for root, dirs, files in os.walk(rdf_path):
    for f in files:
        name, ext = os.path.splitext(f)
        if ext == '.rdf':
            rdf_lookup[find_gid.findall(name)[0]] = os.path.join(root,f)
rdf_lookup_b = sc.broadcast(rdf_lookup)

### Initializing the raw dataset using `sc.parallelize`

In [9]:
filelist = gutenberg_cleanup.get_filelist(extract_path)

print('Total number of books: %d'%len(filelist))

Total number of books: 30807


When you use `sc.parallelize` to distribute a dataset across the cluster, you can choose the number of partitions across which to distribute the dataset. The higher the number of partitions, the higher the "parallelism". When Spark subsequently executes maps and reduces on this dataset, it does so by dispatching tasks to different executors, which then request the cores under their control to do the actual work. By increasing the number of partitions, you increase the number of tasks - more tasks gives the Spark scheduler more flexibility in distributing the work across the cluster and therefore maximally leveraging the compute resources at its disposal. In some cases, where a single partition might require a lot of memory it can cause `Out of memory` errors - in such cases, simply reducing the amount of data per task by increasing the parallelism can help. 

Note that as long as the tasks take a few hundred milliseconds the scheduler should have no trouble dispatching them. On the other hand, there is a bit of overhead associated with partitioning the data so you don't want an unreasonably high number of partitions. You can see the [Spark guide](http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism) for a bit more detail. 

Below, we will choose to use 5 times as many partitions as we have cores in the job. 

In [10]:
ncores = sc.defaultParallelism
files_rdd = sc.parallelize(filelist, ncores*5)

In [11]:
files_rdd.take(5)

['/Users/rok/gutenberg_data/new_dload/1/0/0/0/10001/10001.txt',
 '/Users/rok/gutenberg_data/new_dload/1/0/0/0/10002/10002-8.txt',
 '/Users/rok/gutenberg_data/new_dload/1/0/0/0/10003/10003.txt',
 '/Users/rok/gutenberg_data/new_dload/1/0/0/0/10004/10004-8.txt',
 '/Users/rok/gutenberg_data/new_dload/1/0/0/0/10005/10005-8.txt']

### Transforming the list of filenames into a `key,value` pair RDD of metadata and text

The raw Gutenberg Project dataset consists of `txt` files and files that hold metadata in XML format. For example, here are the first few lines of a metadata file at random:

In [12]:
with open('/Users/rok/gutenberg_data/cache/epub/1000/pg1000.rdf') as f: 
    print(' '.join(f.readlines()[:20]))

<?xml version="1.0" encoding="utf-8"?>
 <rdf:RDF xml:base="http://www.gutenberg.org/"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
   xmlns:dcam="http://purl.org/dc/dcam/"
   xmlns:cc="http://web.resource.org/cc/"
   xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:dcterms="http://purl.org/dc/terms/"
 >
   <pgterms:ebook rdf:about="ebooks/1000">
     <dcterms:hasFormat>
       <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/1000.txt.utf-8">
         <dcterms:format>
           <rdf:Description rdf:nodeID="N1d6fbe7c5c724eb9a80228a47d8a07c5">
             <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
             <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain</rdf:value>
           </rdf:Description>
         </dcterms:format>
         <dcterms:isFormatOf rdf:resource="ebooks/1000"/>
         <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">

#### Data Ingestion procedure

Our first task is to ingest this dataset by doing the following: 

1. open and read the text file
2. match each metadata entry with its corresponding raw text 
3. produce an RDD of `(metadata, text)` pairs

These steps are often very similar at the beginning of any analysis, and can be quite time consuming to get right. For the purposes of this exercise, we have already built the functions needed to perform these operations. They are found in [`gutenberg_cleanup.py`](gutenberg_cleanup.py) if you want to have a look. 

The important functions are:

* `get_gid` -- returns the Gutenberg ID given filename
* `get_metadata` -- return the metadata given an ID 

These will be used to construct a `key,value` pair RDD. The `key` will be the dictionary returned by `get_metadata`, while the `value` we will use the raw text returned by `get_text`. 

In [13]:
from gutenberg_cleanup import get_metadata, clean_text, get_gid

To pass the `gutenberg_cleanup` source file to the executors, we will use the `addPyFile` method of the `SparkContext`:

In [14]:
sc.addPyFile('{cwd}/gutenberg_cleanup.py'.format(cwd=os.getcwd()))

Use the `map` method of `files_rdd` to map the filenames to `(metadata, text)` tuples using `get_gid` and `get_text` functions:

In [15]:
text_rdd = (files_rdd.map(lambda filename: gutenberg_cleanup.read_file(filename, rdf_lookup_b.value))
                     .filter(lambda x: x[0] is not None))

So that we don't have to constantly re-load the data off disk, lets cache this RDD: 

In [18]:
gutenberg_cleanup.get_metadata('2895', rdf_lookup)

{'author': <dcterms:creator>
 <pgterms:agent rdf:about="2009/agents/53">
 <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Mark_Twain"></pgterms:webpage>
 <pgterms:alias>Twain, Mark (Samuel Clemens)</pgterms:alias>
 <pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1835</pgterms:birthdate>
 <pgterms:name>Twain, Mark</pgterms:name>
 <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1910</pgterms:deathdate>
 <pgterms:alias>Clemens, Samuel Langhorne</pgterms:alias>
 </pgterms:agent>
 </dcterms:creator>,
 'author_id': '53',
 'author_name': ['Twain', ' Mark'],
 'birth_year': '1835',
 'death_year': '1910',
 'downloads': '1495',
 'file_types': {'2895-h.htm': 'text/html; charset=iso-8859-1',
  '2895-h.zip': 'text/html; charset=iso-8859-1',
  '2895.epub.images': 'application/epub+zip',
  '2895.epub.noimages': 'application/epub+zip',
  '2895.kindle.images': 'application/x-mobipocket-ebook',
  '2895.kindle.noimages': 'application/x-mobi

In [16]:
text_rdd.take(2)

[({'author': <dcterms:creator>
   <pgterms:agent rdf:about="2009/agents/1308">
   <pgterms:alias>Seneca, L. A. (Lucius Annaeus)</pgterms:alias>
   <pgterms:name>Seneca, Lucius Annaeus</pgterms:name>
   <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">65</pgterms:deathdate>
   <pgterms:alias>Seneca, Annaeus</pgterms:alias>
   <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Seneca_the_Younger"></pgterms:webpage>
   </pgterms:agent>
   </dcterms:creator>,
   'author_id': '1308',
   'author_name': ['Seneca', ' Lucius Annaeus'],
   'birth_year': '1863',
   'death_year': '1950',
   'downloads': '274',
   'file_types': {'10001-h.htm': 'text/html; charset=us-ascii',
    '10001-h.zip': 'text/html; charset=us-ascii',
    '10001.epub.images': 'application/epub+zip',
    '10001.epub.noimages': 'application/epub+zip',
    '10001.kindle.images': 'application/x-mobipocket-ebook',
    '10001.kindle.noimages': 'application/x-mobipocket-ebook',
    '10001.plucker': 

In [16]:
%%time
#text_rdd.cache()
ndocs = text_rdd.filter(lambda x: x[0] is not None).count()
print('number of documents: ', ndocs)

number of documents:  25406
CPU times: user 32.4 ms, sys: 12.8 ms, total: 45.2 ms
Wall time: 6min


In [18]:
text_rdd.map(lambda x: len(x[1])).sum()

9346480671

In [39]:
words_re = re.compile('[\w\']+')

no_punctuation = re.compile("[^a-zA-Z0-9\s'-]")

Since we called `count()`, it means that the entire RDD was generated/calculated. This combination of `cache` and `count` is a common way to check how much memory your dataset needs - once `count` completes you can check the memory taken up by the RDD by going to the "Storage" tag of the Spark UI. 

Because the data is cached, next time you try to use `text_rdd` it will be much much quicker. For example, 

In [17]:
%%time
assert(text_rdd.count() == ndocs)

CPU times: user 47 ms, sys: 10 ms, total: 57 ms
Wall time: 3.09 s


## Save the raw dataset to HDFS (or local storage)

As a final bit of preparation before continuing with analysis, we save the raw data in a way that makes it faster to access later. We don't want to have to read the data off local disk every time we need to repeat some part of the analysis. Instead, it's much more advantageous to use the Hadoop Distributed File System (HDFS) to store the data once we've read it in and put it in a `key,value` format. 

By storing the data in HDFS, we make sure that the system can take advantage of data-locality at a later stage in our analysis. 

In [18]:
!hadoop fs -rm -r -f /user/roskarr/gutenberg/raw_text_rdd

Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1
15/11/20 15:00:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/20 15:00:59 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/roskarr/gutenberg/raw_text_rdd


In [19]:
text_rdd.saveAsPickleFile('hdfs:///user/roskarr/gutenberg/raw_text_rdd')

Now, whenever we need it, we can read the data off the HDFS instead: 

In [20]:
text_rdd = sc.pickleFile('hdfs:///user/roskarr/gutenberg/raw_text_rdd')

In [21]:
%time text_rdd.count()

CPU times: user 48 ms, sys: 9 ms, total: 57 ms
Wall time: 6.17 s


48177

In [22]:
sc.stop()