# Gutenberg N-Grams

Start the Spark Context: 

In [8]:
import sys, os
os.environ['SPARK_HOME'] = '/Users/rok/spark'
sys.path.insert(0, '/Users/rok/spark/python/')
sys.path.insert(0, '/Users/rok/spark/python/lib/py4j-0.8.2.1-src.zip')

import pyspark

In [72]:
conf = pyspark.SparkConf()

In [77]:
conf.set('spark.executor.memory', '2g')
conf.set('spark.driver.memory', '4g')

<pyspark.conf.SparkConf at 0x10aa01790>

In [79]:
sc = pyspark.SparkContext(master = "local[4]", conf = conf)

## Make a key-value RDD of book metadata and text

Getting data into spark from a collection of local files is a very common task. A useful pattern to keep in mind is the following: 

1. make a list of filenames and distribute it among the workers
3. "map" each filename to the data you want to get out
4. now you are left with the RDD of raw data distributed among the workers!

The [`gutenberg_cleanup`](gutenberg_cleanup.py) module contains two functions that can help with this: `get_text` and `get_metadata`.

They pretty much do the obvious: 

`get_metadata` returns a metadata object with various useful fields that will be used to create a unique key for each book

`get_text` returns the raw text extracted from HTML, cleaned of tags and punctuation and converted to lower case. 

### 1. Distributing the filenames

In [98]:
import glob
flist = glob.glob('/Users/rok/python_src/gutenberg/dl-cache/*html')
print 'number of books: ', len(flist)

number of books:  3206


In [99]:
files_rdd = sc.parallelize(flist)

In [83]:
files_rdd.take(5)

['/Users/rok/python_src/gutenberg/dl-cache/1000.html',
 '/Users/rok/python_src/gutenberg/dl-cache/1001.html',
 '/Users/rok/python_src/gutenberg/dl-cache/1002.html',
 '/Users/rok/python_src/gutenberg/dl-cache/1003.html',
 '/Users/rok/python_src/gutenberg/dl-cache/1004.html']

### 2. Map the filenames to metadata, text key-value pairs

Use the `get_text` and `get_metadata` functions to construct a key,value pair RDD, where `key` is composed of a string:

`title||last_name||first_name||birth_year||death_year`

Use "||" as a delimiter. 

For the `value` of each `key`,`value` pair use the raw text returned by `get_text`. 

Hint: in Python, there are many ways to make a string, but a pretty easy one is like this: 

    "bla_%s"%(var)"

where `var` matches the `%s` and is a variable that can be converted to a string. You can include more `%s` (or `%d`, `%f` etc) and more variables in the tuple that follows. 

In [91]:
from gutenberg_cleanup import get_metadata, get_text

In [88]:
text_rdd = (files_rdd.map(lambda filename: 
                         (get_metadata(filename), get_text(filename)))
                     .map(lambda (meta, text): ("%s||%s||%s||%s||%s"%(meta.title,meta.first_name,meta.last_name,meta.birth_year,meta.death_year), text)))

So that we don't have to constantly re-load the data off disk, lets cache this RDD: 

In [87]:
# text_rdd.cache()

In [85]:
text_rdd.count()

3130

Take a look at the first set of keys:

In [90]:
text_rdd.keys().take(50)

['- No Title -||None||None||None||None',
 u"Divine Comedy, Longfellow's Translation, Hell||None||Dante Alighieri||1265||1321",
 u"Divine Comedy, Longfellow's Translation, Purgatory||None||Dante Alighieri||1807||1882",
 u"Divine Comedy, Longfellow's Translation, Paradise||None||Dante Alighieri||1265||1321",
 u"Divine Comedy, Longfellow's Translation, Complete||None||Dante Alighieri||1807||1882",
 u"Divine Comedy, Cary's Translation, Hell||None||Dante Alighieri||1772||1844",
 u"Divine Comedy, Cary's Translation, Purgatory||None||Dante Alighieri||1265||1321",
 u"Divine Comedy, Cary's Translation, Paradise||None||Dante Alighieri||1265||1321",
 u"Divine Comedy, Cary's Translation, Complete||None||Dante Alighieri||1772||1844",
 u'The Hacker Crackdown: Law and Disorder on the Electronic Frontier||Bruce||Sterling||1954||None',
 u'The First Men in the Moon||H. G. (Herbert George)||Wells||1866||1946',
 u'The Lure of the Dim Trails||B. M.||Bower||1874||1940',
 u'The Oregon Trail: Sketches of Prai

It's clear that we're going to have to do some quality control here. For example, we probably don't want books with "None" as either of the author names, and likewise we have to have the birth date in order to be able to create a time series out of the data in the end. 

Construct an RDD, as above, except that you filter out all elements that lack a value for `title`, `first_name`, `last_name`, or `birth_year`.

In [100]:
filtered_rdd = (files_rdd.map(lambda filename: get_metadata(filename))
                         .filter(lambda metadata: metadata.title is not None and 
                                                  metadata.first_name is not None and
                                                  metadata.last_name is not None and 
                                                  metadata.birth_year is not None))

In [101]:
filtered_rdd.first()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 11, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/rok/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/Users/rok/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/rok/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 267, in dump_stream
    bytes = self.serializer.dumps(vs)
  File "/Users/rok/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 414, in dumps
    return pickle.dumps(obj, protocol)
RuntimeError: maximum recursion depth exceeded in cmp

	at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
	at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
	at org.apache.spark.scheduler.Task.run(Task.scala:70)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
