In [1]:
from __future__ import print_function
%matplotlib inline
import matplotlib.pylab as plt
import sys, os, glob
import numpy as np

plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['font.size'] = 18
plt.style.use('fivethirtyeight')

# Analyzing the Gutenberg Books Corpus

In this notebook, we will use the cleaned, pre-processed data that we created in the [pre-processing part](gutenberg-preprocessing-SOLUTIONS.ipynb). As a reminder, we ended up with an RDD of `(gid, text)` tuples that has been cleaned and we stored it on HDFS at `/user/<YOUR_USERNAME>/gutenberg/cleaned_rdd`. 

In the [first analysis notebook](gutenberg-analysis-SOLUTIONS.ipynb) we build an N-gram viewer for the gutenberg books project. Now, we will use the corpus to train a simple language classification model using [Spark's machine learning library](http://spark.apache.org/docs/latest/mllib-guide.html).

## Setting up Python and Spark

These steps are identical to those used in the previous notebook so we have omitted the lengthy explanations -- if you need to check what any of this is doing, have a look at the pre-processing notebook. }

In [102]:
import findspark, os
findspark.init()

import pyspark
from pyspark import SparkConf, SparkContext

In [103]:
# put the number of executors and cores into variables so we can refer to it later
num_execs = 10
exec_cores = 4

In [104]:
# initializing the SparkConf
os.environ['SPARK_DRIVER_MEMORY'] = '4g'
os.environ['SPARK_CONF_DIR'] = '%s/../../spark_config'%os.getcwd()
conf = (SparkConf()
            .set('spark.executor.memory', '8g')
            .set('spark.executor.instances', str(num_execs))
            .set('spark.executor.cores', str(exec_cores))
            .set('spark.storage.memoryFraction', 0.3)
            .set('spark.shuffle.memoryFraction', 0.5)
            .set('spark.yarn.executor.memoryOverhead', 3072)
            .set('spark.yarn.am.memory', '8g')
            .set('spark.yarn.am.cores', 4)
            .set('spark.executorEnv.PYTHONPATH', 
                 '{home}/spark_workshop/notebooks/gutenberg'.format(home=os.environ['HOME']))
            .set('spark.executorEnv.PATH', os.environ['PATH']))

In [105]:
sc = SparkContext(master = 'yarn-client', conf = conf)

If this works successfully, you can check the [YARN application scheduler](http://hadoop.hpc-net.ethz.ch:8088/cluster) and you should see your app listed there. Clicking on the "Application Master" link will bring up the familiar Spark Web UI. 

## Load the data from HDFS

In [106]:
# TODO: load cleaned_rdd from the HDFS
cleaned_rdd = sc.pickleFile('/user/roskarr/gutenberg/cleaned_rdd').cache()

### Load in the metadata dictionary and broadcast it

In [107]:
from cPickle import load

with open('{home}/gutenberg_metadata.dump'.format(home=os.environ['HOME']), 'r') as f :
    meta_dict = load(f)

In [108]:
# TODO: create meta_b by broadcasting meta_dict
meta_b = sc.broadcast(meta_dict)

Now, our `cleaned_rdd` contains `gid`'s as keys and text as values and if we want some other piece of metadata, we can just access it via the lookup table, for example `meta_b.value[gid][meta_name]`. 

We will use the same `extract_ngrams` and `vectorize_doc` functions as in the previous notebook: 

In [109]:
from scipy.sparse import csr_matrix
import re

def extract_ngrams(tokens, ngram_range=[1,1], select_ngrams = None, character = False):
    """
    Turn tokens into a sequence of n-grams 

    **Inputs**:

    *tokens*: a list of tokens

    **Optional Keywords**:

    *ngram_range*: a tuple with min, max ngram ngram_range
    
    *select_ngrams*: the vocabulary to use
    
    *character*: True if using character ngrams; default is False

    **Output**

    Generator yielding a list of ngrams in the desired range
    generated from the input list of tokens

    """
    join_str = "" if character else " "
    
    # handle token n-grams
    min_n, max_n = ngram_range
    n_tokens = len(tokens)
    for n in xrange(min_n, min(max_n + 1, n_tokens + 1)):
        for i in xrange(n_tokens - n + 1):
            if n == 1: 
                res = tokens[i]
            else : 
                res = "".join(tokens[i: i+n])
           
            if select_ngrams is not None : 
                if res in select_ngrams: 
                    yield res
            else : 
                yield res
            
def vectorize_doc(doc, vocab, ngram_range = [1,1]) : 
    """
    Returns a vector representation of `doc` given the reference 
    vocabulary `vocab` after tokenizing it with `tokenizer`
    
    Arguments: 
        
        doc: a sequence of tokens (words or characters)
        
        vocab: the vocabulary mapping
        
        ngram_range: the range of ngrams to process
        
    Returns:
    
        a sparse vector representation of the document given the vocabulary mapping
    """
    from collections import defaultdict
    from scipy.sparse import csr_matrix 
    
    d = defaultdict(int)
    
    for ngram in extract_ngrams(doc, ngram_range, vocab) : 
        d[ngram] += 1
        
    values = np.empty(len(d))
    indices = np.empty(len(d))
    
    for i, (ngram, val) in enumerate(d.iteritems()) : 
        indices[i] = vocab[ngram]
        values[i] = val
        
    return csr_matrix((values, (indices, np.zeros(len(d)))), shape = (len(vocab), 1))

# Language classification

Here we will try to use some of the same techniques we developed before, but apply them to a classification problem: determining whether a text is English or German. 

We will use the rather straightforward method outlined in [Cavnar & Trenkle 1994](http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf):

For each of the English/German training sets:

1. tokenize the text (spaces are also tokens, so we replace them with "_")
2. extract N-grams where 1 < N < 5
3. determine 300 most common N-grams for the whole corpus
4. encode both sets of documents using the combined top 300-ngrams



Before, we used words as "tokens" -- now we will use characters, even accounting for white space (which we will replace with "_"). We will use the two example sentences again:

    document 1: "a dog bit me"
    document 2: "i bit the dog back"
    
First, we use the `extract_ngrams` function: 

In [110]:
s1 = "a dog bit me"
s2 = "i bit the dog back"

In [111]:
ngrams1 = list(extract_ngrams(s1.replace(' ','_'), ngram_range=[1,5], character=True))
ngrams2 = list(extract_ngrams(s2.replace(' ','_'), ngram_range=[1,5], character=True))

In [112]:
print(list(ngrams1))

['a', '_', 'd', 'o', 'g', '_', 'b', 'i', 't', '_', 'm', 'e', 'a_', '_d', 'do', 'og', 'g_', '_b', 'bi', 'it', 't_', '_m', 'me', 'a_d', '_do', 'dog', 'og_', 'g_b', '_bi', 'bit', 'it_', 't_m', '_me', 'a_do', '_dog', 'dog_', 'og_b', 'g_bi', '_bit', 'bit_', 'it_m', 't_me', 'a_dog', '_dog_', 'dog_b', 'og_bi', 'g_bit', '_bit_', 'bit_m', 'it_me']


We can create the vocabulary by getting the set of all ngrams and building a lookup table:

In [113]:
vocab = set(list(ngrams1)) | set(list(ngrams2))

In [114]:
vocab_dict = {word:ind for ind,word in enumerate(vocab)}
print('number of ngrams: ',len(vocab_dict))

number of ngrams:  89


Note that extracting ngrams can increase the size of the data quite a lot!

In [115]:
vocab_dict

{'_': 73,
 '_b': 24,
 '_ba': 12,
 '_bac': 19,
 '_back': 27,
 '_bi': 8,
 '_bit': 29,
 '_bit_': 54,
 '_d': 23,
 '_do': 55,
 '_dog': 83,
 '_dog_': 72,
 '_m': 18,
 '_me': 84,
 '_t': 15,
 '_th': 71,
 '_the': 36,
 '_the_': 31,
 'a': 74,
 'a_': 66,
 'a_d': 34,
 'a_do': 58,
 'a_dog': 88,
 'ac': 5,
 'ack': 62,
 'b': 41,
 'ba': 68,
 'bac': 67,
 'back': 25,
 'bi': 70,
 'bit': 37,
 'bit_': 6,
 'bit_m': 35,
 'bit_t': 30,
 'c': 75,
 'ck': 0,
 'd': 42,
 'do': 26,
 'dog': 81,
 'dog_': 7,
 'dog_b': 9,
 'e': 76,
 'e_': 64,
 'e_d': 49,
 'e_do': 21,
 'e_dog': 51,
 'g': 77,
 'g_': 14,
 'g_b': 59,
 'g_ba': 63,
 'g_bac': 22,
 'g_bi': 65,
 'g_bit': 3,
 'h': 44,
 'he': 38,
 'he_': 4,
 'he_d': 50,
 'he_do': 11,
 'i': 78,
 'i_': 60,
 'i_b': 57,
 'i_bi': 28,
 'i_bit': 85,
 'it': 56,
 'it_': 17,
 'it_m': 53,
 'it_me': 33,
 'it_t': 43,
 'it_th': 69,
 'k': 79,
 'm': 80,
 'me': 39,
 'o': 82,
 'og': 47,
 'og_': 40,
 'og_b': 2,
 'og_ba': 13,
 'og_bi': 16,
 't': 48,
 't_': 1,
 't_m': 61,
 't_me': 32,
 't_t': 87,
 't_th'

And finally, we can use `vectorize_doc` with the vocabulary mapping to turn the documents into vectors: 

In [116]:
vectorize_doc(s1.replace(' ','_'), vocab_dict, ngram_range=[1,5]).toarray().squeeze()

array([ 0.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,
        0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,
        1.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,
        1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,
        0.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  3.,  1.,  0.,  1.,  1.,
        1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.])

### Saving memory consumption with `mapPartitions`

As you can see from the simple example above, by extracting the 1-5 grams from a simple 12-character string, we created a vector with 48 stored values. Our actual documents will therefore swell in size very rapidly -- we definitely don't want to be holding all of those huge lists in memory! 

What we want in the first part is to get the top most-used N-grams. For this, we just need to create an RDD of N-grams and to avoid building lists we'll use the technique of generators discussed on the first day. 

Note that the `extract_ngrams` function above is already a generator: now we just want to make a small wrapper function that uses `extract_ngrams` to "yield" ngrams one by one into the RDD. 

A slight complication is that `mapPartitions` gives us an *iterator* over the data in the partition - the items returned by this iterator will be just individual documents, which we can then pass to `extract_ngrams`. 

In [117]:
from collections import defaultdict

def ngram_generator(iterator, ngram_range=[1,1]) : 
    """Take an iterator of documents and create a generator of ngrams
    
    Arguments:
        
        iterator: the document iterator
        
    Keywords:
        
        ngram_range: the range of ngrams to consider
    """ 
    for text in iterator : 
        for ngram in extract_ngrams(text.replace(' ', '_'), ngram_range): 
            yield ngram

We will subsample the `cleaned_rdd` in order to make this next set of cells complete in a reasonable amount of time -- once it's working, you can go back and do it for the full dataset, but it will take approximately 30 minutes. Note that we also repartition the data in order to ease the resource requirements of individual partitions. 

In [118]:
sampled_data = cleaned_rdd.sample(False, 0.1).repartition(2000).cache()

In [119]:
english_rdd = sampled_data.filter(lambda (gid,text): (meta_b.value[gid]['lang'] == 'en')).cache()
german_rdd = sampled_data.filter(lambda (gid, text): (meta_b.value[gid]['lang'] == 'de')).cache()

In [120]:
ngram_range = [1,3] # should use 1-5 ngram range, but make it smaller to speed up the processing a bit

#### Making the sets of most common english and german ngrams

To build the sets of ngrams, we use the now-familiar pattern: 

1. map the documents in the RDDs to their constituent ngrams (here we use the mapPartition call with the `ngram_generator` defined above)
2. do the distributed key count using the `map` --> `(key, 1)` --> `reduceByKey` pattern
3. sort the result (in descending order) and take the top 1000 ngrams

In [121]:
# TODO: 
en_ngram_counts = (english_rdd.values()
                              .mapPartitions(lambda it: ngram_generator(it, ngram_range))
                              .map(lambda ngram: (ngram,1))
                              .reduceByKey(lambda a,b:a+b).cache())

In [122]:
# TODO
de_ngram_counts = (german_rdd.values()
                             .mapPartitions(lambda it: ngram_generator(it, ngram_range))
                             .map(lambda ngram: (ngram,1))
                             .reduceByKey(lambda a,b:a+b).cache())

In [123]:
%%time
top_1000_en_ngrams = (en_ngram_counts.sortBy(lambda (ngram,count): count, False)
                                     .map(lambda (ngram, count): ngram)
                                     .take(1000))

CPU times: user 670 ms, sys: 158 ms, total: 828 ms
Wall time: 3min 24s


In [124]:
%%time
top_1000_de_ngrams = (de_ngram_counts.sortBy(lambda (ngram,count): count, False)
                                     .map(lambda (ngram, count): ngram)
                                     .take(1000))

CPU times: user 645 ms, sys: 151 ms, total: 796 ms
Wall time: 44.7 s


In [125]:
# building the top_ngrams dictionaries
# combine the german and english ngrams
top_ngrams = set(top_1000_de_ngrams) | set(top_1000_en_ngrams)

# build the ngrams dictionary lookup
top_ngrams_dict = {ngram:i for (i,ngram) in enumerate(top_ngrams)}

# broadcast the ngrams dictionary
top_ngrams_dict_b = sc.broadcast(top_ngrams_dict)

In [126]:
# TODO: map the sampled_data RDD into (gid, vector) tuples 
# by using the vectorize_doc function and the broadcasted ngram dictionary

vector_rdd = sampled_data.map(lambda (gid,text): (gid, 
                                                 vectorize_doc(text.replace(' ','_'),top_ngrams_dict_b.value, ngram_range)))

The `ngram_generator` function followed by the `map` and `reduceByKey` calls above is clear but a bit inefficient -- can you transfer some of the reduction into `mapPartition` and `ngram_generator`? 

## Train the language classification model

To train the model, we need to first map the `vector_rdd` elements into [`LabeledPoint`](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=labeledpoint#pyspark.mllib.regression.LabeledPoint), which is just a Spark abstraction that encompases a *label* and a *vector*. We then split the data into a training and validation sets and produce the trained model. 

In [127]:
from pyspark.mllib.feature import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithSGD

First, create a `vector_lp` RDD by mapping the contents of `vector_rdd` into a `LabeledPoint` using 0 if the language is english and 1 if it is anything else. 

In [128]:
# TODO: create an RDD of LabeledPoint with 0 for english and 1 for german
vector_lp = vector_rdd.map(lambda (gid, vec): LabeledPoint(0 if meta_b.value[gid]['lang'] == 'en' else 1, vec))

The Spark machine learning library provides a simple method for creating training and validation sets, which we will use below. These will be our inputs for the logistic regression model fitting -- it is *always* a good idea to cache the inputs, since the training requires many iterations over the data. 

In [129]:
training, validation = vector_lp.randomSplit([0.7,0.3])
training.cache()
validation.cache()

PythonRDD[42] at RDD at PythonRDD.scala:43

#### Pass the training set to the model

Here we will use the basic logistic regression model with stochastic gradient descent -- feel free to experiment with different parameters and other models from [pyspark.mllib.classification](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.classification).

In [130]:
model = LogisticRegressionWithSGD.train(training, regType='l1')

To check the performance of the model, we define a function that takes a data RDD and a model as parameters and computes the error:

In [131]:
def check_model(data, model) : 
    """Calculates the model error on the data
    
    Arguments: 
        
        data: the data RDD
        
        model: the classification model
        
    Returns:
    
        the error, which is the fraction of incorrectly predicted elements
    """
    
    error = (data.map(lambda p: (p.label, model.predict(p.features)))
                 .filter(lambda (v,p): v!=p).count())/float(data.count())
    return error

In [132]:
train_error = check_model(training, model)
print("Training Error = " + str(train_error))
validation_error = check_model(validation, model)
print("Validation Error = " + str(validation_error))

Training Error = 0.00856793145655
Validation Error = 0.0061919504644


For fun, lets create a function that will score a new string of text: 

In [133]:
def predict_language(text, model) : 
    """Predict the language given the pre-trained model
    
    Arguments: 
        text: a string
        
        model: the trained logistic regression model
    """
    vec = vectorize_doc(text.replace(' ','_'), top_ngrams_dict_b.value, [1,2])
    
    return model.predict(vec)

Try it out! Enter your own sentence:

In [134]:
text = "a dog bit me!"
predict_language(text, model)

0

In [135]:
sc.stop()