# Lab Sheet 2 - Solutions: Extracting Word Frequency Vectors with Spark

These tasks are for working in the lab session and during the week. We'll do a bit of word preprocessing in task 1) and in task 2) we'll load a number of files and will go through the processing steps to extract word frequencies. 


## 1) Word preparation

Define your own mapper function for removing the plural “s” at the end of words and turning them to lower case as a rough approximation towards stemming. 

Use the python def syntax [see here](https://docs.python.org/release/1.5.1p1/tut/functions.html) to define your own function stripFinalS(word) that takes as argument a word, and outputs the word in lower case without any possible trailing “s”.

For this task, you can treat strings as lists and apply "list slicing": <br>
`lst[0:3] # the first three elements` <br>
`lst[:-2] # all but the last two elements`

You need to check that the string is not empty (test `len(word)`) before accessing the letters in the string, otherwise you'll raise an exception.

In [18]:
def stripFinalS( word ):
#>>> add code here
    wordl = word.lower() # lower case
    if len(wordl) >1 and wordl[-1] == 's': # check for final letter
        return wordl[:-1] # remove final letter
    else:
        return wordl; # return unchanged
    
print(stripFinalS('Houses')) # for testing, should return 'house'

house


### Comments
- Slicing the Python array: `word[-1]` gives the last letter, `word[:-1]` gives all letters from the beginf to before the last (exclusive). For more information, look [here](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists) in section 3.1.2 for 'slice' 
- Checking that the word string is not empty (`len(word)>0`) is useful. I filter empty string out in the code below, but there is no guarantee that that happens in other contexts.

Add your new function into the word count example below for testing, replacing the `lower()` method.

In [19]:
from operator import add
import re

linesRDD = sc.textFile("./City-Data-Science/library/hamlet.txt") # read text as RDD
wordsRDD = linesRDD.flatMap(lambda line: re.split('\W+',line)) # split words, break lists
wordsFilteredRDD = wordsRDD.filter(lambda word: len(word)>0) # filter empty words out
#>>>
words1RDD = wordsFilteredRDD.map(lambda word: (stripFinalS(word),1)) # lower case, (w,1) pairs
wordCountRDD = words1RDD.reduceByKey(add) # reduce and add up counts
freqWordsRDD = wordCountRDD.filter(lambda x:  x[1] >= 3 ) # remove rare words
output = freqWordsRDD.sortBy(lambda x: -x[1]).take(10) # collect 1o most frequent words
for (word, count) in output: # iterate over (w,c) pairs
    print("%s: %i" % (word, count)) #  … and print

the: 1218
i: 1021
and: 1019
to: 834
a: 817
of: 733
you: 610
my: 516
in: 464
it: 442


### Comment
This task is very similar to last week, no surprises ... ;-)


# 2) Extracting word frequency vectors from text documents

Now we start a new script, which reads in a whole directory with text files and extracts word frequency information.

This involves some tuple restructuing and list transformation. It is important to use meaningful variable names. Also it is helpful to use pen and paper (or a text editor) to write down the structures that you are intending to create. Keep in mind the final goal of getting a list of words and their frequencies for each file, i.e. (filename,[(w,c), ... , (w,c)]). 


## 2a) Load the files
Load all text files in the directory /data/student/bigdatastud/library on the server lewes using sc.wholeTextFiles <br>(see [http://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles](http://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles)). This will create an RDD with tuples of the structure (filepath,content), where content is the whole text from the file. 

In [20]:
dirPath = "./City-Data-Science/library/"
fw_RDD = sc.wholeTextFiles(dirPath) #<<< add code to create an RDD with wholeTextFiles
print("partitions: ", fw_RDD.getNumPartitions()) # on IBM DSX we have 2 executors by default with one partition each
print("elements: ", fw_RDD.count())

partitions:  2
elements:  19


### Comment
Straightforward, just read from the given path. Reads all the files into (path,content) pairs.

## 2b) Split the RDD elements using flatMap to get (filename, word) elements.

For this, define a function that takes a pair `(filename,content)` and output list of pairs `[(filename, word1), ...(filename, wordN)]`. You can get the words as usual by re.split(’\W+’,x). 

Use list comprehensions (see http://www.pythonforbeginners.com/basics/list-comprehensions-in-python) to iterate through the word list in a for loop, and append the (filename,word) tuples to a new list.  

Below is a template, you need to fill in the that starts with `<<<`.

In [21]:
def splitFileWords(filenameContent): # your splitting function
    f,c = filenameContent # split the input tuple  
    fwLst = [] # the new list for (filename,word) tuples
    wLst = re.split('\W+',c) # <<< now create a word list wLst
    for w in wLst: # iterate through the list
        if len(w) >0: 
            fwLst.append((f,w)) # <<< and append (f,w) to the 
    return fwLst #return a list of (f,w) tuples 
    
fw_RDD = fw_RDD.flatMap(splitFileWords)
fw_RDD.take(3)

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/prideandpredjudice.txt',
  'The'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/prideandpredjudice.txt',
  'Project'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/prideandpredjudice.txt',
  'Gutenberg')]

### Comments
- Building the list `fwLst` is the main new concept here. 
- Creating tuples with brackets is a technique that is frequently used.

Now use filter to keep only the tuples with stopwords (remember, the words are now the 2nd element of the tuple).

In [22]:
stopwlst = ['the','a','in','of','on','at','for','by','I','you','me'] # stopword list
fw_RDD2 = fw_RDD.filter(lambda x: x[1] in stopwlst) #<<< filter, keeping only stopwords
fw_RDD2.top(3)

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  'you'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  'you'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  'you')]

### Comment
- With RDD.filter(), it is important to returen a boolean.
- Important: `filter` keeps only elements, where the provided function (here as a lambda) returns `true`.


## 2c) Count the words and reorganise the tuples to count: ((filename,word), count)

Now you can package the elements into tuples with 1s and use reduceByKey(add) to get the counts of the words per filename, similar to last week and in task 1 above.

In [23]:
fw_1_RDD = fw_RDD2.map(lambda x: (x,1))  #<<< change (f,w) to ((f,w),1)
fw_c_RDD = fw_1_RDD.reduceByKey(add) #<<< count the words
fw_c_RDD.top(3)
# the printed elements should look similar to this:
# (('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
#   'you'),
#  260)

[(('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
   'you'),
  260),
 (('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
   'the'),
  695),
 (('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
   'on'),
  85)]

### Comment 
This example follows the word count example, with the difference of keeping the filename in addition to the word.

## 2d) Creating and concatenating lists

As a next step, map the `((filename,word),count)` eleemnts to `( filename, [ (word, count) ])` structure, i.e. rearange and wrap a list aournd the one tuple (just by writing squre backets). For this create a function `reGrpLst` to regroup and create a list. Check that the output has the intended structure.

In [24]:
def reGrpLst(fw_c): # we get a nested tuple
    fw,c = fw_c
    f,w = fw
    return (f,[(w,c)]) # return (f,[(w,c)]) structure. Can be used verbatim, if your variable names match.

f_wcL_RDD = fw_c_RDD.map(reGrpLst) 
f_wcL_RDD.top(3)

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  [('you', 260)]),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  [('the', 695)]),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/tempest.txt',
  [('on', 85)])]

Next we can concatenate the lists per filename using reduceByKey(). Write a lambda that cancatenates the lists per element.  Concatenation of lists in Python is done with '+', e.g.  `[1,2] + [3,4]` returns `[1,2,3,4]`.

### Comments
Here we have a new technique: creating lists instead of tuples (using `[]` instead of `()`). The approach is similar to that of word counting, but adding lists (with `+` or `add`) means concatenating them, so that we produce a long list.  

In [25]:
f_wcL2_RDD = f_wcL_RDD.reduceByKey(add) #<<< create [(w,c), ... ,(w,c)] lists per file 

In [26]:
output = f_wcL2_RDD.collect() 
for el in output[1:4]:
    print(el)
    print()

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/king_lear.txt', [('me', 228), ('I', 705), ('for', 130), ('on', 104), ('you', 412), ('a', 364), ('the', 746), ('by', 84), ('of', 495), ('in', 280), ('at', 66)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/othello.txt', [('by', 106), ('me', 278), ('a', 432), ('on', 123), ('I', 874), ('you', 490), ('the', 699), ('for', 189), ('of', 503), ('at', 70), ('in', 322)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/emma.txt', [('on', 688), ('by', 576), ('me', 574), ('for', 1342), ('the', 5006), ('a', 3062), ('I', 3200), ('you', 1743), ('at', 1013), ('in', 2172), ('of', 4389)])



## Extra task: Creating Hash Vectors

If we want to compare the word-counts for different files, and in particular if we want to use not just the stopwords, we need to bring them to one dimensionality as vectors. For this we use the 'Hashing Trick' shown in the lecture. 

Start by writing a function that takes a (word,count) list, and transforms it into vector of fixed size. For that you need to take the hash value of each word modulo the size (`hash(word) % size`) and add up all counts of words that map here. 

In [27]:
def hashWcList(lst,size):
    lst2 = [0] * size; # create a vector of the needed size filled with '0's
    for (w,c) in lst: # for every (word,count) pair in the given list
        lst2[hash(w)%size] += c # add the count to the position where the word gets hashed to. 
    return lst2 # return the new list, containing only numbers
        
hashWcList([('this',23),('is',12),('a',34),('little',13),('test',24)],5) # for testing
#output should look like this: [36, 0, 36, 0, 34]

[36, 0, 36, 0, 34]

### Comments
This method gives us a single vector that represents every text document as a compact vector of fixed dimension. 
Vector like this can be used for findning documents in databases, grouping them by similarity, studying writing styles etc. 

In [28]:
f_hv_RDD = f_wcL2_RDD.map(lambda f_wcl: (f_wcl[0],hashWcList(f_wcl[1],10)))
output = f_hv_RDD.collect()
for el in output[1:4]:
    print(el)
    print()
# now we can display a hashed vector for every text file 

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/king_lear.txt', [561, 496, 280, 705, 0, 332, 0, 876, 0, 364])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/prideandpredjudice.txt', [4464, 1824, 1850, 2066, 0, 1122, 0, 5280, 0, 1962])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/emma.txt', [5402, 2319, 2172, 3200, 0, 1262, 0, 6348, 0, 3062])



### Extra Demos: exracting file names with regular expressios and creating a DataFrame to show with Pixiedust
We first use splittig with Regular Expressions to get just the filename without the path and the exentsion. 
Then we convert the RDD into a data frame 

In [29]:
fn_hv_RDD = f_hv_RDD.map(lambda x: (re.split('[/\.]',x[0])[-2],x[1])) 
fn_hv_RDD.take(3)

[('julius_cesar', [458, 460, 214, 531, 0, 272, 0, 695, 0, 241]),
 ('king_lear', [561, 496, 280, 705, 0, 332, 0, 876, 0, 364]),
 ('othello', [573, 596, 322, 874, 0, 401, 0, 888, 0, 432])]

In [None]:
import pixiedust
display(fn_hv_RDD.toDF(['title', 'vec']))

title,vec
julius_cesar,"[458, 460, 214, 531, 0, 272, 0, 695, 0, 241]"
king_lear,"[561, 496, 280, 705, 0, 332, 0, 876, 0, 364]"
othello,"[573, 596, 322, 874, 0, 401, 0, 888, 0, 432]"
emma,"[5402, 2319, 2172, 3200, 0, 1262, 0, 6348, 0, 3062]"
romeo_and_juliet,"[564, 428, 342, 655, 0, 345, 0, 955, 0, 462]"
persuasion,"[3083, 959, 1346, 1124, 0, 574, 0, 3815, 0, 1529]"
mansfield_park,"[5912, 2099, 2501, 2364, 0, 1140, 0, 7512, 0, 3065]"
northanger_abbey,"[2835, 1282, 1222, 1285, 0, 613, 0, 3629, 0, 1474]"
macbeth,"[454, 293, 198, 346, 0, 184, 0, 730, 0, 258]"
lady_susan,"[925, 462, 391, 804, 0, 337, 0, 994, 0, 365]"
