# Lab Sheet 3: Extracting Word Fequency Vectors with Spark

These tasks are for working in the lab session and during the week. We will use the same data as last week (19 files in './City-Data-Science/library/') and use some more RDD functions. We will apply two different approaches to create and use fixed size vectors.

First update the repo.

In [None]:
!git clone https://github.com/tweyde/City-Data-Science.git
%cd City-Data-Science/
!git pull https://github.com/tweyde/City-Data-Science.git
%cd ..

Here is code from last week that we run first, and then extend. 

In [None]:
import re 

def stripFinalS( word ):
    word = word.lower() # lower case
    if len(word) >0 and word[-1] == 's': # check for final letter
        return word[:-1]
    else:
        return word
    
def splitFileWords(filenameContent): # your splitting function
    f,c = filenameContent # split the input tuple  
    fwLst = [] # the new list for (filename,word) tuples
    wLst = re.split('\W+',c) # <<< now create a word list wLst
    for w in wLst : # iterate through the list
        fwLst.append((f,stripFinalS(w))) # and append (f,w) to the 
    return fwLst #return a list of (f,w) tuples 

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

dirPath = './City-Data-Science/library/' #  path
ft_RDD = sc.wholeTextFiles(dirPath) # create an RDD with wholeTextFiles
fnt_RDD = ft_RDD.map(lambda ft: (re.split('[/\.]',ft[0])[-2],ft[1])) # just take filenames, 
                                                # drop path and extension for readability
fw_RDD1 = fnt_RDD.flatMap(splitFileWords) # split words per file, strip final 's'
fw_RDD = fw_RDD1.filter(lambda fw: fw[1] not in ['','project','gutenberg', 'ebook']) # get rid of some unwanted words
fw_RDD.take(3)
# output should look like this: [('emma', 'the'), ('emma', 'of'), ('emma', 'emma')]

## 1) Warm-up
Let's start with a few small tasks, to become more fluent with RDDs and lambda expressions.

a) Count the number of documents.
b) Determine the number distinct words in total (the vocabulary size) using RDD.distinct(). This involves removing the fs from the (f,w) pairs and geting getting the RDD size (with RDD.count()). 
c) Get the number of words (including repeated ones) per book. 
d) Determine the number of distinct words per book. This involves determining the disting (f,w) pairs, geting a list of words per file, and getting the list size.
e) Count the average number of occurences per word per file (words/vocabulary). Use RDD.join() to get both numbers into one RDD. 

Remember that `>>>` indicates a line where you should do something - you need to remove it for any code to work. 
Typically, you'll find a `...` placeholder in that line at the place where you should add the code.  

In [None]:
# a) Library size
>>>print("Number of documents: ",ft_RDD ...) # count the number of docs

# b) Vocabulary size
>>>w_RDD = fw_RDD.map( ... ) # remove the file names, keep just the words
>>>w_RDDu = w_RDD # keep only one unique instance of every word
print('Total vocabulary size: ',w_RDDu.count()) # 


# c) words per book
from operator import add
>>>f1_RDD = fw_RDD.map(lambda fw: ...) # swap and wrap (f,w) to (f,1)
>>>fc_RDD = f1_RDD.reduceByKey(... ) # add the 1s up
print('Words per book: ',fc_RDD.take(3))
>>> extra task: try also to express this with one function that appeared in the lecture today

# d) Vocabulary per book
fw_RDDu = fw_RDD.distinct() # get unique (f,w) pairs - i.e. evey word only once per file. I use postfix u to mark 'unique'
f1_RDDu = fw_RDDu.map(lambda fw: (fw[0],1)) # wrap (f,w) to (f,1)
fcu_RDD = f1_RDDu.reduceByKey(add) # add the 1s up
print('Vocabulary per book: ',fcu_RDD.take(3)) 
>>> extra task: try also replacing the map and reduce by one function 

# e) Average occurences of words per book (i.e. words/vocab per book)
>>>f_wv_RDD = fc_RDD ... fcu_RDD # join the two RDDs to get (f,(w,v)) tuples
print(f_wv_RDD.take(3)) 
>>>f_awo_RDD = f_wv_RDD.map(lambda f_wv: (f_wv[0], ... )) # this is the tricky part. 
            # Resolve nested tuples in the lambda to get (filename,words/vocab) tuples
print('Average word occurences: ',f_awo_RDD.take(3))
# should look like this [('henry_V', 6.237027812370278), ('king_lear', 7.815661103979461), ('lady_susan', 8.531763947113834)]


## 2) Fixed vectors: Reduced vocabulary approach

The first task in this lab is to use a reduced vocabulary, only the stopwords from a list, to make sure that we have a fixed size vector. This is a common approach in stylometry. The problem is that some stopwords might not appear in some documents. We will deal with that by creating an RDD with ((f,w),0) tuples that we then merge with the ((f,w),count) RDD. 

Start by running the code above, then you can add 1s and use reduceByKey(add) like last week to get the counts of the words per filename.

Then, please make sure that all stopwords are present by creating a new RDD that contains the keys of the fw_RDD, i.e. the filenames, using the keys() method of class RDD. Then you can use flatMap to create a [((filname,stopword),0), ...] list, using a list comprehension. The 0s should not be 1s, as we don't want add to add extra counts.
The RDD with ((filename,stopword),0) tuples can then be merged with fw_RDD2 using union(). Then you can count as normal.

In [None]:
from operator import add

stopwlst = ['the','a','in','of','on','at','for','by','i','you','me'] # stopword list
>>>fw_RDD2 = fw_RDD.filter(lambda x: ... ) # filter, keeping only stopwords

>>>fsw_0_RDD = fw_RDD.keys().flatMap(lambda f: [((f,sw),0) for sw in stopwlst])
print(fsw_0_RDD.take(3))

>>>fw_1_RDD = fw_RDD2.map(lambda fw: ...)  #<<< change (f,w) to ((f,w),1)
print(fw_1_RDD.take(3))

>>>fw_10_RDD = fw_1_RDD ... fsw_0_RDD #<<< create the union on the two RDDs
print(fw_10_RDD.take(3))

fw_c_RDD = fw_10_RDD.reduceByKey(add) #<<< count the words
print(fw_c_RDD.take(3))
# output should look like this:
#[(('emma', 'the'), 0), (('emma', 'a'), 0), (('emma', 'in'), 0)]
#[(('emma', 'the'), 1), (('emma', 'of'), 1), (('emma', 'by'), 1)]
#[(('emma', 'the'), 1), (('emma', 'of'), 1), (('emma', 'by'), 1)]
#[(('emma', 'the'), 5380), (('emma', 'by'), 591), (('emma', 'you'), 2068)]

## 3) Creating sorted lists

As a next step, map the `((filename,word),count)` to `( filename, [ (word, count) ])` using the function `reGrpLst` to regroup and create a list. 

Then sort the [(word,count),...] lists in the values (i.e. 2nd part of the tuple) with the the words as keys. Have a [look at the Python docs](https://docs.python.org/3.5/library/functions.html?highlight=sorted#sorted) for how to do this. Hint: use a lambda that extracts the words as the key, e.g. `sorted(f_wdL[1], key = lambda wc: ... )`.   

In [None]:
def reGrpLst(fw_c): # we get a nested tuple
    >>>     # split the outer tuple
    >>>     # split the inner tuple
    return (f,[(w,c)]) # return (f,[(w,c)]) structure. Can be used verbatim, if your variable names match.


>>>f_wcL_RDD = fw_c_RDD.map(reGrpLst) # apply reGrpLst
f_wcL2_RDD = f_wcL_RDD.reduceByKey(add) # create [(w,c), ... ,(w,c)] lists per file 
>>>f_wcLsort_RDD = f_wcL2_RDD.map(lambda f_wcL: (f_wcL[0], sorted(...))) #<<< sort the word count lists by word
print(f_wcLsort_RDD.take(3)) 
>>>f_wVec_RDD = f_wcLsort_RDD.map(lambda f_wc: (f_wc[0],[float(c) for ...])) # remove the words from the wc pairs and convert the numbers to floats
f_wVec_RDD.take(3)
# output:
# [('lady_susan', [('a', 611), ('at', 161), ('by', 152), ('for', 262), ('i', 1106), ('in', 402), ('me', 200), ('of', 787), ...

## 4) Clustering

Now we have feature vectors of fixed size, we can use KMeans as provided by Spark.

The files in our library are by two authors. After clustering, check if the cluters reflect authorship:

WILLIAM SHAKESPEARE: 
merchant_of_venice, 
richard_III, 
midsummer,
tempest,
romeoandjuliet,
othello,
henry_V,
macbeth,
king_lear,
julius_cesar,
hamlet

JANE AUSTEN
mansfield_park,
emma,
northanger_abbey,
lady_susan,
persuasion,
prideandpredjudice,
senseandsensibility

In [None]:

from math import sqrt

from pyspark.mllib.clustering import KMeans #, KMeansModel

#print('f_wVec_RDD.take(2): ', f_wVec_RDD.take(1))
>>>wVec_RDD = f_wVec_RDD.map(lambda f_wcl: ...) # strip the filenames, keep only the vectors
#print(wVec_RDD.take(3))

# Build the model (cluster the data)
clusterModel = KMeans.train(wVec_RDD, 2, maxIterations=10, initializationMode="random")

# Assign the filenames to the clusters
fc_RDD = f_wVec_RDD.map(lambda fv: (fv[0],clusterModel.predict(fv[1])))
for s in fc_RDD.collect():
    print(s)

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusterModel.centers[clusterModel.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = wVec_RDD.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
# now check if the clusters match the authors
# output:
#('lady_susan', 1)
#('macbeth', 1)
#('merchant_of_venice', 1)
#('othello', 1)
#('persuasion', 0)
#('emma', 0)

## 5) Alternative approach: feature hashing

Instead of the previous appraoch, we now use feature hashing, as done last week.

In [None]:
def hashing_vectorizer(word_count_list, N):
    v = [0] * N  # create fixed size vector of 0s
    for word_count in word_count_list: 
>>>        ...     # unpack tuple
        h = hash(word)              # get hash value
        v[h % N] = v[h % N] + count # add count
    return v # return hashed word vector

from operator import add

N = 10

# we use fw_RDD from the beginning with all the words, not just stopwords
>>>fw_1_RDD = fw_RDD.map(lambda fw: ...)  #<<< change (f,w) to ((f,w),1)
fw_c_RDD = fw_1_RDD.reduceByKey(add) #as above
f_wcL_RDD = fw_c_RDD.map(reGrpLst) #as above
f_wcL2_RDD = f_wcL_RDD.reduceByKey(add) #create [(w,c), ... ,(w,c)] lists per file 
>>>f_wVec_RDD = f_wcL2_RDD.map(lambda f_wcl: (f_wcl[0], ...)) # apply the hashing_vectorizer to the word-count list
print(f_wVec_RDD.take(3))
# output:
# [('henry_V', [2277, 2293, 1182, 1792, 2058, 1550, 787, 1821, 814, 1916, 902, 752, 1249, 1022, 888, 1702, 1357, 2886, 1007, 1645]), ('king_lear', [2149, 2217, 1010, 2167, 2331, 1372, 726, 1682, 747, 1623, 1470, 889, 1248, 1371, 1062, 1472, 1510, 2456, 1364, 1253]), ('lady_susan', [2015, 1850, 823, 1828, 2099, 1658, 704, 1656, 588, 1319, 1433, 789, 1051, 909, 748, 1236, 1290, 2195, 570, 1348])]

In [None]:
from math import sqrt

from pyspark.mllib.clustering import KMeans #, KMeansModel

#print('f_wVec_RDD.take(2): ', f_wVec_RDD.take(1))
wVec_RDD = f_wVec_RDD.map(lambda f_wcl: f_wcl[1]) # strip the filenames
#print(wVec_RDD.collect())

# Build the model (cluster the data)
clusterModel = KMeans.train(wVec_RDD, 2, maxIterations=10, initializationMode="random")

# Assign the files to the clusters
fc_RDD = f_wVec_RDD.map(lambda fv: (fv[0],clusterModel.predict(fv[1]))) 
for s in fc_RDD.collect():
    print(s)
    
# resusing 'error' function from abov
WSSSE = wVec_RDD.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

## 6) Neutralising document length: normalised vectors

'Lady Susan' ends up reliably in the wrong cluster. A possible explanation is that it is shorter than the other Austen works. Try normalising the word counts, i.e. by dividing by their sum. That takes away the effect of length. What is the effect on the clustering?
    
You can use a list comprehension for the normalisation.

In [None]:
>>>nwVec_RDD = wVec_RDD.map(lambda v: (...)) # provide a list comprehension that 
                            #normalises the values by dividing by the sum over the list
print("Normalised vectors: ",nwVec_RDD.take(3))

# Build the model (cluster the data)
clusterModel = KMeans.train(nwVec_RDD, 2, maxIterations=10, initializationMode="random")

# Assign the files to the clusters
fc_RDD = f_wVec_RDD.map(lambda fv: (fv[0],clusterModel.predict(fv[1])))
for s in fc_RDD.collect():
    print(s)
# output
# Normalised vectors:  [[0.07615384615384616, 0.07668896321070234, 0.03953177257525083, 0.05993311036789298, 
# ...                       
# ('henry_V', 0)
# ('king_lear', 0)
#('lady_susan', 1)
# ..

## 7) Building an index

Starting from the fw_RDD we now start building the index and calculating the IDF values. Since we have the TF values alread, we only need to keep the unique filenames per word using [RDD.distinct()](https://spark.apache.org/docs/2.1.0/api/python/pyspark.html#pyspark.RDD.distinct).  
Then we create a list of filenames. The length of the list is the document frequency DF per word.
From the DF value we can calculate the IDF value as log(18/DF) 

In [None]:
from math import log

fwu_RDD = fw_RDD.distinct() # get unique file/word pairs
>>>wfl_RDD = fwu_RDD.map(lambda fw: (fw[1],...)) # create (w,[f]) tuples 
wfL_RDD = wfl_RDD.reduceByKey(add) # concatenate the lists with 'add'
print(wfL_RDD.take(3))

>>>wdf_RDD = wfL_RDD.map(lambda wfl: (wfl[0],...)) # get the DF replacing the file list with its lenght
print("DF: ",wdf_RDD.take(3))
>>>widf_RDD = wdf_RDD.map(lambda wdf: (wdf[0],...))) # get he IDF by replacing DF with log(18/DF)
print("IDF: ",widf_RDD.take(3))
# ouptut:
# [('of', ['henry_V', 'king_lear', 'lady_susan', 'macbeth', 'merchant_of_venice', 'midsummer', 'northanger_abbey', 
# DF:  [('of', 18), ('shakespeare', 15), ('henry', 9)]
# IDF:  [('of', 0.0), ('shakespeare', 0.1823215567939546), ('henry', 0.6931471805599453)]