# Lab Sheet 2: Extracting Word Frequency Vectors with Spark

These tasks are for working in the lab session and during the week. First we'll go through (almost) the same preliminaries as last week. In task 1) we'll do a bit of **word preprocessing** and in task 2) we'll load a number of files and will go through the processing steps to **extract word frequencies**.

As last week, the places where you need to add code are marked with "..." and/or ">>>". In most cases, this code is not valid, so that you need to edit it to run anything. I recommend you comment the marker out like this "# >>>" to keep the marker of the original task.

The **purpose** of this lab to get develop your **skills** in **programming** against the **Spark interface**.
It is also to get practical experience with **vector representations of documents** and the **Hashing Trick**. (We'll cover TF-IDF next week).

## Preliminaries
Mount Drive and install Spark

Mount drive (you'll need to open the link and copy over the authorization code).

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Next, we check if we can read the `Big_Data` folder. If the command below fails, go back to the shared [`Big_Data`]() folder and click on **"Add to My Drive"** in the folder menu.

In [None]:
%ls "/content/drive/My Drive/Big_Data/data/"

Next, we **install Spark** (may take a minute or two). This will need to be done **every time a new machine is created**.


In [None]:
%cd
!tar -xzf "/content/drive/My Drive/Big_Data/data/spark/spark-3.5.0-bin-hadoop3.tgz" # unpacking
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # installing java
import os # Python package for interaction with the operating system
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" # tell the system where Java lives
os.environ["SPARK_HOME"] = "/root/spark-3.5.0-bin-hadoop3" # and where spark lives
!pip install -q findspark # install helper package
import findspark # use the helper package
findspark.init() # to set up spark
%cd "/content/drive/My Drive/Big_Data"

**Speaker notes**
So we set up a sparkcontext which if you remember is the entry point for accessing the PySpark interface


Get a SparkContext

In [None]:
import pyspark
# get a spark context
sc = pyspark.SparkContext.getOrCreate()
print(sc)

In [None]:
!pwd
!ls
!cd data
!ls


## Using Jupyter in Colab: system commands and magics

**Speaker Notes**

So what we have to remember is that when we are working with a Google colab notebook we are working on a virtual Linux machine that exists somewhere in the Google Cloud ecosystem. Now when we use the exclamation mark we interact with the shell on this VM as part of a new process which as soon as the command has been executed, that process is closed. So for example if you  change directorry using !cd you will find that if you execute a ls command azfterwards that you are not in the directory that you cd to in the previous command.

If you want a lasting change you need to use a

**!** at the beginning of a line **executes** what follows in the **system shell**, rather than in Python.

In `!ls -l "$dirPath"`,  the **`$`** references the **Python variable** dirPath instead of the string 'dirPath'

**Changing the current directory** requires the use of the **magic "`%cd`"** (see
https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd ). "`!cd`" will run, but not actually change the directory.

PS: auto-completion does unfortunately not work after the space.


In [None]:
!ls "/content/drive/My Drive/Big_Data/data"
dirPath = "/content/drive/My Drive/Big_Data/data"
%ls -l "$dirPath" # Get the content of the directory in long format (note the '$')

## 1) Word preparation

Define **your own mapper function** for removing the plural “s” at the end of words and turning them to lower case as a rough approximation towards **stemming**.

Use the python def syntax [see here](https://docs.python.org/release/3.6.9/tutorial/controlflow.html#defining-functions) to define your own function stripFinalS(word) that takes as argument a word, and outputs the word in lower case without any possible trailing “s”.

For this task, you can treat strings as lists and apply "**list slicing**": <br>
`lst[0:3] # the first three elements` <br>
`lst[:-2] # all but the last two elements`

For more information, look [here](https://docs.python.org/release/3.6.9/tutorial/introduction.html#strings) in section 3.1.2 for 'slice'.

You need to **check** that the string is **not empty** (test `len(word)`) before accessing the letters in the string, otherwise you'll raise an exception.

Alternatively, you can use the string method [`rstrip`](https://docs.python.org/release/3.6.9/library/stdtypes.html#str.rstrip), which removes a given character from the right side of a string. There are also [`strip`](https://docs.python.org/release/3.6.9/library/stdtypes.html#str.strip) and [`lstrip`](https://docs.python.org/release/3.6.9/library/stdtypes.html#str.lstrip) variants that work on both sides or the left side, respectively.

In [None]:
def stripFinalS( word ):
  wordl = word.lower() # lower case
>>> add code here
  return wordl

print(stripFinalS('houses')) # for testing, should return 'house'

Add your new function into the word count example below for testing, replacing `word.lower()` in the call to `map()`. The code below will parse and run, but your task is still to change it. This task is very similar to last week, no surprises ... ;-)

In [None]:
from operator import add
import re

filepath = "/content/drive/My Drive/Big_Data/data/hamlet.txt"

linesRDD = sc.textFile(filepath) # read text as RDD
wordsRDD = linesRDD.flatMap(lambda line: re.split('\W+',line)) # split words, break lists
wordsFilteredRDD = wordsRDD.filter(lambda word: len(word)>0) # filter empty words out
# >>> replace word.lower() below using your stripFinalS method
words1RDD = wordsFilteredRDD.map(lambda word: (word.lower(), 1)) # lower case, (w, 1) pairs
wordCountRDD = words1RDD.reduceByKey(add) # reduce and add up counts
freqWordsRDD = wordCountRDD.filter(lambda x:  x[1] >= 3 ) # remove rare words
output = freqWordsRDD.sortBy(lambda x: -x[1]).take(15) # collect 10 most frequent words
for (word, count) in output: # iterate over (w,c) pairs
    print("{}: {}".format(word, count)) #  … and print

You should see no words ending in 's' when using your function, including some cases where the "s" shouldn't have been removed, but we are just getting started...

## 2) Extracting word frequency vectors from text documents

Now we start a **new task**, which is reading in a **whole directory** with text files and **extracting word frequency** information. This is done in several steps.

The steps involve some **tuple restructuring** and **list transformation**. It is helpful to use **meaningful variable names** to keep track of what's going on.

It is also helpful to use pen and paper (or a text editor) to **write down the structures** that you are intending to create. I recommend using a pseudocode with brackets and meaningful descriptions for the content, e.g. `(word,count)` for a tuple, or  `[(file,word), ... , (file,word)]` for a list of tuples.

Keep in mind the final goal of getting a list of words and their frequencies for each file, i.e. `(filename,[(w,c), ... , (w,c)])`. This is not a full vocabulary vector, as we don't have a representation of words that did not appear in this file. However, this can be beneficial when creating an inverted index.

### 2a) Load the files

To start, **load all text files** in the directory `"/content/drive/My Drive/BigData/data/library"`  using `sc.wholeTextFiles()` (see [here](http://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles)). This will create an RDD with tuples of the structure **(filepath,content)**, where content is the whole text from the file.

In [None]:
dirPath = "/content/drive/My Drive/Big_Data/data/library" # the path to our data
%ls -l "$dirPath" # show the files
doc_RDD = sc. ... # <<< add code to create an RDD with wholeTextFiles
print("partitions: ", doc_RDD.getNumPartitions()) # default is 2
print("elements: ", doc_RDD.count()) # should be as many as there are files in the library folder

### 2b) Split the RDD elements using flatMap to get (filename, word) elements.

For this, **define a function** that takes a pair `(filename,content)` and output list of pairs `[(filename, word1), ...(filename, wordN)]`. You can get the words from a string `x` as usual by `re.split('\W+', x)`.

Use list comprehensions (see http://www.pythonforbeginners.com/basics/list-comprehensions-in-python) to iterate through the word list in a for loop, and append the (filename,word) tuples to a new list.  

Below is a template, you need to complete the lines that start with `>>>`.

In [None]:
def splitFileWords(filenameContent): # your splitting function
    f,c = filenameContent # unpack the input tuple
    fwLst = [] # the new list for (filename,word) tuples
    >>> wLst =  # now create a word list wLst by splitting c (the content)
    for w in wLst: # iterate through the list
        >>> # and append (f, w) to the fwLst
    return fwLst # return a list of (f, w) tuples

fw_RDD = doc_RDD.flatMap(splitFileWords)
print(fw_RDD.take(3))
# should print something similar to this:
# [('file:/content/drive/My Drive/BigData/library/emma.txt', 'The'), ...

#### Comments
- Building the list `fwLst` is the main new concept here.
- Creating tuples with brackets is a technique that is frequently used.

Now use filter to keep only the tuples with stopwords (remember, the words are now the 2nd element of the tuple).

In [None]:
stopwlst = ['the','a','in','of','on','at','for','by','I','you','me'] # stopword list
fw_RDD2 = fw_RDD.filter ... # <<< filter, keeping only tuples with a stopword as 2nd element
fw_RDD2.top(3)

#### Comments
- With RDD.filter(), it is important to return a boolean.
- Important: `filter` keeps only elements, where the provided function (here as a lambda) returns `True`.

### 2c) Count the words and reorganise the tuples to count: ((filename,word), count)

Now you can package the elements into tuples with 1s and use reduceByKey(add) to get the counts of the words per filename, similar to last week and in task 1 above.

In [None]:
fw_1_RDD = fw_RDD2.map(lambda x: ...)  # <<< change (file,word) to ((file,word),1)
fw_c_RDD = fw_1_RDD. # <<< reduceByKey to count the words using "add" (imported above)
fw_c_RDD.top(3)
# the printed elements should look similar to this:
# [(('file:/content/drive/My Drive/BigData/library/tempest.txt', 'you'), 260), ...

#### Comment
This example follows the word count example, with the difference of keeping the filename in addition to the word.

### 2d) Creating and concatenating lists

As a next step, map the `((filename, word), count)` elements to `( filename, [ (word, count) ])` structure, i.e. rearange and wrap a list around the one tuple (just by writing square brackets). For this create a function `reGrpLst` to regroup and create a list. Check that the output has the intended structure.

In [None]:
def reGrpLst(fw_c): # we get a nested tuple ((f,w),c)
    fw, c = fw_c # unpack the outer tuple fw, c = fw_c
>>> # unpack the inner tuple f, w = fw
>>> # return (f,[(w,c)]) structure. Can be used verbatim, if your variable names match.

f_wcL_RDD = fw_c_RDD.map(reGrpLst)
f_wcL_RDD.top(3)
# output should look like this:
# [('file:/content/drive/My Drive/BigData/library/tempest.txt', [('you', 260)]), ...

Next we can concatenate the lists per filename using reduceByKey(). Write a lambda that concatenates the lists per element.  Concatenation of lists is done in Python with '`+`', e.g.  `[1,2] + [3,4]` returns `[1,2,3,4]`.

#### Comment
Here we have a new technique: creating lists instead of tuples (using `[]` instead of `()`). The approach is similar to that of word counting, but adding lists (with `+` or `add`) means concatenating them, so that we produce a long list.  

In [None]:
>>> f_wcL2_RDD = f_wcL_RDD.reduceByKey(lambda wc1, wc2: ... ) # <<< create [(w,c), ... ,(w,c)] lists per file

In [None]:
output = f_wcL2_RDD.collect()
for el in output[1:4]:
    print(el)
    print()

## 3) Creating Hash Vectors

If we want to **compare** the word-counts for different files, and in particular if we want to use not just the stopwords, we **need** to bring them to the **same dimensionality** as vectors. For this we use the **'Hashing Trick'** shown in the lecture.

Start by writing a function that takes a (word, count) list, and transforms it into vector of fixed size. For that you need to take the hash value of each word modulo the size (`hash(word) % size`) and add up all counts of words that map here.

In [None]:
def hashWcList(lst, size):
    lst2 = [0] * size; # create a vector of the needed size filled with '0's
    for (w, c) in lst: # for every (word,count) pair in the given list
        lst2[...] += ... # determine the position [...] with hash(w) % size and add c there
    return lst2 # return the new list, containing only numbers

hashWcList([('this',23),('is',12),('a',34),('little',13),('test',24)],5) # for testing
# output should look similar to this: [35, 13, 0, 24, 34]

### Comment
This method gives us a single vector that represents every text document as a compact vector of fixed dimension.
Vector like this can be used for finding documents in databases, grouping them by similarity, studying writing styles etc.

In [None]:
f_hv_RDD = f_wcL2_RDD.map(lambda f_wcl: (f_wcl[0], hashWcList(f_wcl[1], 10)))
output = f_hv_RDD.collect()
for el in output[1:4]:
    print(el)
    print()
# now we can display a hashed vector for every text file

## Reading
Read sections 2.1 and 2.2 of Lescovec et al (2019), "Mining of Massive Datasets". Work out the answers to exercise 2.2.1 on page 30. Also, work out the answer to exercise 2.3.1 a),c),d) on page 40. If you have time, have a look (not needed to go to full detail) at section 2.3.

### Demo: extracting file names and creating a DataFrame

We first load the module [`os.path`](https://docs.python.org/3.6/library/os.path.html), which contains utilities to manipulate paths in Python. From that module, we use the [`basename`](https://docs.python.org/3.6/library/os.path.html#os.path.basename) method to get just the filename, without parent folders, and [`splitext`](https://docs.python.org/3.6/library/os.path.html#os.path.splitext) to remove the extension.

In [None]:
import os.path
fn_hv_RDD = f_hv_RDD.map(lambda x: (os.path.splitext(os.path.basename(x[0]))[0], x[1]))
fn_hv_RDD.take(3)

Then we convert the RDD into a DataFrame and show it.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession(sc)

df = spark.createDataFrame(fn_hv_RDD,["Text name","Hash vector"])
df.show()

You can convert a Spark DataFrame to a Pandas DataFrame to use other tools, e.g. for plotting.

In [None]:
df1 = df.toPandas()
display(df1)