### Getting data from git
The cell below will create a local copy from our Github repository into the local filesystem. This only needs to be done once

In [1]:
!git clone https://github.com/tweyde/City-Data-Science.git

Cloning into 'City-Data-Science'...
remote: Counting objects: 41, done.[K
remote: Compressing objects: 100% (39/39), done.[K
remote: Total 41 (delta 7), reused 21 (delta 1), pack-reused 0[K
Unpacking objects: 100% (41/41), done.


Update the local copy of the repository

In [2]:
%cd City-Data-Science/
!git pull https://github.com/tweyde/City-Data-Science.git
%cd ..

/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science
From https://github.com/tweyde/City-Data-Science
 * branch            HEAD       -> FETCH_HEAD
Already up-to-date.
/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work


## 1) Word preparation

In [27]:
def stripFinalS( word ):
    if len(word) > 1:
        if word[-1] == 's':
            word = word[:-1]
    else:
        print('The string is empty')
        
    return word.lower()
print(stripFinalS('houses')) # for testing, should return 'house'

house


In [28]:
from operator import add
import re

linesRDD = sc.textFile("./City-Data-Science/library/hamlet.txt") # read text as RDD
wordsRDD = linesRDD.flatMap(lambda line: re.split('\W+',line)) # split words, break lists
wordsFilteredRDD = wordsRDD.filter(lambda word: len(word)>0)

words1RDD = wordsFilteredRDD.map(lambda word: (stripFinalS(word),1)) # lower case, (w,1) pairs
wordCountRDD = words1RDD.reduceByKey(add) # reduce and add up counts
freqWordsRDD = wordCountRDD.filter(lambda x:  x[1] >= 5 ) # remove rare words
output = freqWordsRDD.sortBy(lambda x: -x[1]).take(10) # collect 1o most frequent words
for (word, count) in output: # iterate over (w,c) pairs
    print("%s: %i" % (word, count)) #  … and print
# this sohuld print the stopwords with their 

the: 1218
i: 1019
and: 1019
to: 834
a: 815
of: 733
you: 610
my: 516
in: 464
it: 442


# 2) Extracting word frequency vectors from text documents

Now we start a new script, which reads in a whole directory with text files and extracts word frequency information.

This involves some tuple restructuing and list transformation. It is important to use meaningful variable names. Also it is helpful to use pen and paper (or a text editor) to write down the structures that you are intending to create. Keep in mind the final goal of getting a list of words and their frequencies for each file, i.e. (filename,[(w,c), ... , (w,c)]). 


## 2a) Load the files
Load all text files in the directory /data/student/bigdatastud/library on the server lewes using sc.wholeTextFiles <br>(see [http://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles](http://spark.apache.org/docs/2.0.0/api/python/pyspark.html#pyspark.SparkContext.wholeTextFiles)). This will create an RDD with tuples of the structure (filepath,content), where content is the whole text from the file. 

In [57]:
dirPath = "./City-Data-Science/library/"
fw_RDD = sc.wholeTextFiles("./City-Data-Science/library/")

print("partitions: ", fw_RDD.getNumPartitions()) # on IBM DSX we have 2 executors by default with one partition each
print("elements: ", fw_RDD.count())
# this should print partitions:  2 and elements:  19

partitions:  2
elements:  19


## 2b) Split the RDD elements using flatMap to make the (filename, word) tuples the key.

For this define a function that takes a pair `(filename,content)` and output list of pairs `[(filename, word1), ...(filename, wordN)]`.

Use list comprehensions (see http://www.pythonforbeginners.com/basics/list-comprehensions-in-python) to iterate through the word list in a for loop, and append the (filename,word) tuples to a new list.  

Below is a template, you need to fill in the that starts with `<<<`.

In [58]:
def splitFileWords(filenameContent): 
    f,c = filenameContent # unpack the input tuple  
    fwLst = [] # the new list for (filename,word) tuples
    wLst =  re.split('\W+', c) # <<< now create a word list wLst be splitting c (the content)
    for w in wLst: # iterate through the list
         if len(w) > 0:
            fwLst.append((f,w)) # <<< and append (f,w) to the fwList
    return fwLst #return a list of (f,w) tuples 
    
fw_RDD = fw_RDD.flatMap(splitFileWords)
fw_RDD.take(1)

# should print something similar to this:
# [('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/prideandpredjudice.txt',
# 'The'),

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/prideandpredjudice.txt',
  'The')]

### Question: 
Looking that the new elements, what might be problematic if we were using a large dataset and what could we do to prevent this from happening?

Now use filter to keep only the tuples with stopwords (remember, the words are now the 2nd element of the tuple).

In [59]:
stopwlst = ['the','a','in','of','on','at','for','by','I','you','me'] # stopword list
fw_RDD2 = fw_RDD.filter(lambda x: x[1] in stopwlst) #<<< filter, keeping only stopwords as 2nd part of the tuples
fw_RDD2.top(3) #

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  'you'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  'you'),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  'you')]


## 2c) Count the words and reorganise the tuples to count: ((filename,word), count)

Now you can package the elements into tuples with 1s and use reduceByKey(add) to get the counts of the words per filename, similar to last week and in task 1 above.

In [60]:
fw_1_RDD = fw_RDD2.map(lambda x: (x,1))  #<<< change (f,w) to ((f,w),1)
fw_c_RDD = fw_1_RDD.reduceByKey(add) #<<< count the words
fw_c_RDD.top(3)

[(('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
   'you'),
  260),
 (('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
   'the'),
  695),
 (('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
   'on'),
  85)]

## 2d) Creating and concatenating lists

As a next step, map the `((filename,word),count)` eleemnts to `( filename, [ (word, count) ])` structure, i.e. rearange and wrap a list aournd the one tuple (just by writing squre backets). For this create a function `reGrpLst` to regroup and create a list. Check that the output has the intended structure.

In [62]:
def reGrpLst(fw_c): # we get a nested tuple
    fw,c = fw_c # unpack the outer tuple
    f,w = fw
    return (f, [(w,c)]) # return (f,[(w,c)]) structure.

f_wcL_RDD = fw_c_RDD.map(reGrpLst) 
f_wcL_RDD.top(3)

[('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  [('you', 260)]),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  [('the', 695)]),
 ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt',
  [('on', 85)])]

Next we can concatenate the lists per filename using reduceByKey(). Write a lambda that cancatenates the lists per element.  Concatenation of lists in Python is done with '+', e.g.  `[1,2] + [3,4]` returns `[1,2,3,4]`.

In [63]:
f_wcL2_RDD = f_wcL_RDD.reduceByKey(lambda wc1,wc2: wc1+wc2 ) #<<< create [(w,c), ... ,(w,c)] lists per file 

In [64]:
output = f_wcL2_RDD.collect() 
for el in output[1:4]:
    print(el)
    print()
# should show something like this:
# ('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s832-dfe96c6e1f1d61-70d619a53771/notebook/work/City-Data-Science/library/king_lear.txt', 
# [('of', 495), ('in', 280), ('at', 66), ('me', 228), ('I', 705), ('for', 130), ('on', 104), ('you', 412), ('a', 364), ('the', 746), ('by', 84)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/hamlet.txt', [('in', 427), ('of', 685), ('at', 86), ('you', 544), ('I', 617), ('the', 1050), ('for', 202), ('on', 141), ('by', 124), ('a', 496), ('me', 236)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/merchant_of_venice.txt', [('at', 73), ('of', 499), ('in', 280), ('for', 201), ('me', 253), ('on', 77), ('you', 440), ('by', 108), ('a', 444), ('the', 832), ('I', 676)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt', [('at', 80), ('of', 435), ('in', 262), ('a', 355), ('you', 260), ('by', 106), ('the', 695), ('me', 165), ('I', 559), ('on', 85), ('for', 134)])



## 2e) Creating Hash Vectors

If we want to use all the words in each text, we need to reduce the dimensionality of the vectors. For this we use the 'Hashing Trick' shown in the lecture. 

Start by writing a function that takes a (word,count) list, and transforms it into vector of fixed size. For that you need to take the have value of each word module the size (`hash(word) % size`) and add up all counts that map here. 

In [65]:
def hashWcList(lst,size):
    lst2 = [0] * size;
    for (w,c) in lst:
        lst2[hash(w) % size] += c  # determine the position with hash(w)%size and add c there
    return lst2
        
hashWcList([('this',23),('is',12),('a',34),('little',13),('test',24)],5) # for testing
#output should look like this: [36, 0, 36, 0, 34]

[36, 0, 36, 0, 34]

In [66]:
f_hv_RDD = wordCountRDD.map(lambda f_wcl: (f_wcl[0],hashWcList(f_wcl[1],10)))

for el in output[1:4]:
    print(el)
    print()
# now we can display a 10-dimensional vector for every text file 

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/hamlet.txt', [('in', 427), ('of', 685), ('at', 86), ('you', 544), ('I', 617), ('the', 1050), ('for', 202), ('on', 141), ('by', 124), ('a', 496), ('me', 236)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/merchant_of_venice.txt', [('at', 73), ('of', 499), ('in', 280), ('for', 201), ('me', 253), ('on', 77), ('you', 440), ('by', 108), ('a', 444), ('the', 832), ('I', 676)])

('file:/gpfs/global_fs01/sym_shared/YPProdSpark/user/s8bb-4c006897e4ec94-b51c55267e90/notebook/work/City-Data-Science/library/tempest.txt', [('at', 80), ('of', 435), ('in', 262), ('a', 355), ('you', 260), ('by', 106), ('the', 695), ('me', 165), ('I', 559), ('on', 85), ('for', 134)])

