### Spark Project Assignment:

In this project assignment, you will utilize the Spark Mllib package to search for words within a body of text. Similar to the capabilities afforded by a search engine, you will surface all documents that contains the **word** that you want to look up.


**Question 1:**

As you go through this assignment, you will get a chance to apply some of the techniques that you were exposed to in the last session on NLP. However note that, you need to apply Spark programming constructs to come up with a solution for this assignment.

**a)** Compute the TF-IDF for every word in the entire body of documents that we have

**b)** Construct a Document Term Matrix that contains the TF-IDF for each each word across the entire set of documents

**c) ** Look up the column index for the *word* you are searching from within the *Document Term Matrix* that you constructed; then extract all documents that contain the word (based on the TF-IDF score) into a  RDD based on the index

**d)** Output the document(s) that contain the word you were searching on; and also output the name of document with the maximum **tf-idf** score

**Dataset**: https://www.dropbox.com/s/m6u5v1mt18crxtv/small_body_of_text_on_history.tsv?raw=1 - Download the file and save it to a local folder and then utilize the textfile method of the SparkContext package to read in the file



In [1]:
from pyspark import SparkContext, SparkConf, SparkFiles
from pyspark.mllib.feature import HashingTF, IDF

import numpy as np
import string

In [2]:
conf = SparkConf().setAppName("Lambda").setMaster("local[2]")
sc = SparkContext(conf=conf)

In [3]:
sc.addFile('https://ucd30554c87233f77ae39e2a5d78.dl.dropboxusercontent.com/cd/0/inline/AJVC7evwNMoBnuAMIzu4A90IDgdVEbXBum96fEatTFYcxGbyTOjjg-iKeM9ofCCmZW-iESA09JKs4YL4Tmki3NNy7hnA__hnw2D1tYyyJfFcQAMGqaHyL90WbBC4piHt5w16a-zznfs1jv66OS7sJgA9DkDIA17Fx3IA5HKvPWckb1nTzwhkvoLi_E0-K-S_W2s/file')

In [4]:
data = sc.textFile(SparkFiles.get('file'))

In [5]:
print("There are {} documents".format(data.count()))

There are 1000 documents


### Data Preprocessing

In [6]:
def remove_punct(text):
    text = text.replace('\n', ' ')
    for char in string.punctuation:
        text = text.replace(char, '')

    return text.lower()

In [7]:
documents = data.map(lambda line: remove_punct(line).split())

In [8]:
names = documents.map(lambda line: ': '.join(line[:2]))
documents = documents.map(lambda line: line[4:])

In [9]:
print(documents.collect()[0][:100])

['anarchism', 'sometimes', 'referred', 'to', 'as', 'libertarianism', 'though', 'that', 'term', 'sometimes', 'has', 'other', 'meanings', 'as', 'well', 'is', 'a', 'political', 'philosophy', 'encompassing', 'theories', 'and', 'attitudes', 'which', 'support', 'the', 'elimination', 'of', 'all', 'forms', 'of', 'compulsory', 'government', 'the', 'term', 'anarchism', 'derives', 'from', 'the', 'greek', 'ἀναρχος', 'anarchos', 'meaning', 'without', 'rulers', 'from', 'the', 'prefix', 'ἀν', 'an', 'without', 'ἄρχή', 'archê', 'sovereignty', 'realm', 'magistracy', 'ισμός', 'ismos', 'from', 'a', 'stem', 'ιζειν', 'izein', 'it', 'is', 'defined', 'by', 'the', 'concise', 'oxford', 'dictionary', 'of', 'politics', 'as', 'the', 'view', 'that', 'society', 'can', 'and', 'should', 'be', 'organized', 'without', 'a', 'coercive', 'state', 'specific', 'anarchists', 'may', 'have', 'additional', 'criteria', 'for', 'what', 'constitutes', 'anarchism', 'and', 'they', 'often']


In [10]:
print(documents.collect()[1][:100])

['autism', 'is', 'a', 'brain', 'development', 'disorder', 'that', 'is', 'characterized', 'by', 'impaired', 'social', 'interaction', 'and', 'communication', 'and', 'restricted', 'and', 'repetitive', 'behavior', 'all', 'starting', 'before', 'a', 'child', 'is', 'three', 'years', 'old', 'this', 'set', 'of', 'signs', 'distinguishes', 'autism', 'from', 'milder', 'autism', 'spectrum', 'disorder', 'asd', 'such', 'as', 'pervasive', 'developmental', 'disorder', 'not', 'otherwise', 'specified', 'pddnosnnautism', 'has', 'a', 'strong', 'genetic', 'basis', 'although', 'the', 'genetics', 'of', 'autism', 'are', 'complex', 'and', 'it', 'is', 'unclear', 'whether', 'asd', 'is', 'explained', 'more', 'by', 'multigene', 'interactions', 'or', 'by', 'rare', 'mutation', 'in', 'rare', 'cases', 'autism', 'is', 'strongly', 'associated', 'with', 'agents', 'that', 'cause', 'birth', 'defects', 'other', 'proposed', 'causes', 'such', 'as', 'childhood', 'vaccine', 'are', 'controversial']


### Vectorizing

In [11]:
hashingTF = HashingTF()
tf = hashingTF.transform(documents)

In [12]:
idf = IDF()
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

In [13]:
dataset = names.zip(tfidf)

In [14]:
term_frequencies = names.zip(tf)

In [15]:
documents = names.zip(documents)

### Word Lookup

In [16]:
query = 'astronomy'

# takes a while due to .collect(), produces an integer
index = int(hashingTF.transform(sc.parallelize([[query]])).collect()[0].indices[0])

In [17]:
filtered = dataset.filter(lambda doc: index in doc[1].indices)

### Matching Document Names and Maximum TF-IDF

In [18]:
names = filtered.map(lambda x: x[0])
print(names.collect()[:10])

['39: albedo', '308: aristotle', '580: astronomer', '737: afghanistan', '748: amateur', '1049: amateur', '1130: avicenna', '1168: anaximander', '1181: astrometry', '1196: angle']


In [19]:
scores = filtered.map(lambda x: x[1][index])

In [20]:
# can't see how to do this without collect() or an spark sql context, since order matters
top_document_name = names.collect()[np.argmax(scores.collect())]

In [21]:
print(documents.filter(lambda x: x[0] == top_document_name).collect())

[('748: amateur', ['030754', 'thumbnnamateur', 'astronomy', 'a', 'subset', 'of', 'astronomy', 'is', 'a', 'hobby', 'whose', 'participants', 'enjoy', 'studying', 'and', 'observing', 'celestial', 'objectsnnthe', 'typical', 'amateur', 'astronomer', 'is', 'one', 'who', 'does', 'not', 'depend', 'on', 'the', 'field', 'of', 'astronomy', 'as', 'a', 'primary', 'source', 'of', 'income', 'or', 'support', 'and', 'does', 'not', 'have', 'a', 'professional', 'degree', 'or', 'advanced', 'academic', 'training', 'many', 'amateurs', 'are', 'beginners', 'while', 'others', 'have', 'a', 'high', 'degree', 'in', 'astronomy', 'and', 'often', 'assist', 'and', 'work', 'alongside', 'professional', 'astronomersnnamateur', 'astronomy', 'is', 'usually', 'associated', 'with', 'viewing', 'the', 'night', 'sky', 'when', 'most', 'celestial', 'objects', 'and', 'events', 'are', 'visible', 'but', 'sometimes', 'amateur', 'astronomers', 'also', 'operate', 'during', 'the', 'day', 'for', 'events', 'such', 'as', 'sunspot', 'and',

**Question 2:** Using the same dataset -


**a)** Construct a Document Term Matrix that contains the Term Frequency of each each word in the set of documents

**b) ** Look up the column index for the *word* you are searching from within the *Document Term Matrix* that you constructed; then extract all documents that contain the word (based on the TF score) into a  RDD based on the index

**c)** Identify the document that has the maximum count for the word you are looking for and output the name of the document along with the term frequency

In [22]:
def top_document(query, frequencies):
    index = int(hashingTF.transform(sc.parallelize([[query]])).collect()[0].indices[0])
    
    filtered = frequencies.filter(lambda doc: index in doc[1].indices)
    names = filtered.map(lambda x: x[0])
    scores = filtered.map(lambda x: x[1][index])
    
    top_document_name = names.collect()[np.argmax(scores.collect())]
    return documents.filter(lambda x: x[0] == top_document_name)

In [23]:
query = 'astronomy'

doc = top_document(query, term_frequencies)
print(doc.collect())

[('748: amateur', ['030754', 'thumbnnamateur', 'astronomy', 'a', 'subset', 'of', 'astronomy', 'is', 'a', 'hobby', 'whose', 'participants', 'enjoy', 'studying', 'and', 'observing', 'celestial', 'objectsnnthe', 'typical', 'amateur', 'astronomer', 'is', 'one', 'who', 'does', 'not', 'depend', 'on', 'the', 'field', 'of', 'astronomy', 'as', 'a', 'primary', 'source', 'of', 'income', 'or', 'support', 'and', 'does', 'not', 'have', 'a', 'professional', 'degree', 'or', 'advanced', 'academic', 'training', 'many', 'amateurs', 'are', 'beginners', 'while', 'others', 'have', 'a', 'high', 'degree', 'in', 'astronomy', 'and', 'often', 'assist', 'and', 'work', 'alongside', 'professional', 'astronomersnnamateur', 'astronomy', 'is', 'usually', 'associated', 'with', 'viewing', 'the', 'night', 'sky', 'when', 'most', 'celestial', 'objects', 'and', 'events', 'are', 'visible', 'but', 'sometimes', 'amateur', 'astronomers', 'also', 'operate', 'during', 'the', 'day', 'for', 'events', 'such', 'as', 'sunspot', 'and',

In [24]:
query = 'machine'

doc = top_document(query, term_frequencies)
print(doc.collect())

[('1368: assembly', ['215845', 'an', 'assembly', 'language', 'is', 'a', 'lowlevel', 'language', 'for', 'programming', 'computer', 'it', 'implements', 'a', 'symbolic', 'representation', 'of', 'the', 'numeric', 'machine', 'code', 'and', 'other', 'constants', 'needed', 'to', 'program', 'a', 'particular', 'cpu', 'architecture', 'this', 'representation', 'is', 'usually', 'defined', 'by', 'the', 'hardware', 'manufacturer', 'and', 'is', 'based', 'on', 'abbreviations', 'called', 'mnemonics', 'that', 'help', 'the', 'programmer', 'remember', 'individual', 'instructions', 'registers', 'etc', 'an', 'assembly', 'language', 'is', 'thus', 'specific', 'to', 'a', 'certain', 'physical', 'or', 'virtual', 'computer', 'architecture', 'as', 'opposed', 'to', 'most', 'highlevel', 'language', 'which', 'are', 'usually', 'portablennassembly', 'languages', 'were', 'first', 'developed', 'in', 'the', '1950s', 'when', 'they', 'were', 'referred', 'to', 'as', 'second', 'generation', 'programming', 'language', 'they', 