# Spark Term Frequency for Search Algorithms

TF-IDF: Term Frequency and Invert Document Frequency

- Figures out what terms are most relevant for a document

**Term Frequency**: measures how often a word occurs in a document. (a word that occurs frequently is probably important to that document's meaning)

**Document Frenquecy**: is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page. (Common words that just appear everywhere no matter what the toping, like 'a', 'the', 'and'...)

Relevancy of a word to a document: $\frac{\text{Term Frequency}}{\text{Document Frequency}}$

or $\text{Term Frequency}*\text{Inverse Document Frequency}$

This is how often the word appears in a document, over, how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document.

## In Practice

We use the log of the IDF, since word frequencies are distributed exponentially. That gives us a better weighting of a words overall popularity.

TF-IDF assumes a document is just a 'bunch of words':
- Parsing documents into a 'bunch of words' can be most of the owrk
- Words can be represented as hash value (number) for efficiency
- What about synonyms? Various tenses? Abbreviantions? Capitalizations? Misspellings?

An simple search algorithm could be:

1) ComputeTD-IDF for every word in a corpus

2) For a given search word, sort the documents by their TF-IDF score for that word

3)  Display the results


# Search Wikipedia with Spark

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=20bd6245fb4be0db9fd7fce68c6a4dabde0a11a8cecc6d5d1f524c9da9e981d3
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [24]:
from pyspark.mllib.feature import HashingTF, IDF
from pyspark import SparkConf, SparkContext

In [None]:
conf = SparkConf().setMaster("local").setAppName("SparkTFIDF")
sc = SparkContext(conf = conf)

In [40]:
rawData = sc.textFile("subset-small.tsv")
fields = rawData.map(lambda x: x.split("\t"))
documents = fields.map(lambda x: x[3].split(" "))

In [33]:
documentNames = fields.map(lambda x: x[1])

In [34]:
hashingTF = HashingTF(100000)
tf = hashingTF.transform(documents)

In [35]:
#TF*IDF
tf.cache()
idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)

In [36]:
gettysburgTF = hashingTF.transform(["Gettysburg"])
gettysburgHashValue = int(gettysburgTF.indices[0])

In [37]:
gettysburgRelevance = tfidf.map(lambda x: x[gettysburgHashValue])

In [38]:
zippedResults = gettysburgRelevance.zip(documentNames)

In [39]:
print("Best document for Gettysburg is:")
print(zippedResults.max())

Best document for Gettysburg is:
(0.0, 'Ælle of Sussex')
