#### TFIDF Passage Retrieval

A simple and efficient way of retreiving relevant documents (e.g. sentence or paragraph) from a document store is using `TFIDF`. Given a collection of documents, we can create TFIDF vectors for each document, which is a vector of TFIDF weights (one for each word from the vocabulary). 

Term-frequency for term $t$ in document $d$: 

$TF(t,d) = 1 + \log_{10}(\text{count}(t,d))$

where $\text{count}(t,d)$ is the frequency with which term $t$ appears in document $d$. (Note that we take the log to supress the range of count values).

Inverse-document-frequency for term $t$:

$IDF(t) = \log_{10}(\frac{N}{\text{df}(t)})$

where $N$ is the total number of documents in the collectino and $\text{df}(t)$ is the number of documents in which term $t$ occurs. Then the TFIDF is given by the following product:

$TFIDF(t,d) = TF(t,d) IDF(t)$


Then given a query $q$ which is a sequence of terms, we can construct a TFIDF vector for this query. However, since queries are usually short and is likely to contain a single occurance of each unique term, we can simplify it's TFIDF vector by setting the TFIDF weight for each unique term to 1. Then we can compute a score for each document as the cosine similarity between the query vector and the corresponding document TFIDF vector $d$:

$score(q,d) = \frac{q \cdot d}{|q| |d|}$

Since $|q|$ is a fixed constant, we can ignore it because it will not affect the ranking of document scores. Then using our simplifying assumption of $q$ being a vector of binary weights, we have the following document score function:

$score(q,d) = \frac{\sum_{t\in q} TFIDF(t,d)}{\sqrt{\sum_{t\in d} TFIDF(t,d)^2}}$

where the square root term in the denominator is the norm of the document TFIDF vector. So the score for each document is just the TFIDF weights for the query terms which also appear in that document, normalized by the norm of that documents TFIDF vector.

Now each word in the query will not occur in all documents, so we need to only consider documents that actually contain these query words instead of iterating over all documents in the collection. We can maintain an `inverted index` data structure which is a dictionary that maps each unique word to a list of tuples, each tuple containing a document and the TFIDF weight.

e.g. `inverted_index = {'w1' : [(d1, TFIDF(w1,d1)), (d2, TFIDF(w1,d2),..)], 'w2': ...}`

This data structure will allow us to compute and rank the document scores very efficiently.
