<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/math-and-architectures-of-deep-learning/04-linear-algebra/06_document_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Document Retrieval

Suppose, we have a set of documents ${d_0,...,d_6}$.

Now given an incoming query
phrase, we have to retrieve documents that match the query phrase. We will use bag
of words model - i.e., our matching approach does not pay attention to where a word
appears in a document, it simply pays attention to how many times a word appears in
a document.

Our documents are:

* $d_0$:Roses are lovely. Nobody hates roses.
* $d_1$:**Gun violence** has reached an epidemic proportion in America.
* $d_2$:The issue of **gun violence** is really over-hyped. One can find many instances of **violence** where no **guns** were involved.
* $d_3$:**Guns** are for violence prone people. **Violence** begets **guns**. **Guns** beget **violence**.
* $d_4$:I like **guns** but I hate **violence**. I have never been involved in **violence**. But I own many **guns**. **Gun violence** is incomprehensible to me. I do believe **gun** owners are the most anti **violence** people on the planet. He who never uses a **gun** will be prone to senseless **violence**.
* $d_5$:**Guns** were used in a armed robbery in San Francisco last night.
* $d_6$:Acts of **violence** usually involve a weapon.

Let us look at a toy dataset. Rows correspond to documents. Columns correspond to terms. Each cell
contains the term frequency. The terms “Gun” and “Violence” occur equal number of times in most documents, indicating clear correlation.

\begin{vmatrix}
  & violence & gun & america & roses \\
  d_{0} & 0 & 0 & 0 & 2 \\
  d_{1} & 1 & 1 & 1 & 0 \\
  d_{2} & 2 & 2 & 0 & 0 \\
  d_{3} & 3 & 3 & 0 & 0 \\
  d_{4} & 5 & 5 & 0 & 0 \\
  d_{5} & 0 & 1 & 0 & 0 \\
  d_{6} & 1 & 0 & 0 & 0 \\
\end{vmatrix}

Cosine similarity between document vectors is often used to measure similarity between two documents. Cosine Similarity only considers direct overlap of terms. 

The terms "Gun" and "violence" have clear correlation (they appear together in many other documents, so documents containing "Gun" should be similar to documents containing "violence"). 

Cosine Similarity will not see that. LSA wiil. In LSA terms often occuring together become part of the same topic. 

Documents are projected into topic space - e.g., "Gun-violence" is a topic - where indirect similarities are visible.

##Setup

In [1]:
import torch

In [10]:
def cosine_similarity(vec_1, vec_2):
  return torch.dot(vec_1, vec_2) / (torch.linalg.norm(vec_1) * torch.linalg.norm(vec_2))

##SVD and LSA

In [4]:
# # consider only 4 terms for simplicity
terms = ["violence", "gun", "america", "roses"]

# pre-computed document term matrix: Each row describes a document and Each column contains TF scores for one term. IDF is ignored for simplicity
doc_term_matrix = torch.tensor([
  [0, 0, 0, 2], 
  [1, 1, 1, 0], 
  [2, 2, 0, 0], 
  [3, 3, 0, 0], 
  [5, 5, 0, 0], 
  [0, 1, 0, 0], 
  [1, 0, 0, 0]
]).float()

Let's perform SVD on the doc-term matrix. The columns of
resulting matrix `V` correspond to topics. 

These are eigenvectors of $X^TX$, i.e., principal vectors of the
doc-term matrix. 

Thus, a topic corresponds to the direction of maximum variance in doc feature
space.

In [5]:
# Let us perform SVD
U, S, V_t = torch.linalg.svd(doc_term_matrix)
print(f"Principal Values: {S[0]:.2f} {S[1]:.2f} {S[2]:.2f} {S[3]:.2f}")

Principal Values: 8.89 2.00 1.00 0.99


The columns of V are the topic vectors. Each topic vector can be seen as a weighted sum of the terms in our vocabulary.

In [6]:
V = V_t.T

`S` indicates the diagonal matrix of principal values. These
signify topic weights ( importance). 

We choose a cut-off and discard all topics below that weight - dimensionality reduction.

Let us reduce this to a lower rank representation.

There is a big  drop in principal value from `S[0]` to `S[1]`.Hence, we choose to cutoff all principal vectors beyong `V[0]`.

We will retain only the first column of `V`, the principal axis. 

In [7]:
rank = 1
U = U[:, :rank]
V = V[:, :rank]

Now that we have reduced the dimensionality to only contain one topic, let let us look at the weighted contributions of terms to this topic.

Note that both violence and gun have every high affinity, and contribute equally to this topic.

In [9]:
term_topic_affinity = list(zip(terms, V[:, 0]))
print(term_topic_affinity)

[('violence', tensor(-0.7070)), ('gun', tensor(-0.7070)), ('america', tensor(-0.0181)), ('roses', tensor(1.1381e-09))]


Let us consider 2 documents $d_5$ and $d_6$. 

Note that the similarity between the two documents is 0 even though intuitively they are similar.

In [11]:
d5_d6_similarity = cosine_similarity(doc_term_matrix[5], doc_term_matrix[6])
assert d5_d6_similarity == 0
print(f"Cosine similarity between document 5 and document 6 in original space is {d5_d6_similarity}")

Cosine similarity between document 5 and document 6 in original space is 0.0


Now let us instead look at the document representation in the topic space.

We notice in this new space, documents 5 and 6 are close.

In [12]:
doc_topic_matrix = torch.matmul(doc_term_matrix, V)
d5_d6_similarity = cosine_similarity(doc_topic_matrix[5], doc_topic_matrix[6])
print(f"LSA topic based Cosine similarity between document 5 and document 6 is {d5_d6_similarity}")

LSA topic based Cosine similarity between document 5 and document 6 is 1.0


The cosine similarity between document vectors does not look at such secondary evidence.

This is the blind spot that LSA tries to overcome.

Words are known by the company they keep. This means, if terms appear together in many
documents, they are likely to share some semantic similarity. 

For instance, guns and violence
in the above examples. Such terms should be grouped together into a common
pool of semantically similar terms. We will call this pool a topic. 

Document similarity
should be measured in terms of common topics rather than common terms. 

Thus, we
have expanded the notion of shared terms between documents to shared topics between
documents.