In [1]:
import torch

## LSA on toy dataset

Let us look at a toy dataset. Rows correspond to documents. Columns correspond to terms. Each cell
contains the term frequency. The terms “Gun” and “Violence” occur equal number of times in most documents, indicating clear correlation

\begin{vmatrix}
  & violence & gun & america & roses \\
  d_{0} & 0 & 0 & 0 & 2 \\
  d_{1} & 1 & 1 & 1 & 0 \\
  d_{2} & 2 & 2 & 0 & 0 \\
  d_{3} & 3 & 3 & 0 & 0 \\
  d_{4} & 5 & 5 & 0 & 0 \\
  d_{5} & 0 & 1 & 0 & 0 \\
  d_{6} & 1 & 0 & 0 & 0 \\
\end{vmatrix}

Cosine similarity between document vectors is often used to measure similarity between two documents. Cosine Similarity only considers direct overlap of terms. The terms "Gun" and "violence" have clear correlation (they appear together in many other documents, so documents containing "Gun" should be similar to documents containing "violence"). Cosine Similarity will not see that. LSA wiil. In LSA terms often occuring together become part of the same topic. Documents are projected into topic space - e.g., "Gun-violence" is a topic - where indirect similarities are visible

In [2]:
terms = ["violence", "gun", "america", "roses"]
doc_term_matrix = torch.tensor([[0, 0, 0, 2], [1, 1, 1, 0], [2, 2, 0, 0], [3, 3, 0, 0], [5, 5, 0, 0], [0, 1, 0, 0], [1, 0, 0, 0]]).float()

In [3]:
# Let us perform SVD
U, S, V_t = torch.linalg.svd(doc_term_matrix)


print("Principal Values. %0.2f %0.2f %0.2f %0.2f"%(S[0], S[1], S[2], S[3]))

# The columns of V are the topic vectors. Each topic vector can
# be seen as a weighted sum of the terms in our vocabulary.
V = V_t.T


# Let us reduce this to a lower rank representation.
# There is a big  drop in principal value from S[0] to S[1]. 
# Hence, we choose to cutoff all principal vectors beyong V[0].
# We will retain only the first column of V, the principal axis. 
rank = 1
U = U[:, :rank]
V = V[:, :rank]

Principal Values. 8.89 2.00 1.00 0.99


In [4]:
# Now that we have reduced the dimensionality to only contain one topic, let 
# let us look at the weighted contributions of terms to this topic.
term_topic_affinity = list(zip(terms, V[:, 0]))

# Note that both violence and gun have every high affinity, and contribute equally to this
# topic.
print(term_topic_affinity)

[('violence', tensor(-0.7070)), ('gun', tensor(-0.7070)), ('america', tensor(-0.0181)), ('roses', tensor(1.1381e-09))]


In [5]:
# Let us consider 2 documents d5 and d6. 

def cosine_similarity(vec_1, vec_2):
    return torch.dot(vec_1, vec_2) / (torch.linalg.norm(vec_1) * torch.linalg.norm(vec_2))


# Note that the similarity between the two documents is 0 even though
# intuitively they are similar
d5_d6_similarity = cosine_similarity(doc_term_matrix[5], doc_term_matrix[6])
assert d5_d6_similarity == 0
print("Cosine similarity between document 5 and document 6 in original space is {}".format(d5_d6_similarity))

# Now let us instead look at the document representation in the topic space
# We notice in this new space, documents 5 and 6 are close.
doc_topic_matrix = torch.matmul(doc_term_matrix, V)
d5_d6_similarity = cosine_similarity(doc_topic_matrix[5], doc_topic_matrix[6])
print("LSA topic based Cosine similarity between document 5 and document 6 is {}".format(d5_d6_similarity))

Cosine similarity between document 5 and document 6 in original space is 0.0
LSA topic based Cosine similarity between document 5 and document 6 is 1.0
