## Generate Similarity Matrix

Documents are vectorized using Term Frequency / Inverse Document Frequency (TF-IDF) and stored in a Term-Document (TD) Matrix. Each document is represented by a vector of TF-IDF values. The size of the TF-IDF vector is equal to the size of the vocabulary.

We then generate a similarity matrix S of cosine similarities by multiplying the (L2-normalized TD matrix with its transpose). Each entry of the similarity matrix S<sub>ij</sub> represents the similarity between document<sub>i</sub> and document<sub>j</sub>.

The similarity matrix so formed will form the basis for creating **random walk** generation probabilities. In order to prevent self-loops, the diagonal of the similarity matrix is set to 0.

This notebook contains the steps needed to build a TD matrix, and prepare and save the similarity matrix S.

In [1]:
import numpy as np
import os

from scipy.sparse import save_npz
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
DATA_DIR = "../data"
TEXT_FILEPATH = os.path.join(DATA_DIR, "texts.tsv")

TD_MATRIX_FILEPATH = os.path.join(DATA_DIR, "tdmatrix.npz")
COSIM_FILEPATH = os.path.join(DATA_DIR, "cosim.npy")

NUM_TOP_GENERATORS = 80

### Read text into memory

We will read the `texts.tsv` file and save the text into a local list object. We also save the doc_ids into a list, so we can correlate a row in the TD matrix with an actual `doc_id` later.

In [3]:
doc_ids, texts = [], []
lid = 1
ftext = open(TEXT_FILEPATH, "r")
for line in ftext:
    try:
        doc_id, text = line.strip().split('\t')
    except ValueError:
        print("line {:d}, num cols {:d}".format(lid, len(line.strip().split('\t'))))
    doc_ids.append(doc_id)
    texts.append(text)
    lid += 1

ftext.close()

### Vectorize and create TD Matrix

We declare a Scikit-Learn TF-IDF Vectorizer with:
* minimum document frequency -- token must appear in at least 5 documents to be counted in the vocabulary.
* L2 normalization -- each row is normalized by the square root of the sum of its squared elements. 

L2 normalization is done in place because we are going to compute cosine similarity later.

The TD matrix is a `scipy.sparse` matrix. We save it as an `.npz` file to use for evaluation later.

In [4]:
vectorizer = TfidfVectorizer(min_df=5, norm="l2")
td_matrix = vectorizer.fit_transform(texts)

save_npz(TD_MATRIX_FILEPATH, td_matrix)

print(td_matrix.shape)

(18810, 28813)


### Create Similarity Matrix

The TD Matrix represents a corpus of 18810 documents, each containing 28813 token features, or vectors of size 28813.

We can generate a (18810, 18810) document-document similarity matrix by multiplying the TD matrix with its transpose.

In [5]:
S = td_matrix * np.transpose(td_matrix)
print(S.shape)

(18810, 18810)


### Retain only top generators

We want to sparsify the similarity matrix by considering only the **top generators** (Kurland and Lee, 2005). The paper mentions that using 80 top generators gives good values downstream. 

So for each row (document), we will discard all elements except for the ones whose values are within the top 80 values for the row.

Using `np.argpartitions` returns the top N values from a matrix along direction given by `axis`, but does not return them sorted. We don't need it sorted for our application, so thats fine. But using this is faster than the naive sorting and slicing approach.

In [6]:
S = S.todense()
num_to_discard = S.shape[1] - NUM_TOP_GENERATORS
zero_indices = np.argpartition(S, -NUM_TOP_GENERATORS, axis=1)[:, 0:num_to_discard]
for i in range(zero_indices.shape[0]):
    for j in zero_indices[i]:
        S[i, j] = 0

### Remove self-loops

The algorithm calls for generating random walks on the similarity graph, i.e., the graph generated by considering the similarity matrix S as an adjacency matrix. 

In this similarity graph, each node represents a document and each edge represents the probability of transitioning from the source document to the target document. The cosine similarity is expressed as a number between 0 and 1 and can be thought of as a proxy for this transition probability.

We will execute random walks on this graph to get an estimate of the generation probabilities, i.e., what is the probability of being able to generate one document from another. We don't want to consider walks that start and end at the same node, as shown in the equation below.

<a href="https://www.codecogs.com/eqnedit.php?latex=\fn_jvn&space;g(d_i&space;|&space;d_j)&space;=&space;\left\{\begin{matrix}&space;0&space;&&space;if&space;\,&space;i&space;=&space;j&space;\\&space;p(d_i&space;|&space;d_j)&space;&&space;otherwise&space;\end{matrix}\right." target="_blank"><img src="https://latex.codecogs.com/png.latex?\fn_jvn&space;g(d_i&space;|&space;d_j)&space;=&space;\left\{\begin{matrix}&space;0&space;&&space;if&space;\,&space;i&space;=&space;j&space;\\&space;p(d_i&space;|&space;d_j)&space;&&space;otherwise&space;\end{matrix}\right." title="g(d_i | d_j) = \left\{\begin{matrix} 0 & if \, i = j \\ p(d_i | d_j) & otherwise \end{matrix}\right." /></a>


In [7]:
for i in range(S.shape[0]):
    S[i, i] = 0

### Renormalize Similarity Matrix

In order for the resulting matrix to represent transition probabilities, we have to re-normalize the similarity matrix so the remaining elements sum to 1 across every row (unless all elements in the row are 0, in which case they sum to 0).

We will save the similarity matrix for the next step in the process.

In [8]:
S_rowsum = np.sum(S, axis=1).reshape(S.shape[0], 1)
S_rowsum[S_rowsum == 0] = 1e-19
Snorm = S / np.sum(S, axis=1)

np.save(COSIM_FILEPATH, Snorm)

  This is separate from the ipykernel package so we can avoid doing imports until
