# 3. Document Embedding

In the previous section, we represented variable length texts as fixed length numeric vectors; the approach we have used so far is the traditional approach of Bag of Words (BoW), which tokenizes a text into words (tokens), ignoring orders of tokens but may reserve the count. This approach is high dimension, and very sparse; this may result in over fitting and high time complexity.

A more modern text vectorization approach is word embedding (also called simply embedding), relying on neural representations. This approach takes distributional semantics into account; that is, a word’s meaning is given by the words that frequently appear close-by. Hence, we can construct a word’s context by using the set of words that appear nearby within a fixed-sized window. 

Semantically similar texts, then, would appear closer to each other in the vector space. We could also possibly capture semantic operations by operations in the vector space; for example, similarity between texts could be measured by vector dot product. We could also perform algebraic operations; for example, 

$\text{vector(”King”)} - \text{vector(”Man”)} + \text{vector(”Woman”)} \sim \text{vector(“Queen”)}$. 

Modern-day representations are typically learned from vast body of texts, often with deep neural networks, and they typically result in pre-trained models.


## 3.1 Universal Sentence Encoder

The Universal Sentence Encoder (USE) was first published by Google around 2018. It maps a sentence, word, or short paragraph to a fixed length (typically 512) numeric vector. This approach would mean semantically similar sentences would be placed closer to each other in the embedding space. 

Embeddings are typically the result of using raw text, so no pre-processing would be involved. This sentence embedding can then be used for downstream applications,
e.g., classification, clustering, and language prediction. 

USE is a pre-trained model trained on variety of data, e.g., wikipedia and books. It was trained with a deep averaging network (DAN) encoder; more information and explanation on the process behind USE can be found at https://arxiv.org/pdf/1803.11175.pdf.

To utilize USE, we can take one of three approaches:
<ol>
<li>We could take our desired document, turn it into a collection of sentences, and then map each sentence to its respective vector;</li>
<li>We could treat each document as a short paragraph and match each document to its respective vector, or;</li>
<li>We could take a similar approach to #1, except then aggregate the vectors for each document to form a single vector per document.</li>
</ol>

To install USE, run the following code:


In [None]:
pip install tensorflow
pip install tensorflow_hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)


Note that the first time you run this, it may take some time (5+ minutes) to complete the process.

## 3.2 t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is best used to scale text features to the same scale. In short, it is a method of dimension reduction (like PCA). t-SNE associates probabilities on a Student's t-distribution with each point; it then uses some randomization (hence Stochastic) to embed, paying particular attention to the neighbors of each point. 

While t-SNE will not be discussed further as to its specific methods, it can nonetheless be used for document embedding. More explanations on t-SNE can be found on the $\texttt{scikit\_learn}$ website, https://scikit-learn.org/stable/modules/manifold.html#t-sne. 

## 3.3 More Exercises

<b>Exercise 3.1</b>: Take two documents, one labeled as SDG 1 and the other as SDG 8. Segment these into sentences, compute the embedding, and find the dot product between the embeddings.