# Word Embeddings

## Vector Space Models

This notebook contains information about vector spaces. Specifically, it describes what type of information these vectors could encode, as well as the different types of applications that we can use with these vector spaces. Now, we introduce the general idea behind vector space models. 

Suppose we have two questions. 

> Where are you <font color="green">heading?</font>
>
> Where are you <font color="green">from?</font>

These sentences have identical words, except for the last ones. However, they both have a different meaning. On the other hand, say we have two more questions whose words are completely different but both sentences mean the same thing. 

> What is your age?
> 
> How old are you?

Vector space models will help identify whether the first pair of questions or the second pair are similar in meaning even if they do not share the same words. They can be used to identify similarity for a question answering, paraphrasing, and summarization. Vector space models allow to capture dependencies between words. Consider this sentence. 

> You eat <font color="green">ceral</font> from a <font color="green">bowl</font>

Here, we can see that the word "cereal" and the word "bowl" are related. Now let's look at this other sentence.

> You <font color="green">buy</font> something and someone else <font color="green">sells</font> it

What it is saying is that someone sells something because someone else buys it. The second half of the sentence is dependent on the first half. With vector space models, we will be able to capture this and many other types of relationships among different sets of words. Vector space models are used in information extraction to answer questions in the style of *who?*, *what?*, *where?*, *how?*, and etc., in machine translation and in chatbots programming. 

As a final thoughts, we share this quote from **John Firth**, a famous English linguist. 

<img style="float:left" src="https://upload.wikimedia.org/wikipedia/commons/c/c3/John_Rupert_Firth.png" width="150"/><br/><br/>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "You shall know a word by the company it keeps." (Firth, 1957)

<br/><br/><br/><br/><br/><br/><br/>

This is one of the most fundamental concepts in NLP. When using vector space models, the way that representations are made is by identifying the context around each word in the text, and this captures the relative meaning. To recap, vector space models allow to represent words and documents as vectors, which can capture the relative meaning.

## Word by Word and Word by Doc

Now, we will learn how to construct a vectors based off a co-occurrence matrix. Specifically, depending on the task we are trying to solve, we can have several possible designs. To get a vector space model using a word by word design, we make a co-occurrence matrix and extract vector presentations for the words in the corpus. We are able to get a vector space model using a word by document design using a similar approach. 

### Word by Word Matrix

The co-occurrence of two different words is the number of times that they appear together in the corpus within a certain word distance $k$. For instance, suppose that the corpus has the following two sentences. 

> I like simple data.
> 
> I prefer simple raw data.

We can extract the co-occurrence between words using a word distance $k=2$. Thus, for the word "data" we extract the relation for the word marked in green.

> I <font color="green">like simple</font> <font color="blue">data</font>
> 
> I prefer <font color="green">simple raw</font> <font color="blue">data</font>

The row of the co-occurrence matrix corresponding to the word "data" with a $k=2$ would be populated with the following values. 

|      | simple | raw | like |  I  |
| :--- | :----: | :-: | :--: | :-: |
| data |   2    |  1  |   1  |  0  |

For the column corresponding to the word "simple", we get a value equal to two. Because data and simple co-occur in the first sentence within a distance of one word, and in the second sentence within a distance of two words. The row of the co-occurrence matrix corresponding to the word data would look like this if we consider the co-occurrence with the words "simple", "raw", "like", and "I". In this case, the vector representation of the word data would be equal to `data=[2, 1, 1, 0]`. With a word by word design, we get a representation with $n$ entries, with $n$ between one and the size of the entire vocabulary. 

### Word by Document Matrix

For a word by document design, the process is quite similar. In this case, we count the times that words from the vocabulary appear in documents that belong to specific categories. For instance, we could have a corpus consisting of documents between different topics like "Entertainment", "Economy", and "Machine Learning". Here, we have to count the number of times that the words appear on the document that belong to each of the three categories. In this example below, suppose that the word data appears 500 times in documents from the corpus related to "Entertainment", 6,620 times in "Economy" documents, and 9,320 in documents related to "Machine Learning". The word film appears in each document's category 7,000, 4,000, and 1,000 times respectively. 

|      | Entertainment | Economy | Machine Learning |
| :--- | :-----------: | :-----: | :--------------: |
| data |      500      |  6620   |      9320        |
| film |     7000      |  4000   |      1000        |

Once we have constructed the representations for multiple sets of documents or words, we get the vector space. Let's take the matrix from the last example. Below, we could take a representation for the words "data" and "film" from the rows of the table. However, we take the representation for every category of documents by looking at the columns. So the vector space will have two dimensions. The number of times that the words "data" and "film" appear on the type of document. For the each corpus, we have the following vector representations. 

<img src="images/vector_space.svg" width="35%"/>

Note that in this space, it is easy to see that the "Economy" and "Machine Learning" documents are much more similar than they are to the "Entertainment" category. 

## Euclidean Distance

Now, we are going to learn about Euclidean distance, which is a similarity metric. This metric allows to identify how far two points or two vectors are apart from each other. Let's use two of the corpora vectors we saw previously. Remember, in that's example, there are two dimensions. The number of times that the word "data" and the word "film" appeared in the each corpus. Below, we consider the "Entertainment" corpus (red dot) and the "Machine Learning" corpus (green dot). Let's represent those vectors as points in the vector space.

<img src="images/euclidean_distance.svg" width="35%"/>

The **Euclidean distance** is the length of the straight line segment connecting them. To get that value, we should use the following formula. 

$$
d(B, A) = \sqrt{(\color{blue}{B_1 - A_1})^2 + (\color{orange}{B_2 - A_2})^2}
$$

The first term is their horizontal distance (marked as a dashed blue line) squared and the second term is their vertical distance (marked as a dashed orange line) squared. This formula is an example of the Pythagorean theorem. If we solve for each of the terms in the equation, we arrive at this expression and at last, get a Euclidean distance approximately equal to 10,667.

$$
d(B, A) = \sqrt{(9320 - 500)^2 + (1000 - 7000)^2} = \sqrt{(8820)^2 + (-6000)^2} \approx 10667
$$

When we have higher dimensions, the Euclidean distance is not much more difficult. Let's use the following co-occurrence matrix. 

|        | data | boba | ice-cream |
| :----- | :--: | :--: | :-------: |
| AI     |  6   |  0   |     1     |
| drinks |  0   |  4   |     6     |
| food   |  0   |  6   |     8     |

Suppose that we want to know the Euclidean distance between the vector $\vec{v}$ of the word "ice-cream" and the vector representation $\vec{w}$ of the word "boba". To start, we need to get the difference between each of their dimensions, square those differences, sum them up and then get the square root of the results. 

$$
d(\vec{v}, \vec{w}) = \sqrt{(1-0)^2 + (6-4)^2 + (8-6)^2} = \sqrt{1+4+4} = \sqrt{9} = 3
$$

This process is the generalization of the one from the previous example. This is the formula that we would use to get the Euclidean distance between vector representations on an n-dimensional vector space. From algebra, this formula is known as the norm of the difference between the vectors that we are comparing.

$$
d(\vec{w}, \vec{v}) = \sqrt{\sum\limits_{i-1}^n (v_i - w_i)^2}\ \ \ \ \ \rightarrow \ \ \ \ \ \text{Norm of} (\vec{v} - \vec{w})
$$

Let's take a look at the implementation of the Euclidean distance in Python. If we have two vectors like the ones from the previous example, we can use the `linalg` module from numpy to get the norm of the difference between them. If we implement this code in Python, we should get this results, the norm function works for n-dimensional spaces.

In [3]:
import numpy as np

# Create numpy vectors v and w
v = np.array([1, 6, 8])
w = np.array([0, 4, 6])

# Calculate the Euclidean distance d
d = np.linalg.norm(v - w)

# Print the result
print("The Euclidean distance between v and w is: ", d)

The Euclidean distance between v and w is:  3.0


# Cosine Similarity

Now, we learn about cosine similarity, which is another type of similarity function. It's basically makes use of the cosine of the angle between two vectors. And based off that, it tells whether two vectors are close or not. To illustrate how the **Euclidean distance** might be problematic, let's take the following example. Suppose that we are in a vector space where the corpora are represented by the occurrence of the words "disease" and "eggs". We have the representation of a "Food" corpus and "Agriculture" corpus and the "History" corpus. Each one of these corpora have text related to that subject. We know that the word totals in the corpora differ from one another. In fact, the "Agriculture" and the "History" corpus have a similar number of words, while the "Food" corpus has a relatively small number. Let's define the Euclidean distance between the "Food" and the "Agriculture" corpus as $d_1$. Let the Euclidean distance between the "Agriculture" and the "History" corpus be $d_2$. 

<img src="images/distances_food.svg" width="35%"/>

As we can see, the distance $d_2$ is smaller than the distance $d_1$, which would suggest that the "Agriculture" and "History" corpora are more similar than the "Agriculture" and "Food" corpora. Another common method for determining the similarity between vectors is computing the **Cosine** of their inner angle. If the angle is small, the cosine would be close to one. And as the angle approaches 90 degrees, the cosine approaches zero. As we can see in the image below, the angle $\alpha$ between "Food" and "Agriculture" is smaller than the angle $\beta$ between "Agriculture" and "History". In this particular case, the Cosine of those angles is a better proxy of similarity between these vector representations than their Euclidean distance.

<img src="images/angles_food.svg" width="35%"/>

Remember that the main advantage of this metric over the Euclidean distance is that it is not biased by the size difference between the representations. 

Now that we have the intuition behind the use of Cosine of the angle between two vectors representations as a similarity metric, we go deeper into an explanation. First, recall some definitions from algebra. The norm of a vector or its magnitude is defined to be the square root of the sum of its elements squared. 

$$
||\hat{v}|| = \sqrt{\sum\limits_{i=1}^{n} v_i^2}
$$

The dot product between two vectors is the sum of the products between their elements in each dimension of the vector space.

$$
\hat{v}\cdot\hat{w} = \sum\limits_{i=1}^{n} v_i*w_i
$$

Let's take another look at two of the corpora from the last section. Remember that in this example, we have a vector space where the representation of the corpora is given by the number of occurrences of the words "disease" and "eggs". The angle between those vector representations is denoted by $\beta$. The "Agriculture" corpus is represented by the vector $\hat{v}$, and the "History" corpus is going to be vector $\hat{w}$. 

<img src="images/beta_angle.svg" width="35%"/>

The dot products between those vectors is defined as follows. 

$$
\hat{v}\cdot\hat{w} = ||\hat{v}||\ ||\hat{w}||\ cos(\beta)
$$

From this equation, we see that the cosine of the angle beta is equal to the dot product between the vectors divided by the product of the two norms. 

$$
cos(\beta) = \frac{\hat{v}\cdot\hat{w}}{||\hat{v}||\ ||\hat{w}||}
$$

Replacing the actual values from the vector representations should give this expression. 

$$
cos(\beta) = \frac{(20 \times 30) + (40 \times 20)}{\sqrt{20^2 + 40^2} \times \sqrt{30^2 + 20^2}} = 0.87
$$

In the numerator, we have the product between the occurrences of the words, "disease" and "eggs", and in the denominator, we have the product between the norm of the vector representations of the "Agriculture" and "History" corpora. Ultimately, we should get a cosine similarity of 0.87.

Consider when two vectors are orthogonal in the vector spaces. It is only possible to have positive values for any dimension. So the maximum angle between pair of vectors is 90 degrees. In that case, the cosine would be equal to zero, and it would mean that the two vectors have orthogonal directions or that they are **dissimilar**. 

<img src="images/beta_angle90.svg" width="35%"/>

Now, let's look at the case where the vectors have the same direction. In this case, the angle between them is zero degrees and the cosine is equal to one, because cosine of zero is just one. As we can see, as the cosine of the angle between two vectors approaches one, the closer their directions are.

<img src="images/beta_angle0.svg" width="35%"/>

An important takeaway is that, this metric is proportional to the similarity between the directions of the vectors that we are comparing. Also, the cosine similarity takes values between 0 and 1. 

## Manipulating Words in Vector Spaces

Now, we learn to manipulate vectors and specifically perform some simple vector arithmetic, meaning by adding vectors and subtracting vectors, we will be able to predict the countries of certain capitals. Suppose that we have a vector space with countries and their capital cities. We know that the capital of the United States is Washington DC and we do not know the capital of Russia, but we would like to use the known relationship between Washington DC and the USA to figure it out. For that, we use some simple vector algebra. 

<img src="images/countries_capital.svg" width="40%"/>

For this example, we are in a hypothetical two-dimensional vector space that has different representations for different countries and capitals cities. First, we will have to find the relationship between the Washington DC and USA vector representations. In other words, which vector connects them? To do that, get the difference between the two vectors. The values from that will tell us how many units on each dimension we should move in order to find a country's capital in that vector space. 

$$
d(\text{USA-Washington}) = [5, -1]
$$

So to find the capital city of Russia, we will have to sum it's vector presentation with the vector that we also got in the last step. At the end, we should deduce that the capital of Russia has a vector representation of `(10, 4)`. 

<img src="images/countries.svg" width="40%"/>

However, there are no cities with that representation, so we have to take the one that is the most similar to its by comparing each vector with the Euclidean distances or Cosine similarities. In this case, the vector representation that is closest to the `(10, 4)` is the one for Moscow. Using this simple process, we could have predicted the capital of Russia if we knew the capital of the USA. The only catch here is that we need a vector space where the representations capture the relative meaning of words. Now we have a simple process to get unknown relationships between words by the use of known relationships between others. We now know the importance of having vectors spaces where the representations of words capture the relative meaning in natural language. 

## Visualization and PCA

It is often the case that we end up having vectors in very, very high dimensions. We want to find a way to reduce the dimension of these vectors to two dimensions so we can plot it on an XY axis. **Principal Components Analysis** (PCA) allows to do so, as well as can be used to visualize vector representations with higher dimensions.

Imagine we have the following representation for words in a vector space. 

| word | $d_1$ | $\ldots$ | $d_n$ |
| :--- | :---: | :------: | :---: |
| oil  | 0.20  | $\ldots$ | 0.10  |
| gas  | 2.10  | $\ldots$ | 3.40  |
| city | 9.30  | $\ldots$ | 52.1  |
| town | 6.20  | $\ldots$ | 34.3  |

In this scenario, vector space dimension is higher than two. We know that the words "oil" and "gas", and "city" and "town" are related. We want to see if that relationship is captured by the representation of the words. Dimensionality reduction is a perfect choice for this task. When we have a representation of words in a high dimensional space, we could use an algorithm like PCA to get a representation on a vector space with fewer dimensions. If we want to visualize the data, we can get a reduced representation with three or fewer features.

| word | $d_1$ | $d_2$ |
| :--- | :---: | :---: |
| oil  | 2.30  | 21.2  |
| gas  | 1.56  | 19.3  |
| city | 13.4  | 34.1  |
| town | 15.6  | 29.8  |

If we perform **Principal Components Analysis** on the data and get a two-dimensional representation, we can then plot a visual of the words. In this case, we might find that the initial representation captured the relationship between the words "oil" and "gas", and "city", and "town". Because in two-dimensional space they appear to be clustered with related words. We can even find other relationships among the words that we did not expect, which is a fun and useful possibility.

<img src="images/word_embeddings.svg" width="50%"/>

For the sake of simplicity, we begin with a two dimensional vector space. Say that we want data to be represented by one feature instead. Using PCA, first we find a set of uncorrelated features, and then project data to a one dimensional space, trying to retain as much information as possible. 

<img src="images/pca.svg" width="80%"/>

PCA is an algorithm used for dimensionality reduction that can find uncorrelated features for data. It's very helpful for visualizing data to check if the representation is capturing relationships among words.

## PCA Algorithm

Now, we learn about **eigenvalues** and **eigenvectors**, and see how we can use them to reduce the dimension of the features. It is important to note how to get uncorrelated features for the data, and then how to reduce the dimensions of the word representations while trying to keep as much information as possible from the original embedding. To perform dimensionality reduction using PCA, begin with the original vector space. Then we get uncorrelated features for the data. Finally, we project the data to a number of desired features that retain the most information. 


Matrices have eigenvectors and eigenvalues, where the eigenvectors of the covariance matrix from data give directions of uncorrelated features. The eigenvalues are the variance of the data sets in each of those new features. So to perform PCA, we will need to get the eigenvectors and eigenvalues from the covariance matrix of the data. The first step is to get a set of uncorrelated features. For this step, we mean normalize the data

$$
x_i = \frac{x_i - \mu_{x_i}}{\sigma_{x_i}}
$$

Then get the covariance matrix $\Sigma$, and finally, perform a Singular Value Decomposition (SVD) to get a set of three matrices.

$$
\left(
\begin{matrix}
    \color{red}{\blacksquare} & \color{blue}{\blacksquare} & \color{green}{\blacksquare} & \ldots & \color{navy}{\blacksquare} \\
    \color{red}{\blacksquare} & \color{blue}{\blacksquare} & \color{green}{\blacksquare} & \ldots & \color{navy}{\blacksquare} \\
    \color{red}{\blacksquare} & \color{blue}{\blacksquare} & \color{green}{\blacksquare} & \ldots & \color{navy}{\blacksquare} \\
    \color{red}{\blacksquare} & \color{blue}{\blacksquare} & \color{green}{\blacksquare} & \ldots & \color{navy}{\blacksquare} \\
    \color{red}{\blacksquare} & \color{blue}{\blacksquare} & \color{green}{\blacksquare} & \ldots & \color{navy}{\blacksquare}\\
  \end{matrix}\right) \ \ \ \left (\begin{matrix}
 \color{red}{\blacksquare} &                           &                            &  &  \\
                          & \color{blue}{\blacksquare} &                            &  &  \\
                          &                           & \color{green}{\blacksquare} &  &  \\
                          &                           &                             & \ldots & \\
                          &                           &                            &  & \color{navy}{\blacksquare}
\end{matrix}\right)\ \ \ \ \ \left( \begin{matrix}
    \color{red}{\blacksquare} & \color{red}{\blacksquare} & \color{red}{\blacksquare} &  \color{red}{\blacksquare} \\
    \color{blue}{\blacksquare} & \color{blue}{\blacksquare} & \color{blue}{\blacksquare} &  \color{blue}{\blacksquare} \\
    \color{green}{\blacksquare} & \color{green}{\blacksquare} & \color{green}{\blacksquare} &  \color{green}{\blacksquare} \\
    \ldots &  \ldots &  \ldots & \ldots &  \ldots \\
    \color{navy}{\blacksquare} & \color{navy}{\blacksquare} & \color{navy}{\blacksquare} & \color{navy}{\blacksquare}\\
  \end{matrix}\right)
$$

The first of those matrices contain the eigenvector stacked column wise. And the second one has the eigenvalues on the diagonal. The singular vector decomposition is already implemented in many programming libraries. 

<img src="images/eigenvectors.svg" width="80%"/>

The next step is to project data to a new set of features. We use the eigenvectors and eigenvalues in this step. Let's denote the eigenvectors with $U$, and the eigenvalues with $S$. First, we perform the dot products between the matrix containing the word embeddings and the first and columns of the $U$ matrix, where $n$ equals the number of dimensions that we want to have at the end. 

$$
X' = XU[:,0:2]
$$

<img src="images/svd.svg" width="60%"/>

For visualization, it is common practice to have two dimensions. Then we get the percentage of variance retained in the new vector space. 

$$
\text{% of Retained variance} = \frac{\sum\limits_{i=0}^1 S_{ii}}{\sum\limits_{i=0}^d S_{jj}}
$$

As an important side note, the eigenvectors and eigenvalues should be organized according to the eigenvalues in descending order. This condition will ensure that we retain as much information as possible from the original embedding.

Eigenvectors from the covariance matrix of the normalized data give the directions of uncorrelated features. The eigenvalues associated with those eigenvectors tell us the variance of the data on those features. The dot products between the word embeddings and the matrix of eigenvectors will project the data onto a new vector space of the dimension that we choose. 