# A brief introduction to word embeddings and neural networks

![alt text](w2v_image.png)

In [None]:
#!pip install gensim
#!pip install plotly
import gensim
import gensim.downloader as api
import pandas as pd
import numpy as np
import nltk


from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
import string
punct = list(string.punctuation)

import seaborn as sns
sns.set()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from IPython.display import IFrame
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
from scipy.spatial import distance

# Preliminary: matrix multiplication

Matrices are arrays of numbers in of $m \times n$ dimensions, where $m$ and $n$ dimensions ($m$ is the number rows; $n$ the number of columns). These are all examples of matrices:

$$A = \begin{bmatrix}1& 5 & 7 & 23& 6\end{bmatrix}$$ 
\
$$B = \begin{bmatrix}1& 5 \\ 7 & 23 \\ 6 & 9\end{bmatrix}$$
\
$$C = \begin{bmatrix}5 \\ 7 \\ 9\end{bmatrix}$$
\
$$D = \begin{bmatrix}1& 5 & i\\ 7 & 3i & e\\ \pi & 6 & 9\end{bmatrix} \ E = \begin{bmatrix}x^3& x+y & y^2\\ \sqrt{x} & x^2 + y^2 & x^y\end{bmatrix}$$

Linear algebra is the branch of mathematics that deals with matrix operations, and it underwrites much of NLP. For our purposes in learning `word2vec`, we need to know one operation: matrix multiplication. 

To multiply two matrices, the number of columns of the first matrix must equal the number of rows of the second matrix. That is, they must have dimensions $m \times n$ and $n \times p$. The multiplication operation consists of  starting with the first row of the first matrix and multiplying each entry in that row with the corresponding entry in the first column of the second matrix and adding the results. This gives the first entry of the new matrix. This is repeated over all the columns of the seond matrix, giving the entire first row of the new matrix. Then, you move on to the second row of the first matrix and repeat the process; this gives the second row of the new matrix. Once the process has completed for all the rows of the first matrix, the operation is compelete. This is easier to follow with an example:

$$\begin{bmatrix}1& 5 \\ 7 & 23 \\ 6 & 9\end{bmatrix} \times \begin{bmatrix}4 & 5 & 10 & 9 \\ 7 & 2 & 5 & 1 \end{bmatrix} = \begin{bmatrix}1\times4+5\times7 & 1\times5+5\times2 & 1\times10+5\times5 & 1\times9 + 5\times1 \\ 7\times4+23\times7 & 7\times5+23\times2 & 7\times10+23\times5 & 7\times9 + 23\times1 \\ 6\times4+9\times7 & 6\times5+9\times2 & 6\times10+9\times5 & 6\times9 + 9\times1\end{bmatrix} = \begin{bmatrix}39 & 15 & 35 & 14 \\ 189 & 81 & 185 & 86 \\ 87 & 48 & 105 & 63\end{bmatrix}$$



# The theory of word embeddings

Most present-day NLP applications are based on word embeddings. These are a way of representing words and larger items of text based on how they behave across a corpus. Typically, these behaviours are learned by neural networks, which can then be used to describe or generate language samples. Large Language Models (LLMs) like ChatGPT, GPT3, and T5 were all made possible be the emergence of word embeddings as a way of describing language.

The linguistic principle behind word embeddings is what's known as the distributional hypothesis. According to [Sahlgren (2006)](https://www.diva-portal.org/smash/get/diva2:1041938/FULLTEXT01.pdf), this can be stated as the claim that:

>there is a correlation between distributional similarity and meaning similarity, which allows us to utilize the former in order to estimate the latter.

That is, words that mean the same thing can generally be found in the same contexts as other words that mean the same thing. For example, the words 'cat' and 'feline' are more likely to share contexts with each other than the words 'dog' or 'canine'. Word embedding models seek to represent this common context mathematically, so we can use them with NLP methods.

The first word embedding model was `word2vec`, which was developed by [Mikolov et al](https://arxiv.org/abs/1301.3781) in 2013. Though `word2vec` has since been superseded by more sophisticated embedding models, the basic principles that underwrite them are the same. So, how does `word2vec` work?

The key insight is that `word2vec` wants to create vector representations of words, such that the vectors for words that share contexts with each other are closer in space. The model starts out by assigning each word a random vector (otherwise called a set of weights). During training on a corpus, the model learns which words are close to each other and adjusts the weights in an iterative process. Usually, the vectors are in 300 dimensions. Why 300? The choice is somewhat abitrary and can be any number, but 300 was found to provide enough paramaters to be empirically useful. In any event 300 parameters gives a lot of degrees of freedom; to quote Jon von Neumann:

> with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

In more detail, `word2vec` procedes as follows:

1. The corpus is tokenized into words.
2. Each word is represented as a $1 \times V$ matrix using one-hot encoding; this is a vector of size $V$, where $V$ is the vocabulary size.
3. A matrix of $V \times 300$ size is randomly initialised; this is called the hidden layer. Multiplying this vector by the one-hot representation of a word assigns that word one of the 300 place vectors. This vector is its embedding.
4. A second random matrix of $300 \times V$ is created; this is called the output layer. It's the same as the embeddings, but transposed to enable matrix multiplication. 
5. For a given word, its behaviour relative to every other word can be captured by getting a similarity measure of its embeddding and the output later weights: this gives a single number that measures vector similarity. Doing this for every word in the vocabulary gives a $1 \times V$ matrix of values.
6. This matrix is converted into a probability distribution using the softmax function. This gives the probability of the word co-occurring with every other word in the vocabulary
7. In the process of training, the `word2vec` model learns what words are *actually* associated with each other. It uses this knowledge to adjust the word embeddings and output weights to match the actual probability distributions. It does this iteratively until training ends. 

## Toy example

1. Corpus = ['Dogs eat bones', 'Cats eat fish', 'Plants absorb nutrients']

   Vocabulary = ['absorb', 'bones', 'cats', 'dogs', 'eat', 'fish', 'nutrients', 'plants']
   

2. One-hot encoding:

$$\text{absorb} = \begin{bmatrix}1& 0 & 0& 0& 0& 0& 0& 0\end{bmatrix}$$
$$\text{bones} = \begin{bmatrix}0& 1 & 0& 0& 0& 0& 0& 0\end{bmatrix}$$
$$\text{cats} = \begin{bmatrix}0& 0 & 1& 0& 0& 0& 0& 0\end{bmatrix}$$
$$\ldots$$
$$\text{plants} = \begin{bmatrix}0& 0 & 0& 0& 0& 0& 0& 1\end{bmatrix}$$


3. Create  random $V \times M$ matrix, where $V$ is vocab size and $M$ is the desired size of the embedding (here, we use 3 for convenience). This is the hidden layer:

$$\text{Hidden Layer} = \begin{bmatrix}2 & 5 & 5 \\7 & 5 & 9 \\2 & 5 & 5 \\1 & 3 & 2 \\0 & 6 & 2 \\3 & 5 & 1 \\5 & 4 & 8 \\3 & 1 & 0\end{bmatrix}$$

  Multiply the one-hot encoding of a word (here, 'plants') by this vector gives a unique word embedding for that word:

$$\text{plants} = \begin{bmatrix}0& 0 & 0& 0& 0& 0& 0& 1\end{bmatrix} \times \begin{bmatrix}2 & 5 & 5 \\7 & 5 & 9 \\2 & 5 & 5 \\1 & 3 & 2 \\0 & 6 & 2 \\3 & 5 & 1 \\5 & 4 & 8 \\3 & 1 & 0\end{bmatrix} = \begin{bmatrix}3 & 1& 0\end{bmatrix}$$

4. Next we define a second $M \times V$ matrix using our word embeddings; this is the output layer weights:

$$\text{Output Layer Weights} = \begin{bmatrix}2 & 7& 2& 1& 0& 3& 5 & 3 \\5& 5& 5& 3& 6& 5& 4& 1 \\5& 9& 5& 2& 2& 1& 8& 0\end{bmatrix}$$

5. If we take a word––here, 'plants'––and get the cosine similarity of its word embedding with every column in the weights matrix, we get a similarity score between that word and every other word in the vocalbulary:

$$\text{Similarity scores} = \begin{bmatrix}3 & 1 &0\end{bmatrix} \times \begin{bmatrix}2 & 7& 2& 1& 0& 3& 5 & 3 \\5& 5& 5& 3& 6& 5& 3& 1 \\5& 9& 5& 2& 2& 1& 8& 0\end{bmatrix} = \begin{bmatrix}0.4 & 0.6 & 0.47 & 0.5 & 0.3 & 0.7 & 0.5 & 1\end{bmatrix}$$

   (For example, $\text{cosine similarity}(\begin{bmatrix}3 & 1 &0\end{bmatrix}, \begin{bmatrix}2 & 5 & 5\end{bmatrix}) = 0.4$)

6. The softmax function $\frac{e^i}{\sum_{j=1}^K e^j}$ converts this matrix of similarity scores into a probability distribution for how likely the word of interest ('plants') will occur with every other word: 

$$P(plants|x\in V) = \begin{bmatrix} 0.10 & 0.12 & 0.11 & 0.11 & 0.09 & 0.14 & 0.11 & 0.19 \end{bmatrix}$$

During training, a `word2vec` model learns the actual probability distribution of 'plants' with respect to all the other words and calculates the difference between this true distribution and the randomly initialsed one. This is called the loss function. It then adjusts the word embeddings to minimise the loss function in an iterative process of updating. Training is complete when any further changes cause the loss function to increase. The word embeddings are retained as a representation of how each word behaves in the corpus. Because similar words will generate similar probability distributions due to co-occurrence, their word embeddings will be similar. 



# Example: The King James Bible

Let's see what it looks like to train a `word2vec` model with reference to a text––here, the King James version (KJV) of the Bible.

![alt text](king-james-bible.jpeg)

## Step 1: Preprocessing our data.

For `word2vec` to train, it needs a list of sentences that have been tokenized and lemmatized. This means we must:

1. Open our text file
2. Split on sentences
3. Tokenize and lemmatise our sentences and remove stopwords

In [None]:

#Open and read the text file
with open('KJV.txt', 'r') as f:
    bible = f.read()

# Get rid of newline characters and non-ascii gibberish and make everything lowercase
bible = bible.encode('ascii', 'ignore')
bible = bible.decode()
bible = ' '.join(bible.splitlines())
bible = bible.lower()



In [None]:
# Split text into sentences and remove trailing and leading whitespace
bible = bible.split('.')
bible = [i.strip() for i in bible]

In [None]:
bible[0]

In [None]:
# Create a good tokenixer and lemmatizer

def good_tokens(text):
    words = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(i).lower() for i in words]
    lemmas = [i for i in lemmas if i not in stops and i not in punct]
    return lemmas

In [None]:
# tokenize into words

bible_tokens = [good_tokens(i) for i in bible]

## Step 2: Call and train our `word2vec` model

In [None]:
bible_model = gensim.models.Word2Vec(bible_tokens, min_count= 20, vector_size = 300)

In [None]:
bible_model.wv.most_similar(['jesus'])

In [None]:
bible_model.wv.doesnt_match(['man', 'woman', 'child', 'horse'])

In [None]:
vocab = bible_model.wv.index_to_key
vectors = [bible_model.wv[i] for i in vocab]

df = pd.DataFrame(vectors)


In [None]:
df.columns = [str(i) for i in df.columns]

## Step 3: Visualising our data

In [None]:
# Reduce the dimensionality of the data so we can plot it

pca_1 = PCA(n_components = 3)
comps_1 = pca_1.fit_transform(df)
pc_df_1 = pd.DataFrame(data = comps_1, columns = ['PC '+str(i) for i in range(1, comps_1.shape[1]+1)])
df = pd.concat([df, pc_df_1], axis = 1)
df.index = vocab


In [None]:
# Cluster our data to see what groups together

kmeans = KMeans(n_clusters=20, random_state=0, n_init="auto").fit(df)
df['clusters_knn'] = [str(i) for i in kmeans.labels_]



In [None]:
fig = px.scatter_3d(df, x='PC 1', y='PC 2', z='PC 3',
              color='clusters_knn', hover_data = [df.index])

fig.update_traces(marker=dict(size = 5, line=dict(width=2,
                                        color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

## Use a pretrained model

Training an accurate model is time consuming and computationally intensive. Many NLP packages come with pertrained models already installed. Gensim, for example, has a pretained model based on Google news

In [None]:
model = api.load("word2vec-google-news-300")

In [None]:
model.similarity('UK', 'France')

In [None]:
model.similarity('UK', 'Mars')

In [None]:
vocab = model.index_to_key


In [None]:
import random

sample = random.choices(vocab, k = 5)

In [None]:
sample