# Overview

# History

I started reading the book "Modern Approaches in Natural Language Processing" by Becker et. al. (2020).

## Problem: Representing Words as Numbers

From the beginning of NLP and related fields, creating numeric representations of language has been a question at the forefront.

## One Hot Encoding

### Origin and Intuition

From what I [gather](https://stats.stackexchange.com/questions/308916/what-is-one-hot-encoding-called-in-scientific-literature), the term originated from field of electrical engineering and I was unable to find a precise publication of origin. In this context, one typically thinks of a state machine, ie. a circuit, is used to maintain or track the "state" of some process. Abstractly, the process can take on a set of mutually exclisive states over it's lifetime. The state machine observes the process and records the current state of the underlying process. This record however is typically an encoding: a number used to represent each unique state. And because we are dealing with electrical circuits, that number is represented in binary. There are many schemed for such an encoding, One Hot Encoding being one of them.

With one hot encoding, each state is represented by a unique position in the binary bit stream. The rules for the stream are that a bit can only be "hot" (i.e. have a value of 1) when the rest of the bits in the stream are "cold" (i.e. have a value of 0); thus only one state can be realized at a given time. In other words, only one bit can be "hot" in the group of bits (sometimes called a one hot or a one hot encoding).

We can see a book talking about this [here](https://www.sciencedirect.com/topics/computer-science/one-hot-encoding).

Although the application and implimentation are physically different, One Hot Encoding in the context of machine learning is exactly the same; we are using bits to represent states.

### Basic Concept
In the context of machine learning, one-hot encodings are often used to encode words in a vocabulary (language) or text. Each word is indexed and represented by it's own number. Generally this is done using vectors; where each dimension in the vector coresponds to a specific word from the set of possible words. Thus each word will have a one-to-one mapping with a one-hot vector.

The problem with this approach is that it creates "sparse vectors" i.e. vectors which high dimensionality but are mostly empty and only have one bit of information.

<center><img src="images/one_hot_encoding.png" style="width:50%"><center>

## Bag Of Words

An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure.

With bag of words, we are again representing a word as a vector but this time, the vector's dimensions are encoded with counts of co-occurance. So in this encoding, any time a word occurs before or after another given word, that adjacent words co-occurance dimension is incremented. This can be done with words in the same sentence, different sentences, different chapters, etc.

<center><img src="images/bag_of_words.png" style="width:50%"><center>

While this is less sparse, it is still sparse.

## Problem: Sparse Vectors & Word Similarity

> Many machine learning models won’t work well with high dimensional and sparse features (Goldberg (2016)). Neural networks in particular struggle with this type of data. And with growing vocabulary the feature size vectors also increases by the same length. So, the dimensionality of these approaches is the same as the number of different words in your text. That means estimating more parameters and therefore using exponentially more data is required to build a reasonably generalizable model. This is known as the curse of dimensionality. But these problems can be solved with dimensionality reduction methods such as Principal Component Analysis or feature selection models where less informative context words, such as the and a are dropped.
>
> The major drawback of these methods is that there is no notion of similarity between words. That means words like cat and tiger are represented as similar as cat and car. If the words cat and tiger would be represented as similar words one could use the information won from the more frequent word “cat” for sentences in which the less frequent word tiger appears. If the word embedding for tiger is similar to that of cat the network model can take a similar path instead of having to learn how to handle it completely anew.
>
>  Becker et. al. (2020)

## TODO: Intermediary representations?

> Vector space models have been used in distributional semantics since the 1990s. Since then, we have seen the development of a number models used for estimating continuous representations of words, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) being two such examples.
>
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove#reference1)

## Context Vectors ??

## Problem: Computational Power

> After Bengio et al.'s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.
>
> https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove

## Solution: More Computing Power

With the first commercially available quad core processor hitting the market towards the end of 2006. Possibilities began to open as operating systemd were able to support more memory and larger file systems. Advancements continued through the 2010's pushing the computing power, memory, and storage space far beyond the levels of the early 2000's. This greatly enabled the continued development of neural networks and thus research into context vectors.

## Solution: More Efficient Algorithms and Less Expensive Loss Functions
As we will see, while computing power increased and research continued, new methods were introduced which require less computing compared to their predecessors (e.g. Collobert and Weston).

## Word Embeddings

To overcome the problem of sparse vectors and the problem of encoding a word's meaning word embeddings were introduced.

Word embeddings also represent words as vectors in a high-dimensional space. The difference with this encoding scheme is that words with similar meanings or relationships are positioned closer to each other geometrically within the geometric space being represented by the vectors.

Note: As we will see, the traditional word embeddings lacked context (i.e. sematic relationships at the phrase level).

### Origin

> Word embeddings are based on the idea that contextual information alone constitutes a viable representation of linguistic items, in stark contrast to formal linguistics and the Chomsky tradition. This idea has its theoretical roots in structuralist linguistics and ordinary language philosophy, and in particular in the works of Zellig Harris, John Firth, and Ludwig Wittgenstein, all publishing important works in the 1950s (in the case of Wittgenstein, posthumously). The earliest attempts at using feature representations to quantify (semantic) similarity used hand-crafted features. Charles Osgood’s semantic differentials in the 1960s is a good example, and similar representations were also used in early works on connectionism and artificial intelligence in the 1980s.
>
> Methods for using automatically generated contextual features were developed more or less simultaneously around 1990 in several different research areas...
>
> Later developments are basically only refinements of these early models...
>
> The main difference between these various models is the type of contextual information they use...
> 
> These different contextual representations capture different types of semantic similarity; the document-based models capture semantic relatedness (e.g. “boat” – “water”) while the word-based models capture semantic similarity (e.g. “boat” – “ship”). This very basic difference is too often misunderstood.
>
> [*A Brief History of Word Embeddings*](https://www.gavagai.io/text-analytics/a-brief-history-of-word-embeddings/)

#### Bengio et. al. (2003)

I have seen claims ([e.g.](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)) that the term word embeddings was originally coined in a 2003 [paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) titled "A Neural Probabilistic Language Model" by Bengio et al. But if you read the paper, the term embedding is not found.

Instead we see them using the terms like "distributed representations (of words) or distributed word feature vectos".

I think it's more acurate to say: Bengio was first to propose a neural network-based word embedding model and his work inspired several other researchers like Mikolev who, as we will see, makes a major publication in 2013. [ref](https://medium.com/co-learning-lounge/nlp-word-embedding-tfidf-bert-word2vec-d7f04340af7f)



Additionally, i believe Bengio et. al. were the first to train them in a neural language model jointly with the model's parameters. We will see this practice carry forward and influence the discovery of the Transformer model.

#### Collobert and Weston (2008)

I have seen several articles ([e.g.](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove) [e.g.](https://www.ruder.io/word-embeddings-1)) agruing that it was Collbert and Weston to first show the value of pre-training word embeddings to that they can be used in downstream tasks. 

Essentially they showed how to apply word embeddings to [transfer learning](Transfer%20Learning%20And%20Pre-Trained%20Models.ipynb).

Their [paper](http://machinelearning.org/archive/icml2008/papers/391.pdf) titled *A unified architecture for natural language processing* also introduces a neural network architecture that forms the foundation for many current approaches.

In their paper they also note that:

>  (Bengio & Ducharme, 2001) and (Schwenk & Gauvain, 2002) already presented very similar language models. However, their goal was to give a probability of a word given previous ones in a sentence. Here, we only want to have a good representation of words

>In their 2011 paper, they further expand on this [8].

##### Language Representations vs. Word Predictors

Rather than contrstructing devices explicitly for next word prediction, they began looking at vectors more generally as a means of soring meaning based on context: thus the term context vectors. This abstraction allowed a more generic approach that could then be applied downstream through transfer learning.

Another big win was an optimization of the training process to use a less computationally demanding loss function:

> In order to avoid computing the expensive softmax, their solution is to employ an alternative objective function: rather than the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score ... for a correct word sequence (a probable word sequence in Bengio's model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion
>
> https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove

#### Mikolov et al. (2013) - Word2vec

A year later, Pennington et al. introduced us to GloVe, a competitive set of pre-trained embeddings, suggesting that word embeddings was suddenly among the mainstream.
>
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove#reference1)

#### Pennington et al. (2014) - Glove

It was Mikolov et al. (2013), however, who really brought word embedding to the fore through the creation of word2vec, a toolkit enabling the training and use of pre-trained embeddings.
>
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove#reference1)

### Basic Concept

Word embeddings are again, a type of encoding that uses a vector to store the encoding information. However, as opposed to some of the previous implimentations which use discrete vectors, Word embeddings use continuous vectors to represent each word in a vocabulary. 

The trick with word embeddings is that the dimensions corespond to a particular meaning of a word. Additionally meaning is not mutually exclusive, words can have similar meaning and then have non-zero values in that dimension, or negative values if they have the opposite meaning for example. 

Words that are similar to eachother in meaning have similar vector values and are thus geometrically oriented such that they are close to those words with similar meaning.

This is an important design consideration; the overlap allows the vectors to be much more dense than the prior encodings which were all sparse.

So the question becomes: what are the meanings associated with each dimension. This is a difficult question to answer because the dimensions are chosen dynamically based on a machine earning algorithm which selects the optimal dimensions based on some loss function. 

We will also see that there are many implimentation which use their own loss fucntions etc. to determine the dimensionality and the values of the vectors.

But a common analogy is to think of the principal axes or principal components one discovers through [Principal Componen Analysis (PCA)](../../../../Data%20Science/Principal%20Component%20Analysis%20(PCA)/Principal%20Component%20Analysis.ipynb). According to Becker (2000) the dimensionality is usualy between 100 and 500 principle meanings of words.

Using this analogy, the word embedding algorithm will seek to minimize the distance between similar words while minimizing the dimensionality.

<center><img src="images/word_embeddings.png""><center>

Another important feature of the Word Embeddings is that because they are continuous and encode meaning, they allow us to perform algebra:

> With such word vectors even algebraic computations become possible as shown in Tomáš Mikolov, Yih, and Zweig (2013). For example, vector(King)−vector(Man)+vector(Woman) results in a vector that is closest to the vector representation of the word Queen. Another possibility to use word embeddings vectors is translation between languages. Tomas Mikolov, Le, and Sutskever (2013) showed that they can find word translations by comparing vectors generated from different languages. By searching for a translation one can use the word vector from the source language and search for the closest vector in the target language vector space, this word can then be used as a translation. The reason this works is that if a word vector from one language is similar to the word vector of the other language, this word is used in a similar context. This method can be used to infer missing dictionary entries. An example for this method depicted in figure 3.4. In figure 3.4 the vectors for numbers and animals are depicted on the left side and the same words are depicted on the right side. It can be seen that the vectors for the correct translation align in similar geometric spaces. Again, two-dimensional representation was achieved by using dimension reduction methods.


<center><img src='images/word_embedding_algebra.png'></center>

> FIGURE 3.4: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). Source: Tomas Mikolov, Le, and Sutskever (2013)
>
> Becker et. al. (2020)

There are a number of implimentations for word embeddings as we will see.

### Intuition

I started wondering why word embeddings were given their name. I gogled and found the following answer on [quora](https://www.quora.com/Why-are-word-embeddings-called-word-embeddings):

> "Word": This part of the term specifies that we are dealing with individual words in the vocabulary. In reality we have evolved the term to deal with "tokens" which may be only parts of words.
>
> The term "embedding" comes from the field of mathematics, where it refers to the process of mapping objects from one space into another, often with the goal of preserving certain relationships or properties. In the context of word embeddings: ... (the term) signifies the act of placing or mapping words into a continuous vector space. Just as embedding a physical object in a material might mean surrounding it or encapsulating it within that material, word embeddings encapsulate the semantic meaning and relationships of words within a vector space.

### Implimentations

#### Word2vec

##### Overview

From googling, I see that word2vec is not a singular algorithm, rather, it is a family of model architectures and optimizations for learning word embeddings.

Within the word2vec umbrella, there are two implimentations of word embeddings:

- **Continuous bag-of-words model**: predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
- **Continuous skip-gram model**: predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below.

Generally speaking, the two algorithms rely on shallow neural networks.

A great articles can be found [here](https://jalammar.github.io/illustrated-word2vec/) and [here](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa).

##### History

According to [wikipedia](https://en.wikipedia.org/wiki/Word2vec), Word2vec was published in 2013 by a team of researchers led by Mikolov at Google over two papers respectively titled [Efficient Estimation of Word Representations in Vector Space
](https://arxiv.org/abs/1301.3781) and [Distributed Representations of Words and Phrases and their Compositionality](https://ui.adsabs.harvard.edu/abs/2013arXiv1310.4546M/abstract). Among the authors, Tomas Mikolov is the most widely cited. Additionally the algorithm was [patented](https://worldwide.espacenet.com/patent/search/family/053054725/publication/US9037464B1?q=pn%3DUS9037464) in 2015.

Additionally, a follow up paper was [published](https://arxiv.org/abs/1402.3722) in 2014 by Goldberg et. al. explaining the math and rational behind it.

With the invention of the [Transformer](./Transformers.ipynb), the word2vec algorithm is seen as being outdated as a means of producing word ebmeddings.

##### Tutorials

[Using tensorflow](https://www.tensorflow.org/text/tutorials/word2vec)

#### Transformer ?

More on how this works ^^

Mikolov et al. [2] recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Here are two key benefits that these architectures have over Bengio's and the C&W model;

- They forgo the costly hidden layer.

- They allow the language model to take additional context into account.

The success their model can not only be attributed to these differences, it importantly also comes from specific training strategies, both of which we will now look at;

#### GloVe

GloVe was [published](https://aclanthology.org/D14-1162/) in 2014 by a team of researcher at Stanford University (Pennington et. al.). The project is open source and the home page can be found [here](https://nlp.stanford.edu/projects/glove/). It hosts several iterations and versions of word vectors trained through various means.

According to the home page:
> Training is performed on aggregated global word-word co-occurrence statistics from a corpus

And according to the [paper's](https://aclanthology.org/D14-1162.pdf) abstract:
> Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. 

While algorithms like word2vec implicitly derive word meaning, GloVe tries to do this explicitly.

> the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.
>
> For this to be accomplished, they propose a weighted least squares objective (J) that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences
>
> ...
>
> With GloVe, we have already seen that the differences are not as obvious: While GloVe is considered a predict model by Levy et al. (2015) [10], it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [12] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)

## Context Embeddings

### Word Embeddings vs. Contextual Embeddings

> Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents...
>
> Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.
>
> Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
>
> On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words (words with multiple meanings, i.e. homonyms), e.g. "left" in example above, based on their context.
>
> Contextual embeddings depend on the other words in a given sentence... Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated.
>
> For example, consider the two sentences:
>
> I will show you a valid point of reference and talk to the point.
Where have you placed the point.
Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).
>
> While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.
>
> [stackoverflow](https://stackoverflow.com/questions/62272056/what-are-the-differences-between-contextual-embedding-and-word-embedding)