# Overview

# History

I started reading the book "Modern Approaches in Natural Language Processing" by Becker et. al. (2020).

The term work embedding has a broad meaning which has expanded with advancements in the field.

## Problem: Representing Words as Numbers

From the beginning of NLP and related fields, creating numeric representations of language has been a question at the forefront.

## One Hot Encoding

### Origin and Intuition

From what I [gather](https://stats.stackexchange.com/questions/308916/what-is-one-hot-encoding-called-in-scientific-literature), the term originated from field of electrical engineering and I was unable to find a precise publication of origin. In this context, one typically thinks of a state machine, ie. a circuit, is used to maintain or track the "state" of some process. Abstractly, the process can take on a set of mutually exclisive states over it's lifetime. The state machine observes the process and records the current state of the underlying process. This record however is typically an encoding: a number used to represent each unique state. And because we are dealing with electrical circuits, that number is represented in binary. There are many schemed for such an encoding, One Hot Encoding being one of them.

With one hot encoding, each state is represented by a unique position in the binary bit stream. The rules for the stream are that a bit can only be "hot" (i.e. have a value of 1) when the rest of the bits in the stream are "cold" (i.e. have a value of 0); thus only one state can be realized at a given time. In other words, only one bit can be "hot" in the group of bits (sometimes called a one hot or a one hot encoding).

We can see a book talking about this [here](https://www.sciencedirect.com/topics/computer-science/one-hot-encoding).

Although the application and implimentation are physically different, One Hot Encoding in the context of machine learning is exactly the same; we are using bits to represent states.

### Basic Concept
In the context of machine learning, one-hot encodings are often used to encode words in a vocabulary (language) or text. Each word is indexed and represented by it's own number. Generally this is done using vectors; where each dimension in the vector coresponds to a specific word from the set of possible words. Thus each word will have a one-to-one mapping with a one-hot vector.

The problem with this approach is that it creates "sparse vectors" i.e. vectors which high dimensionality but are mostly empty and only have one bit of information.

<center><img src="images/one_hot_encoding.png" style="width:50%"><center>

## Bag Of Words

An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure.

With bag of words, we are again representing a word as a vector but this time, the vector's dimensions are encoded with counts of co-occurance. So in this encoding, any time a word occurs before or after another given word, that adjacent words co-occurance dimension is incremented. This can be done with words in the same sentence, different sentences, different chapters, etc.

<center><img src="images/bag_of_words.png" style="width:50%"><center>

While this is less sparse, it is still sparse.

## Problem: Sparse Vectors & Word Similarity

> Many machine learning models won’t work well with high dimensional and sparse features (Goldberg (2016)). Neural networks in particular struggle with this type of data. And with growing vocabulary the feature size vectors also increases by the same length. So, the dimensionality of these approaches is the same as the number of different words in your text. That means estimating more parameters and therefore using exponentially more data is required to build a reasonably generalizable model. This is known as the curse of dimensionality. But these problems can be solved with dimensionality reduction methods such as Principal Component Analysis or feature selection models where less informative context words, such as the and a are dropped.
>
> The major drawback of these methods is that there is no notion of similarity between words. That means words like cat and tiger are represented as similar as cat and car. If the words cat and tiger would be represented as similar words one could use the information won from the more frequent word “cat” for sentences in which the less frequent word tiger appears. If the word embedding for tiger is similar to that of cat the network model can take a similar path instead of having to learn how to handle it completely anew.
>
>  Becker et. al. (2020)

## TODO: Intermediary representations?

> Vector space models have been used in distributional semantics since the 1990s. Since then, we have seen the development of a number models used for estimating continuous representations of words, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) being two such examples.
>
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove#reference1)

### Context Vectors

To overcome some of the problems with previous encodings that are sparse and/or do not convey word meaning context vectors were introduced.

> Context Vectors are fixed-length vector representations useful for document retrieval and word sense disambiguation. Context vectors were motivated by four goals:
>
>1. Capture “similarity of use” among words (“car” is similar to “auto”, but not similar to “hippopotamus”).
>2. Quickly find constituent objects (eg., documents that contain specified words).
>3. Generate context vectors automatically from an unlabeled corpus.
>4. Use context vectors as input to standard learning algorithms.
>
> Context Vectors lack, however, a natural way to represent syntax, discourse, or logic.
>
> [Hybrid Neural Systems 1998](https://link.springer.com/chapter/10.1007/10719871_14)

**Note**: Colloially context vectors are used synonimousyly with word embeddings ([e.g.](https://www.baeldung.com/cs/word2vec-word-embeddings), [e.g.](https://medium.com/@RobinVetsch/nlp-from-word-embedding-to-transformers-76ae124e6281), [e.g.](https://towardsdatascience.com/what-in-the-corpus-is-a-word-embedding-2e1a4e2ef04d)). But there is a distinction between the two. Word embeddings are a particular class of context vectors that are typically trained using neural networks. While context vectors have been around since at least the 1980's word embeddings were not seen until the early 2000s.

## Problem: Computational Power

> After Bengio et al.'s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.
>
> https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove

## Solution: More Computing Power

With the first commercially available quad core processor hitting the market towards the end of 2006. Possibilities began to open as operating systemd were able to support more memory and larger file systems. Advancements continued through the 2010's pushing the computing power, memory, and storage space far beyond the levels of the early 2000's. This greatly enabled the continued development of neural networks and thus research into context vectors.

## Solution: More Efficient Algorithms and Less Expensive Loss Functions
As we will see, while computing power increased and research continued, new methods were introduced which require less computing compared to their predecessors (e.g. Collobert and Weston).

## Word Embeddings

To overcome the problem of sparse vectors and the problem of encoding a word's meaning word embeddings were introduced.

Word embeddings also represent words as vectors in a high-dimensional space. The difference with this encoding scheme is that words with similar meanings are positioned closer to each other geometrically within the geometric space being represented by the vectors.

**Note**: Sometimes word embeddings are referred to as "distributed word representations" or "word representations" (e.g. [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)).

**Note**: As we will see, the traditional word embeddings lacked context (i.e. information about sematic relationships at the phrase level was not collected and encoded). This gave rise to Contextual Word Embeddings.

> Word embeddings are based on the idea that contextual information alone constitutes a viable representation of linguistic items, in stark contrast to formal linguistics and the Chomsky tradition. This idea has its theoretical roots in structuralist linguistics and ordinary language philosophy, and in particular in the works of Zellig Harris, John Firth, and Ludwig Wittgenstein, all publishing important works in the 1950s (in the case of Wittgenstein, posthumously). The earliest attempts at using feature representations to quantify (semantic) similarity used hand-crafted features. Charles Osgood’s semantic differentials in the 1960s is a good example, and similar representations were also used in early works on connectionism and artificial intelligence in the 1980s.
>
> Methods for using automatically generated contextual features were developed more or less simultaneously around 1990 in several different research areas...
>
> Later developments are basically only refinements of these early models...
>
> The main difference between these various models is the type of contextual information they use...
> 
> These different contextual representations capture different types of semantic similarity; the document-based models capture semantic relatedness (e.g. “boat” – “water”) while the word-based models capture semantic similarity (e.g. “boat” – “ship”). This very basic difference is too often misunderstood.
>
> [*A Brief History of Word Embeddings*](https://www.gavagai.io/text-analytics/a-brief-history-of-word-embeddings/)

### Bengio et. al. (2003)

I have seen claims ([e.g.](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)) that the term word embeddings was originally coined in a 2003 [paper](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) titled "A Neural Probabilistic Language Model" by Bengio et al. But if you read the paper, the term embedding is not found.

Instead we see them using the terms like "distributed representations (of words) or distributed word feature vectos".

I think it's more acurate to say: Bengio was first to propose a neural network-based word embedding model and his work inspired several other researchers like Mikolev who, as we will see, makes a major publication in 2013. [ref](https://medium.com/co-learning-lounge/nlp-word-embedding-tfidf-bert-word2vec-d7f04340af7f)



Additionally, i believe Bengio et. al. were the first to train them in a neural language model jointly with the model's parameters. We will see this practice carry forward and influence the discovery of the Transformer model.

### Collobert and Weston (2008)

I have seen several articles ([e.g.](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove) [e.g.](https://www.ruder.io/word-embeddings-1)) agruing that it was Collbert and Weston to first show the value of pre-training word embeddings to that they can be used in downstream tasks. 

Essentially they showed how to apply word embeddings to [transfer learning](Transfer%20Learning%20And%20Pre-Trained%20Models.ipynb).

Their [paper](http://machinelearning.org/archive/icml2008/papers/391.pdf) titled *A unified architecture for natural language processing* also introduces a neural network architecture that forms the foundation for many current approaches.

In their paper they also note that:

>  (Bengio & Ducharme, 2001) and (Schwenk & Gauvain, 2002) already presented very similar language models. However, their goal was to give a probability of a word given previous ones in a sentence. Here, we only want to have a good representation of words

>In their 2011 paper, they further expand on this [8].

#### Language Representations vs. Word Predictors

Rather than contrstructing devices explicitly for next word prediction, they began looking at vectors more generally as a means of soring meaning based on context: thus the term context vectors. This abstraction allowed a more generic approach that could then be applied downstream through transfer learning.

Another big win was an optimization of the training process to use a less computationally demanding loss function:

> In order to avoid computing the expensive softmax, their solution is to employ an alternative objective function: rather than the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score ... for a correct word sequence (a probable word sequence in Bengio's model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion
>
> https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove

### Mikolov et. al. (2013) - Word2vec

In 2013, Mikolov et. al. [published](https://arxiv.org/abs/1301.3781) *Efficient Estimation of Word Representations in Vector Space* which proposes two novel model architectures for computing continuous vector representations of words from very large data sets (~1.6 billion words).

> It was Mikolov et al. (2013), however, who really brought word embedding to the fore through the creation of word2vec, a toolkit enabling the training and use of pre-trained embeddings.
>
> [aylien - blog](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)

It is important to note that these new architectures are computationally less expensive when compared with previous models (like Bengio or Collobert) because of two feaures:

- They forgo the costly hidden layer.
- They allow the language model to take additional context into account.

[aylien - blog](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)

In addition, Mikolov and team contribute to the discussion of how training time and accuracy depends
on the dimensionality of the word vectors and on the amount of the training data. Building on previous Mikolov publications as well as the work of others they construct an equation to model the time complexity of the training process.


### Pennington et al. (2014) - Glove

A year later, Pennington et al. introduced us to GloVe, a competitive set of pre-trained embeddings, suggesting that word embeddings was suddenly among the mainstream.
>
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove#reference1)

## Problem: Global Meaning vs. Contextual Meaning

Traditional word embedding techniques learn meaning globally. In other words, they learn an embedding that is applicable to all instances of a given word within the training corpus. They do this by computing and encoding global co-occurance statistics for a given word into the coresponding vector. As such, regardless of the context, every instance of a word will map to the same vector, ragardless of it's context. 

In some writings, we use the term "static" to reflect the fact that the vectors are not changing based on the context.

The problem with this approach is that while we do capture some contextual information, there is a "blurring" that "mutes" the nuance of a particular word. 

For example, only one representation is learnt for the word "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings which depend on the context of the word. A better model would be able to handle two different representations in the embedding space.

[stackoverflow](https://stackoverflow.com/questions/62272056/what-are-the-differences-between-contextual-embedding-and-word-embedding)

## Solution: Context Embeddings

Contextual embedding methods however are designed to learn sequence-level semantics with context being expressed as a function over the entire input sequence. As such, Context Embeddings are able to transcend the limitations of traditional i.e. global word representations (word2vec, GloVe) which look at context as a fucntion of co-oocurance within the input sequence. 
They are able to learn different representations for polysemous words (words with multiple meanings, i.e. homonyms), e.g. "left" in example above, based on their context.

Additionally, even if a word has a similar meaning and is used in a similar way, the context embedding will still be slightly different due to the difference in context.

For example, consider the two sentences:
- I will show you a valid point of reference and talk to the point.
- Where have you placed the point.

Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in sentence one and also the same for the word 'point' in sentenct two. (all three occurrences has same embeddings). 

The context embeddings on the other hand will be different between all three occurrances of the word point.

[stackoverflow](https://stackoverflow.com/questions/62272056/what-are-the-differences-between-contextual-embedding-and-word-embedding)

This disctinguishes Context Embeddings as a new class with fundamentally different philosophies on the nature of context and thus the inherent meaning of words.

An additional difference is that context embeddings are generally oriented at the token level rather than the word level. They assign each token a representation based on its context. As such, it is possible for the same token to have multiple representations based on it's context. 

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

Another distinction is that the ability to differentiate nuance between differering context words.

> A limitation of CBOW is that it equally weights the context words when making a prediction, which is inefficient, since some words have higher predictive value than others.
>
> https://aclanthology.org/2020.coling-main.608.pdf

These advancements have lead to the discovery that Context Embeddings are transferrable between languanges. A great article on the subject can be found [here](https://www.ruder.io/cross-lingual-embeddings/).

> Further analyses (Liu et al., 2019a; Hewitt and Liang , 2019 ; Hewitt and Manning , 2019 ; Tenney et al. , 2019a) demonstrate that contextual embeddings are capable of learning useful and transferable representations across languages

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

**Note**: Despite significant differences, the two terms are used interchangably in a number of texts.

Below we list a number of major developments within the field:

### Dai and Le (2015) - Precursor To Modern Context Embeddings

> Dai and Le (2015) is the first work we are aware of that uses language modelling together with a sequence autoencoder to improve sequence learning with recurrent networks. Thus, it can be thought of as a precursor to modern contextual embedding methods.
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

### (Rocktaschel et al. 2015) - Traversal-style Approaches
> These approaches pre-process each text input as a single contiguous sequence of tokens through special tokens including [START] (the start of a sequence), [DELIM] (delimiting two sequences from the text input) and [EXTRACT] (the end of a sequence).
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

### (2015) - OpenAI Founded

> OpenAI was founded in 2015 as a nonprofit research organization by Altman, Elon Musk, Peter Thiel, and LinkedIn cofounder Reid Hoffman, among other tech leaders.
>
> [vice](https://www.vice.com/en/article/5d3naz/openai-is-now-everything-it-promised-not-to-be-corporate-closed-source-and-for-profit)

### Ramachandran et al. (2016) - Initializing With Pretrained Weights
> Ramachandran et al. (2016) extends Dai and Le (2015) by proposing a pre-training method to improve the accuracy of sequence to sequence (seq2seq) models. The encoder and decoder of the seq2seq model is initialized with the pre-trained weights (as opposed to random weights, which are resolved based on the input and output sequence corpus).
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

### (Vaswani et al., 2017) - Transformer Architecture
>  has been shown to better capture global dependencies from the inputscompared to its alternatives, e.g. recurrent networks, and perform strongly on a range of sequence learning tasks, such as machine translation (Vaswani et al., 2017) and document generation (Liu et al., 2018).
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

This model is discussed in further detail in the [Transformer notebook](Transformers.ipynb).

### (Peters et al., 2018) - Bidirectional Training

> The ELMo model (Peters et al., 2018) generalizes traditional word embeddings by extracting context-dependent representations from a bidirectional language model.
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

The paper is titled *Deep contextualized word representations* and can be found [here](https://arxiv.org/abs/1802.05365).

### (Radford et al., 2018) - GPT

> GPT adopts a two-stage learning paradigm: (a) nsupervised pre-training using a language modelling objective and (b) supervised fine-tuning.
>
> The goal is to learn universal representations transferable to a wide range of downstream tasks.
>
> [ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

The Radford and the rest of the OpenAI team published *Improving Language Understanding
by Generative Pre-Training* on June 11, 2018 which can be found [here](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

GPT is a proprietary model owned by OpenAi.

### (Devlin et al., 2018) - BERT - Masked Language Modeling (MLM)

ELMo concatenates representations from the forward and backward LSTMs without considering the interactions between the left and right contexts. GPT and GPT-2 use a left-to-right decoder, where every token can only attend to its left context. These architectures are sub-optimal for sentence-level tasks,
e.g. named entity recognition and sentiment analysis, as it is crucial to incorporate contexts from
both directions. 

BERT proposes a masked language modelling (MLM) objective, where some of the tokens of a input sequence are randomly masked, and the objective is to predict these masked positions taking the corrupted sequence as input. BERT applies a Transformer encoder to attend to bi-directional contexts during pre-training.

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

### (Radford et al., 2019) GPT-2

GPT2 follows a similar architecture to the original GPT.

It trains on a significantly larger corpus named WebText which (scraped from reddit posts and coresponding outbound links) which inherantly exhibits instances of text formatted in a question and answer style structure as well as summarization information. This results in the model being ablt to solve a wide variety of NLP tasks without explicit supervision.

Like GPT, GPT-2 uses a left-to-right decoder rather than a bi-directional encoder like ELmo.

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)

In the original paper the authors compare GPT-2 to GPT and BERT highlighting that GPT-2 is a "larger" model meaning that the neural network is larger and thus more weights need to be trained. The GPT-2 model is said to have ~1.5 Billion parameters.

GPt-2 was published February 2019 in a paper titled *Language Models are Unsupervised Multitask Learners* and can be found [here](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

### (2018 - 2019) - BERT Variants 
Variants of BERT including EARNIE, SpanBert, StructBert, RoBERTa, ALBERT further study and improve the objective and architecture of BERT

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)


### (Yang et al., 2019) - XLNet, Traversal Without Artificial Symbols

The XLNet model identifies two weaknesses of BERT:
1. BERT assumes conditional independence of corrupted tokens.
2. The symbols such as [MASK] are introduced by BERT during pre-training, yet they never occur in real data, resulting in a discrepancy between pre-training and fine-tuning.

XLNet proposes a new auto-regressive method based on permutation language modelling (PLM) (Uria et al., 2016) without introducing any new symbols. 

XLNet further adopts two-stream self-attention and Transformer-XL (Dai et al., 2019) to take into account the target positions and learn longrange dependencies, respectively

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)


### (Clark et al., 2019) - ELECTRA - Traversing Via Imputation

> Compared to BERT, ELECTRA (Clark et al., 2019) proposes a more effective pretraining method. Instead of corrupting some positions of inputs with [MASK], ELECTRA replaces some tokens of the inputs with their plausible alternatives sampled from a small generator network. ELECTRA trains a discriminator to predict whether each token in the corrupted input was replaced by the generator or not. The pre-trained discriminator can then be used in downstream tasks for fine-tuning, improving upon the pre-trained representation learned by the generator.

[ Liu et. al. (2020) - A Survey on Contextual Embeddings](https://arxiv.org/abs/2003.07278)


### (Lewis et al., 2019) - BART - Advanced MLM

For more information see the [BART notebook](BART.ipynb)

### (March 2019) - OpenAI goes Closed-source

> The code behind both GPT-1 and GPT-2 has been officially released by OpenAI and is available on GitHub for any developer to utilize and make improvements on. The same cannot be said for GPT-3. Rather than deliver on their original promise listed in their mission statement claiming that “[our code] will be shared with the world,” (OpenAI) OpenAI instead decided to not release the source code for GPT-3 and instead release the model in the form of service.
>
> [source](https://sites.imsa.edu/hadron/2021/02/03/openai-was-the-shift-to-closed-source-justified/)

> In (March) 2019, OpenAI became a for-profit company called OpenAI LP, controlled by a parent company called OpenAI Inc. The result was a “capped-profit” structure that would limit the return of investment at 100-fold the original sum. If you invested \\$10 million, at most you’d get \\$1 billion. Not exactly what I’d call capped.
>
> A few months after the change, Microsoft injected $1 billion. OpenAI’s partnership with Microsoft was sealed on the grounds of allowing the latter to commercialize part of the tech, as we’ve seen happening with GPT-3 and Codex.
>
> [source](https://onezero.medium.com/openai-sold-its-soul-for-1-billion-cf35ff9e8cd4)
>
> [source](https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/)


### (Radford et al. 2020) - GPT-3 - 175 Billion Parameters

GPT-3 was announced in may and release in june 2020. 

In the [paper](https://arxiv.org/abs/2005.14165) the authors note that increasing the size of the language model improves is ability to perform downstream one-shot tasks:

> Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

### (March 2022) - GPT-3.5 - Enhancements And Newer Data

In March 2022, OpenAI made available new versions of GPT-3 which were trained on newer data sets (including text as well as code) spannign up to June 2021. Additionally, the public api added new features like edit and insert.

There were several models/releases included in the 3.5 family:
- gpt-3.5-turbo (chat)
- text-davinci-002 (text completion)
- text-davinci-003 (text completion)

https://en.wikipedia.org/wiki/GPT-3

> GPT-3.5 is an upgraded version of GPT-3 with fewer parameters that includes a fine-tuning process for machine learning algorithms. The fine-tuning process involves reinforcement learning with human feedback, which helps to improve the accuracy and effectiveness of the algorithms. Additionally, GPT-3.5 is designed to work within policies based on ethical human values, ensuring that the AI systems it powers are safe and reliable for human use.
>
> [source](https://www.iffort.com/blog/2023/03/31/gpt-3-vs-gpt-3-5)

> Instead of releasing GPT-3.5 in its fully trained form, OpenAI utilized it to develop several systems specifically optimized for various tasks, all accessible via the OpenAI API. One of these, text-davinci-003, is said to handle more intricate commands than models constructed on GPT-3 and produce higher quality, longer-form writing.
>
> OpenAI data scientist Jan Leike stated that text-davinci-003 is comparable to InstructGPT, a series of GPT-3-based models that OpenAI introduced earlier this year. These models are designed to minimize the generation of problematic text, like toxic or highly biased content, while better adhering to a user’s intentions.
>
> https://blog.accubits.com/gpt-3-vs-gpt-3-5-whats-new-in-openais-latest-update/#What%E2%80%99s-different-in-GPT-3.5

### (November 2022) - ChatGPT - User Facing Chat Bot

The original release of ChatGPT was based on the GPT-3 family models but has since been updated to support newer version (GPT-4) as well.

While the GPT family of models are geared towards researchers and developers, ChatGPT is geared towared uses. Through the Web UI or API, users can now interract with the GPT models in a much friendlier way.

This marked the beginning of a monumental shift in the worlds perception of this technology.

### (March 2023) - GPT-4 - 100 trillion parameters

In addition to it's large size, GPT-4 bosts the following enhancements:

- Improved model alignment — the ability to follow user intention
- Lower likelihood of generating offensive or dangerous output
- Increased factual accuracy
- Better steerability — the ability to change behavior according to user requests
- Internet connectivity – the latest feature includes the ability to search the Internet in real-time

[source](https://www.forbes.com/sites/bernardmarr/2023/05/19/a-short-history-of-chatgpt-how-we-got-to-where-we-are-today/?sh=515ef82f674f)

# Basic Concept

Word embeddings are again, a type of encoding that uses a vector to store the encoding information. However, as opposed to some of the previous implimentations which use discrete vectors, Word embeddings use continuous vectors to represent each word in a vocabulary. 

The trick with word embeddings is that the dimensions corespond to a particular meaning of a word. Additionally meaning is not mutually exclusive, words can have similar meaning and then have non-zero values in that dimension, or negative values if they have the opposite meaning for example. 

Words that are similar to eachother in meaning have similar vector values and are thus geometrically oriented such that they are close to those words with similar meaning.

This is an important design consideration; the overlap allows the vectors to be much more dense than the prior encodings which were all sparse.

So the question becomes: what are the meanings associated with each dimension. This is a difficult question to answer because the dimensions are chosen dynamically based on a machine earning algorithm which selects the optimal dimensions based on some loss function. 

We will also see that there are many implimentation which use their own loss fucntions etc. to determine the dimensionality and the values of the vectors.

But a common analogy is to think of the principal axes or principal components one discovers through [Principal Componen Analysis (PCA)](../../../../Data%20Science/Principal%20Component%20Analysis%20(PCA)/Principal%20Component%20Analysis.ipynb). According to Becker (2000) the dimensionality is usualy between 100 and 500 principle meanings of words.

Using this analogy, the word embedding algorithm will seek to minimize the distance between similar words while minimizing the dimensionality.

<center><img src="images/word_embeddings.png""><center>

Another important feature of the Word Embeddings is that because they are continuous and encode meaning, they allow us to perform algebra:

> With such word vectors even algebraic computations become possible as shown in Tomáš Mikolov, Yih, and Zweig (2013). For example, vector(King)−vector(Man)+vector(Woman) results in a vector that is closest to the vector representation of the word Queen. Another possibility to use word embeddings vectors is translation between languages. Tomas Mikolov, Le, and Sutskever (2013) showed that they can find word translations by comparing vectors generated from different languages. By searching for a translation one can use the word vector from the source language and search for the closest vector in the target language vector space, this word can then be used as a translation. The reason this works is that if a word vector from one language is similar to the word vector of the other language, this word is used in a similar context. This method can be used to infer missing dictionary entries. An example for this method depicted in figure 3.4. In figure 3.4 the vectors for numbers and animals are depicted on the left side and the same words are depicted on the right side. It can be seen that the vectors for the correct translation align in similar geometric spaces. Again, two-dimensional representation was achieved by using dimension reduction methods.


<center><img src='images/word_embedding_algebra.png'></center>

> FIGURE 3.4: Distributed word vector representations of numbers and animals in English (left) and Spanish (right). Source: Tomas Mikolov, Le, and Sutskever (2013)
>
> Becker et. al. (2020)

There are a number of implimentations for word embeddings as we will see.

# Intuition Behind Terminology

I started wondering why word embeddings were given their name. I gogled and found the following answer on [quora](https://www.quora.com/Why-are-word-embeddings-called-word-embeddings):

> "Word": This part of the term specifies that we are dealing with individual words in the vocabulary. In reality we have evolved the term to deal with "tokens" which may be only parts of words.
>
> The term "embedding" comes from the field of mathematics, where it refers to the process of mapping objects from one space into another, often with the goal of preserving certain relationships or properties. In the context of word embeddings: ... (the term) signifies the act of placing or mapping words into a continuous vector space. Just as embedding a physical object in a material might mean surrounding it or encapsulating it within that material, word embeddings encapsulate the semantic meaning and relationships of words within a vector space.

# Implimentations

## Word2vec

### Overview

From googling, I see that word2vec is not a singular algorithm, rather, it is a family of model architectures and optimizations for learning word embeddings.

Within the word2vec umbrella, there are two implimentations of word embeddings:

- **Continuous bag-of-words model**: predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
- **Continuous skip-gram model**: predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below.

Generally speaking, the two algorithms rely on shallow neural networks.

A great articles can be found [here](https://jalammar.github.io/illustrated-word2vec/), [here](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa), and [here](https://towardsdatascience.com/what-in-the-corpus-is-a-word-embedding-2e1a4e2ef04d).

### History

According to [wikipedia](https://en.wikipedia.org/wiki/Word2vec), Word2vec was published in 2013 by a team of researchers led by Mikolov at Google over two papers respectively titled [Efficient Estimation of Word Representations in Vector Space
](https://arxiv.org/abs/1301.3781) and [Distributed Representations of Words and Phrases and their Compositionality](https://ui.adsabs.harvard.edu/abs/2013arXiv1310.4546M/abstract). Among the authors, Tomas Mikolov is the most widely cited. Additionally the algorithm was [patented](https://worldwide.espacenet.com/patent/search/family/053054725/publication/US9037464B1?q=pn%3DUS9037464) in 2015.

Additionally, a follow up paper was [published](https://arxiv.org/abs/1402.3722) in 2014 by Goldberg et. al. explaining the math and rational behind it.

With the invention of the [Transformer](./Transformers.ipynb), the word2vec algorithm is seen as being outdated as a means of producing word ebmeddings.

### Tutorials

[Using tensorflow](https://www.tensorflow.org/text/tutorials/word2vec)

## GloVe



GloVe was [published](https://aclanthology.org/D14-1162/) in 2014 by a team of researcher at Stanford University (Pennington et. al.). The project is open source and the home page can be found [here](https://nlp.stanford.edu/projects/glove/). It hosts several iterations and versions of word vectors trained through various means.

According to the paper:

> (we introduce... ) a new model for word representation which we call GloVe, for Global Vectors, because the global corpus statistics are captured directly by the model.

According to the home page:
> Training is performed on aggregated global word-word co-occurrence statistics from a corpus

And according to the [paper's](https://aclanthology.org/D14-1162.pdf) abstract:
> Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. 

While algorithms like word2vec implicitly derive word meaning, GloVe tries to do this explicitly.

> the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.
>
> For this to be accomplished, they propose a weighted least squares objective (J) that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences
>
> ...
>
> With GloVe, we have already seen that the differences are not as obvious: While GloVe is considered a predict model by Levy et al. (2015) [10], it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [12] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.
> [source](https://aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove)