# Word Embeddings

We learned a few ways to extract features from text data, and represent them as vectors. For the purposes of large language models and more ambitious natural language processing projects, word embeddings are a critical technique that will serve as a capstone to this course. 

**Word embeddings** are a technique for representing similar or related words to be close together in a vector space. It has allowed many breakthroughs in natural language processing, including the large language models we know today. The technical advantage they bring is they clump together words that are related and therefore reduce the number of dimensions needed, and by creating this density there is more data available for a given context. This greatly reduces the dimensionality and the number of features. 

Each word is mapped to a vector, and that vector takes up a point in space. Words that tend to be used together in a similar context are going to have vectors that are closer together. For example, "dog" and "dalmation" are likely to be close together. However, "dog" and "cat" are also going to be close together because those two words are often used together in the same sentence/context as well. The word "pet" should be close to those vectors as well. 

"Candy" and "Chocolate" will be close together, but not be anywhere near the "dog", "pet", or "cat"-related words. 

svg image

So how are these word embeddings built? Generally speaking, a neural network or other algorithm will look around a word at the other words surrounding it. Then you will have a model that for a given word input, will output words with probabilities that they are associated with that word, or perhaps are the next word in a sentence. Think of a given input word outputting a probability distribution of other words that would surround it. Given a large enough text dataset, enough context can be constructed to productively predict relevant words. 

What gets really interesting is how contexts can be navigated. If you have a sufficient word embedding model you can take the vector for "king," subtract the vector for "man," add the vector for "woman," and then land close to the vector for "queen." Similarly, if I take the vector for "shirt," subtract the vector "man," then add the vector "woman" I might get the word "blouse." Or if I take the vector for "Berlin" and subtract "Germany," then add "England" I *should* get "London." 

## Word2Vec

Back in 2013, Tomas Mikolov at Google developed a famous method [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) to build word embeddings. It has become the standard for word embeddings and many extensions like GloVe have been built around it. It can build two types of models for word embeddings: 

* CBOW - Continuous bag of word, builds a word embedding by predicting the current word based on its context.
* Continuous Skip-Gram Model - Builds a word embedding by predicting the surrounding words given a current word.

Word2Vec takes each current word and looks at a window of neighboring words, and the size of this window is configurable. Because of the efficiency of the algorithm, larger embeddings can be built efficiently from larger amounts of training data.

An extension to Word2Vec, Global Vectors for Word Representation ([GloVe](https://en.wikipedia.org/wiki/GloVe)) was further developed in 2014 by researchers at Stanford. It incorporates Latent Semantic Analysis (LSA) and other techniques that better incorporate global word statistics. Instead of using a window for local context, it uses statistics across the entire text corpus. This results in substantial improvements to word embeddings. 