# Word Embeddings

We learned a few ways to extract features from text data, and represent them as vectors. For the purposes of large language models and more ambitious natural language processing projects, word embeddings are a critical technique that will serve as a capstone to this course. 

**Word embeddings** are a technique for representing similar or related words to be close together in a vector space. It has allowed many breakthroughs in natural language processing, including the large language models we know today. The technical advantage they bring is they clump together words that are related and therefore reduce the number of dimensions needed, and by creating this density there is more data available for a given context. This greatly reduces the dimensionality and the number of features. 

Each word is mapped to a vector, and that vector takes up a point in space. Words that tend to be used together in a similar context are going to have vectors that are closer together. For example, "dog" and "dalmation" are likely to be close together. However, "dog" and "cat" are also going to be close together because those two words are often used together in the same sentence/context as well. The word "pet" should be close to those vectors as well. 

"Candy" and "Chocolate" will be close together, but not be anywhere near the "dog", "pet", or "cat"-related words. 

![svg image](media/89maFwKN.svg)

So how are these word embeddings built? Generally speaking, a neural network or other algorithm will look around a word at the other words surrounding it. Then you will have a model that for a given word input, will output words with probabilities that they are associated with that word, or perhaps are the next word in a sentence. Think of a given input word outputting a probability distribution of other words that would surround it. Given a large enough text dataset, enough context can be constructed to productively predict relevant words. 

What gets really interesting is how contexts can be navigated. If you have a sufficient word embedding model you can take the vector for "king," subtract the vector for "man," add the vector for "woman," and then land close to the vector for "queen." Similarly, if I take the vector for "shirt," subtract the vector "man," then add the vector "woman" I might get the word "blouse." Or if I take the vector for "Berlin" and subtract "Germany," then add "England" I *should* get "London." 

## Word2Vec and GloVe

Back in 2013, Tomas Mikolov at Google developed a famous method [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) to build word embeddings. It has become the standard for word embeddings and many extensions like GloVe have been built around it. It can build two types of models for word embeddings: 

* CBOW - Continuous bag of words, builds a word embedding by predicting the current word based on its context.
* Continuous Skip-Gram Model - Builds a word embedding by predicting the surrounding words given a current word.

Word2Vec takes each current word and looks at a window of neighboring words, and the size of this window is configurable. Because of the efficiency of the algorithm, larger embeddings can be built efficiently from larger amounts of training data. A lot of the previous techniques we learned require sparse vectors with lots of 0's, and this is not ideal as we want dense datasets with more information and less 0's. 

An extension to Word2Vec, Global Vectors for Word Representation ([GloVe](https://en.wikipedia.org/wiki/GloVe)) was further developed in 2014 by researchers at Stanford. It incorporates Latent Semantic Analysis (LSA) and other techniques that better incorporate global word statistics. Instead of using a window for local context, it uses statistics across the entire text corpus. This results in substantial improvements to word embeddings. 

## Word Embedding with Gensim 

To build a useful word embedding, you will need a large amount of text data. Think millions or billions of words. You can create a word embedding that is reusable for multiple models, or marry one to a specific application model where it learns jointly with it. We will take the former approach, and as you can guess pre-built word embeddings are available for free from researchers under permissive licenses. In practice, this might be desirable instead of building your own from scratch. You can also update the embedding during the training of your own model so it is better tailored to your purposes. 

During the "learning" phase of word embedding, each word vector is moved around the vector space to be near other words relevant to it. We are going to leverage [Gensim](https://radimrehurek.com/gensim/) to manage this task for us. You can install it using `pip` or `conda` as shown below:

```
pip install gensim

conda install -c conda-forge gensim
```

### Building a Word2Vec from Scratch

To use the Word2Vec model, you need a lot of data, such as every aggregated news article or the entirety of Wikipedia. But we can still try to learn a few things still on a smaller dataset, such as the entire Charles Dickens' book *A Tale of Two Cities*. Let's download and load the text format of the book from [Project Gutenberg](https://www.gutenberg.org/ebooks/98). Let's also remove the license boilerplate, and make the words lowercase. 

In [None]:
import re 

import urllib.request

urllib.request.urlretrieve(
    r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/llm/tale_of_two_cities.txt", 
    "tale_of_two_cities.txt"
)


filename = 'tale_of_two_cities.txt' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()

text = re.sub(r"^(.|\n)+START OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES \*{3}", '', text)
text = re.sub(r"\*{3} END OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES (.|\n)+", '', text)
text = text.strip().lower()
text

We will need to break up the book into sentences, and then the tokens within those sentences, to create a two-dimensional list. We will also filter out punctuation and stop words to keep things simple here as our dataset is sparse enough. 

In [None]:
import spacy 
nlp = spacy.load("en_core_web_sm")

tale_of_two_cities = nlp(text) 

sentences = [[token.lemma_ for token in sent if token.is_alpha and not token.is_stop] 
                           for sent in tale_of_two_cities.sents]

# display each sentence and its tokens 
for sent in sentences:
    print(f"{sent}\n")

So let's get this book loaded into `Word2Vec`. There are a number of paramters we can set. 

* vector_size (default 100) - This sets the size of the word vector, or the number of dimensions of the word embedding.
* window (default 5) - The maximum distance between the target word and the words around it.
* min_count (default 5) - The minimum number of instances a word must occur to be captured for the model.
* workers (default 3) - The number of CPU worker threads to build the word embeddings.
* sg (default 0) - Selects the CBOW (0) or skip gram (1) algorithm for training.

I did some experimentation and considering a single Charles Dickens novel is not a lot of text data for word embeddings, I found the following model to develop meaningful results. I then look up the words that are most similar to "wine."

In [None]:
from gensim.models import Word2Vec

# train model
model = Word2Vec(sentences, vector_size=50, min_count=5, window=5, sg=1)

# summarize the loaded model
print(model)

# look up words related to bastille 
model.wv.most_similar(positive=['wine'], topn=10)

Now you may look at these words and scratch your head. Words like "shop" and "defarge" are related to the word "wine"? Well if you read the book (particularly Chapter 5) many of these words make sense to be associated with the word "wine." The chapter is literally titled "The Wine-shop" which is the setting. You can read the [Cliffs Notes](https://www.cliffsnotes.com/literature/t/a-tale-of-two-cities/summary-and-analysis/book-1-chapter-5) for a quick summary, but essentially a wine cask in **Saint Antoine** breaks in the **street** and crowds of people descend to scoop up the free wine. **Monsieur** and **Madame Defarge** are the **keepers** of the wine **shop** and talk to three men all named **Jacque**. They then unlock a **door** to a back room for their other guests. 

So, that explains a lot why the word embedding associated **wine** with all these words. Unfortunately, the book does not provide any further context about wine to associate words like red and white, or Cabernet and Chardonnay. We need a much larger dataset to do that. Thankfully, we can get a prebuilt word embedding that was constructed off the entire Google News dataset, which has about 100 billion words. 

Let's see how "wine" performs there. This may take a moment to download. 

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
wv

Now let's see what happens when we look up "wine." Sure enough, we get some words we strongly associate with wine. 

In [None]:
print(wv.most_similar(positive=['wine'], topn=25))

To make it extra interesting, lets navigate the vector space to find what `wine + hops - grapes` will do. Sure enough, we get `beer` as our top result. 

In [None]:
print(wv.most_similar(positive=['wine', 'hops'], negative=['grapes'], topn=10))

There are some other interesting words that were in the results. "Stumptown Coffee?" That's a brand of cold brew coffee that is marketed to look like a beer, but it is not beer. Still, that loosely expalins why it ended up in the word embeddings for "beer." 

And of course, there's the classic `king + woman - man` example. 

In [None]:
print(wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=10))

## Conclusion

Word embeddings are a powerful tool that are foundational for large language models and many natural langauge processing problems in general. Understanding how they are built behind-the-scenes is bit of a rabbit hole, but knowing the concepts and what they achieve is enough to make you productive. That being said, experiment and play with word embeddings, and [definitely spend some time in the Gensim documentation](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py) 

## Exercise 

Calculate a vector that finds Sony's equivalent competing product for Microsoft Xbox, using the word embedding model built from Google News. Replace the question marks "?" below to achieve this objective. 

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

# put your code here 
wv.most_similar(positive=[?], negative=[?], topn=10)

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

# put your code here 
wv.most_similar(positive=['Xbox', 'Sony'], negative=['Microsoft'], topn=10)