# Embeddings
Embeddings are vector representations of words and are a fundamental concept in natural language processing and AI in general.

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import gensim
import gensim.downloader

# for plotting
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'iframe'

<br><br><br><br>

## Word Embeddings
<hr>
In order to be able to do anything with words, the words need to be converted into a numeric representation that captures its semantic meaning. This numeric representation is called an *embedding* and is a vector with hundreds or even thousands of dimensions, depending on the embedding model used. 

Modern embedding models like OpenAI's `text-embedding-3-small` has 1536 dimensions and can't be run locally. Typically people pay to use it through the OpenAI API.

We will instead be using some small embedding models that can be easily used locally. We will use the gensim library to download the pretrained models and use them. Each model takes a while to download the first time you use it, but it is cached so it will load quickly the next time.

- Google News 300: 300-dimensional embedding model trained on a 3 billion word Google News corpus in 2015. 
- GLoVE Twitter 100: 100-dimensional model trained on a corpus of 2 billion tweets with a 1.2M vocab.

These are [Word2Vec](https://en.wikipedia.org/wiki/Word2vec#:~:text=Word2vec%20is%20a%20technique%20in,text%20in%20a%20large%20corpus.) type models.

In [2]:
# Show all available models in gensim-data, they will be saved for next time in ~/gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [70]:
import gensim.downloader


def load_model(model_name, folder = "models", limit = 200000):
    """Downloads word2vec model from web or loads model from file if already downloaded"""
    binary_path = f"{folder}/{model_name}.bin"    
    
    try:
        # load model binary from file if it has been downloaded already
        w2v_model = gensim.models.KeyedVectors.load_word2vec_format(binary_path, binary=True, limit = limit)
        print(f"loaded from binary: {binary_path}")
    except:
        # download model from web and save to binary for faster loading next time
        w2v_model = gensim.downloader.load(model_name)
        w2v_model.save_word2vec_format(binary_path, binary=True, write_header=True)
    
    return w2v_model

In [71]:
# Download the each of these models (might take a while the first time)
model_name = 'word2vec-google-news-300'
# model_name = 'glove-twitter-100'
w2v_model = load_model(model_name)

loaded from binary: models/word2vec-google-news-300.bin


In [72]:
len(w2v_model.vectors)

200000

### Word Embeddings as Vectors


In [51]:
# Get word vector
word_vector = w2v_model['dog']
print(f'dimension: {word_vector.shape}')
print(f"components: \n{str(word_vector)[0:100]}...]")

dimension: (300,)
components: 
[ 5.12695312e-02 -2.23388672e-02 -1.72851562e-01  1.61132812e-01
 -8.44726562e-02  5.73730469e-02  5...]


The length of vectors is NOT normalized, so different embeddings have different lengths

In [52]:
# use np.linalg.norm to get the vector lengths
word = "dog"
vector = w2v_model[word]
length = np.linalg.norm(vector)
print(f"The vector for '{word}' has length   {length}")

word = "puppy"
vector = w2v_model[word]
length = np.linalg.norm(vector)
print(f"The vector for '{word}' has length   {length}")

The vector for 'dog' has length   2.981123447418213
The vector for 'puppy' has length   3.2765257358551025


So each word is a vector in a 300-dimensional space. The words whose vectors are closer together are more semantically similar. But since the lengths of each vector is different, we use the **cosine similarity**, which is a measure of how closely aligned the vectors are



In [53]:
w2v_model.similarity('dog', 'puppy')

0.81064284

In [54]:
w2v_model.similarity('dog', 'potato')

0.1720275

In [77]:
# similarity between one word and many others
word = "dog"
other_words = ["puppy", "cat", "horse", "frog", "cookie", "submarine", "automobile", "ionization"]


print("Similarity:")
for other_word in other_words:
    try:
        similarity = w2v_model.similarity(word, other_word)
        print(f"\t{similarity:.2f} : {word} and {other_word}")
    except KeyError as e:
        print("\t", e)

Similarity:
	0.81 : dog and puppy
	0.76 : dog and cat
	0.48 : dog and horse
	0.36 : dog and frog
	0.28 : dog and cookie
	0.21 : dog and submarine
	0.13 : dog and automobile
	0.09 : dog and ionization
	 "Key 'and' not present"


## Visualizing Embeddings
It's impossible to visualize a 300-dimensional vector, but we can apply a dimensionality reduction technique called principal component analysis (PCA) to reduce to 3 dimensions while preserviving the most variance across the 3 dimensions. We can then plot that 3-dimensional embedding and see how words with different meanings have vectors that point in different directions. This obviously loses A LOT of information but demonstrates the concept of word embeddings as vectors in a space.

In [56]:
words = [
    "football","soccer", "hockey",
    "blackjack", "chess", "poker", "roulette",
    "river","ocean", "lake",
    "brownie","cookie","cake", 
    "tomato", "grapefruit", "peach"
]
# make an array containing the embeddings for each word
embeddings = np.array([w2v_model[w] for w in words])

# perform Princial Component Analysis to reduce to 3 dimension
pca = PCA(n_components=3)
embeddings_reduced = pca.fit_transform(embeddings)

In [57]:
# Assemble into a data frame for easy reference and to plot using plotly express
df = pd.DataFrame(
    {
        "word": words,
        "x": embeddings_reduced[:,0],
        "y": embeddings_reduced[:,1],
        "z": embeddings_reduced[:,2],
    }
)
df.head()

Unnamed: 0,word,x,y,z
0,football,0.381601,0.905858,-1.662725
1,soccer,0.385773,1.055156,-1.924351
2,hockey,0.705126,1.117215,-1.838518
3,blackjack,2.493462,-1.016062,0.982311
4,chess,1.304086,0.194028,-0.589587


Let's produce an interactive 3D plot that can be rotated by dragging with the mouse. Similar words have vectors that are closer together.

In [58]:
# plot each embedding as a point
fig = px.scatter_3d(
    df, x = "x", y = "y", z = "z",
    color = "z", 
    text = "word", 
    width = 700, height = 600, 
    opacity = 0.7,
    title = "Reduced Embedding Space",
)

# add lines from origin to point to mmake it look like a vector
for word, coord in zip(words, embeddings_reduced):
    fig.add_trace(go.Scatter3d(
        x=[0, coord[0]], y=[0, coord[1]], z=[0, coord[2]],
        mode='lines',
        line_width = 1,
        line_color = "SlateGrey",
        showlegend = False
    ))

fig.update_layout(uniformtext_minsize=6, uniformtext_mode='hide')
fig.update_scenes(xaxis_visible=False, yaxis_visible=False,zaxis_visible=False )

fig.layout.update(showlegend = False) 
fig.layout.showlegend = False

fig.show()

### Plot Embeddings function
Let's write these steps into a convenient functions so that we can easily repeat this process later

In [59]:
def embed_words(words, w2v_model):
    return np.array([w2v_model[w] for w in words])

def reduce_dimensions(embeddings, n_components=3):
    pca = PCA(n_components=n_components)
    embeddings_reduced = pca.fit_transform(embeddings)
    return embeddings_reduced

def plot_embeddings(embeddings_reduced, words, title = "Reduced Embedding Space"):
    # Assemble into a data frame
    df = pd.DataFrame({"word": words, "x": embeddings_reduced[:,0], "y": embeddings_reduced[:,1], "z": embeddings_reduced[:,2]})

    # plot each embedding as a point with a label
    fig = px.scatter_3d(
        df, x = "x", y = "y", z = "z",
        color = "z", 
        text = "word", 
        width = 700, height = 600, 
        opacity = 0.7,
        title = title
    )
    
    # add lines from origin to point to mmake it look like a vector
    for word, coord in zip(words, embeddings_reduced):
        fig.add_trace(go.Scatter3d(
            x=[0, coord[0]], y=[0, coord[1]], z=[0, coord[2]],
            mode='lines',
            line_width = 1,
            line_color = "SlateGrey",
            showlegend = False
        ))
    fig.update_layout(uniformtext_minsize=6, uniformtext_mode='hide')
    fig.update_scenes(xaxis_visible=False, yaxis_visible=False,zaxis_visible=False )

    return fig

In [60]:
# it's fun to play with different words
words = [
    "car", "truck", "pickup", "bicycle", "tricycle", "motorcycle", 
    "scooter", "stroller", "speedboat", "ferry", "sailboat", "freighter"
]
embeddings = embed_words(words, w2v_model)
embeddings_reduced = reduce_dimensions(embeddings)
plot_embeddings(embeddings_reduced, words, title="New plot with new words")

<br><br>

### Doing math with embeddings
Since words embeddings are vectors, we can do vector arithmetic with them. 
For example, if we take emeddings for the words "king", "queen", "man", and "woman", we can do a kind of semantic math:


Meaning the concept of king and queen include aspects of gender, royalty, humanness. If we take "queen" (contains elements of royalty and feminine), then add "man" (elements of masculine), we get something that has royalty, and both masculine and feminine gender elements. So if we subtract the word "king" we remove the royalty and masculine and are left with just the feminine "woman"

$$\text{queen (royal, feminime)} + \text{man (masculine)} - \text{king (royal, masculine)} = \text{woman (feminine)}$$


This can be visualized with these 2D embedding vectors:

![](img/king-man-queen-woman.png)


Now this is a very simplified analysis but it gets at the core idea that the dimensions of the embedding encode different aspects of the meaning.
We don't really know what each dimension encodes, but the meaning is there. Modern embedding models contain up to 3000 dimensions and thus can encode a lot more information. 

The function `analogy` below performs this arithmetic and returns the nearest n words to the result.

In [61]:
def analogy(word1, word2, word3, model, n=5):
    """
    Returns analogy word using the given model.

    Parameters
    --------------
    word1 : (str)  word1 in the analogy relation
    word2 : (str)  word2 in the analogy relation
    word3 : (str)  word3 in the analogy relation
    model : word2vec embedding model
    n : (int) the number of most similar words to return. Default is 5
    
    Returns
    ---------------
        pd.dataframe
    """
    print(f"{word1.upper()} is to  {word2.upper()} is as {word3.upper()} is to : ____")
    sim_words = model.most_similar(positive=[word3, word2], negative=[word1], topn = n)
    return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])

In [62]:
word_1 = "king"
word_2 = "queen"
word_3 = "man"
analogy(word_1, word_2, word_3, w2v_model)

KING is to  QUEEN is as MAN is to : ____


Unnamed: 0,Analogy word,Score
0,woman,0.760944
1,girl,0.613999
2,teenage_girl,0.604096
3,teenager,0.582576
4,lady,0.575256


In [63]:
word_1 = "dog"
word_2 = "puppy"
word_3 = "cat"

analogy(word_1, word_2, word_3, w2v_model)

DOG is to  PUPPY is as CAT is to : ____


Unnamed: 0,Analogy word,Score
0,kitten,0.763499
1,puppies,0.71109
2,pup,0.692949
3,kittens,0.688839
4,cats,0.679649


In [64]:
word_1 = "eat"
word_2 = "ate"
word_3 = "walk"

analogy(word_1, word_2, word_3, w2v_model)

EAT is to  ATE is as WALK is to : ____


Unnamed: 0,Analogy word,Score
0,walked,0.732153
1,walking,0.626509
2,jogged,0.580705
3,walks,0.572665
4,strolled,0.558068


Wow! So the embeddings capture the actual meaning of the words to the level of understanding genedered words, verb tenses, and animal baby names.

### Bias in embeddings
The pre-trained embeddings we are using may reflect the biases present in the texts they were trained on. In this exercise you'll explore whether there are any worrisome biases present in the embeddings or not. 

Try comparing the google news vs twitter models to see which one contains more bias.

In [65]:
# model_name = 'word2vec-google-news-300'
model_name = 'glove-twitter-100'
w2v_model = gensim.downloader.load(model_name)

In [66]:
word_1 = "man"
word_2 = "woman"
word_3 = "boss"

analogy(word_1, word_2, word_3, w2v_model)

MAN is to  WOMAN is as BOSS is to : ____


Unnamed: 0,Analogy word,Score
0,wife,0.656686
1,mother,0.62274
2,husband,0.595436
3,daughter,0.594694
4,bosses,0.592432


In [30]:
len(w2v_model.vectors)


1193514

In [28]:
w2v_model_loaded

<gensim.models.keyedvectors.KeyedVectors at 0x14f0c4f53d0>

In [33]:
w2v_model["i"]


array([-3.9621e-04,  4.5670e-01,  3.3890e-01,  2.9695e-01, -3.6924e-01,
       -2.6325e-01, -2.7247e-01, -5.5130e-01,  4.5820e-01,  6.3605e-01,
       -8.0225e-03, -4.3155e-01, -5.5607e+00,  2.9010e-01, -1.8375e-01,
       -1.1136e-01, -9.4750e-02,  3.8869e-03, -6.6665e-01,  2.7977e-01,
       -2.6449e-02, -7.9124e-02, -3.0858e-02, -1.9652e-01, -2.4584e-01,
       -1.1291e+00, -1.6832e-02, -3.2932e-01,  1.2434e-01, -2.7388e-01,
       -5.1654e-01,  7.9321e-02, -1.5876e-02,  2.0981e-01,  1.9013e-01,
        2.8153e-01, -1.6484e-02,  1.1702e-01,  5.8550e-01,  5.6655e-01,
       -1.6504e+00,  3.2778e-02, -2.9156e-01, -4.9912e-02, -2.5162e-01,
        1.3975e-01,  8.0455e-01, -5.0464e-01, -4.7144e-01, -4.3065e-01,
       -4.8675e-01,  3.1117e-01, -2.0250e-01,  1.7717e-02,  1.1674e-01,
        3.2407e-01, -8.8009e-03, -3.3196e-01,  6.3339e-01,  4.5964e-01,
        8.8130e-02,  5.1968e-01, -4.3081e-01, -1.1251e-01, -1.0954e-01,
       -2.9048e-01, -3.4017e-01,  6.6440e-01, -2.6080e-01,  2.66

In [34]:
w2v_model_loaded["i"]

array([-3.9621e-04,  4.5670e-01,  3.3890e-01,  2.9695e-01, -3.6924e-01,
       -2.6325e-01, -2.7247e-01, -5.5130e-01,  4.5820e-01,  6.3605e-01,
       -8.0225e-03, -4.3155e-01, -5.5607e+00,  2.9010e-01, -1.8375e-01,
       -1.1136e-01, -9.4750e-02,  3.8869e-03, -6.6665e-01,  2.7977e-01,
       -2.6449e-02, -7.9124e-02, -3.0858e-02, -1.9652e-01, -2.4584e-01,
       -1.1291e+00, -1.6832e-02, -3.2932e-01,  1.2434e-01, -2.7388e-01,
       -5.1654e-01,  7.9321e-02, -1.5876e-02,  2.0981e-01,  1.9013e-01,
        2.8153e-01, -1.6484e-02,  1.1702e-01,  5.8550e-01,  5.6655e-01,
       -1.6504e+00,  3.2778e-02, -2.9156e-01, -4.9912e-02, -2.5162e-01,
        1.3975e-01,  8.0455e-01, -5.0464e-01, -4.7144e-01, -4.3065e-01,
       -4.8675e-01,  3.1117e-01, -2.0250e-01,  1.7717e-02,  1.1674e-01,
        3.2407e-01, -8.8009e-03, -3.3196e-01,  6.3339e-01,  4.5964e-01,
        8.8130e-02,  5.1968e-01, -4.3081e-01, -1.1251e-01, -1.0954e-01,
       -2.9048e-01, -3.4017e-01,  6.6440e-01, -2.6080e-01,  2.66