<a href="https://colab.research.google.com/github/sarahkaarina/lazy-language/blob/main/vectorizing/creating_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating word embeddings

**This basic notebook will talk you through how to create word embeddings**

*What is a word embedding?*

A word embedding simply refers to the vector representation of a word.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# libraries for calculatine cosine similarity
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

# using GPT-2
from transformers import GPT2Tokenizer, GPT2Model, GPT2LMHeadModel

**Step 1**

Create a semi-random list of words to vectorize.

In [None]:
# Create random list of words

animals = ['cat', 'dog', 'sheep', 'cow', 'moose', 'leaopard', 'fish']
vegetables = ['onion', 'cucumber', 'zucchini', 'tomato', 'cabbage', 'carrot', 'broccoli']
clothing_items = ['dress', 'shirt', 'pants', 'jeans', 'jumper', 'jacket', 'tie', 'skirt']

random_words = list(animals + vegetables + clothing_items)

# check length to make sure we have the same number of words in our embeddings
len(random_words)

22

**Step 2**

Vectorize/embed words.

In [None]:
model = 'gpt2'
logit_model = GPT2LMHeadModel
tokenizer_model = GPT2Tokenizer

tokenizer = tokenizer_model.from_pretrained(model)
model = logit_model.from_pretrained(model)

input = tokenizer.encode(random_words, add_prefix_space = "True")

embeddings = model.transformer.wte.weight[input, :]

# Let's check what they look like!
embeddings

# Let's also check we have all 22 words in here
print(embeddings.size())


torch.Size([22, 768])

**Step 3**

Compare similarity

In [1]:
# Numpify them so you can calculate cosine similarity

np_embed = embeddings.squeeze().detach().numpy()

cosine_distribution = np.around(1-pairwise_distances(np_embed, metric="cosine"), 2)

cosine_distribution

NameError: name 'embeddings' is not defined