<a href="https://colab.research.google.com/github/sunmyeonglee/2025-1-NLP/blob/main/1_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2Vec Implementation from Scratch

This notebook demonstrates how to implement the Word2Vec algorithm from scratch using PyTorch. We'll use the first Harry Potter book as our corpus to train word embeddings.


## 1. Setting Up the Environment

First, we need to import the necessary libraries:
- `torch` and `torch.nn` for tensor operations and neural network functionality
- `string` for string manipulations (removing punctuation)


In [2]:
import torch
import torch.nn as nn
import string


## 2. Getting the Text Data

We'll download the first Harry Potter book to use as our corpus.

In [3]:
!wget "https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt"


--2025-03-20 04:46:54--  https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 439742 (429K) [text/plain]
Saving to: ‘J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt’


2025-03-20 04:46:54 (14.5 MB/s) - ‘J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt’ saved [439742/439742]



## 3. Text Preprocessing

Before we can use the text data, we need to preprocess it:
- Remove punctuation
- Convert text to lowercase
- Split text into tokens (words)

This function will help us clean and tokenize the text.

In [4]:
def remove_punctuation(x):
  return x.translate(''.maketrans('', '', string.punctuation))

def make_tokenized_corpus(corpus):
  out= [ [y.lower() for y in remove_punctuation(sentence).split(' ') if y] for sentence in corpus]
  return [x for x in out if x!=[]]


## 4. Loading and Formatting the Text

Now we'll load the text file, replace some special characters, and split the text into sentences.


In [5]:
with open("J. K. Rowling - Harry Potter 1 - Sorcerer's Stone.txt", 'r') as f:
  strings = f.readlines()
list_of_sentences = "".join(strings).replace('\n', ' ').replace('Mr.', 'mr').replace('Mrs.', 'mrs').split('. ')

Let's tokenize the text using our preprocessing function `make_tokenized_corpus`:

In [6]:
# Corpus is a list of list of strings (words)

for sentence in list_of_sentences[:10]:
  print(sentence)

Harry Potter and the Sorcerer's Stone   CHAPTER ONE  THE BOY WHO LIVED  mr and mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much
They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense
 mr Dursley was the director of a firm called Grunnings, which made drills
He was a big, beefy man with hardly any neck, although he did have a very large mustache
mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors
The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere
 The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it
They didn't think they could bear it if anyone found out about the Potters
mrs Potter was

In [7]:
corpus = make_tokenized_corpus(list_of_sentences)

type(corpus), type(corpus[0]), type(corpus[0][0])

(list, list, str)

## 5. Creating Context Word Pairs

A key concept in Word2Vec is learning from context. We need to create pairs of words that appear near each other in the text. We'll use a sliding window approach to create these pairs.

For example, with the window size of 2, for the word "to" in the sentence "they were the last people youd expect to be involved...", we would create pairs with:
- ("to", "expect")
- ("to", "be")
- ("to", "involved")
- ("to", "in")

These pairs will be our training data.

In [8]:
from tqdm import tqdm

sample_sentence = ['they', 'were', 'the', 'last', 'people', 'youd', 'expect', 'to', 'be', 'involved', 'in', 'anything', 'strange', 'or', 'mysterious', 'because', 'they', 'just', 'didnt', 'hold', 'with', 'such', 'nonsense']

word_pairs = []
window_size = 2

for sample_sentence in tqdm(corpus):
  for cur_idx, center_word in enumerate(sample_sentence):
    window_begin = max(cur_idx - window_size, 0)
    window_end = min(cur_idx + window_size + 1, len(sample_sentence))
    # for context_word in sample_sentence[window_begin:window_end]:
    #   # if center_word == context_word: continue
    #   word_pairs.append( (center_word, context_word))
    for j in range(window_begin, window_end):
      if cur_idx == j: continue
      word_pairs.append( (center_word, sample_sentence[j]))

print(f"\nLength of word_pairs is {len(word_pairs)}")
print(f"First 5 example of word_pairs is {word_pairs[:5]}")

100%|██████████| 4682/4682 [00:00<00:00, 7982.27it/s]


Length of word_pairs is 282372
First 5 example of word_pairs is [('harry', 'potter'), ('harry', 'and'), ('potter', 'harry'), ('potter', 'and'), ('potter', 'the')]





## 6. Building the Vocabulary

To work with word vectors, we need to create a vocabulary that maps each unique word to an index. We'll also filter out rare words that appear less than a certain number of times in the corpus.

### 6.1 Collecting All Words

First, let's collect all words in our corpus:


In [9]:
# we have to make vocabulary
sentence = corpus[0]
entire_words = []

for sentence in corpus:
  for word in sentence:
    entire_words.append(word)

len(entire_words)

77597


### 6.2 Finding Unique Words

Now, let's find the unique words in our corpus:

In [10]:
# we have to get the "unique" item among total words

vocab_set = set(entire_words)
len(vocab_set)

unique_words = list(vocab_set)
len(unique_words)

6038

### 6.3 Converting to a List and Sorting

We'll convert the set of unique words to a sorted list:

In [11]:
# vocab_set[0] # set is not subscriptable because it has no order

unique_words = sorted(list(unique_words))
unique_words[0]

'\the'

### 6.4 Filtering by Frequency

Now, let's filter out rare words that occur less than a specified number of times:
- We can use the `Counter` class from the `collections` module to count the frequency of each word in the corpus.
- Caution on `alist.sort()` will return `None`.

In [12]:
# how can we filter the vocab by its frequency?
filtered_vocab = None
# you can use word counter as dictionary
# In python dictionary, dict.keys() gives keys, and dict.values() give values,
# dict.items() give (key, value)

from collections import Counter
word_counter = Counter(entire_words)
word_counter.most_common(10)
word_counter['harry']

threshold = 5
filtered_vocab = []
for key, value in word_counter.items():
  if value > threshold:
    filtered_vocab.append(key)

filtered_vocab.sort()
filtered_vocab[:10]

['a',
 'able',
 'abou',
 'about',
 'above',
 'across',
 'added',
 'afford',
 'afraid',
 'after']

## 7. Filtering Word Pairs

Now that we have our filtered vocabulary, we need to filter our word pairs to only include words that are in our vocabulary:

In [13]:
# Filter the word_pairs using the vocab
# word_pairs, filtered_vocab
# word_pairs is a list of [word_a, word_b]

filtered_word_pairs = []
vocab_set = set(filtered_vocab)

for pair in tqdm(word_pairs):
  a, b = pair
  if a in filtered_vocab and b in filtered_vocab:
    filtered_word_pairs.append(pair)

100%|██████████| 282372/282372 [00:11<00:00, 23648.45it/s]


In [14]:
# implement same algorithm with list comprehension

filtered_word_pairs = [pair for pair in word_pairs if pair[0] in vocab_set and pair[1] in vocab_set]

In [15]:
len(filtered_word_pairs), len(word_pairs)

(226846, 282372)

## 8. Converting Words to Indices

For efficiency, we'll convert our words to indices according to their position in our vocabulary:

In [16]:
# convert word into index of vocab
# filtered_vocab.index('happily')
filtered_vocab.index('harry')

527

This is inefficient because `list.index()` has to scan the list every time. Let's use a dictionary for faster lookups:

In [17]:
# we can make it faster
# use dictionary to find the index of string
word2idx = dict()
for idx, word in enumerate(filtered_vocab):
  word2idx[word] = idx

word2idx['harry']

527

Now, let's convert our word pairs to index pairs more efficiently:

In [18]:
index_pairs = [(word2idx[pair[0]], word2idx[pair[1]]) for pair in filtered_word_pairs]
index_pairs[0]

(527, 953)

In [19]:
# Why we don't need idx2tok?

filtered_vocab[527]

'harry'

## 9. Creating Initial Word Vectors

Now we'll create random vectors for each word in our vocabulary. These vectors will be adjusted during training:
- We can use `torch.randn` to create random vectors that follow normal distribution.

In [20]:
# we have to make random vectors for each word in the vocab
# we also have to decide the dimension of the vector

dim = 100
vocab_size = len(filtered_vocab)

word_vectors = torch.randn(vocab_size, dim) / 10
word_vectors

tensor([[-0.0124,  0.0829, -0.0483,  ...,  0.0534, -0.1356, -0.1032],
        [-0.0813, -0.0419,  0.1096,  ...,  0.0738, -0.0739, -0.0765],
        [-0.0604, -0.1448, -0.0114,  ..., -0.0085, -0.0507, -0.0507],
        ...,
        [ 0.0722,  0.0876, -0.0965,  ...,  0.0842,  0.0322, -0.0457],
        [-0.0476,  0.0510, -0.0890,  ..., -0.2315,  0.1206,  0.0551],
        [ 0.0007,  0.0694, -0.0221,  ..., -0.1533, -0.0268,  0.0693]])

In [21]:
# what is the vector for harry?
word_vectors[word2idx['harry']]

tensor([ 0.1135,  0.0922, -0.0107,  0.2302,  0.0872, -0.1071, -0.1155, -0.0481,
         0.0341,  0.0131, -0.1134, -0.0725, -0.1494, -0.0842, -0.1923, -0.0043,
        -0.0182,  0.0025, -0.0130,  0.0004, -0.1578, -0.0950,  0.0114, -0.0695,
         0.0222, -0.0447,  0.0081, -0.0524, -0.0580,  0.0531, -0.1617,  0.0366,
         0.0006,  0.1494, -0.1193, -0.0283,  0.1395, -0.0770,  0.1175, -0.1774,
         0.0281, -0.2518, -0.1144,  0.0343, -0.0375, -0.1438, -0.0705,  0.0879,
         0.0670, -0.0391, -0.0186, -0.0102, -0.0120,  0.0077, -0.1177,  0.1268,
         0.1821,  0.0365,  0.1456, -0.0744, -0.0426,  0.0289,  0.0071,  0.1051,
        -0.0773,  0.0410, -0.1463, -0.0144,  0.2803, -0.0849, -0.2018, -0.1808,
        -0.0660, -0.0142, -0.0138, -0.1480, -0.0075,  0.1440,  0.2027, -0.0161,
        -0.1862,  0.0309,  0.0975,  0.0306, -0.0144,  0.1110, -0.1405, -0.1502,
        -0.0082, -0.0117,  0.0157,  0.0185,  0.0133, -0.1423,  0.0589, -0.2247,
        -0.0724, -0.0357,  0.0350,  0.14

## 10. Understanding Word Relationships with Dot Products

The core of Word2Vec is using dot products to measure relationships between words. Let's explore this concept:

In [22]:
torch.set_printoptions(sci_mode=False) # Do this to avoid scientific notation

## Dot Product
- Assume we have two vectors $a$ and $b$.
  - $a = [a_1, a_2, a_3, a_4, ..., a_n]$
  - $b = [b_1, b_2, b_3, b_4, ..., b_n]$
- $a \cdot b$ = $\sum _{i=1}^n a_ib_i$  = $a_1b_1 + a_2b_2 + a_3b_3 + a_4b_4 + ... + a_nb_n$

Let's calculate the dot product between "harry" and "potter":


In [23]:
# calculate P(potter|harry)
harry = word_vectors[word2idx['harry']]
potter = word_vectors[word2idx['potter']]
dot_product_value_between_potter_harry = sum(harry * potter)
dot_product_value_between_potter_harry

tensor(-0.0597)

In [24]:
# we can get the dot product value for every other words in the vocab
# to get  P(word | harry)
word_dot_dict = {}
for word in filtered_vocab:
  w_idx = word2idx[word]
  w_vector = word_vectors[w_idx]
  word_dot_dict[word] = sum(harry * w_vector)
word_dot_dict

{'a': tensor(0.0831),
 'able': tensor(0.0892),
 'abou': tensor(0.1700),
 'about': tensor(0.0123),
 'above': tensor(-0.0422),
 'across': tensor(-0.1013),
 'added': tensor(-0.0162),
 'afford': tensor(-0.2145),
 'afraid': tensor(0.0356),
 'after': tensor(-0.0948),
 'afternoon': tensor(0.0923),
 'again': tensor(0.0270),
 'against': tensor(-0.1651),
 'ages': tensor(-0.0419),
 'ago': tensor(0.0498),
 'agreed': tensor(0.0805),
 'ah': tensor(0.0125),
 'ahead': tensor(0.0302),
 'air': tensor(-0.1065),
 'albus': tensor(-0.0609),
 'alive': tensor(-0.0157),
 'all': tensor(-0.1298),
 'alley': tensor(0.0777),
 'allowed': tensor(0.0134),
 'almost': tensor(-0.0927),
 'alone': tensor(0.0419),
 'along': tensor(0.0134),
 'already': tensor(0.1327),
 'also': tensor(0.0066),
 'although': tensor(-0.0189),
 'always': tensor(0.2100),
 'am': tensor(-0.0530),
 'an': tensor(-0.2792),
 'and': tensor(0.2426),
 'angrily': tensor(-0.0309),
 'angry': tensor(-0.0026),
 'another': tensor(0.0047),
 'answer': tensor(-0.13

Now, let's convert these dot products to probabilities using the softmax function:
- We have to convert our prediction into probability distribution to get P(word|harry) so that sum of [P(a|harry), ..., P(potter|harry), ... P(ron|harry), ... ] = 1
- current dot product value is any real number, sometimes called as logit
  - logit from logistic regression. Some values that are not yet converted to 0-1 or value before sigmoid function
  - every probability should be in range (0, 1) (greater than 0, smaller than 1)
  - this can be handled by taking exponential of dot product values, divided by total sum
  - This function is called **Softmax**

- Why we use exponential?
  - Because we want to make every probability in positive range while preserving the order


In [27]:
from math import exp
word_exp_dict = {}
for word, dot_value in word_dot_dict.items():
  exp_value = torch.exp(dot_value)
  word_exp_dict[word] = exp_value

sum_exp = sum([value for value in word_exp_dict.values()])
word_prob_dict = {}
for word, exp_value in word_exp_dict.items():
  word_prob_dict[word] = exp_value / sum_exp

word_prob_dict

{'a': tensor(0.0007),
 'able': tensor(0.0007),
 'abou': tensor(0.0008),
 'about': tensor(0.0007),
 'above': tensor(0.0006),
 'across': tensor(0.0006),
 'added': tensor(0.0006),
 'afford': tensor(0.0005),
 'afraid': tensor(0.0007),
 'after': tensor(0.0006),
 'afternoon': tensor(0.0007),
 'again': tensor(0.0007),
 'against': tensor(0.0006),
 'ages': tensor(0.0006),
 'ago': tensor(0.0007),
 'agreed': tensor(0.0007),
 'ah': tensor(0.0007),
 'ahead': tensor(0.0007),
 'air': tensor(0.0006),
 'albus': tensor(0.0006),
 'alive': tensor(0.0006),
 'all': tensor(0.0006),
 'alley': tensor(0.0007),
 'allowed': tensor(0.0007),
 'almost': tensor(0.0006),
 'alone': tensor(0.0007),
 'along': tensor(0.0007),
 'already': tensor(0.0008),
 'also': tensor(0.0007),
 'although': tensor(0.0006),
 'always': tensor(0.0008),
 'am': tensor(0.0006),
 'an': tensor(0.0005),
 'and': tensor(0.0008),
 'angrily': tensor(0.0006),
 'angry': tensor(0.0007),
 'another': tensor(0.0007),
 'answer': tensor(0.0006),
 'any': tenso

In [None]:
# Get P(potter|harry)
word_prob_dict['ha']

## 13. Efficient Matrix Operations
![img](https://mkang32.github.io/images/python/khan_academy_matrix_product.png)

Instead of calculating dot products one by one, we can use matrix multiplication for efficiency:


In [41]:
harry, potter
center_word_mat = torch.stack([harry, potter])
center_word_mat.shape

torch.Size([2, 100])

In [43]:
# get dot product result for every word in the vocabulary
harry.shape
# first, make vector_of_harry into matrix format
harry_mat = harry.unsqueeze(0)
word_vectors.shape
# do matrix multiplication
dot_by_mat = torch.mm(center_word_mat, word_vectors.T)
dot_by_mat = dot_by_mat.T
dot_by_mat.shape

torch.Size([1506, 2])

Let's verify that our matrix multiplication gives the same result as individual dot products:

In [35]:
dot_by_mat[word2idx['potter']], word_dot_dict['potter']

(tensor([-0.0597]), tensor(-0.0597))

Now let's implement the complete softmax calculation using matrix operations:


In [44]:
# convert dot product result into exponential
mat_exp = torch.exp(dot_by_mat)
mat_exp.shape

torch.Size([1506, 2])

In [45]:
# get the sum of exponential
sum(mat_exp)
sum_of_mat_exp = torch.sum(mat_exp, dim=1, keepdim=True)
sum_of_mat_exp

tensor([[2.0586],
        [2.2399],
        [2.0874],
        ...,
        [2.1022],
        [2.2362],
        [2.0015]])

In [47]:
# divide exponential value with sum
prob = mat_exp / sum(mat_exp)
prob.sum(dim=0)

tensor([1.0000, 1.0000])

## 14. Creating a Probability Function

Let's create a function to calculate probabilities efficiently:

In [51]:
def get_probs(query_vectors, entire_vectors):
  dot_by_mat = torch.mm(query_vectors, entire_vectors.T)
  dot_by_mat = dot_by_mat.T
  mat_exp = torch.exp(dot_by_mat)
  sum_of_mat_exp = torch.sum(mat_exp, dim=0)
  prob = mat_exp / sum_of_mat_exp
  return prob

get_probs(center_word_mat, word_vectors)

tensor([[0.0007, 0.0006],
        [0.0007, 0.0008],
        [0.0008, 0.0006],
        ...,
        [0.0007, 0.0007],
        [0.0007, 0.0007],
        [0.0007, 0.0006]])

## 15. Preparing for Training

Before training our Word2Vec model, we need to split our dataset into training and testing sets:

In [57]:
# Now we can train the word2vec
import random

# Let's think about training pairs
index_pairs # this is our dataset. It's list of list of two integer
print(len(index_pairs))
# two integer means a pair of neighboring words

# Training set and Test set
# To validate that our model can solve 'unseen' problems
# So we have to split the dataset before training.

# To randomly split the dataset, we will first shuffle the dataset

random.shuffle(index_pairs) # this will shuffle the list items

226846


In [58]:
train_set = index_pairs[:200000]
test_set = index_pairs[200000:]

In [59]:
len(train_set), len(test_set)

(200000, 26846)

## 16. Training the Word2Vec Model

Now we'll train our Word2Vec model using batched gradient descent:

In [60]:
# making batch from train_set
# Batch is a set of training samples, that are calculated together
# And also we update the model after one single batch

batch = train_set[:20]
batch
center_words = [x[0] for x in batch]
context_words = [x[1] for x in batch]

center_words_vectors = word_vectors[center_words]
prob = get_probs(center_words_vectors, word_vectors)
prob

tensor([[0.0006, 0.0006, 0.0007,  ..., 0.0007, 0.0006, 0.0008],
        [0.0006, 0.0007, 0.0007,  ..., 0.0007, 0.0007, 0.0008],
        [0.0006, 0.0006, 0.0007,  ..., 0.0007, 0.0007, 0.0006],
        ...,
        [0.0007, 0.0005, 0.0006,  ..., 0.0007, 0.0006, 0.0006],
        [0.0008, 0.0007, 0.0007,  ..., 0.0008, 0.0006, 0.0006],
        [0.0007, 0.0007, 0.0007,  ..., 0.0007, 0.0008, 0.0007]])

## 17. Evaluating the Training

Let's visualize the training loss to see if our model is learning:

In [None]:
import matplotlib.pyplot as plt
plt.plot(loss_record)

## 18. Testing the Model

Now we'll test our model on the test set:

## 19. Exploring Learned Word Relationships

Let's explore what our model has learned by finding the words most closely related to "harry":

In [None]:
# P(potter|harry)?
