# Lab 03 Part 2 - Word Embeddings
In this lab we will look into word embeddings with word2vec and other similar methods

In [2]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab03/lab03_word_embeddings.ipynb" target="_parent">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/></a>'
)
display(colab_button)

### Build your own
Let's start by first building out own word2vec model, instead of downloading a ready trained one. For that we are going to use the 20 news groups from sklearn, since is not too large for a lab exercise.

In [None]:
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

# lets check the first two documents
documents[:2]

The first thing to do is to format the documents into a list of sentences that contains a list of tokens. We are not going to do any further cleaning and pre-processing for now (to keep things simple for the labs), but that would be advisable.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

# This will take a minute or so
token_list = []
## TODO: use this loop to tokenize each document using sent_tokenize, and then to tokenize each word using word_tokenize; 
##       do remember to add all the resultant tokens to token_list

for d in documents:
    s = ...
    token_list = ...

# check the first three sentences
token_list[:3]

Now is time to import the word2vec algorithm and set the key parameters

In [None]:
from gensim.models.word2vec import Word2Vec

# Number of vector elements (dimensions) to represent the word vector
num_features = 300
# Min number of word count to be considered in the Word2vec model. If your corpus is small, reduce the min count. If you’re training with a large corpus, increase the min count.
min_word_count = 1
# Number of CPU cores used for the training. If you want to set the number of cores dynamically, check out import multiprocessing: 
#num_workers = multiprocessing.cpu_count()
num_workers = 2
# Context window size
window_size = 3
# Subsampling rate for frequent terms
subsampling = 1e-3

Let's train the model!

In [None]:
%%time
## TODO: Use the parameters defined in the code cell above above to start Word2Vec model training.
model = Word2Vec(token_list, workers=..., vector_size=..., min_count=..., window=..., sample=...)

Once you’ve trained your word model, you can reduce the memory footprint by about half if you freeze your model and discard the unnecessary information. The following command will discard the unneeded output weights of your neural network:

The model cannot be trained further once the weights of the output layer have been discarded.

Save the trained model with the following command and preserve it for later use:

In [None]:
model_name = "my_own_domain_specific_word2vec_model"
model.save(model_name)

Now lets say we want to load the model that we had previously saved.

In [None]:
from gensim.models.word2vec import Word2Vec
model_name = "my_own_domain_specific_word2vec_model"
## TODO: Load the model you just saved in your lab folder
model = ...

Let's check the most similar words to "justice"

In [None]:
print(model.wv.most_similar('justice'))

### Challenge - 1
Try a few more words and observe if what is retrieved makes sense

### Using the gensim API
Having build our own model is great, but lets now load a model that was trained with MANY documents.

In [None]:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

We will be using the downloader for the embedding models

In [None]:
import gensim.downloader as api

# this command can be used to check what models are available
#api.info()

Let's load the word2vec model from google news containing 300 features

In [None]:
# This will also take a minute or so
## TODO: Use the api above to load a new model 'word2vec-google-news-300'
word2vec_model = ...

Now check the embedding vector for "beautiful"... you will see a 300 dimensional vector

In [None]:
word2vec_model["beautiful"]

Let's check some similar words to the word "girl"

In [None]:
word2vec_model.most_similar("girl")

How about some maths with vectors! Try the following:

queen - girl + boy = king

In [None]:
word2vec_model.most_similar(positive=['boy', 'queen'], negative=['girl'], topn=1)

Time to do some visualisations and see how similar words end up close together and far from other words that are not as similar

In [None]:
vocab = ["boy", "girl", "man", "woman", "king", "queen", "banana", "apple", "mango", "fruit", "coconut", "orange"]

def tsne_plot(model):
    labels = []
    wordvecs = []

    for word in vocab:
        wordvecs.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=3, n_components=2, init='pca', random_state=42)
    coordinates = tsne_model.fit_transform(np.array(wordvecs))

    x = []
    y = []
    for value in coordinates:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(8,8)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(2, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

tsne_plot(word2vec_model)

### Challenge - 2
Try a few more examples to visualise and see if similar words land close together

## GloVe
Let's try another model (GloVe) and see if that is any different to word2vec

In [None]:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

In [None]:
import gensim.downloader as api
glove_model = api.load('glove-wiki-gigaword-300')

In [None]:
glove_model["beautiful"]

It will be interesting to see if this will fins similar words to "girl" like word2vec did

In [None]:
glove_model.most_similar("girl")

Let's also see if it can solve the same analogy too

In [None]:
glove_model.most_similar(positive=['boy', 'queen'], negative=['girl'], topn=1)

In [None]:
vocab = ["boy", "girl", "man", "woman", "king", "queen", "banana", "apple", "mango", "fruit", "coconut", "orange"]

def tsne_plot(model):
    labels = []
    wordvecs = []

    for word in vocab:
        wordvecs.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=3, n_components=2, init='pca', random_state=42)
    coordinates = tsne_model.fit_transform(np.array(wordvecs))

    x = []
    y = []
    for value in coordinates:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(8,8)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(2, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

tsne_plot(glove_model)

Let's continue with GloVe and check if plural words play any role in how close is to the original singular words

In [None]:
print(glove_model.distance("fruit", "fruits"))
print(glove_model.distance("girl", "girls"))
print(glove_model.distance("girl", "boy"))

## Challenge - 3
Calculate the distance for "king" and "queen", then for "woman and "man". Is it similar? Check the plot to confirm.

Calculate the distance for "king" and "apple", then for "queen" and "apple". Is it similar again? Check the plot to confirm.

Now let's try and see if the model can find the capitals of different countries

In [None]:
import pandas as pd
# pretty print function
def pp(obj):
    print(pd.DataFrame(obj))

## TODO: Use the most_similar function over glove_model to find out the capitals of the countries in the list below.
##       HINT: [worda] is to be passed as a value to the negative parameter. 
##       Ques: Should [wordb, wordc] be passed as a positive parameter? 
##             What should be the topn results returned? Think about this and complete the function.
def analogy(worda, wordb, wordc):
    result = ...
    return result[0][0]

countries = ['australia', 'canada', 'germany', 'ireland', 'italy']
capitals = [analogy('usa', 'washington', country) for country in countries]
pp(zip(countries,capitals))

## Challenge - 4
Looks good... but what if you change "usa" to "us"? Or if you used a different example to start with like "greece" and "athens"?

Now let's plot the results on a graph

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

def plot_data(orig_data, labels):
    pca = PCA(n_components=2)
    data = pca.fit_transform(orig_data)
    plt.figure(figsize=(7, 5), dpi=100)
    plt.plot(data[:,0], data[:,1], '.')
    for i in range(len(data)):
        plt.annotate(labels[i], xy = data[i])
    for i in range(len(data)//2):
        plt.annotate("",
                xy=data[i],
                xytext=data[i+len(data)//2],
                arrowprops=dict(arrowstyle="->",
                                connectionstyle="arc3")
        )
       
labels = countries + capitals
data = [glove_model[w] for w in labels]
plot_data(data, labels)

## doc2vec
Now let's look into generating feature vectors for documents instead of just words. For that we are going to use word2vec

In [None]:
import multiprocessing
num_cores = multiprocessing.cpu_count()

from gensim.models.doc2vec import TaggedDocument,Doc2Vec
from gensim.utils import simple_preprocess

First let's load some data

In [None]:
corpus = ['This is the first document','another document']

training_corpus = []
for i, text in enumerate(corpus):
    tagged_doc = TaggedDocument(simple_preprocess(text), [i])
    ## TODO: Append tagged_doc to the training_corpus you have. 
    ...
    
# If you’re running low on RAM, and you know the number of documents ahead of time (your corpus object isn’t an iterator or generator),
# you might want to use a preallocated numpy array instead of Python list for your training_corpus:
#training_corpus = np.empty(len(corpus), dtype=object);
#… 
#training_corpus[i] = …

Now we will build the model and train it

In [None]:
doc2vec_model = Doc2Vec(vector_size=100, min_count=2, workers=num_cores, epochs=10)
doc2vec_model.build_vocab(training_corpus)
doc2vec_model.train(training_corpus, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

### This is how you train a doc2vec model but before you move on to inferencing, 
### try training the same model with different dimension sizes and by increasing the number of epochs for training. 
### Observe the result of inferencing via each of these models

In [None]:
## TODO: Increase the vector size to 300 with this one
doc2vec_model_2 = ...
doc2vec_model_2.build_vocab(training_corpus)
doc2vec_model_2.train(...)

In [None]:
## TODO: Increase the number of epochs to 20 with this one.
doc2vec_model_3 = ...
doc2vec_model_3.build_vocab(training_corpus)
doc2vec_model_3.train(...)

In [None]:
## TODO: Try increasing both the parameters for this one.
doc2vec_model_4 = ...
doc2vec_model_4.build_vocab(training_corpus)
doc2vec_model_4.train(...)

Time to generate the feature vector of a new document!

In [None]:
doc2vec_model.infer_vector(simple_preprocess('This is a completely unseen document'), epochs=10)

In [None]:
doc2vec_model_2.infer_vector(simple_preprocess('This is a completely unseen document'), epochs=10)

In [None]:
doc2vec_model_3.infer_vector(simple_preprocess('This is a completely unseen document'), epochs=...) # Use the increased number of epochs

In [None]:
doc2vec_model_4.infer_vector(simple_preprocess('This is a completely unseen document'), epochs=...) # Use the increased number of epochs

## Challenge - 5
Use the fetch_20newsgroups dataset from sklearn (see code above) and re-train doc2vec with that data instead.

Then, check using the most similar function to see if the documents you test are indeed similar.