# **Word Embedding with Gensim**

The advancement of deep learning in Natural Language Processing is often attributed to the advent of word embeddings. Rather than using the words themselves as features, neural network methods typically take as input dense, relatively low-dimensional vectors that model the meaning and usage of a word.The concept of word embeddings gained prominence through models like Word2Vec, pioneered by Thomas Mikolov and his Google team. Subsequently, various other methods emerged, including GloVe and FastText embeddings. In this notebook, we'll delve into word embeddings using the original Word2Vec method, as implemented in the Gensim library.

# Training word embedding

Training word embeddings with Gensim couldn't be easier. The only thing we need is a corpus of sentences in the language under investigation.

In [None]:
!pip install gdown

In [None]:
!gdown "https://drive.google.com/uc?id=1UmvSodP7wo_L8OBu_9ImaLe5lVEVaiPX"

In [None]:
pip install spacy
pip install scikit-learn
pip install --upgrade gensim
pip install pandas
pip install matplotlib
pip install scikit-learn
pip install datasets
pip install tensorflow

In [18]:
# Import libraries
import os
import csv
import spacy
import pandas as pd
import gensim

class Corpus(object):

    def __init__(self, filename):
        self.filename = filename
        self.nlp = spacy.blank("en")

    def __iter__(self):
        with open(self.filename, "r") as i:
            reader = csv.reader(i, delimiter=",")
            # tokenise the lower cased text using spacy tokeniser
            for _, abstract in reader:
                tokens = [t.text.lower() for t in self.nlp(abstract)]
                yield tokens


documents = Corpus("data.csv")

When we train our word embeddings, gensim allows us to set a number of parameters. The most important of these are min_count, window, vector_size and sg:

* **min_count** is the minimum frequency of the words in our corpus. For infrequent
words, we just don't have enough information to train reliable word embeddings. It therefore makes sense to set this minimum frequency to at least 10. In these experiments, we'll set it to 100 to limit the size of our model even more.
window is the number of words to the left and to the right that make up the context that word2vec will take into account.

* **vector_size** is the dimensionality of the word vectors. This is generally between 100 and 1000. This dimensionality often forces us to make a trade-off: embeddings with a higher dimensionality are able to model more information, but also need more data to train.

* **sg**: there are two algorithms to train word2vec: skip-gram and CBOW. Skip-gram tries to predict the context on the basis of the target word; CBOW tries to find the target on the basis of the context. By default, Gensim uses CBOW (sg=0).


We'll investigate the impact of some of these parameters later.

In [3]:
model = gensim.models.Word2Vec(documents, min_count=100, window=5, vector_size=100)

# Using word embeddings

Let's take a look at the trained model. The word embeddings are on its wv attribute, and we can access them by the using the token as key. For example, here is the embedding for nlp, with the requested 100 dimensions.

In [4]:
model.wv["nlp"]

array([ 1.2049196e+00, -2.0399175e+00,  1.9319603e+00, -1.1644354e-01,
        1.7962197e+00,  6.4193588e-01,  1.2741158e+00, -7.9043919e-01,
       -5.4795825e-01,  1.5752438e+00, -1.2307194e-02, -3.3175652e+00,
        6.7342067e-01, -4.6384311e-01, -5.6133050e-01,  9.0628380e-01,
       -1.1305474e+00,  1.1876471e+00,  4.4961435e-01,  1.5232416e+00,
        2.1306667e+00, -3.9594689e-01,  2.3165701e-01,  8.6593503e-01,
       -2.4511082e+00, -1.4231625e+00,  6.9664586e-01,  1.1180078e+00,
        1.1939554e+00, -1.0265014e+00,  7.2400504e-01,  1.1364397e+00,
        4.1103932e-01,  1.1459923e+00,  1.0287075e+00, -1.1662725e+00,
       -1.5545501e+00, -2.6318336e+00, -2.6789132e-01,  1.9964441e+00,
       -1.2992334e+00, -2.4077046e+00,  2.4514947e+00,  3.3987838e-01,
       -1.0033230e+00,  3.0160430e-01,  2.0661438e+00,  1.8518491e+00,
       -6.1098385e-01, -2.7785852e+00,  2.0116267e+00,  1.9540480e+00,
       -5.5057816e-02, -1.1721641e+00,  3.4381017e-01,  7.1585095e-01,
      

We can also easily find the similarity between two words. Similarity is measured as the cosine between the two word embeddings, and therefore ranges between -1 and +1. The higher the cosine, the more similar two words are. As expected, the figures below show that nmt (neural machine translation) is closer to smt (statistical machine translation) than to ner (named entity recognition).

In [5]:
print(model.wv.similarity("experiments","results"))
print(model.wv.similarity("linguistics","translation"))

# More specific example
print(model.wv.similarity("nmt", "smt"))
print(model.wv.similarity("nmt", "ner"))

0.5043842
-0.08405011
0.69067127
0.39934808


In a similar vein, we can find the words that are most similar to a target word. The words with the most similar embedding to bert are all semantically related to it: other types of pretrained models such as roberta, mbert, xlm, as well as the more general model type BERT represents (transformer and transformers), and more generally related words (pretrained).

In [6]:
model.wv.similar_by_word("bert", topn=10)

[('roberta', 0.7877159714698792),
 ('transformer', 0.7542476654052734),
 ('elmo', 0.7362399101257324),
 ('transformers', 0.7179213166236877),
 ('pretrained', 0.7090664505958557),
 ('gpt-2', 0.6605957746505737),
 ('mbert', 0.6548022627830505),
 ('xlm', 0.6529796123504639),
 ('xlnet', 0.6369994878768921),
 ('lstm', 0.6300441026687622)]

In [7]:
model.wv.similar_by_word("sentences", topn=10)

[('paraphrases', 0.7356393337249756),
 ('documents', 0.7144497036933899),
 ('strings', 0.696727991104126),
 ('texts', 0.6827948689460754),
 ('keyphrases', 0.6799679398536682),
 ('tokens', 0.6754116415977478),
 ('words', 0.6572448015213013),
 ('phrases', 0.6515894532203674),
 ('poems', 0.6500973105430603),
 ('paragraphs', 0.6317392587661743)]

Interestingly, we can look for words that are similar to a set of words and dissimilar to another set of words at the same time. This allows us to look for analogies of the type BERT is to a transformer like an LSTM is to .... Our embedding model correctly predicts that LSTMs are a type of RNN, just like BERT is a particular type of transformer.

In [8]:
model.wv.most_similar(positive=["transformer", "lstm"], negative=["bert"], topn=1)

[('rnn', 0.8247852921485901)]

Similarly, we can also zoom in on one of the meanings of ambiguous words. For example, in NLP tree has a very specific meaning, which is obvious from its nearest neighbours constituency, parse, dependency and syntax.

In [9]:
model.wv.most_similar(positive=["tree"], topn=10)

[('trees', 0.7817927002906799),
 ('constituency', 0.7085326910018921),
 ('recursive', 0.6886435151100159),
 ('parse', 0.6862645149230957),
 ('formalism', 0.6298339366912842),
 ('dependency', 0.6251062750816345),
 ('constituent', 0.6246699094772339),
 ('path', 0.616413950920105),
 ('hierarchical', 0.607570469379425),
 ('kernels', 0.6073637008666992)]

However, if we specify we're looking for words that are similar to tree, but dissimilar to syntax, suddenly its standard meaning takes over, and forest crops up in its nearest neighbours.

In [10]:
model.wv.most_similar(positive=["tree"], negative=["syntax"], topn=10)

[('modified', 0.4574902653694153),
 ('bayes', 0.4566059112548828),
 ('greedy', 0.4538378119468689),
 ('forest', 0.4479946196079254),
 ('random', 0.4259032607078552),
 ('logistic', 0.4248599112033844),
 ('feed', 0.4096589982509613),
 ('nearest', 0.4025936424732208),
 ('softmax', 0.39892685413360596),
 ('stochastic', 0.3971480429172516)]

Finally, we can present the word2vec model with a list of words and ask it to identify the odd one out. It then uses the word embeddings to identify the word that is least similar to the other ones. For example, in the list lstm cnn gru svm transformer, it correctly identifies svm as the only non-neural model. In the list bert word2vec gpt-2 roberta xlnet, it correctly singles out word2vec as the only non-transormer model. In word2vec bert glove fasttext elmo, bert is singled out as the only transformer.

In [11]:
print(model.wv.doesnt_match("lstm cnn gru svm transformer".split()))
print(model.wv.doesnt_match("bert word2vec gpt-2 roberta xlnet".split()))
print(model.wv.doesnt_match("word2vec bert glove fasttext elmo".split()))

svm
word2vec
bert


# Exploring hyperparameters

We mentioned above there are a number of parameters we can set when training our embeddings. Let's investigate the impact some of these have on the result. Quantifying the quality of embeddings is a hard task. There exist quite a few data sets for evaluating the quality of English embeddings, but this is not the case for other languages or very specialized domains, such as NLP. Moreover, it's unclear what information good embeddings should capture. Should they model syntactic information as well as semantic knowledge? Should they capture semantic similarity, or merely topical relatedness? Often, the answer depends on the end task you want to use the embeddings for.

Here we'll use a simple method for evaluating our embeddings. We'll count how many times two nearest neighbours in the vector space have the same part of speech. After all, if our embeddings model similarity (and not just relatedness) in meaning, we expect a noun to have another noun as nearest neighbour, and the same for verbs, adjectives, and so on.

First we'll use spaCy to determine the part of speech of all the words in our vocabulary. Note that our evaluation metric does rely on the quality of spaCy's part-of-speech tagging, which may not be very accurate for low-frequency words out of context. Nevertheless, we'll assume it's good enough for our purposes.

In [None]:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl

In [12]:
import spacy
# Get spaCy's installation path
spacy_path = spacy.__file__
print(f"SpaCy installation path: {spacy_path}")

SpaCy installation path: /venv/main/lib/python3.12/site-packages/spacy/__init__.py


In [13]:
import spacy
from tqdm.notebook import tqdm

nlp = spacy.load("en_core_web_sm")

word2pos = {}
for word in tqdm(model.wv.key_to_index):
    word2pos[word] = nlp(word)[0].pos_

word2pos["translation"]

  0%|          | 0/3099 [00:00<?, ?it/s]

'NOUN'


Then we write a simple method that takes a model and looks up the nearest neighbour to every word in its vocabulary. It returns the number of times this nearest neighbour has the same part of speech: a percentage we'll call the accuracy.

In [14]:
import numpy as np

def evaluate(model, word2pos):
    same = 0
    for word in tqdm(model.wv.key_to_index):
        most_similar = model.wv.similar_by_word(word, topn=1)[0][0]
        if word2pos[most_similar] == word2pos[word]:
            same += 1
    return same/len(model.wv.key_to_index)

evaluate(model, word2pos)

  0%|          | 0/3099 [00:00<?, ?it/s]

0.6479509519199742

Now we vary some of the settings we introduced above. In particular we're interested in the influence of embedding size (the dimensionality of the trained embeddings), and the size of the context window. We vary the embedding size between 100, 200 and 300,and the context window between 2, 5 and 10. This means we'll train 9 models in total, which obviously takes a bit of time. Feel free to go grab a coffee.

In [17]:
sizes = [100, 200, 300]
windows = [2,5,10]

df = pd.DataFrame(index=windows, columns=sizes)

for size in sizes:
    for window in windows:
        print("Size:", size, "Window:", window)
        model = gensim.models.Word2Vec(documents, min_count=100, window=window, vector_size=size)
        acc = evaluate(model, word2pos)
        df[size][window] = acc

df

Size: 100 Window: 2


  0%|          | 0/3099 [00:00<?, ?it/s]

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df[size][window] = acc


Size: 100 Window: 5


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 100 Window: 10


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 200 Window: 2


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 200 Window: 5


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 200 Window: 10


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 300 Window: 2


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 300 Window: 5


  0%|          | 0/3099 [00:00<?, ?it/s]

Size: 300 Window: 10


  0%|          | 0/3099 [00:00<?, ?it/s]

Unnamed: 0,100,200,300
2,0.680865,0.679897,0.674088
5,0.655373,0.644079,0.643756
10,0.616651,0.622459,0.623104


Although the accuracies of all models are very similar, the results do show some interesting patterns.

First, it looks like smaller contexts work better than larger ones. This is logical, as our evaluation metric is a syntactic one: the closest context words contain much more useful information about the part of speech of a word than those further away.

Second, higher-dimensional word embeddings do not necessarily work better than lower-dimensional ones. This may sound counter-intuitive, as higher-dimensional embeddings are able to capture more information. Still, larger embeddings also require more data, while we're using a pretty small corpus.

In [None]:
df.plot()

# Conclusions

Word embeddings are one of the most exciting trends on Natural Language Processing since the 2000s. They allow us to model the meaning and usage of a word, and discover words that behave similarly. This is crucial for the generalization capacity of many machine learning models. Moving from raw strings to embeddings allows them to generalize across words that have a similar meaning, and discover patterns that had previously escaped them.

### **Simple Task for you**
Download a pretrained word2vec wordembedding and perform above similarity checks and evaluate and compare your results.

In [None]:
# Your code goes here

# ***LSTM & CNN for classification***
This part describes how to implement LSTM models for text binary classification using tensorflow and keras.

In [None]:
import random
from datasets import load_dataset
from sklearn.model_selection import train_test_split

RANDOM_SEED = 500
VALIDATION_SIZE = 0.2

imdb = load_dataset("imdb")

# Method 1: Convert to pandas first (recommended)
df_train = imdb['train'].to_pandas()

# Split the data
train_split, validation_split = train_test_split(
    df_train, 
    test_size=VALIDATION_SIZE, 
    random_state=RANDOM_SEED
)

# Extract as lists
train_txt = train_split['text'].tolist()
train_lbl = train_split['label'].tolist()
validation_txt = validation_split['text'].tolist()
validation_lbl = validation_split['label'].tolist()

test_txt = imdb['test']['text']
test_lbl = imdb['test']['label']

print(f"Training samples: {len(train_txt)}")
print(f"Validation samples: {len(validation_txt)}")

## Vectorise the dataset and build the vocabulary

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import TextVectorization

MAX_LENGTH = 300
MAX_VOCAB_SIZE = 20000
BATCH_SIZE = 128

# Now create TensorFlow dataset
text_ds = tf.data.Dataset.from_tensor_slices(train_txt).batch(BATCH_SIZE)

# Continue with your vectorizer
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=MAX_VOCAB_SIZE, 
    output_sequence_length=MAX_LENGTH
)
vectorizer.adapt(text_ds)

print("Vectorizer adapted successfully!")

In [None]:
vectorizer.get_vocabulary()[:5]

In [None]:
output = vectorizer([["You are welcome to the RANLP conference"]])
output.numpy()[0, :8]

## Download embeddings

In [None]:
!gdown "https://drive.google.com/uc?id=12tAq-AbroB9hHwi597lIyjktr3o62w8o"

In [None]:
# !wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
# !unzip -q glove.6B.zip
# !ls

## Create word index and embeddings index

In [None]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

path_to_glove_file = 'glove.6B.100d.txt'

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

## Build embeddings matrix

In [None]:
num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

## Create LSTM model

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras import layers

NUM_CLASSES = 2

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    name='embeddings'
)
input = tf.keras.Input(shape=(None,), dtype="int64", name="input")
x = embedding_layer(input)
x = layers.LSTM(128, name="lstm_1",return_sequences=True)(x)
x = layers.LSTM(128, name="lstm_2")(x)
output = layers.Dense(NUM_CLASSES, activation="softmax", name="dense_predictions")(x)
model = keras.Model(inputs=input, outputs=output, name="lstm_model")
model.summary()

## Train Model

In [None]:
x_train = vectorizer(np.array([[s] for s in train_txt])).numpy()
x_val = vectorizer(np.array([[s] for s in validation_txt])).numpy()

y_train = np.array(train_lbl)
y_val = np.array(validation_lbl)

LEARNING_RATE = 0.01
optimiser = keras.optimizers.Adam(learning_rate=LEARNING_RATE)
# optimiser = keras.optimizers.SGD(learning_rate=LEARNING_RATE)
# optimiser = keras.optimizers.RMSprop(learning_rate=LEARNING_RATE)
model.compile(
    loss="sparse_categorical_crossentropy", optimizer=optimiser, metrics=["accuracy"],
)
model.fit(x_train, y_train, batch_size=256, epochs=3, validation_data=(x_val, y_val))

x_test = vectorizer(np.array([[s] for s in test_txt])).numpy()
y_test = np.array(test_lbl)
scores = model.evaluate(x_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

## Early Stopping

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=2,min_delta=0.001)
model.fit(x_train, y_train, batch_size=64, epochs=5, validation_data=(x_val, y_val),callbacks=[callback])

## Inferencing

In [None]:
string_input = keras.Input(shape=(1,), dtype="string", name="text_input")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

sample_text = tf.constant(["I like this movie"], dtype=tf.string)

probabilities = end_to_end_model.predict(sample_text)
print("Probabilities:", probabilities)
print("Predicted class:", np.argmax(probabilities[0]))

# Bi-LSTM

Change the above model to a Bi-LSTM model

In [None]:
# Your code goes here