In [1]:
import numpy as np
from pprint import pprint
from gensim.models import Word2Vec
from gensim.models import KeyedVectors


# Word Embeddings

## Introduction

Word embeddings are vector representations of words that capture their semantic meaning. Word embeddings are often used as a fundamental component for downstream natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and text summarization. Word embeddings are typically generated by training a shallow neural network on a large corpus of text. The resulting word embeddings can then be used to perform various NLP tasks.



## Word2Vec

Word2Vec is a popular algorithm for generating word embeddings, which are vector representations of words that capture their semantic meaning. Gensim is a Python library that provides an easy-to-use interface for training and using Word2Vec models. With Gensim, you can train Word2Vec models on your own text data, or use pre-trained models to perform various natural language processing tasks. In this introduction, we'll explore how to use Gensim to train Word2Vec models on your own text data.

Though there are many implimentations, gensim is one of the more popular ones. It is also one of the more efficient ones.

https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html?highlight=king%20man




## Using Word2Vec

Word2Vec is an algorithm. In order to use this algorithm to generate word embeddings, we need a trained version.

1) You can train your own model. To create a model you will need to provide it with a large corpus of text. The Word2Vec algorithm will then generate a vector representation for each word in the corpus. The resulting word embeddings can then be used to perform various NLP tasks.
2) Use a pretrained Word2Vec model. There are many pretrained Word2Vec models available online. These models have been trained on large corpuses of text, and can be used to perform various NLP tasks. 
3) Use a pretrained Word2Vec model and fine-tune it on your own corpus of text. This is a good option if you have a small corpus of text, and want to improve the performance of the pretrained model on your own data.

### 1) Train your own Word2Vec model

The following is a small 'toy example' of how to train your own Word2Vec model using Gensim. In this example, we'll train a Word2Vec model on a small corpus of text. The resulting word embeddings can then be used to perform various NLP tasks.

In [2]:
%%time

# Define a list of sentences to train on
sentences = [["I", "love", "chocolate"],
             ["Chocolate", "is", "my", "favorite", "food"],
             ["I", "love", "ice", "cream"],
             ["Ice", "cream", "is", "delicious"]]

# Train a Word2Vec model on the sentences
model = Word2Vec(sentences, min_count=1)

CPU times: user 2.16 ms, sys: 1.77 ms, total: 3.93 ms
Wall time: 2.8 ms


In [3]:
%%time

# Get the vector representation of a word
vector = model.wv["chocolate"]

pprint(vector)

array([ 9.7702928e-03,  8.1651136e-03,  1.2809718e-03,  5.0975787e-03,
        1.4081288e-03, -6.4551616e-03, -1.4280510e-03,  6.4491653e-03,
       -4.6173059e-03, -3.9930656e-03,  4.9244044e-03,  2.7130984e-03,
       -1.8479753e-03, -2.8769434e-03,  6.0107317e-03, -5.7167388e-03,
       -3.2367026e-03, -6.4878250e-03, -4.2346325e-03, -8.5809948e-03,
       -4.4697891e-03, -8.5112294e-03,  1.4037776e-03, -8.6181965e-03,
       -9.9166557e-03, -8.2016252e-03, -6.7726658e-03,  6.6805850e-03,
        3.7845564e-03,  3.5616636e-04, -2.9579818e-03, -7.4283206e-03,
        5.3341867e-04,  4.9989222e-04,  1.9561886e-04,  8.5259555e-04,
        7.8633073e-04, -6.8160298e-05, -8.0070542e-03, -5.8702733e-03,
       -8.3829118e-03, -1.3120425e-03,  1.8206370e-03,  7.4171280e-03,
       -1.9634271e-03, -2.3252917e-03,  9.4871549e-03,  7.9704521e-05,
       -2.4045217e-03,  8.6048469e-03,  2.6870037e-03, -5.3439722e-03,
        6.5881060e-03,  4.5101536e-03, -7.0544672e-03, -3.2317400e-04,
      

In [4]:
# Find the most similar words to a given word
similar_words = model.wv.most_similar("chocolate")

pprint(similar_words)

[('love', 0.17272792756557465),
 ('Ice', 0.16694681346416473),
 ('food', 0.11117953062057495),
 ('Chocolate', 0.10941851884126663),
 ('cream', 0.07963486015796661),
 ('I', 0.04130808636546135),
 ('favorite', 0.03771297261118889),
 ('is', 0.008315940387547016),
 ('delicious', -0.005896823015064001),
 ('my', -0.07424271106719971)]


In [5]:
sorted_array = sorted(similar_words, key=lambda x: x[1])

pprint(sorted_array)

[('my', -0.07424271106719971),
 ('delicious', -0.005896823015064001),
 ('is', 0.008315940387547016),
 ('favorite', 0.03771297261118889),
 ('I', 0.04130808636546135),
 ('cream', 0.07963486015796661),
 ('Chocolate', 0.10941851884126663),
 ('food', 0.11117953062057495),
 ('Ice', 0.16694681346416473),
 ('love', 0.17272792756557465)]


## Loading a Pretrained Word2Vec Model

In this example we will use Google's word2vec-google-news-300 model. This model contains 300-dimensional vectors for 3 million words and phrases. You can download the model directly from the original Word2Vec authors github repo:
* https://github.com/mmihaltz/word2vec-GoogleNews-vectors/blob/master/GoogleNews-vectors-negative300.bin.gz
  
NOTE: You can also use the GenSim downloader to download the model. However, the download can be slow and may take longer than downloading the file from github (directly from the original authors repo)




In [6]:
%%time

# Load a pre-trained Word2Vec model
model_path = "./data/GoogleNews-vectors-negative300.bin.gz" # change ./data if you do not have the data folder in the same directory as this file
model = KeyedVectors.load_word2vec_format(model_path, binary=True)

# or, as discussed above...
#import gensim.downloader as api
#model = api.load('word2vec-google-news-300')

CPU times: user 20.8 s, sys: 890 ms, total: 21.7 s
Wall time: 21.7 s


In [7]:
# Get the vector representation of a word
vector = model["house"]

pprint(vector)

array([ 1.57226562e-01, -7.08007812e-02,  5.39550781e-02, -1.89208984e-02,
        9.17968750e-02,  2.55126953e-02,  7.37304688e-02, -5.68847656e-02,
        1.79687500e-01,  9.27734375e-02,  9.03320312e-02, -4.12109375e-01,
       -8.30078125e-02, -1.45507812e-01, -2.37304688e-01, -3.68652344e-02,
        8.74023438e-02, -2.77099609e-02,  1.13677979e-03,  8.30078125e-02,
        3.57421875e-01, -2.61718750e-01,  7.47070312e-02, -8.10546875e-02,
       -2.35595703e-02, -1.61132812e-01, -4.78515625e-02,  1.85546875e-01,
       -3.97949219e-02, -1.58203125e-01, -4.37011719e-02, -1.11328125e-01,
       -1.05957031e-01,  9.86328125e-02, -8.34960938e-02, -1.27929688e-01,
       -1.39648438e-01, -1.86523438e-01, -5.71289062e-02, -1.17675781e-01,
       -1.32812500e-01,  1.55639648e-02,  1.34765625e-01,  8.39843750e-02,
       -9.03320312e-02, -4.12597656e-02, -2.51953125e-01, -2.27539062e-01,
       -6.64062500e-02, -7.66601562e-02,  5.15136719e-02,  5.90820312e-02,
        3.49609375e-01, -

In [8]:
# Find the most similar words to a given word
similar_words = model.most_similar("house")

pprint(similar_words)

[('houses', 0.7072390913963318),
 ('bungalow', 0.6878558993339539),
 ('apartment', 0.6628996729850769),
 ('bedroom', 0.6496937274932861),
 ('townhouse', 0.6384080052375793),
 ('residence', 0.6198420524597168),
 ('mansion', 0.6058192253112793),
 ('farmhouse', 0.5857570171356201),
 ('duplex', 0.5757936835289001),
 ('appartment', 0.5690325498580933)]


### Fine tuning an existing pre-trained model

The third option we have is to fine-tune an existing pre-trained model on our own corpus of text. This is a good option if you have a small corpus of text, and want to improve the performance of the pretrained model on your own data.

NOTE: This is 'tricky' stuff. You need to be careful when doing this. If you have a small corpus of text, you may not have enough data to fine-tune the model. In this case, you may end up overfitting the model to your data. If you have a large corpus of text, you may be able to fine-tune the model without overfitting. However, you may not be able to improve the performance of the model on your own data. In this case, you may be better off using the pretrained model as-is. I've included this option for completeness, but I don't recommend using it unless you have a good reason to do so (and have much more exprerience than you do now).

In [9]:
%%time

# fine tune this model using sample sentences
new_sentences = [
    ["The", "p220", "laser", "printer", "can", "print", "15", "pages", "per", "minute"],
    ["The", "p220", "laser", "printer", "is", "a", "black", "and", "white", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "color", "printer"],
    ["The", "p320", "laser", "printer", "can", "print", "21", "pages", "per", "minute"],
    ["The", "p220", "laser", "printer", "is", "a", "fast", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "faster", "printer"],
    ["The", "p220", "laser", "printer", "is", "a", "cheap", "printer"],
    ["The", "p320", "laser", "printer", "is", "an", "expensive", "printer"],
    ["The", "p220", "laser", "printer", "is", "a", "reliable", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "reliable", "printer"],
    ["The", "p220", "laser", "printer", "is", "a", "small", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "large", "printer"],
    ["The", "p220", "laser", "printer", "is", "a", "light", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "heavy", "printer"],
    ["The", "p220", "laser", "printer", "is", "a", "loud", "printer"],
    ["The", "p320", "laser", "printer", "is", "a", "quiet", "printer"],
]

#model = Word2Vec(sentences, min_count=1)

model = Word2Vec(sentences=new_sentences, vector_size=300, window=5, min_count=1, workers=4)
#model.build_vocab(sentences)

total_examples = model.corpus_count
print('total_examples:', total_examples)

print('Total epochs to be used in training:', model.epochs)
model.train(sentences, total_examples=total_examples, epochs=model.epochs) # model.epochs = 5 by default, but you can change it. Increasing this value will increase the training time and may improve the model's performance.

model.build_vocab(new_sentences) # this is needed to update the model's vocabulary with the new sentences (for instance, p220 and p230 printer names in our example)
model.wv.vectors_lockf = np.ones(len(model.wv)) # np.ones() is a function that returns an array of 1s with the same shape as the input array, this is needed by the new training method in gensim 4+
model.wv.intersect_word2vec_format("./data/GoogleNews-vectors-negative300.bin.gz", lockf=0.0, binary=True) # lockf = 1.0 allows for futher training, lockf = 0.0 does not

model.save("./data/fine-tuned_word2vec_model.bin")


total_examples: 16
Total epochs to be used in training: 5
CPU times: user 28.7 s, sys: 624 ms, total: 29.3 s
Wall time: 29.3 s


In [10]:
# Find the most similar words to a given word
similar_words = model.wv.most_similar("p220")

pprint(similar_words)


[('faster', 0.10351143032312393),
 ('p320', 0.0540100522339344),
 ('and', 0.05245136469602585),
 ('fast', 0.03871935233473778),
 ('white', 0.03503745049238205),
 ('black', 0.02766772173345089),
 ('pages', 0.012000232003629208),
 ('21', 0.002772286534309387),
 ('minute', -0.014604938216507435),
 ('printer', -0.018044397234916687)]
