<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/practical-natural-language-processing/3-text-representation/6_training_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word Embeddings

What does it mean when we say a text representation should capture “distributional similarities between words”?

Let’s consider some examples. If we’re given the word “USA,” distributionally similar words could be other countries (e.g., Canada, Germany, India, etc.) or cities in the USA. If we’re given the word “beautiful,” words that share some relationship with this word (e.g., synonyms, antonyms) could be considered distributionally similar words. These are words that are likely to occur in similar contexts.

In 2013, a seminal work by Mikolov et al. showed that their neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships such as:

`King – Man + Woman ≈ Queen`

While learning such semantically rich relationships, Word2vec ensures that the learned word representations are low dimensional (vectors of dimensions 50–500, instead of several thousands) and dense (that is, most values in these vectors are non-zero).

Such representations make ML tasks more tractable and efficient. Word2vec led to a lot of work (both pure and applied) in the direction of learning text representations using neural networks. These representations are also called “embeddings.”

To “derive” the meaning of the word, Word2vec uses distributional similarity and distributional hypothesis. That is, it derives the meaning of a word from its context: words that appear in its neighborhood in the text. So, if two different words (often) occur in similar context, then it’s highly likely that their meanings are also similar.

Word2vec operationalizes this by projecting the meaning of the words in a vector space where words with similar meanings will tend to cluster together, and words with very different meanings are far from one another.

Conceptually, Word2vec takes a large corpus of text as input and “learns” to represent the words in a common vector space based on the contexts in which they appear in the corpus.

## Training our own embeddings

Now we’ll focus on training our own word embeddings. For this, we’ll look at two architectural variants that were proposed in the original Word2vec approach. The two variants are:

- **Continuous bag of words (CBOW)**
- **SkipGram**

Both of these have a lot of similarities in many respects. We’ll begin by understanding the CBOW model, then we’ll look at SkipGram.

Throughout this section, we’ll use the sentence “The quick brown fox jumps over the lazy dog” as our toy corpus.

## Continuous bag of words (CBOW)

In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

It is a (statistical) model that tries to give a probability distribution over sequences of words. Given a sentence of, say, m words, it assigns a probability $Pr(w_1, w_2, ….., w_n)$ to the whole sentence.

The objective of a language model is to assign probabilities in such a way that it gives high probability to “good” sentences and low probabilities to “bad” sentences.

By good, we mean sentences that are semantically and syntactically correct. By bad, we mean sentences that are incorrect—semantically or syntactically or both. So, for a sentence like “The cat jumped over the dog,” it will try to assign a probability close to 1.0, whereas for a sentence like “jumped over the the cat dog,” it tries to assign a probability close to 0.0.

CBOW tries to learn a language model that tries to predict the “center” word from the words in its context.

Let’s understand this using our toy corpus. If we take the word “jumps” as the center word, then its context is formed by words in its vicinity. If we take the context size of 2, then for our example, the context is given by brown, fox, over, the. CBOW uses the context words to predict the target word corpus; i.e., it takes every word in the corpus as the target word and tries to predict the target word from its corresponding context words.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/cbow.png?raw=1' width='800'/>

The idea is then extended to the entire corpus to build the training set. Details are as follows: we run a sliding window of size 2k+1 over the text corpus. For our example, we took k as 2. Each position of the window marks the set of 2k+1 words that are under consideration.

<img src='https://github.com/rahiakela/img-repo/blob/master/deeplearning.ai-NLPS/sliding-window.png?raw=1' width='800'/>

The center word in the window is the target, and k words on either side of the center word form the context. This gives us one data point. If the point is represented as (X,Y), then the context is the X and the target word is the Y. A single data point consists of a pair of numbers: (2k indices of words in context, index of word in target). To get the next data point, we simply shift the window to the right on the corpus by one word and repeat the process. This way, we slide the window across the entire corpus to create the training set.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/cbow-corpus.png?raw=1' width='800'/>

Now that we have the training data ready, let’s focus on the model. For this, we construct a shallow net (it’s shallow since it has a single hidden layer).

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/cbow-model.png?raw=1' width='800'/>


#### Training CBOW Embeddings Using Gensim

Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. [Gensim](https://radimrehurek.com/gensim/index.html) is an open source Python library for natural language processing, with a focus on topic modeling.

In [None]:
from gensim.models import Word2Vec

import warnings
warnings.filterwarnings('ignore')

In [None]:
# define training data
# Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
# Every list contains lists of tokens of that document.
corpus = [['dog','bites','man'], ["man", "bites" ,"dog"],["dog","eats","meat"],["man", "eats","food"]]

# Training the model
model_cbow = Word2Vec(corpus, min_count=1, sg=0)  # using CBOW Architecture for trainnig

In [None]:
# Summarize the loaded model
print(model_cbow)

# Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(words)

# Acess vector for one word
print(model_cbow['dog'])

Word2Vec(vocab=6, size=100, alpha=0.025)
['dog', 'bites', 'man', 'eats', 'meat', 'food']
[ 0.00367401  0.0018915  -0.00268964 -0.00114343 -0.00424361  0.00235518
 -0.00083533 -0.00069819  0.00207098 -0.00392796  0.00096701  0.00371491
  0.00135592 -0.00066026  0.00498563  0.00217539  0.00380146 -0.00379186
  0.00153982  0.00369387  0.00249354  0.00405935 -0.00121351 -0.0018121
 -0.00486681  0.00206851 -0.00200907  0.00457367 -0.00095888 -0.0002709
 -0.00102884  0.00301103 -0.00188044 -0.00229532 -0.00333086  0.00416533
  0.00011025  0.00483275  0.0023524  -0.00180897 -0.00354142  0.00171692
 -0.00058051 -0.0023132  -0.00347794 -0.00094115 -0.00231837  0.0021582
  0.00209103 -0.00307385 -0.00237195 -0.00423406  0.00348398  0.00466104
  0.00306665  0.00388333 -0.00422608 -0.00268456 -0.00384427 -0.00076647
  0.0015278  -0.00109259 -0.00322287  0.00179591  0.00460067 -0.00202937
  0.00254326  0.00462279  0.00418871 -0.00152545  0.00296148  0.00081408
 -0.00118316 -0.00012839 -0.00050644  

In [None]:
# Compute similarity
print("Similarity between eats and bites: ", model_cbow.similarity('eats', 'bites'))
print("Similarity between eats and man: ", model_cbow.similarity('eats', 'man'))

Similarity between eats and bites:  -0.100368656
Similarity between eats and man:  0.009752579


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
# Most similarity
model_cbow.most_similar('meat')

[('dog', 0.07002865523099899),
 ('food', 0.003390274941921234),
 ('eats', -0.02171921730041504),
 ('man', -0.04722334444522858),
 ('bites', -0.0998893454670906)]

In [None]:
# save model
model_cbow.save('model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=6, size=100, alpha=0.025)


## SkipGram

SkipGram is very similar to CBOW, with some minor changes. In Skip‐
Gram, the task is to predict the context words from the center word.

For our toy corpus with context size 2, using the center word “jumps,” we try to predict every word in context—“brown,” “fox,” “over,” “the”—as shown below. This constitutes one step. SkipGram repeats this one step for every word in the corpus as the center word.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/skipgram.png?raw=1' width='800'/>

The dataset to train a SkipGram is prepared as follows: we run a sliding window of size 2k+1 over the text corpus to get the set of 2k+1 words that are under consideration. The center word in the window is the X, and k words on either side of the center word are Y.

**Unlike CBOW, this gives us 2k data points. A single data point consists of a pair:(index of the center word, index of a target word).** We then shift the window to the right on the corpus by one word and repeat the process. This way, we slide the window across the entire corpus to create the training set.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/skipgram-corpus.png?raw=1' width='800'/>

The shallow network used to train the SkipGram model is very similar to the network used for CBOW, with some minor changes.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/skipgram-model.png?raw=1' width='800'/>



#### Training SkipGram Embeddings Using Gensim

To use both the CBOW and SkipGram algorithms in practice, there are several available implementations that abstract the mathematical details for us. One of the most commonly used implementations is gensim.

In [None]:
#Training the model
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training

In skipgram, the task is to predict the context words from the center word.

In [None]:
# Summarize the loaded model
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


In [None]:
# Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(words)

['dog', 'bites', 'man', 'eats', 'meat', 'food']


In [None]:
# Acess vector for one word
print(model_skipgram['dog'])

[ 0.00367401  0.0018915  -0.00268964 -0.00114343 -0.00424361  0.00235518
 -0.00083533 -0.00069819  0.00207098 -0.00392796  0.00096701  0.00371491
  0.00135592 -0.00066026  0.00498563  0.00217539  0.00380146 -0.00379186
  0.00153982  0.00369387  0.00249354  0.00405935 -0.00121351 -0.0018121
 -0.00486681  0.00206851 -0.00200907  0.00457367 -0.00095888 -0.0002709
 -0.00102884  0.00301103 -0.00188044 -0.00229532 -0.00333086  0.00416533
  0.00011025  0.00483275  0.0023524  -0.00180897 -0.00354142  0.00171692
 -0.00058051 -0.0023132  -0.00347794 -0.00094115 -0.00231837  0.0021582
  0.00209103 -0.00307385 -0.00237195 -0.00423406  0.00348398  0.00466104
  0.00306665  0.00388333 -0.00422608 -0.00268456 -0.00384427 -0.00076647
  0.0015278  -0.00109259 -0.00322287  0.00179591  0.00460067 -0.00202937
  0.00254326  0.00462279  0.00418871 -0.00152545  0.00296148  0.00081408
 -0.00118316 -0.00012839 -0.00050644  0.00136482  0.00319222 -0.00354986
  0.00442137 -0.00309065  0.00274137  0.00032026  0.00

In [None]:
#Compute similarity
print("Similarity between eats and bites:", model_skipgram.similarity('eats', 'bites'))
print("Similarity between eats and man:", model_skipgram.similarity('eats', 'man'))

Similarity between eats and bites: -0.100360826
Similarity between eats and man: 0.009756193


From the above similarity scores we can conclude that eats is more similar to bites than man.

In [None]:
# Most similarity
model_skipgram.most_similar("meat")

[('dog', 0.07002865523099899),
 ('food', 0.003390274941921234),
 ('eats', -0.021795928478240967),
 ('man', -0.04722334444522858),
 ('bites', -0.0998893529176712)]

In [None]:
# save model
model_skipgram.save('model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=6, size=100, alpha=0.025)


## Training Your Embedding on Wiki Corpus

The corpus download page :

https://dumps.wikimedia.org/enwiki/20200120/

The entire wiki corpus as of 28/04/2020 is just over 16GB in size.
We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.


In [None]:
!mkdir -p data/en/
!wget -P data/en/ https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2

In [None]:
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec
from gensim.models.fasttext import FastText
import time

In [None]:
# Preparing the Training data
wiki = WikiCorpus("data/en/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2", lemmatize=False, dictionary={})
sentences = list(wiki.get_texts())

#### Hyperparameters

1.   **sg** - Selecting the training algorithm: 1 for skip-gram else its 0 for CBOW. Default is CBOW.
2.   **min_count**-  Ignores all words with total frequency lower than this.<br>
There are many more hyperparamaeters whose list can be found in the official documentation [here.](https://radimrehurek.com/gensim/models/word2vec.html)

##### CBOW

In [None]:
start = time.time()

# CBOW
word2vec_cbow = Word2Vec(sentences, min_count=10, sg=0)
end = time.time()

print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} hrs ".format((end-start) / 3600.0))

CBOW Model Training Complete.
Time taken for training is:0.24 hrs 


In [None]:
# Summarize the loaded model
print(word2vec_cbow)
print("-"*30)

# Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(words)
print("-"*30)

# Acess vector for one word
print(word2vec_cbow['film'])
print("-"*30)

#Compute similarity
print("Similarity between film and drama:", word2vec_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:", word2vec_cbow.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=162668, size=100, alpha=0.025)
------------------------------
------------------------------
[ 1.90431190e+00 -5.51376462e-01  7.97884941e-01  4.51891613e+00
  1.83556151e+00 -1.25354812e-01 -3.44340611e+00  1.92727733e+00
  1.66459754e-02 -2.63959318e-02 -4.01172304e+00 -1.64397676e-02
 -3.56255442e-01  1.13677466e+00 -4.75881004e+00 -1.36026585e+00
 -4.36697483e-01  3.92295957e+00  4.92906272e-01  1.43513417e+00
  2.05463266e+00 -2.90527773e+00 -7.83046424e-01  1.01278067e+00
 -1.01803839e+00  1.52562940e+00  2.87656808e+00  3.77922505e-01
 -1.33817863e+00 -6.68153644e-01 -3.13017058e+00 -5.41101694e-01
 -1.04485846e+00 -2.49612063e-01 -6.30465150e-01 -1.39774442e-01
  3.12476897e+00  1.30469701e-03 -4.02023345e-02 -1.75379765e+00
  1.31179202e+00  1.07884556e-01  6.26050889e-01  3.39268237e-01
 -3.06691599e+00  3.42720532e+00  1.59238458e+00  4.08480138e-01
  9.38098729e-01  1.16553712e+00 -2.44246531e+00  7.20634520e-01
 -2.15016437e+00 -9.75300848e-01 -4.25113589e-0

In [None]:
# save model
from gensim.models import Word2Vec, KeyedVectors
word2vec_cbow.wv.save_word2vec_format('word2vec_cbow.bin')

# # load model
new_modelword2vec_cbow = word2vec_cbow.wv.load_word2vec_format('word2vec_cbow.bin')
print(new_modelword2vec_cbow)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7fe3b56a74e0>


In [None]:
#Inspect the model by looking for the most similar words for a test word.
print(new_modelword2vec_cbow.wv.most_similar('computer', topn=5))
#Let us see what the 10-dimensional vector for 'computer' looks like.
print(new_modelword2vec_cbow['computer'])

[('computers', 0.7920218110084534), ('software', 0.752328634262085), ('computing', 0.7421990633010864), ('hardware', 0.7111683487892151), ('automation', 0.7084893584251404)]
[ 1.65838695e+00 -1.07701540e+00 -1.01486778e+00  1.43544555e+00
  2.91654253e+00  1.20623636e+00 -1.96161366e+00 -2.26975009e-01
 -1.39565706e+00 -2.78424412e-01  7.63749540e-01  2.23809791e+00
 -8.51706564e-01 -2.51426548e-01 -5.26211381e-01 -4.44194460e+00
 -2.28046989e+00  2.37675285e+00 -8.17845643e-01  2.57680964e+00
  7.50591397e-01  1.39947844e+00 -3.60294014e-01  3.50890577e-01
  1.07450080e+00 -3.66429067e+00 -1.29366207e+00  2.02568229e-02
 -2.53869581e+00 -6.86380506e-01  5.84168911e-01 -1.16385150e+00
 -9.85027403e-02 -3.32514852e-01  2.45172763e+00 -6.85921848e-01
  1.31153536e+00 -5.70244133e-01  2.78922176e+00 -4.60699722e-02
 -9.63826001e-01 -4.60999012e+00  1.24686623e+00 -3.25452924e+00
 -6.86777430e-03 -7.59981453e-01 -5.75147927e-01 -4.21143860e-01
 -1.97391897e-01 -2.27526593e+00  5.14465511e-

##### SkipGram

In [None]:
start = time.time()

# SkipGram
word2vec_skipgram = Word2Vec(sentences, min_count=10, sg=1)
end = time.time()

print("SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start) / 3600.0))

SkipGram Model Training Complete
Time taken for training is:0.76 hrs 


In [None]:
# Summarize the loaded model
print(word2vec_skipgram)
print("-"*30)

# Summarize vocabulary
words = list(word2vec_skipgram.wv.vocab)
print(words)
print("-"*30)

# Acess vector for one word
print(word2vec_skipgram['film'])
print("-"*30)

# Compute similarity
print("Similarity between film and drama:", word2vec_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:",word2vec_skipgram.similarity('film', 'tiger'))
print("-"*30)

Word2Vec(vocab=162668, size=100, alpha=0.025)
------------------------------
------------------------------
[-0.23111866  0.05441886  0.03372803  0.13707177 -0.21057019 -0.21719156
 -0.50942016 -0.0959969   0.10738868 -0.22697847 -0.2006202   0.09307311
  0.49558416  0.09980547 -0.32388633  0.8389933  -0.26217332  0.15804051
  0.04422802 -0.6422501  -0.13988324  0.17808583  0.2748994   0.406608
  0.16889557  0.40298256  0.3314519  -0.38180158 -0.08413954  0.03168397
  0.21765189 -0.29584822 -0.20922557 -0.20921154  0.04986164 -0.17977662
  0.3854816   0.00452686 -0.14921552  0.08367458 -0.06871274 -0.46198443
 -0.00661817  0.00296341 -0.03707964  0.3179535   0.18510874 -0.00520774
 -0.22150153  0.12291416 -0.3287973   0.27427495 -0.5516082  -0.31229776
 -0.22935799  0.30577254 -0.6270615   0.3815057   0.47746992 -0.08065654
 -0.3544498   0.23296663  0.25502983  0.48915774  0.30910882  0.23649693
 -0.20445469  0.4176479  -0.17707215  0.42989367  0.13742803 -0.28408983
 -0.20521085  0.25

In [None]:
# save model
word2vec_skipgram.wv.save_word2vec_format('model_skipgram.bin')

# # load model
new_model_skipgram = word2vec_skipgram.wv.load_word2vec_format('model_skipgram.bin')
print(new_model_skipgram)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7fe292b31860>


In [None]:
# Inspect the model by looking for the most similar words for a test word.
print(new_model_skipgram.wv.most_similar('computer', topn=5))
# Let us see what the 10-dimensional vector for 'computer' looks like.
print(new_model_skipgram['computer'])

[('computers', 0.8504198789596558), ('computing', 0.8227981328964233), ('mainframe', 0.7985801100730896), ('software', 0.7935036420822144), ('technology', 0.7695398330688477)]
[ 0.10162751 -0.07809585 -0.01056721  0.54227394  0.29926765 -0.13967286
 -0.46590078  0.03058971 -0.08964784 -0.18385878 -0.09924877  0.35488087
  0.20336919 -0.20089795 -0.14352345  0.20861967 -0.01653686  0.05604462
 -0.07125833  0.18457292 -0.7390356  -0.44062987  0.6726722  -0.16653307
 -0.05239529  0.04981231 -0.00443736 -0.06904971  0.00559284 -0.19760463
 -0.06876412  0.04488349 -0.82058173  0.26111662  0.14513853 -0.07710455
  0.11813916 -0.03395562 -0.32472324 -0.31857947 -0.47601563 -0.07920008
  0.6101308  -0.18105038 -0.17018622 -0.18113896  0.05818631 -0.08810731
 -0.33484054  0.307189    0.02268581 -0.22856185 -0.39135367  0.38185912
 -0.2928929  -0.07489775 -0.5594251  -0.18492967  0.5026957   0.42202488
  0.04350985  0.03074365 -0.44976184  0.5416055  -0.03505092  0.09679759
  0.6513282   0.47367

## Going Beyond Words

So far, we’ve seen examples of how to use pre-trained word embeddings and train our own word embeddings. This gives us a compact and dense representation for words in our vocabulary. However, in most NLP applications, we seldom deal with atomic units like words—we deal with sentences, paragraphs, or even full texts. So, we need a way to represent larger units of text.

A simple approach is to break the text into constituent words, take the embeddings for individual words, and combine them to form the representation for the text. There are various ways to combine them, the most popular being sum, average, etc., but these may not capture many aspects of the text as a whole, such as ordering. Surprisingly, they work very well in practice.

#### spaCy

It’s always a good idea to experiment with this before moving to other representations. The following code shows how to obtain the vector representation for text by averaging word vectors using the library spaCy.

In [None]:
import spacy

# Load the spacy model that we already installed in Chapter 2. This takes a few seconds.
%time nlp = spacy.load('en_core_web_sm')
# process a sentence using the model
mydoc = nlp("Canada is a large country")
# Get a vector for individual words
print(mydoc[0].vector)  # vector for 'Canada', the first word in the text
print(mydoc.vector)   # Averaged vector for the entire sentence

CPU times: user 2.12 s, sys: 71.6 ms, total: 2.19 s
Wall time: 2.2 s
[ 0.21828675 -1.8616467  -1.8246782   3.9640498   3.1702113   3.4044275
  0.01638174  1.0882976   4.298643    2.6220412   4.9655504  -0.55880976
  0.40846556 -0.5311527  -1.936613    0.4890322   0.01409918  2.017309
  1.9619753  -0.3345103  -2.0549996  -1.9780366   1.4000814  -3.9780545
 -1.4757934  -0.0758431  -0.15845299 -2.4204073  -0.22936638 -2.2050803
 -0.3578331  -1.9166979  -1.1512874  -2.2362876   3.0028348  -2.8118327
  4.337387    0.9023212  -1.3791567   1.4033097   0.36686432  1.2876883
 -0.96107507 -4.0383596  -2.529714    1.3005439  -0.3787139  -0.9970173
  0.580034    3.9643373  -0.5534672  -1.7696033  -1.928772   -1.1359324
 -4.5521493   2.0064287   3.4537764   0.8355992   2.3804865  -2.7131867
 -1.2354802  -0.41219887 -0.83612657  0.42878282  3.4744697  -0.57708937
  2.292963   -4.568878    1.1227657   1.0749986  -3.4492185   1.0809329
 -1.3786955   0.21817788  1.15269    -0.93688565  1.9326314  -2.35

Both pre-trained and self-trained word embeddings depend on the vocabulary they see in the training data. However, there is no guarantee that we will only see those words in the production data for the application we’re building. Despite the ease of using Word2vec or any such word embedding to do feature extraction from texts, we don’t have a good way of handling OOV words yet. This has been a recurring problem in all the representations we’ve seen so far.

####fastText

There are also other approaches that handle the OOV problem by modifying the training process by bringing in characters and other subword-level linguistic components.

Let’s look at one such approach now. The key idea is that one can potentially handle the OOV problem by using subword information, such as morphological properties (e.g., prefixes, suffixes, word endings, etc.), or by using character representations.

**fastText**, from Facebook AI research, is one of the popular algorithms
that follows this approach. A word can be represented by its constituent character ngrams. Following a similar architecture to Word2vec, fastText learns embeddings for words and character n-grams together and views a word’s embedding vector as an aggregation of its constituent character n-grams. This makes it possible to generate embeddings even for words that are not present in the vocabulary.

Say there’s a word, “gregarious,” that’s not found in the embedding’s word vocabulary. We break it into character n-grams—gre, reg, ega, ….ous—and combine these embeddings of the ngrams to arrive at the embedding of “gregarious.”

Gensim’s fastText wrapper can be used both for loading pre-trained models or training models using fastText in a way similar to Word2vec.

##### CBOW

In [None]:
start = time.time()
fasttext_cbow = FastText(sentences, sg=0, min_count=10)
end = time.time()

print("FastText CBOW Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start) / 3600.0))

FastText CBOW Model Training Complete
Time taken for training is:0.77 hrs 


In [None]:
# Summarize the loaded model
print(fasttext_cbow)
print("-"*30)

# Summarize vocabulary
words = list(fasttext_cbow.wv.vocab)
print(words)
print("-"*30)

# Acess vector for one word
print(fasttext_cbow['film'])
print("-"*30)

# Compute similarity
print("Similarity between film and drama:",fasttext_cbow.similarity('film', 'drama'))
print("Similarity between film and tiger:",fasttext_cbow.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=162668, size=100, alpha=0.025)
------------------------------
------------------------------
[-0.613545   -4.3581495  -3.8775482  -0.9321701  -3.359072   -2.3807166
 -3.432002    0.27316695  3.6700144  -4.854859    1.579452   -2.3757536
  1.9457221  -0.04775961  1.69137    -5.889899    5.475778    8.481184
  0.02071534 -1.9858435  -6.5664      1.1549423  -1.0983676  -0.913909
 -5.3280616  -0.0156348   4.2389894   0.1931456  -1.5580894  -0.23353629
 -2.5563855  -1.9006141  -3.568673    0.12540655  2.3656619   0.1127377
 -0.1109978  -0.35732678  0.619922    1.3349954   5.7399755  -0.58467007
 -1.5871537   1.8969312   7.1031523  -1.066831    1.6255599   3.301603
  1.927579   -3.5048401  -5.435487   -1.929023   -1.2138747  -3.1027207
 -1.3200952   1.6980739   4.472812    1.7617748   3.0159202  -3.7370062
 -5.3022366  -3.931471   -1.0397302  -2.6964092   6.311006   -1.9893135
 -5.0324836  -0.55707085  0.9929683  -3.839034    1.0199924  -0.1544579
 -0.42761108  6.06173     0.5

In [None]:
# Inspect the model by looking for the most similar words for a test word.
print(fasttext_cbow.wv.most_similar('computer', topn=5))
# Let us see what the 10-dimensional vector for 'computer' looks like.
print(fasttext_cbow['computer'])

[('minicomputer', 0.9500270485877991), ('microcomputer', 0.9418912529945374), ('compute', 0.919439435005188), ('supercomputer', 0.9156723618507385), ('computers', 0.9115834832191467)]
[-1.6007822  -3.5363698  -1.5684026  -1.7621754  -1.5476772   0.58849007
 -0.5509931   1.2230161  -0.06600951 -0.26381722  1.1617024   3.7479646
  0.78999406 -2.1242847  -0.48248807 -1.0404977  -2.5311005   0.7052628
 -0.8575047   2.292256   -0.31231782 -1.4247077   1.32259     0.1455514
  0.83393234 -2.2491322  -0.03513742 -2.338033    1.3117082   0.5517698
 -3.2233133  -2.7003415   0.72980374 -0.97131485 -0.13022925  1.50669
  1.321786   -1.5551939   0.9832513   3.6322742  -0.29874444 -0.5923763
 -3.6576498   0.68468744  0.9369087  -2.057765   -3.203215    2.194668
  0.35317063  0.36316088 -1.6526406  -1.692081   -3.9637175  -0.66239154
 -0.89309835 -0.6420646  -0.15519533  0.47471026 -1.748186   -0.44098687
  0.99130476 -3.2163641   0.88699645 -3.7377868  -0.09469435 -2.21426
 -1.540081    2.6324596   

In [None]:
# Inspect the model by looking for a unkown word for a test word.
print(fasttext_cbow.wv.most_similar('Ryaan', topn=5))

[('naan', 0.7530190348625183), ('ayaan', 0.7500364780426025), ('maan', 0.7242642641067505), ('daan', 0.7175418138504028), ('baan', 0.7160359621047974)]


##### SkipGram

In [None]:
start = time.time()
fasttext_skipgram = FastText(sentences, sg=1, min_count=10)
end = time.time()

print("FastText SkipGram Model Training Complete\nTime taken for training is:{:.2f} hrs ".format((end-start) / 3600.0))

FastText SkipGram Model Training Complete
Time taken for training is:1.28 hrs 


In [None]:
# Summarize the loaded model
print(fasttext_skipgram)
print("-"*30)

# Summarize vocabulary
words = list(fasttext_skipgram.wv.vocab)
print(words)
print("-"*30)

# Acess vector for one word
print(fasttext_skipgram['film'])
print("-"*30)

# Compute similarity
print("Similarity between film and drama:", fasttext_skipgram.similarity('film', 'drama'))
print("Similarity between film and tiger:", fasttext_skipgram.similarity('film', 'tiger'))
print("-"*30)

FastText(vocab=162668, size=100, alpha=0.025)
------------------------------
------------------------------
[-3.5486487e-01 -7.7937938e-02  6.2126172e-01  4.1887084e-01
 -3.2191923e-01 -6.0315456e-02 -2.0158468e-01  3.5473049e-01
  3.8680422e-01  1.4700870e-01  6.7584082e-03  1.1864316e-01
 -4.4011834e-01  8.9877516e-02  2.7886161e-01 -5.6735440e-03
  3.5107535e-01  4.1561729e-01 -4.6577194e-01 -6.2601618e-02
 -1.2515570e-01  1.8534912e-01 -1.9400490e-02  4.6600264e-01
 -2.7024463e-01  2.0591022e-01  4.1407746e-01 -5.1443201e-01
 -4.2226657e-01  3.2539743e-01  7.8593090e-02 -7.7325350e-01
 -1.5215129e-01 -1.1755857e-01 -2.7412999e-01  1.7897597e-01
  3.1273958e-01  1.5031134e-01  4.1338900e-01  4.0595913e-01
 -2.1709563e-02  1.8654524e-01  2.2302848e-01 -3.4526920e-01
  2.6287466e-01  3.5187706e-02 -4.8806038e-01  1.2495136e-01
 -4.5304667e-02 -3.4461889e-02  1.6955017e-01 -7.4013257e-01
 -2.9166558e-01  1.5609342e-01  4.2513612e-01  7.2491813e-01
 -2.1763289e-02 -7.4966878e-02  2.6403

In [None]:
# Inspect the model by looking for the most similar words for a test word.
print(fasttext_skipgram.wv.most_similar('computer', topn=5))
# Let us see what the 10-dimensional vector for 'computer' looks like.
print(fasttext_skipgram['computer'])

[('computers', 0.9025884866714478), ('microcomputer', 0.8895590305328369), ('computing', 0.8587552905082703), ('minicomputer', 0.8545572757720947), ('microcomputers', 0.8477522134780884)]
[-0.26792672 -0.14854085 -0.13344114  0.01600254 -0.6828093   0.51479566
 -0.08283313  0.2883284  -0.19388802  0.76201576 -0.27318364 -0.32122388
 -0.41548398  0.04701537 -0.13784863 -0.3732795   0.46770307  0.15497878
 -0.23144081  0.41879857  0.08658265 -0.10125585  0.03467295  0.08455643
 -0.38988537  0.31450734 -0.04470623 -0.26623806  0.6234862  -0.03211192
 -0.54713297 -0.1909403   0.21665423 -0.0129564   0.3568217   0.42970824
  0.4596436  -0.05161433  0.30174243  0.72991705 -0.35760084 -0.29019243
  0.01022349  0.26688683  0.08019576 -0.07168911 -0.6152223   0.5514338
 -0.16556856 -0.16901505 -0.178999   -0.12277797  0.37053686  0.31875262
  0.5350165  -0.10235685 -0.31089485 -0.461324   -0.17796701  0.01647321
  0.2448681  -0.1856353  -0.01456876  0.2598886  -0.520902   -0.10449862
 -0.324463

In [None]:
# Inspect the model by looking for a unkown word for a test word.
print(fasttext_cbow.wv.most_similar('Rahi', topn=5))

[('yahi', 0.6589512825012207), ('asahi', 0.6400054693222046), ('ahi', 0.6399035453796387), ('shahi', 0.6395415663719177), ('yaha', 0.6394219398498535)]


In [None]:
# Inspect the model by looking for a unkown word for a test word.
print(fasttext_cbow.wv.most_similar('Ryaan', topn=5))

[('naan', 0.7530190348625183), ('ayaan', 0.7500364780426025), ('maan', 0.7242642641067505), ('daan', 0.7175418138504028), ('baan', 0.7160359621047974)]


In [None]:
# Inspect the model by looking for a unkown word for a test word.
print(fasttext_cbow.wv.most_similar('village', topn=5))

[('hillage', 0.8593059182167053), ('villager', 0.8587302565574646), ('millage', 0.8344053626060486), ('pillage', 0.8275806307792664), ('town', 0.8264790773391724)]
