## Practical 2. Training Embeddings Using Gensim
### Strictly used for internal purpose in Singapore Polytechnic. Do not disclose!
Word embeddings are an approach to representing text in NLP. In this notebook we will demonstrate how to train embeddings using Genism. Gensim is an open source Python library for natural language processing, with a focus on topic modeling

In [1]:
from gensim.models import Word2Vec
import warnings
warnings.filterwarnings('ignore')

In [2]:
# define training data
#Genism word2vec requires that a format of ‘list of lists’ be provided for training where every document contained in a list.
#Every list contains lists of tokens of that document.
corpus = [['students', 'learn', 'nlp'],
             ['nlp', 'workshop', 'interesting', 'students', 'like', 'nlp'],
             ['students', 'study', 'math'],
             ['math', 'foundation', 'nlp']]

#Training the model
model_cbow = Word2Vec(corpus, min_count=1,sg=0) #using CBOW Architecture for trainnig
model_skipgram = Word2Vec(corpus, min_count=1,sg=1)#using skipGram Architecture for training

## Continuous Bag of Words (CBOW)
In CBOW, the primary task is to build a language model that correctly predicts the center word given the context words in which the center word appears.

In [3]:
#Summarize the loaded model
print(model_cbow)

# Summarize vector dimension
print(f'\nvectors dimension: {model_cbow.vector_size}')

#Summarize vocabulary
words = list(model_cbow.wv.vocab)
print(f'\n {words}')

#Acess vector for one word
print(f"\n {model_cbow['math']}")

Word2Vec(vocab=9, size=100, alpha=0.025)

vectors dimension: 100

 ['students', 'learn', 'nlp', 'workshop', 'interesting', 'like', 'study', 'math', 'foundation']

 [-4.7427327e-03  2.7009989e-03  2.4552299e-03  1.8800957e-03
 -4.0438152e-03  3.4139876e-03  4.7604516e-03  1.1196997e-03
 -3.0536046e-03  6.1592931e-04  2.1515526e-03 -3.3468381e-03
 -2.3813201e-03 -8.1891289e-05 -4.7015934e-03 -2.3964678e-03
  3.3518907e-03  3.6684147e-04  3.3118124e-03  4.1195285e-03
  1.9303175e-04 -3.1120996e-03  3.7053400e-03  2.5978107e-03
 -4.4858209e-03  3.9598527e-03 -3.5895035e-03  2.1809381e-03
 -3.7893979e-03  2.7558657e-03  6.0624594e-04 -1.1191514e-03
  3.0170311e-03  4.4734483e-03 -4.6698484e-04 -2.0793968e-04
 -1.4159449e-03  6.1119179e-04  6.5389578e-04 -8.5016299e-04
 -3.8867856e-03  1.3034858e-04 -9.3512301e-04 -1.1513102e-03
  3.2631170e-03 -3.1818855e-03  3.6724450e-03  4.7610798e-03
  4.0262188e-03 -2.9033674e-03  2.7980015e-04  6.5813999e-04
  4.1850712e-03  3.6857466e-03 -4.7332514e-

In [4]:
# Compute similarity 
print("Similarity between math and nlp:",model_cbow.similarity('math', 'nlp'))
print("Similarity between math and foundation:",model_cbow.similarity('math', 'foundation'))

Similarity between math and nlp: 0.09816745
Similarity between math and foundation: 0.034215108


In [5]:
# Most similarity
model_cbow.most_similar('math')

[('workshop', 0.28242602944374084),
 ('students', 0.23566681146621704),
 ('nlp', 0.09816744923591614),
 ('like', 0.09050403535366058),
 ('learn', 0.04797886312007904),
 ('foundation', 0.034215107560157776),
 ('interesting', 0.021503645926713943),
 ('study', 0.01100192591547966)]

In [6]:
# save model
model_cbow.save('model/model_cbow.bin')

# load model
new_model_cbow = Word2Vec.load('model/model_cbow.bin')
print(new_model_cbow)

Word2Vec(vocab=9, size=100, alpha=0.025)


## SkipGram
In skipgram, the task is to predict the context words from the center word.

In [7]:
#Summarize the loaded model
print(model_skipgram)

# Summarize vector dimension
print(f'\nvectors dimension: {model_skipgram.vector_size}')

#Summarize vocabulary
words = list(model_skipgram.wv.vocab)
print(f'\n {words}')

#Acess vector for one word
print(f"\n {model_skipgram['math']}")

Word2Vec(vocab=9, size=100, alpha=0.025)

vectors dimension: 100

 ['students', 'learn', 'nlp', 'workshop', 'interesting', 'like', 'study', 'math', 'foundation']

 [-4.7427327e-03  2.7009989e-03  2.4552299e-03  1.8800957e-03
 -4.0438152e-03  3.4139876e-03  4.7604516e-03  1.1196997e-03
 -3.0536046e-03  6.1592931e-04  2.1515526e-03 -3.3468381e-03
 -2.3813201e-03 -8.1891289e-05 -4.7015934e-03 -2.3964678e-03
  3.3518907e-03  3.6684147e-04  3.3118124e-03  4.1195285e-03
  1.9303175e-04 -3.1120996e-03  3.7053400e-03  2.5978107e-03
 -4.4858209e-03  3.9598527e-03 -3.5895035e-03  2.1809381e-03
 -3.7893979e-03  2.7558657e-03  6.0624594e-04 -1.1191514e-03
  3.0170311e-03  4.4734483e-03 -4.6698484e-04 -2.0793968e-04
 -1.4159449e-03  6.1119179e-04  6.5389578e-04 -8.5016299e-04
 -3.8867856e-03  1.3034858e-04 -9.3512301e-04 -1.1513102e-03
  3.2631170e-03 -3.1818855e-03  3.6724450e-03  4.7610798e-03
  4.0262188e-03 -2.9033674e-03  2.7980015e-04  6.5813999e-04
  4.1850712e-03  3.6857466e-03 -4.7332514e-

In [8]:
#Compute similarity 
print("Similarity between math and nlp:",model_skipgram.similarity('math', 'nlp'))
print("Similarity between math and foundation:",model_skipgram.similarity('math', 'foundation'))

Similarity between math and nlp: 0.09816745
Similarity between math and foundation: 0.034215108


In [9]:
# Most similarity
model_skipgram.most_similar('math')

[('workshop', 0.28242602944374084),
 ('students', 0.23566681146621704),
 ('nlp', 0.09816744923591614),
 ('like', 0.09050403535366058),
 ('learn', 0.04797886312007904),
 ('foundation', 0.034215107560157776),
 ('interesting', 0.021503645926713943),
 ('study', 0.01100192591547966)]

In [10]:
# save model
model_skipgram.save('model/model_skipgram.bin')

# load model
new_model_skipgram = Word2Vec.load('model/model_skipgram.bin')
print(model_skipgram)

Word2Vec(vocab=9, size=100, alpha=0.025)


## Training Word Embedding on Wiki Corpus
The corpus download page : https://www.corpusdata.org/formats.asp
The entire wiki corpus as of 28/04/2020 is just over 16GB in size. We will take a part of this corpus due to computation constraints and train our word2vec and fasttext embeddings.

In [11]:
import re
import string
import time

In [12]:
text_file = open('data/wiki.txt', 'r')
corpus = text_file.read()

In [13]:
# pre-processing
corpus = corpus.lower()
corpus = re.sub(r'\d','', corpus)
corpus = corpus.split('.')
corpus = [re.sub(r'\d','', i) for i in corpus]
corpus = [i.translate(str.maketrans('', '', string.punctuation)) for i in corpus]

In [14]:
corpus = [i.split() for i in corpus]

In [15]:
#CBOW
start = time.time()
word2vec_cbow = Word2Vec(corpus,min_count=10, sg=0)
end = time.time()
print("CBOW Model Training Complete.\nTime taken for training is:{:.2f} s ".format((end-start)))

CBOW Model Training Complete.
Time taken for training is:9.39 s 


In [16]:
# Skipgram
start = time.time()
word2vec_skipgram = Word2Vec(corpus,min_count=10, sg=1)
end = time.time()
print("SkipGram Model Training Complete.\nTime taken for training is:{:.2f} s ".format((end-start)))

SkipGram Model Training Complete.
Time taken for training is:20.50 s 


### An interesting obeseravtion if you noticed is that CBOW trains faster than SkipGram in both cases.

In [17]:
#Summarize the CBOW model
print(word2vec_cbow)
print("-"*50)

Word2Vec(vocab=13588, size=100, alpha=0.025)
--------------------------------------------------


In [18]:
#Summarize vocabulary
words = list(word2vec_cbow.wv.vocab)
print(words[:100])
print("-"*100)

['text', 'albert', 'of', 'prussia', 'may', 'march', 'was', 'the', 'last', 'grand', 'master', 'teutonic', 'knights', 'who', 'after', 'converting', 'to', 'became', 'first', 'monarch', 'duchy', 'state', 'that', 'emerged', 'from', 'former', 'monastic', 'european', 'ruler', 'establish', 'as', 'official', 'religion', 'his', 'lands', 'he', 'proved', 'instrumental', 'in', 'political', 'spread', 'its', 'early', 'stage', 'ruling', 'prussian', 'for', 'nearly', 'six', 'decades', 'a', 'member', 'branch', 'house', 'hohenzollern', 's', 'election', 'had', 'brought', 'about', 'hopes', 'fortune', 'skilled', 'administrator', 'and', 'leader', 'did', 'indeed', 'reverse', 'decline', 'order', 'however', 'demands', 'martin', 'luther', 'rebelled', 'against', 'catholic', 'church', 'holy', 'roman', 'empire', 'by', 'into', 'protestant', 'hereditary', 'realm', 'uncle', 'king', 'poland', 'arrangement', 'confirmed', 'treaty', 'krakw', 'pledged', 'personal', 'oath', 'return', 'invested', 'with']
---------------------

In [19]:
#Acess vector for one word
print(word2vec_cbow['science'])
print("-"*100)

[ 0.738786    0.8861788  -0.31611848  0.5307762  -1.5027008   1.3605808
 -1.8534917  -0.26915264 -1.0699195   0.22573121  0.34850603  0.0945635
 -0.41275886 -0.09613997 -0.33295634  1.0319223  -0.01559319  1.1972445
 -1.0469635   0.20529337 -0.01645879  0.24352127 -0.32144377  0.02693684
 -0.14583959 -0.39322737  0.6679208  -0.57211834  0.61633205  0.18112165
 -0.26600587  0.38309324  0.08289367  0.53765285 -0.37124366 -0.7403174
 -0.06275565 -0.08273292  0.7806035   0.04157685  0.7516202  -0.06333906
  0.58929735  0.28825414  0.05223166  0.29272842 -0.59094536  0.68880457
 -0.1688067   0.36569107 -0.89279836  1.0106404   0.29204327 -0.71944004
 -0.3171779   0.25453657 -1.3759314   0.26642907 -1.0476274   0.906037
  0.36368522 -0.47202477 -0.6266167   0.72990084  0.88393855  0.6026101
  0.34305233  0.4921666   0.11134485  0.6547872   0.5056012  -0.4189302
 -0.13545096 -0.18129024 -0.33972195 -0.4294363   0.6499062  -0.7497354
  0.03588779  0.14677809  0.0846085  -0.03514655  1.5866584 

In [20]:
#Compute similarity 
print("Similarity between science and engineering:",word2vec_cbow.similarity('science', 'engineering'))
print("Similarity between computer and movie:",word2vec_cbow.similarity('science', 'movie'))
print("-"*50)

Similarity between science and engineering: 0.9046574
Similarity between computer and movie: 0.09772175
--------------------------------------------------


In [21]:
# save model
word2vec_cbow.save('model/wiki_cbow.bin')
word2vec_skipgram.save('model/wiki_skipgram.bin')