# Gensim - Fast Building Framework for Word Embedding

## 1. Introduction

### What is Word Embedding

Word embedding can be regarded as an efficient way to represent word in machines. In machines, words are usually represented as vectors.

In order to represent words in machines, the simplest way is by using one-hot-vector, that is, for each word, it can be a high-dimensional vector with only one "1" in the vector. For example, if there are only 4 words "man", "king", "woman" and "queen", they can be represented as [1,0,0,0], [0,1,0,0], [0,0,1,0] and [0,0,0,1]. However, there are millions of vocabularies in the dictionary and representing all of the words by one-hot-vector requires high-dimensional vectors, which may result in out-of-memory problem in machines. In addition, we cannot make machine know the meaning of the words and relationship between two words. To solve these problems, low-dimensional word vectors are needed. In this example, if we manually choose features to represent a word, we can make machines recognize a word according to these features. For example, if there are four features "royal", "normal", "mascular", "feminine", the word "king" can be represented as [0.99, 0.01, 0.8, 0.1] and "queen" as [0.98, 0.01, 0.2, 0.8]. Still it is troublesome to select features manually, but making machines learn such kind of features may work. 

Before word embeddings, scholars trained neural networks to represent words in word vectors on CBOW and Skip-gram. The problem is that computing cost in output layer of a deep neural network is non-trivial. An optimal way is using word2vec model to represent words. Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are low-dimensional vectors with remarkable linear relationships. For example, vec("king") - vec("man") + vec("woman) = vec("queen").

With the help of word embedding, words are represented in an efficient way and make it possible for machines to understand the meaning of the words.

### What is Gensim

Gensim is a free Python library with following features: 

- Scalable statistical semantics

- Analyze plain-text documents for semantic structure

- Retrieve semantically similar documents

Gensim is a convenient library that allows to do following tasks:

- Training word embedding model

- Similarity Queires

- Text Summarization

- Topic Modelling

Gensim is a powerful tool to allow us to do many IR or NLP tasks. In this spotlight, we mainly focus on training word embedding model - including Word2Vec model and FastText.

## 2. Installation

This module depends on NumPy and Scipy, two Python packages for scientific computing. Before installing Gensim make sure they are also installed. Gensim depends on Python with version 2.7 or >=3.5.

Gensim can be installed in one of the following ways:

1. pip

`pip install -U gensim`

2. Anaconda

`conda install -c conda-forge gensim`


## 3. Dataset

In this section we are going to download and preprocess the dataset. The dataset can be downloaded from Kaggle https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. It contains a list of fake and real news. In the directory "fake-and-real-news-dataset", there are two files - "Fake.csv" and "True.csv". The dataset used to train Word2Vec model is called **Corpus**.

First, read the files and check how many rows they have.

In [1]:
import pandas as pd

fake_df = pd.read_csv('./fake-and-real-news-dataset/Fake.csv')
real_df = pd.read_csv('./fake-and-real-news-dataset/True.csv')
print(fake_df.shape)
print(real_df.shape)


(23481, 4)
(21417, 4)


There are four columns, including title, text, subject and date. In this example, we extract sentences from the text column and form the dataset.

In [2]:
print(fake_df.head())
print(real_df.head())

                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date  
0  December 31, 2017  
1  December 31, 2017  
2  December 30, 2017  
3  December 29, 2017  
4  December 25, 2017  
                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept t

Then, the following steps will remove noise from the text. Let's check if there is missing value in the dataset. 

In [3]:
print(fake_df.isnull().sum())
print(real_df.isnull().sum())

title      0
text       0
subject    0
date       0
dtype: int64
title      0
text       0
subject    0
date       0
dtype: int64


Since there is no missing value, we are not required to remove any missing value. 

The next step is preprocessing the corpus.

Gensim also supports a bunch of methods of parsing and preprocessing strings and tokenizing. Usually preprocessing data needs to remove punctuations, tags, numeric, whitespaces and so on. After cleaning the data, tokenize it. Gensim provides a method **preprocess_documents** which can do the preprocessing above.

In [4]:
from gensim.parsing.preprocessing import preprocess_documents
from gensim.utils import tokenize

raw_dataset = []
for row in fake_df['text']:
    raw_dataset.append(str(row))
for row in real_df['text']:
    raw_dataset.append(str(row))
sentences = preprocess_documents(raw_dataset)

In [5]:
print(len(sentences))
print(sentences[0])

44898
['donald', 'trump', 'couldn', 'wish', 'american', 'happi', 'new', 'year', 'leav', 'instead', 'shout', 'enemi', 'hater', 'dishonest', 'fake', 'new', 'media', 'realiti', 'star', 'job', 'couldn', 'countri', 'rapidli', 'grow', 'stronger', 'smarter', 'want', 'wish', 'friend', 'support', 'enemi', 'hater', 'dishonest', 'fake', 'new', 'media', 'happi', 'healthi', 'new', 'year', 'presid', 'angri', 'pant', 'tweet', 'great', 'year', 'america', 'countri', 'rapidli', 'grow', 'stronger', 'smarter', 'want', 'wish', 'friend', 'support', 'enemi', 'hater', 'dishonest', 'fake', 'new', 'media', 'happi', 'healthi', 'new', 'year', 'great', 'year', 'america', 'donald', 'trump', 'realdonaldtrump', 'decemb', 'trump', 'tweet', 'went', 'welll', 'expect', 'kind', 'presid', 'send', 'new', 'year', 'greet', 'like', 'despic', 'petti', 'infantil', 'gibberish', 'trump', 'lack', 'decenc', 'won', 'allow', 'rise', 'gutter', 'long', 'wish', 'american', 'citizen', 'happi', 'new', 'year', 'bishop', 'talbert', 'swan', '

## 4. Train the Model

After preparing for the dataset, we are going to train the word2vec model and FastText model.

### 4.1 Word2Vec Model Training

We are going to train Word2Vec model first.

In [6]:
from gensim.models import Word2Vec
import time

start = time.time()
w2v_model = Word2Vec(sentences=sentences)
end = time.time()
print("Time cost: "+str(end-start)+"s")

Time cost: 34.75360059738159s


Once we train our model, we can do following things: obtain the word vectors for words, retrieve the vocabulary of a model and so on. For example, we can obtain the word vector for the word "trump".

In [7]:
vec_trump = w2v_model.wv['trump']
print(vec_trump)

[ 0.90112185 -0.06394381 -0.54695094  0.5451448   3.9546165   1.4378763
  1.0281826  -1.2107242   1.1606402  -0.7177944   0.58483505  0.328974
 -2.9239256  -1.7333181  -0.7306151  -1.1336333   1.9647046  -0.17106111
  0.26653388 -1.2920225  -0.85093015  0.4205638  -0.7908393  -0.65425694
  0.84474224 -0.5225834  -2.5841691  -0.58454174 -0.275691   -0.13299435
 -2.2737842  -0.67487735 -1.6917374  -0.95548    -1.908789   -0.9495424
  1.6894523  -0.2289263   1.6128705   0.33084643  1.6679616  -2.4691775
 -0.8925712   0.78315806  0.60559136  0.47666115  0.69272935  3.3046427
  0.32235762 -0.6916537  -1.1403162  -1.034852   -0.6216303   0.34747735
 -1.5019106   1.6279614  -0.5610111   0.9031772   2.5842667  -3.317516
  0.4004678   0.16276014  0.96519786 -0.97517735  1.0138454  -1.3710843
 -0.4389914   2.9887648  -0.3712641  -2.5208259  -0.49166197  1.0353457
  1.5504236   0.8116096  -0.1808369   1.7955558  -2.576719   -0.31995437
  1.9097854  -1.4624949  -1.2789673  -0.40529037  0.78812593 

We can also retrieve the vocabulary of the model and here are ten of the words in the vocabulary.

In [8]:
for i, word in enumerate(w2v_model.wv.vocab):
    if i == 10:
        break
    print(word)

donald
trump
couldn
wish
american
happi
new
year
leav
instead


In addition, Word2Vec supports the function of calculating similarity between two words. Here is an example.

In [9]:
pairs = [
    ('trump', 'american'),   
    ('trump', 'donald'),   
    ('new', 'tweet'),  
    ('dishonest', 'fake'),    
    ('fake', 'real'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, w2v_model.wv.similarity(w1, w2)))

'trump'	'american'	0.16
'trump'	'donald'	0.25
'new'	'tweet'	0.05
'dishonest'	'fake'	0.53
'fake'	'real'	0.36


Also we can achieve the top n words similar to a word "trump"

In [10]:
print(w2v_model.wv.most_similar(positive=['trump'], topn=5))

[('trump’', 0.8477027416229248), ('rumsfeld', 0.5220260620117188), ('president’', 0.5177746415138245), ('presumpt', 0.463289737701416), ('candid', 0.4583457112312317)]


### Parameters

Several parameters will affect the training speed and the quality of the model.

**min_count**

min_count is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

default value of min_count=5

In [None]:
model = Word2Vec(sentences=sentences, min_count=10)

**size**

size is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

In [None]:
model = Word2Vec(sentences=sentences, size=200)

**workers**

workers are the number of threads to training the model. In Python there is only one worker due to GIL, unless Cython is installed.

In [None]:
model = Word2Vec(sentences=sentences, workers=4)

Now we are going to modify the parameters and train another model.

In [11]:
start = time.time()
model = Word2Vec(sentences=sentences, min_count=10, size=200)
end = time.time()
print("Time cost: "+str(end-start)+"s")

Time cost: 37.1169114112854s


Now obtain the word vector of "trump"

In [12]:
vec_trump = model.wv['trump']
print(vec_trump)

[ 1.1083409e+00  2.8412825e-01 -4.0333152e-01  4.2425072e-01
  1.2265630e+00  1.3978356e+00  1.9230708e+00 -7.0238733e-01
  1.8900141e+00 -7.5629696e-02  6.2934566e-01  1.6841220e+00
 -1.0395401e+00 -8.9119613e-01  4.1590594e-02 -2.5856951e-01
  7.0791280e-01 -1.4080112e+00  4.6715644e-01  1.3355272e-01
 -1.7386986e-01  2.5997877e-01 -2.0607309e-01 -7.9682446e-01
 -1.4383588e+00 -9.3288469e-01 -2.0245752e+00 -1.1665574e+00
 -1.3097171e-01 -6.1054868e-01  7.3613666e-02 -1.5768647e+00
 -8.1336451e-01  2.2355423e+00 -2.2096384e+00 -8.5814375e-01
  5.5592966e-01  7.4210280e-01  1.1821085e+00  7.4717379e-01
  4.7867921e-01 -1.3896639e+00 -6.0631406e-01 -7.3368832e-02
  1.1755929e+00  5.3119159e-01 -3.4468132e-01  2.7831409e+00
  2.6191559e-01 -1.0885777e+00 -3.8575473e-01  1.6791103e+00
  3.5325789e-01  8.7553543e-01 -7.3722500e-01  9.6064168e-01
 -1.5864632e+00 -3.3833343e-01 -3.3062354e-02 -4.5033127e-01
 -8.7628447e-02 -2.3689389e-01  1.5903516e+00 -1.4993639e-01
  2.9186693e-01  1.76122

Check ten of the vocabulary of the model.

In [13]:
for i, word in enumerate(model.wv.vocab):
    if i == 10:
        break
    print(word)

donald
trump
couldn
wish
american
happi
new
year
leav
instead


### 4.2 FastText Model Training

Different from Word2Vec, FastText does significantly better on syntactic tasks as compared to Word2Vec, especially when the size of training corpus is small. But the training time for FastText is higher than Word2Vec. FastText can be used to obtain vectors of out-of-vocabulary words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.

To train FastText model, we will the Lee Corpus provided by Gensim to train our model. We are going to train FastText model in the following steps. Be careful, training on a large corpus may make the memory crash. In this spotlight, we are not going to train with the previous corpus.

In [14]:
from pprint import pprint as print
from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

# Set file names for train and test data
corpus_file = datapath('lee_background.cor')

model = FT_gensim(size=100)

# build the vocabulary
model.build_vocab(corpus_file=corpus_file)

# train the model
start = time.time()
model.train(
    corpus_file=corpus_file, epochs=model.epochs,
    total_examples=model.corpus_count, total_words=model.corpus_total_words
)
end = time.time()

print("Time cost: "+str(end-start)+"s")

'Time cost: 0.8102421760559082s'


After we train the model, gensim allows us to do following things with the model such as word vector lookup and similarity queries.

Different to Word2Vec model, FastText models support vector lookups for out-of-vocabulary words by summing up character ngrams belonging to the word.

For example, here we want to look up the word vectors of "car" and "cars" 

In [15]:
print('night' in model.wv.vocab)
print('nights' in model.wv.vocab)

True
False


In [16]:
print(model['night'])

array([ 0.1941219 , -0.15062791, -0.5355393 ,  0.5484641 ,  0.54986024,
       -0.21399635, -0.19775148, -0.01883192,  0.35155442,  0.38947713,
       -0.54005367,  0.03728401, -0.71751434,  0.42678547,  0.19532633,
       -0.02315019, -0.22061744,  0.26909608,  0.18943693, -0.39142597,
       -0.25171676,  0.36400837, -0.3321525 ,  0.07453673, -0.6866999 ,
        0.73278326,  0.19264132,  0.2135463 ,  0.50038254,  0.10496826,
       -0.6929901 ,  0.22178094,  0.05497402, -0.5014203 ,  0.37060383,
       -0.0215224 , -0.15111026, -0.08223297,  0.46393153,  0.27434742,
       -0.01615404, -0.04809566,  0.3160232 , -0.01038613,  0.14094114,
        0.24010792, -0.12259277,  0.22934291, -0.04977904, -0.46853322,
       -0.5398909 , -0.6163394 , -0.01623671, -0.01325615,  0.46008006,
       -0.7649052 , -0.15356226, -0.238984  ,  0.09233637, -0.11875038,
        0.22653954, -0.08755895, -0.4213021 , -0.09683564, -0.5114078 ,
        0.26933733,  0.08372572,  0.17297511,  0.01307578,  0.50

  """Entry point for launching an IPython kernel.


In [17]:
print(model['nights'])

array([ 0.16976325, -0.13037507, -0.4659069 ,  0.4763106 ,  0.47740814,
       -0.18761593, -0.17260599, -0.01553334,  0.3055244 ,  0.33936536,
       -0.47178084,  0.03113239, -0.62432075,  0.3725923 ,  0.16590124,
       -0.01995055, -0.19381092,  0.23225938,  0.16306035, -0.340871  ,
       -0.21879405,  0.317462  , -0.28945407,  0.06545433, -0.5977323 ,
        0.63712573,  0.16729671,  0.1860668 ,  0.43515775,  0.09081806,
       -0.6005572 ,  0.19395107,  0.0465325 , -0.43817773,  0.32224125,
       -0.01889641, -0.12995604, -0.07268477,  0.4036368 ,  0.23933122,
       -0.01307502, -0.0421077 ,  0.27434275, -0.0073134 ,  0.12151323,
        0.2090702 , -0.10995457,  0.20143966, -0.04402449, -0.4061034 ,
       -0.47224367, -0.5351996 , -0.01290479, -0.01055295,  0.40079945,
       -0.66518825, -0.13369225, -0.20678155,  0.0778923 , -0.10292069,
        0.19665094, -0.07701752, -0.3661156 , -0.08469696, -0.4453588 ,
        0.23374492,  0.07293572,  0.15031497,  0.01318435,  0.44

  """Entry point for launching an IPython kernel.


Similarity operations work as the same way as Word2Vec. But out-of-vocabulary words can also be used. 

In [18]:
print(model.similarity("night","nights"))

0.999993


  """Entry point for launching an IPython kernel.


As for evaluation, since this task is an unsupervised learning task, there is no suitable method to evaluate the model, but it is depend on the application.

## 5. Conclusion

In this spotlight we introduce the definition of word embedding and the benefits of it. Then we focus on how to train Word2Vec model and FastText model using a python library Gensim and show some usages of the model. I hope this spotlight will give you a deep understanding of word embedding and Gensim. 

## Reference

[Fake news and real news corpora](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)

[gensim: Word2Vec Documentation](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py)

[Gensim: Word2Vec Tutorial](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial/notebook)

[Gensim: FastText Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-auto-examples-tutorials-run-fasttext-py)