# Word Embedding

![Word Embedding](https://cdn-images-1.medium.com/max/800/0*g24VvkPOJPaYDw6W.jpg)
Photo Credit: https://cdn.pixabay.com/photo/2016/03/09/09/14/books-1245690_960_720.jpg

Word Embedding is silver bullet to resolve many NLP problem. Most of modern NLP architecture adopted word embedding and giving up bag-of-word (BoW), Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) etc. 

After reading this article, you will understand:
- History of Word Embedding
- Word Embedding Design
- Apply off-the-shelf word embedding model
- Embedding Visualization
- Take Away

# History of Word Embedding
Traditionally, we use bag-of-word to represent a feature (e.g. TF-IDF or Count Vectorize). Besides BoW, we can apply LDA or LSA on word feature. However, they have some limitations such as high dimensional vector, sparse feature. Word Embedding is a dense feature in low dimensional vector. It is proved that word embedding provides a better vector feature on most of NLP problem.

In 2013, Mikolov et al. made Word Embedding popular. Eventually, word embedding is state-of-the-art in NLP. He released the word2vec toolkit and allowing us to enjoy the wonderful pre-trained model. Later on, gensim provide a amazing wrapper so that we can adopt different pre-trained word embedding models which including Word2Vec (by Google), GloVe (by Stanford), fastText (by Facebook).

12 years before Tomas et al. introduces Word2Vec, Bengio et al. published a paper [1] to tackle language modeling and it is the initial idea of word embedding. At that time, they named this process as "learning a distributed representation for words".

![](https://cdn-images-1.medium.com/max/800/1*FZVMHwCLO3fFo7FvMyA94Q.png)
Capture from A Neural Probabilistic Language Model [2] (Benigo et al, 2003)

In 2008, Ronan and Jason [3] introduce a concept of pre-trained model and showing that it is a amazing approach for NLP problem. Word embedding became famous unitl Tomas released pre-trained model (Word2Vec) in 2013.

![](https://cdn-images-1.medium.com/max/800/1*D6A44ZN5_zwTyuCAODM0fA.png)
Capture from A Unified Architecture for Natural Language Processing [3] (Collobert & Weston, 2008)

Timeline:
- 2001: Bengio et al. introduced a concept of word embedding
- 2008: Ronan and Jason  introduced a concept of pre-trained model
- 2013: Mikolov et al. released pre-trained model which is Word2Vec

# Word Embedding Design

##### Low Dimensional
![](https://food.fnr.sndimg.com/content/dam/images/food/fullset/2014/3/17/0/FNM_040114-KidsCake-rainbow-recipe_s4x3.jpg.rend.hgtvcom.616.462.suffix/1395082987380.jpeg)
Photo Credit: https://www.foodnetwork.com/recipes/food-network-kitchen/four-layer-birthday-cake-3363221

To tackle the high dimensional issue, word embedding use pre-defined vector space such as 300 to present every word. For demo purpose, I use 3 dimension to represent the following words:
- Apple: [1.11, 2.24, 7.88]
- Orange: [1.01, 2.04, 7.22]
- Car: [8.41, 2.34, -1.28]
- Table: [-1.41, 7.34, 3.01]

As pre-defined the vector space (i.e. 3 in the above demo), number of dimension (or feature) is fixed no matter how large the corpus is. Comparing to BoW, number of dimension will be increased when unique word increase. Imagining we have 10k unique words in our documents, number of feature in BoW is 10k (without filtering high/ low frequency word) while the dimension can be keep as 3 in our demo.

##### Semantic Relationship
![](https://cdn-images-1.medium.com/max/1600/1*oF1QyMamN5jXCXfffSRrqA.png)
Photo Credit: https://gointothestory.blcklst.com/similar-but-different-c722f39d923d


In general, the word vector encodes semantic relationship among words. It is a very important concept on word embedding as it benefits on tacking NLP problem. Word vectors will be closed if they have similar meaning. For example, buy and purchase will be closer. Unlike BoW, it only represent 0 or 1 (Counting having a word or not approach) and it cannot represent whether two wordings have similar meaning or not.

In the above example, you may notice that Apple's vector and Orange's vector are closed than others meanwhile Apple's vector is far way from Car's vector relatively.

##### Continuous bag-of-words (CBOW) & Skip-gram
Mikolov et al proposed two new architectures [4] which reducing computation complexity and including additional context. 
CBOW is that using both n words before and after target word (w). For instance, "the word vector encodes semantic relationship among words". If the window (n) is 3, here is the subset of prediction list:
- Case 1, Before Words: {Empty}, After Words: (word, vector, encodes), Predict Word: "the"
- Case 2, Before Words: (the), After Words: (vector, encodes semantic), Predict Word: "word"

Skip-gram uses the opposite approach which use the target word to predict n words before and after target word. For instance, "the word vector encodes semantic relationship among words". If the window (n) is 3, here is the subset of prediction list:
- Case 1, Predict Word: "the", Words: (word, vector, encodes)
- Case 2, Predict Word: "word", Words: (the, vector, encodes, semantic)

![](https://cdn-images-1.medium.com/max/800/1*QwiTOcVmwesADjQ3zMvSjA.png)
Capture from Efficient Estimation of Word Representations in Vector Space (Tomas et al., 2013)
Negative Sampling

##### Negative Sampling
Instead of leveraging all other words as negative label training records. Mikolov et al. proposed to use suitable small amount of negative training record to train the model. So that the whole operation become much faster. 

If you are not familiar with negative sampling, you may check out this article for more information.

# Apply off-the-shelf word embedding model
Introduced history and model architecture, how can we use word embedding to tackle NLP problem?
There are two approaches to deal with word embedding:
- Leveraging off-the-shelf model
- Building a domain specific model.

This article will take the first approach. Selecting 3 well-known pre-trained models and leveraging gensim to load those model. Gensim, well known NLP library, already implement interface to deal with these 3 models.

In [2]:
!pip install tensorflow

Collecting tensorflow
  Downloading https://files.pythonhosted.org/packages/2f/45/68e41b073b17c49dc9f02648acfd43b029072786a229465c27e9554c993e/tensorflow-2.4.0-cp37-cp37m-win_amd64.whl (370.7MB)
Collecting opt-einsum~=3.3.0 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65kB)
Collecting wheel~=0.35 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/65/63/39d04c74222770ed1589c0eaba06c05891801219272420b40311cd60c880/wheel-0.36.2-py2.py3-none-any.whl
Collecting termcolor~=1.1.0 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting h5py~=2.10.0 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/a1/6b/7f62017e3f0b32438dd90bdc1ff0b7b1448b6cb04a1ed84f37b6de95cd7b/h5py-2.10.0-cp37-cp37m-win_amd64.whl (2.5MB)
Co

ERROR: astroid 2.2.5 requires typed-ast>=1.3.0; implementation_name == "cpython", which is not installed.
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\users\\mukjain\\appdata\\local\\continuum\\anaconda3\\lib\\site-packages\\~umpy\\core\\_multiarray_tests.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [4]:
import datetime
import numpy as np
import os

import gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec

import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

print('gensim Version: %s' % (gensim.__version__))

class WordEmbedding:
    __author__ = "Edward Ma"
    __copyright__ = "Copyright 2018, Edward Ma"
    __credits__ = ["Edward Ma"]
    __license__ = "Apache"
    __version__ = "2.0"
    __maintainer__ = "Edward Ma"
    __email__ = "makcedward@gmail.com"

    def __init__(self, verbose=0):
        self.verbose = verbose
        
        self.model = {}
        
    def convert(self, source, ipnut_file_path, output_file_path):
        if source == 'glove':
            input_file = datapath(ipnut_file_path)
            output_file = get_tmpfile(output_file_path)
            glove2word2vec(input_file, output_file)
        elif source == 'word2vec':
            pass
        elif source == 'fasttext':
            pass
        else:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
        
    def load(self, source, file_path):
        print(datetime.datetime.now(), 'start: loading', source)
        if source == 'glove':
            self.model[source] = gensim.models.KeyedVectors.load_word2vec_format(file_path)
        elif source == 'word2vec':
            self.model[source] = gensim.models.KeyedVectors.load_word2vec_format(file_path, binary=True)
        elif source == 'fasttext':
            self.model[source] = gensim.models.wrappers.FastText.load_fasttext_format(file_path)
        else:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
            
        print(datetime.datetime.now(), 'end: loading', source)
            
        return self
    
    def get_model(self, source):
        if source not in ['glove', 'word2vec', 'fasttext']:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
            
        return self.model[source]
    
    def get_words(self, source, size=None):
        if source not in ['glove', 'word2vec', 'fasttext']:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
        
        if source in ['glove', 'word2vec']:
            if size is None:
                return [w for w in self.get_model(source=source).vocab]
            else:
                results = []
                for i, word in enumerate(self.get_model(source=source).vocab):
                    if i >= size:
                        break
                        
                    results.append(word)
                return results
            
        elif source in ['fasttext']:
            if size is None:
                return [w for w in self.get_model(source=source).wv.vocab]
            else:
                results = []
                for i, word in enumerate(self.get_model(source=source).wv.vocab):
                    if i >= size:
                        break
                        
                    results.append(word)
                return results
        
        return Exception('Unexpected flow')
    
    def get_dimension(self, source):
        if source not in ['glove', 'word2vec', 'fasttext']:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
        
        if source in ['glove', 'word2vec']:
            return self.get_model(source=source).vectors[0].shape[0]
            
        elif source in ['fasttext']:
            word = self.get_words(source=source, size=1)[0]
            return self.get_model(source=source).wv[word].shape[0]
        
        return Exception('Unexpected flow')
    
    def get_vectors(self, source, words=None):
        if source not in ['glove', 'word2vec', 'fasttext']:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
        
        if source in ['glove', 'word2vec', 'fasttext']:
            if words is None:
                words = self.get_words(source=source)
            
            embedding = np.empty((len(words), self.get_dimension(source=source)), dtype=np.float32)            
            for i, word in enumerate(words):
                embedding[i] = self.get_vector(source=source, word=word)
                
            return embedding
        
        return Exception('Unexpected flow')
    
    def get_vector(self, source, word, oov=None):
        if source not in ['glove', 'word2vec', 'fasttext']:
            raise ValueError('Possible value of source are glove, word2vec, fasttext')
            
        if source not in self.model:
            raise ValueError('Did not load %s model yet' % source)
        
        try:
            return self.model[source][word]
        except KeyError as e:
            raise
            
            #TODO
#             if oov is None:
#                 raise
            
#             if 'not in vocabulary' in str(e):
#                 if oov == ''

    def build_visual_metadata(self, embedding, words, file_dir, 
                              metadata_name='metadata.csv', project_model_name='model.ckpt'):
        # Create output directory if not exist
        if not os.path.exists(file_dir):
            os.makedirs(file_dir)

        # Build graph
        tf.reset_default_graph()
        sess = tf.InteractiveSession()

        embedding_graph = tf.Variable([0.0], name='embedding')
        place = tf.placeholder(tf.float32, shape=embedding.shape)

        set_embedding_graph = tf.assign(embedding_graph, place, validate_shape=False)
        sess.run(tf.global_variables_initializer())
        sess.run(set_embedding_graph, feed_dict={place: embedding})

        # Build metadata
        with open(os.path.join(file_dir, metadata_name), 'w') as f:
            for word in words:
                f.write(word + '\n')

        # Build projector
        summary_writer = tf.summary.FileWriter(file_dir, sess.graph)
        config = projector.ProjectorConfig()
        embedding_conf = config.embeddings.add()
        embedding_conf.tensor_name = 'embedding:0'
        embedding_conf.metadata_path = metadata_name
        projector.visualize_embeddings(summary_writer, config)

        # Save model
        saver = tf.train.Saver()
        saver.save(sess, os.path.join(file_dir, project_model_name))

        # Clear
        sess.close()

        
downloaded_glove_file_path = '../text/stanford/glove/glove.6B.50d.txt'
glove_file_path = '../text/stanford/glove/glove.840B.300d.vec'

word2vec_file_path = '../text/google/word2vec/GoogleNews-vectors-negative300.bin'
fasttext_file_path = '../text/facebook/fasttext/wiki.en.bin'

word_embedding = WordEmbedding()

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
# You may need to convert text file (downloaed from GloVe website) to vector format
# word_embedding.convert(
#      source='glove', ipnut_file_path=downloaded_glove_file_path, output_file_path=glove_file_path)

##### Word2Vec
[Word2Vec](https://code.google.com/archive/p/word2vec/) is trained on google news and provided by Google. Based on 100 billion words from Google News data, they trained model with 300 dimensions.

Mikolov et al. use skip-gram and negative sampling to build this model which is released in 2013.

##### GloVe
Global Vectors for Word Representation ([GloVe](https://nlp.stanford.edu/projects/glove/)) is provided by Stanford NLP team. Stanford provides various models from 25, 50 , 100, 200 to 300 dimensions base on 2, 6, 42,  840 billion tokens.

Stanford NLP team apply word-word co-occurrence probability to build the embedding. In other word, if two words are co-exist many time, both words may have similar meaning so the matrix will be closer.

##### fastText
[fastText](https://fasttext.cc/) is released by Facebook which provides 3 models with 300 dimensions. One of the pre-trained model is trained with subword. For example, "difference", it will be trained by "di", "dif", "diff" and so on.

In [None]:
word_embedding.load(source='word2vec', file_path=word2vec_file_path)
word_embedding.load(source='glove', file_path=glove_file_path)
word_embedding.load(source='fasttext', file_path=fasttext_file_path)

In [None]:
for source in ['glove', 'word2vec', 'fasttext']:
    print('Source: %s' % (source))
    print(word_embedding.get_vector(source=source, word='apple'))

In [None]:
source = 'word2vec'

embedding = word_embedding.get_vectors(source=source)
words = word_embedding.get_words(source=source)
sub_embedding = embedding[:100000]
sub_words = words[:100000]

# Embedding Visualization
One of state-of-the-art NLP is word embedding, what is it actually? It is a matrix and the simplest way is x and y coordinate but we have 300 dimensions not 2 dimensions. 
We can visualize it by using principal component analysis (PCA) or T-distributed Stochastic Neighbor Embedding (t-SNE). By leveraging TensorBoard, visualization can be presented easily.

In [None]:
word_embedding.build_visual_metadata(embedding=sub_embedding, words=sub_words, file_dir='./word_embedding')

![](https://cdn-images-1.medium.com/max/800/1*glwOAs3oK5IOOegT9ru7sw.png)

In [None]:
"""
    To start the tensorboard.
    1. Open terminal
    2. Go to parent directory of file_dir (e.g. parent directory of word_embedding)
    3. execute "tensorboard --logdir=word_embedding" (e.g. the value of --logdir should be same 
        as what your provide in previous step)
    4. Open browser to access http://localhost:6006 (depending on your host, the default port is 6006)
"""

# Take Away
To access all code, you can visit my github repo.

- Which off-the-shelf model should be use? Depending on your data, __it is possible that all of them are not useful for your domain specific data__.
- Should we train word embedding layer base on your data? According to my experience, if you deal with __domain specific text and most of your word cannot be found from off-the-shell model__, you may consider to build customize word embedding layer. 
- Tensorboard picks first 100000 vectors due to browser resource concern. Recommend to pick a small portion of vectors by yourself.
- Maximum model size of GloVe, Word2Vec and fasttext are ~5.5GB, ~3.5GB and ~8.2GB respectively. It takes about 9, 1, 9 minutes for GloVe, Word2Vec and fasttext respectively.

# Reference
- [1] Yoshua Bengio, Ducharme Rejean &Vincent Pascal. A Neural Probabilistic Language Model. 2001. https://papers.nips.cc/paper/1839-a-neural-probabilistic-language-model.pdf
- [2] Yoshua Bengio, Ducharme Rejean, Vincent Pascal & Janvin Christian. A Neural Probabilistic Language Model. March 2003. http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- [3] Collobert Ronan, & Weston Jason. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. 2008. https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf
- [4] Tomas Mikolov, Greg Corrado, Kai Chen & Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. September 2013. https://arxiv.org/pdf/1301.3781.pdf