<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-3-word-embeddings/1_develop_word_embeddings_with_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop Word Embeddings with Gensim

Word embeddings are a modern approach for representing text in natural language processing. Embedding algorithms like Word2Vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.

We will discover how to train and load word embedding models for natural language processing applications in Python using Gensim.

## Word Embeddings

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning. Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word. It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.

We are going to look at how to use two different word embedding
methods called Word2Vec by researchers at Google and GloVe by researchers at Stanford.

## Gensim Python Library

Gensim is an open source Python library for natural language processing, with a focus on topic modeling. It is billed as "topic modeling for humans".

It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. It supports an implementation of the Word2Vec word embedding for learning new word vectors from text.

It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.

## Develop Word2Vec Embedding

Word2Vec is one algorithm for learning a word embedding from a text corpus. There are two main training algorithms that can be used to learn the embedding from text:-
* **Continuous Bag-of-Words (CBOW)**
* **Skip-gram**

We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words.

Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text. 

Gensim provides the Word2Vec class for working with a Word2Vec model. Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance.

```python
sentences = ...
model = Word2Vec(sentences)
```

Specifically, each sentence must be tokenized, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case). The sentences could be text loaded into memory, or an iterator that progressively loads text, required for very large text corpora. 

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

* **size**: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
* **window**: (default 5) The maximum distance between a target word and words around the target word.
* **min count**: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
* **workers**: (default 3) The number of threads to use while training.
* **sg**: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

The defaults are often good enough when just getting started. If you have a lot of cores, as most modern computers do, I strongly encourage you to increase workers to match the number of cores (e.g. 8).

After the model is trained, it is accessible via the wv attribute. This is the actual word vector model in which queries can be made. 

For example, you can print the learned vocabulary of tokens (words) as follows:

```python
words = list(model.wv.vocab)
print(words)
```

You can review the embedded vector for a specific token as follows:

```python
print(model['word'])
```

Finally, a trained model can then be saved to file by calling the save word2vec format() function on the word vector model. By default, the model is saved in a binary format to save space.

```python
model.wv.save_word2vec_format('model.bin')
```

When getting started, you can save the learned model in ASCII format and review the contents. You can do this by setting binary=False when calling the save word2vec format() function.

```python
model.wv.save_word2vec_format('model.txt', binary=False)
```

The saved model can then be loaded again by calling the Word2Vec.load() function.

```python
model = Word2Vec.load('model.bin')
```

We can tie all of this together with a worked example. Rather than loading a large text document or corpus from file, we will work with a small, in-memory list of pre-tokenized sentences. 

The model is trained and the minimum count for words is set to 1 so that no words are ignored. After the model is learned, we summarize, print the vocabulary, then print a single vector for the word "sentence". 

Finally, the model is saved to a file in binary format, loaded,
and then summarized.



In [0]:
from gensim.models import Word2Vec

In [2]:
# define training data
sentences = [
  ['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
  ['this', 'is', 'the', 'second', 'sentence'],
  ['yet', 'another', 'sentence'],
  ['one', 'more', 'sentence'],
  ['and', 'the', 'final', 'sentence']         
]

# train model
model = Word2Vec(sentences, min_count=1)

# summarize the loaded model
print(model)

# summarize vocabulary
words = list(model.wv.vocab)
print(words)

# access vector for one word
print(model['sentence'])

# save model
model.save('model.bin')

# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Word2Vec(vocab=14, size=100, alpha=0.025)
['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec', 'second', 'yet', 'another', 'one', 'more', 'and', 'final']
[-2.5527617e-03 -2.4090428e-03 -2.6642578e-03 -4.2772451e-03
  3.5824208e-03  2.7083138e-03  3.0410127e-04 -1.8435670e-03
 -2.1708049e-03 -6.9017644e-04 -2.5778562e-03  3.5773404e-03
  2.3550398e-03 -1.3412331e-03 -2.3548866e-03 -4.1006762e-03
 -3.5198894e-03  1.3351815e-03 -4.1320659e-03 -1.6311593e-03
 -3.5127799e-03 -8.3889306e-04  1.2147062e-03  6.5140001e-04
 -2.3203322e-03 -2.2121693e-04 -2.8515861e-03 -3.9685895e-03
  3.7177545e-03 -4.8733721e-03 -3.2821156e-03 -4.1479613e-03
 -4.9392227e-03  1.4273257e-03  1.9067340e-03  2.2379463e-03
  2.3476388e-03  1.1623964e-03 -3.9065577e-05 -1.3478211e-03
  3.9785937e-03 -4.0813209e-03 -2.7081587e-03  1.0029969e-03
 -2.5956549e-03 -3.2614968e-03 -3.7464250e-03  7.7152805e-04
  3.6150618e-03 -8.0331301e-05  9.4698288e-04 -4.4294670e-03
  5.5344781e-04 -1.9595972e-03  2.2155386e-0

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


You can see that with a little work to prepare your text document, you can create your own word embedding very easily with Gensim.

## Visualize Word Embedding