<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/Text_Vectorization_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Text Vectorization

Text data requires special preparation before we can start using it for predictive modeling. Besides cleaning the text, we cannot work with the text directly when using machine learning algorithms because the algorithms are well suited to manage numbers. For example, when we say to give images as inputs we really pass a matrix which contains numeric values where each number depicts the intensity of a pixel. Or when fitting a linear regression model where we have numerical and categorical features we also need to transform the categorical variables to numerical variables. Therefore, we need to convert the text to numbers. This process is called *feature extraction* (or *vectorization*).

Suppose we have a dataset which contains reviews of films and we have another variable which say us if a particular review has a positive or negative sentiment (binary variable where the value 1 means that the review is positive and 0 otherwise). In this classification task, we want to train a model to predict the sentiment of new reviews do not still labeled. In the NLP context, each raw text is usually called *document* (in this case each review is a document). The set of all row text is usually known as *corpus* (in this case the corpus is the set of all reviews). The *vocabulary* of the corpus is the set of all different words in the documents. 

**Note**: We can encode sequences of words (sentences or documents), words, or characters. The choice depends on the problem we want to solve and how we want to focus on. For example, encoding words is easier to generate realistic text. Although if we encode characters, the algorithm could even generate words that do not exist. Each of these units which form our sequences of data are called *tokens*. The process of divide the sequences into these blocks is called *tokenization*.



## Integer Encoding and One-hot encoding


A first idea to encode sentences or documents of words is to assign each word of the vocabulary with a unique number (an integer value).

![](https://i.ibb.co/zs0yb2J/integer-encoding.png)

This method preserves the order of the words as they occur in the sentences. It's called *Index-Based Encoding* since it maps each word with one index (a number). Note that the sequences of indexes have variable length.

This technique is usually used to encode the output target in classification tasks. In the example of sentiment analysis, the sentiment of the review is encoded as 1 positive and 0 negative. In this context, this method is known as *label encoding* or *integer encoding*. 

However this method is usually inefficient for encoding documents or when we transform categorical features in numerical because the integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, although, for some variables (ordinal), this may be enough. Imagine, out of the NLP context, a dataset about people where there are columns like race, age, etc. A training algorithm could understand that a person who is 30 years old is twice as old as a person who is 15 years old. Suppose we transform the categorical variable race in a numerical variable using this encoding ( 0->asian, 1->european and 2->african). The algorithm could understand that an african person has twice as a race than a european person!. 

To solve this, the *one-hot* encoded is used. Instead of assigning a single number to each label, we assign a vector. A binary variable is added for each category.

![](https://i.ibb.co/zxGhQkB/one-hot-encoding.png)


In this case, the distance between all encoded words is the same so we do not have the problem we saw previously. A very similar method called *dummy variables*  is commonly used for transforming categorical features to numerical in many machine learning tasks. 

Let's see one of the most classical methods to encode documents of text.

## Bag of words

A bag-of-words model, or BoW for short, is a way of extracting features from sentences and documents for use in modeling, such as with machine learning algorithms. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The intuition is that documents are similar if they have similar content.

We represent every document from the corpus as a fixed-length vector whose length is equal to the vocabulary of the corpus. 

![](https://i.ibb.co/d03jmGM/encoding-documents-as-vectors.png)


The bag-of-words can be as simple or complex as we like. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words. We will look at three types regarding to how we score the presence of words in the document: one-hot, frequency, and TF–IDF.

### One-hot document vectorization (not ordered)



In this method, each document then can be encoded as a fixed-length binary vector with the length of the vocabulary of known words where 1 encodes the presence of the word and 0 its absence. 

![](https://i.ibb.co/XkX8tyG/one-hot-document-not-ordered2.png)







### Frequency-based document vectorization

A variation of the previous method uses the frequency or count of each word in the document instead of just its presence/absence. This variation is called *frequency-based* vectorization.

![](https://i.ibb.co/8X3HJxq/frequence-based-vectorization.png)

This representation can either be a straight count (integer) encoding as shown in the previous figure or a normalized encoding where each word is weighted by the total number of words in the document.







### TF-IDF

A problem with the previous technique is that highly frequent words start to dominate in the document (larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain-specific words.

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:

- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across documents.


We interpret the score to mean that the closer the TF–IDF score of a term is to 1, the more informative that term is to that document. The closer the score is to zero, the less informative that term is.

In the next figure, the token "studio" has higher relevance to this document since it appears with less frequency in the rest of the documents.


![](https://i.ibb.co/6P9wFJ5/tf-idf.png)





#### Computing TF–IDF

TF–IDF is computed on a per-term basis, such that the relevance of a token to a document is measured by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus.

The term frequency of a term given a document, $\text{tf}(t,d)$, can be the boolean frequency (as in one-hot encoding, 1 if $t$ occurs in $d$  and 0 otherwise), or the count. However, generally both the term frequency and inverse document frequency are scaled logarithmically to prevent bias of longer documents or terms that appear much more frequently relative to other terms:
$\text{tf}(t,d)=1+\text{log}(f_{t,d})$.

Similarly, the inverse document frequency of a term given the set of documents can be logarithmically scaled as follows: $\text{idf}(t,D)=\text{log}(1)+N \cdot n_t$
, where $N$ is the number of documents and $n_t$ is the number of occurrences of the term $t$ in all documents. TF–IDF is then computed completely as $\text{tfidf}(t,d,D)=\text{tf}(t,d) \cdot \text{idf}(t,D)$.


### Design the vocabulary of words

Since every document from the corpus as a fixed-length vector whose length is equal to the vocabulary of the corpus, if the vocabulary size increases, so does the vector representation of documents. We can imagine that for a very large corpus, such as thousands of books, that the length of the vector might be thousands or millions of positions. Further, each document may contain very few of the known words in the vocabulary. This results in a vector with lots of zero scores, which is known as sparse vector or sparse representation.

Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model. There are simple text cleaning techniques that can be used as a first step, such as:

- Ignoring punctuation
- Ignoring frequent words that don’t contain much information, called *stop words*, like "a", "of", etc.
- Fixing misspelled words.
- Reducing words to their stem (e.g. "play" from "playing") is called  *Stemming*. Reducing words to their lemma (e.g. "be" from "was") is called *Lemmatization*.

A more sophisticated approach is to create a vocabulary of grouped words. This both changes the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document. In this approach, each word or token is called a "gram". Creating a vocabulary of two-word pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams. A vocabulary then tracks triplets of words is called a trigram model and the general approach is called the n-gram model, where n refers to the number of grouped words.



![](https://i.ibb.co/r7m59h8/n-grams.png)

For example, the sentences "the movie was good, not bad at all" and "the movie was bad, not good at all" have the opposite meaning. However, their encoding will be equal if we use a unigram model since both sentences have the same words. We can solve this issue using a bigram model.


### Limitations of Bag-of-Words


The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on our specific text data. It has been used with great success on prediction problems like language modeling and document classification.

Nevertheless, it suffers from some shortcomings, such as:

- **Vocabulary**: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.

- **Sparsity**: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.

- **Meaning**: Discarding word order ignores the context and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged ("this is interesting" vs "is this interesting"), synonyms ("old bike" vs "used bike"), and much more. The order of the words in a text is also important, for example, to take into account negations or grammar structures. On the other hand, some more primitive NLP techniques and machine learning algorithms might not make use of the word order anyway.




### Implementation of bag of words in Python

The scikit-learn library offers easy-to-use tools to perform
both tokenization and feature extraction of our text data. The class `CountVectorizer()` can be used for frequency-based document vectorization or the  class `TfidfVectorizer` for TF-IDF. An example using the TF-IDF vectorizer for a sentiment analysis task can be seen [here](https://www.kaggle.com/viroviro/sentiment-analysis-tf-idf-logistic-regression). Alternatively we can use Gensim or NLTK (see implementation [here](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html)).



## One-Hot encoding ordered

Some machine learning algorithms can build an internal representation of items in a sequence, like ordered words in a sentence. Recurrent Neural Networks (RNNs), for example, can exploit the sequence order for better classification results (see notebook [Introduction to RNNs](https://github.com/victorviro/Deep_learning_python/blob/master/Introduction_to_RNNs.ipynb)).

The encoded-words are fed sequentially into the model. Note that the sentences can have a variable number of words, but RNNs can handle variable-length inputs.

![](https://i.ibb.co/3pQxWCf/one-hot-ordered.png)

Remember that with this technique, each word is encoded as a fixed-length vector whose length is equal to the number of words of the vocabulary.
Therefore, when we encode characters, this method can work well since the size of the vocabulary is not too large. But, when we want to encode words, the size of the vocabulary increases considerably, leading to a very large and very sparse representation of the words.


Furthermore, the distance or similkarity between all encoded words will be the same (since each vector points in one dimension). So, this technique has limitations. For example, the tokens "cat" and "dog" (both animals) will be equally distant from unrelated tokens like "bicycle".
 
![](https://i.ibb.co/1fFJqPd/3d-plo.png)


Can we use the direction of the vectors to encode the relationship between these words?

![](https://i.ibb.co/kXyf7ZG/3d-plo2.png)

Yes, we can do it. Let's see how word embeddings work.

## Word embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

One of the benefits of using dense and low-dimensional vectors is computational: the majority of neural network toolkits do not play well with very high-dimensional, sparse vectors. The key to the approach is the idea of using a densely distributed representation for each word. Words are represented as real-valued vectors in a predefined vector space, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding.

Word embedding methods learn a real-valued vector representation for a predefined fixed-sized vocabulary from a corpus of text. The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics. The distributed representation is learned based on the usage of words. This allows words that are used in similar ways to result in having similar representations, naturally capturing their meaning. This can be contrasted with the crisp but fragile representation in a bag-of-words model where, unless explicitly managed, different words have different representations, regardless of how they are used.

Let's see the principal Word Embedding Algorithms:

### Embedding Layer

An embedding layer, for lack of a better name, is a word embedding that is learned jointly with a neural network model on a specific natural language processing task, such as language
modeling or document classification. It usually requires that document text be cleaned and prepared such that each word is one-hot encoded. The size of the vector space is specified as part of the model, such as 50, 100, or 300 dimensions. The vectors are initialized with small random numbers. The embedding layer is used on the front end of a neural network and is fit in a
supervised way using the Backpropagation algorithm.

If a Multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model.
If a recurrent neural network is used, then each word may be taken as one input in a sequence. This approach of learning an embedding layer requires a lot of training data and can be slow,
but will learn an embedding both targeted to the specific text data and the NLP task.

Keras offers a flexible `Embedding` layer that can be used for neural networks on text data in a variety of ways (more info [here](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)).


### Word2Vec

[Word2Vec](https://arxiv.org/abs/1301.3781) is a statistical method for efficiently learning a standalone word embedding from a text corpus. It was developed by Google in 2013.

Additionally, the work involved analysis of the learned vectors and the exploration of vector math on the representations of words. The word vectors can be reduced to two or three-dimensions using a dimensionality reduction technique to explore relationships between certain words.

![](https://i.ibb.co/8dHdyKk/wor2vec-relatiuonships.png)

An interesting tool to visualize these relationships can be seen in a [tensorflow projection](https://projector.tensorflow.org/) where the word vectors are reduced to a three-dimensional space.

Two different learning models were introduced that can be used as part of the Word2Vec approach to learn the word embedding:

-  Continuous Bag-of-Words, or CBOW model.
-  Continuous Skip-Gram Model.

The CBOW model learns the embedding by predicting the current target word (the center word) based on its context words (its surrounding words). The continuous skip-gram model learns by predicting the surrounding words given a current word.

Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words. This window is a configurable parameter of the model. The size of the sliding window has a strong effect on the resulting vector similarities. Large windows tend to produce more topical similarities, while smaller windows
tend to produce more functional and syntactic similarities.

The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).

A notebook explaining the CBOW model can be seen [here](https://github.com/victorviro/Deep_learning_python/blob/master/Word2vec_CBOW_model.ipynb).

### Doc2vec

[Doc2vec](https://cs.stanford.edu/~quocle/paragraph_vector.pdf) is a model for representing documents as a vector and is a generalizing of the word2vec method. After understanding what word2vec is, it will be easier to understand how doc2vec works.

Let's remember the CBOW model where the task
is to predict a word given the other words in a context.

![](https://i.ibb.co/gtXVNB0/CBOW-model.png)

In this model, every word is mapped to a unique vector, represented by a column in a matrix $W$. The column is indexed by the position of the word in the vocabulary. 
The concatenation or sum of the vectors is then used as features
for prediction of the next word in a sentence (for more details see notebook [CBOW model](https://github.com/victorviro/Deep_learning_python/blob/master/Word2vec_CBOW_model.ipynb)).



#### PV-DM

The goal of doc2vec is to create a numeric representation of a document, regardless of its length. The concept that Mikilov and Le have used was simple: they have used the word2vec model, and added another vector called *paragraph vector* (Paragraph id in the image below).

![](https://i.ibb.co/vXDcRRT/doc2vec.png)

That is instead of using just words to predict the next word, we also added another feature vector, which is document-unique. So, when training the word vectors, the document vector is trained as well, and at the end of the training, it holds a numeric representation of the document, represented by a column in matrix $D$. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

This model is called *Distributed Memory model of Paragraph Vector* (PV-DM) since it acts as a memory that remembers what is missing from the current context or the topic of the paragraph.

The paragraph vector takes into consideration the ordering of words within a narrow context, similar to an n-gram model. The combined result is much more effective than a bag-of-words or bag-of-n-grams model because it generalizes better and has a lower dimensionality but still is of a fixed length so it can be used in common machine learning algorithms.

#### PV-DBOW

The previous method considers the concatenation of the paragraph vector with the word vectors to predict the next word in a text window. Another way is to ignore the context words in the input,  but force the model to predict words randomly sampled from the paragraph in the output.

![](https://i.ibb.co/HNHxfjT/doc2vec-2.png)

This means that at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector. This version is called the *Distributed Bag of Words version of Paragraph Vector* (PV-DBOW).

This model requires to store less data since we only need to store the softmax weights as opposed to both softmax weights and word vectors in the previous model. This model is also similar to the Skip-gram model in word vectors.

Training a doc2vec model is rather straight-forward in Gensim.




### GloVe

While word2vec is a predictive model (a feed-forward neural network that learns vectors to improve the predictive ability), Global Vectors for Word Representation, or [GloVe](https://nlp.stanford.edu/projects/glove/), is a count-based model for efficiently learning word vectors, developed by Pennington, et al. at Stanford. Classical vector space model representations of words were developed using matrix factorization techniques such as [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) (LSA) that do a good job of using global text statistics but are not as good as the learned methods like Word2Vec at capturing meaning and demonstrating it on tasks like calculating analogies (e.g. the King and Queen example).
The gloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in Word2Vec. Rather than using a window to define local context, GloVe constructs a word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings. GloVe is a global log-bilinear regression model with a weighted least-squares objective for the unsupervised learning of word representations that outperforms other models on word analogy, word similarity, and named entity recognition tasks.



The model is based on a simple idea that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning which can be encoded as vector differences. Let the matrix of word-word co-occurrence counts be denoted by $X$, whose entries $X_{ij}$ depict the number of times word $j$ occurs in the context of word $i$. An example co-occurrence matrix might look as follows.

![](https://i.ibb.co/c1VRfqv/co-ocurrence-matrix.png)



Let $X_i=\sum_{k} X_{ik}$ be the number of times any word appears in the context of word $i$. Finally, let $P_{ij}=P(j|i)=\frac{X_{ij}}{X_i}$ be the probability that word $j$ appear in the context of word $i$

Suppose we are interested in the concept of thermodynamic phase, for which we might take $i=\text{ice}$ and $j=\text{steam}$.The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, $k$. For words $k$ related to ice but not steam, say $k=\text{solid}$, 
$P_{ik}=P(\text{solid}|\text{ice})$ will be relatively high, and
$P_{jk}=P(\text{solid}|\text{steam})$ will be relatively low, so
the ratio $\frac{P_{ik}}{P_{jk}}$ will be large. Similarly, for words $k$ related to steam but not ice, say $k=\text{gas}$,
 $P_{ik}$ will be relativaly low, and $P_{jk}$ will be relatively high, so the ratio should be small. For words $k$ like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one.

 ![](https://i.ibb.co/jLq6GJq/Selection-742.png)



The ratio is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better able to discriminate between the two relevant words. So, if we can find a way to incorporate $\frac{P_{ik}}{P_{jk}}$ to computing word vectors we will be achieving the goal of using global statistics when learning word vectors.


So, if we can find a way to incorporate $\frac{P_{ik}}{P_{jk}}$ to computing word vectors we will be achieving the goal of using global statistics when learning word vectors. The most general model takes the form

$$F(w_i,w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}}$$

that is a function $F$ which takes in word vectors of $i$, $j$, and $k$ and outputs the ratio we’re interested in. Note that $w$ and $\tilde{w}$ are two embedding layers. The paper says, often both these layers will perform equivalently and will only differ by the different random initialization. However, having two layers help the model to reduce overfitting.

Word vectors are linear systems. For example, we can perform arithmetic in embedding space, such as $w_{\text{king}}-w_{\text{male}}+w_{\text{female}}=w_{\text{queen}}$. With this aim, we can restrict our consideration to those functions $F$ that depend only on the difference of the two target words, modifying the above equation to 

$$F(w_i-w_j,\tilde{w}_k)=\frac{P_{ik}}{P_{jk}}$$


Word vectors are high-dimensional vectors, however $\frac{P_{ik}}{P_{jk}}$ is a scalar. We can introduce a transpose and a dot product between the two entities to prevent $F$ from mixing the vector dimensions in undesirable ways.

$$F((w_i-w_j)^T\tilde{w}_k)=\frac{P_{ik}}{P_{jk}}$$

So, if we suposse a word vector as a $D \times 1$ matrix, $(w_i-w_j)^T$ will be $1 \times D$ shaped which gives a scalar when multiplied with $\tilde{w}_k$.

Next, if we assume $F$ has a certain property (i.e. homomorphism between the additive group and the multiplicative group) which gives,

$$F((w_i-w_j)^T\tilde{w}_k)=\frac{F(w_i^T\tilde{w}_k)}{F(w_j^T\tilde{w}_k)}=\frac{P_{ik}}{P_{jk}}$$

In other words, this particular homomorphism ensures that the subtraction $F(A-B)$ can also be represented as a division $\frac{F(A)}{F(B)}$ and get the same result. Therefore,

$$F(w_i^T\tilde{w}_k)=P_{ik}=\frac{X_{ik}}{X_{i}}$$

If $F$ is the exponential function then

$$w_i^T\tilde{w}_k=\text{log}(P_{ik})=\text{log}(\frac{X_{ik}}{X_{i}})=\text{log}(X_{ik})-\text{log}(X_{i})$$

The term $\text{log}(X_{i})$ is independent of $k$ so it can be absorbed into a bias $b_i$ for $w_i$. Finally, an additional bias $\tilde{b}_k$ is added for $\tilde{w}_k$.

$$w_i^T\tilde{w}_k + b_i + \tilde{b}_k = \text{log}(X_{ik})$$

In an ideal setting, where we have perfect word vectors, the above expression will be zero. So we will be setting the above expression as our cost function.

$$J(w_i,w_j)=(w_i^T\tilde{w}_j + b_i + \tilde{b}_j-\text{log}(X_{ij}))^2$$

Note that the square makes this a mean square cost function. Also, $k$ has been replaced with $j$.

But, if $X_{ik}=0$ then $\text{log}(X_{ik})=\text{log}(0)$ which is undefined. The easy fix would be to use $\text{log}(1+X_{ik})$ known as Laplacian smoothing. But the GloVe paper proposes a sleeker way of doing this. That is to introduce a weighting function.

$$J(w_i,w_j)=f(X_{ij})(w_i^T\tilde{w}_j + b_i + \tilde{b}_j-\text{log}(X_{ij}))^2$$

where 
$$
f(x)=\begin{cases}
              (\frac{x}{x_{\text{max}}})^{\alpha} & \text{if }x< x_{\text{max}},\\
              o & \text{otherwise}
\end{cases}
$$


The final cost function gives us the model

$$J(w_i,w_j)=\sum_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde{w}_k + b_i + \tilde{b}_j-\text{log}(X_{ij}))^2$$

where $V$ is the size of the vocabulary.


### FastText

One major drawback of word-embedding techniques like word2vec and glove was its inability to deal with out of corpus words. These embedding techniques treat words as the minimal entity and try to learn their respective embedding vector. Hence in the case that there is a word that does not appear in the training corpus, word2vec or glove fails to get their vectorized representation. This is a limitation, especially for languages with large vocabularies and many rare words. Another weakness of these models is that for words with the same radicals such as *eat* and *eaten*, Word2Vec doesn’t do any parameter sharing. Each word is learned uniquely based on the context it appears in. Thus, there is scope for utilizing the internal structure of the word to make the process more efficient.

[FastText](https://arxiv.org/abs/1607.04606) is an extension to Word2Vec proposed by Facebook in 2016. Instead of feeding individual words into the Neural Network, FastText breaks words into several $n$-grams (sub-words). Taking the word *where* and $n=3$ as an example, it will be represented by the character $n$-grams: `<wh, whe, her, ere, re>`. The symbols `<` and `>` are special symbols which are appended to show the start and end of the token, allowing to distinguish prefixes and suffixes from other character sequences (the tri-gram `her` from the word *where* is different of the tri-gram `<her>` from the word *her*).

Each word is represented by the sum of the vector representations of its $n$-grams.

After training the Neural Network, we will have word embeddings for all the $n$-grams given the training dataset. Words that do not appear in the training corpus can now be properly represented since it is highly likely that their n-grams also appears in other words.

One of the major drawbacks of this model is a high memory requirement. Since this model creates word-embedding from its characters and not from words. We can control the number of character embeddings by applying hashing. Instead of learning an embedding for each unique $n$-gram, we learn total $K$ embeddings where $K$ denotes the bucket size. Each character $n$-gram is mapped to an integer between 1 to K. Though this could result in collisions, it helps control the vocabulary size. The paper uses the FNV-1a variant of the Fowler-Noll-Vo hashing function to hash character sequences to integer values.

To train our own embeddings, we can either use the official [tool](https://github.com/facebookresearch/fastText) created by Facebook (see [tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html)) or we can use the [fastText implementation](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html) available in gensim (see [tutorial](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb)).

Pre-trained word vectors trained on Common Crawl and Wikipedia for 157 languages are available [here](https://fasttext.cc/docs/en/crawl-vectors.html) and variants of English word vectors are available [here](https://fasttext.cc/docs/en/english-vectors.html).

### Recent word embedding models

Consider the sentences "The man was accused of robbing a bank" and "The man went fishing by the bank of the river". The word "bank" has a different meaning in these two sentences but models like word2vec, Glove, or fastext would produce the same word embedding for this particular word because these models are context-independent (output just one vector or embedding for each word, combining all the different senses of the word into one vector).

Recent models like [ELMo](https://arxiv.org/abs/1802.05365) or [BERT](https://arxiv.org/abs/1810.04805) can generate different word embeddings for a word that captures the context of a word, that is its position in a sentence. For instance, for the same example above, the word "bank" from the first sentence (financial establishment), for example, would be closer to words like "finance", "investment", "fund" etc. whereas the word "bank" from the second sentence (the land alongside to a river or lake) would be closer to words like "river", "lake", "edge" etc..

The main difference above is a consequence of the fact Word2vec or Glove do not take into account word order in their training while ELMo and BERT take into account word order (ELMo uses LSTMS; BERT uses Transformer, an attention-based model with positional encodings to represent word positions).

A practical implication of this difference is that we can use word2vec and Glove vectors trained on a large corpus directly for downstream tasks. All we need is the vectors for the words. There is no need for the model itself that was used to train these vectors. However, in the case of ELMo and BERT, since they are context-dependent, we need the model that was used to train the vectors even after training, since the models generate the vectors for a word based on context. We can just use the context-independent vectors for a word if we choose too (just get the raw trained vector from trained model) but would defeat the very purpose/advantage of these models.

In the next notebooks, we will explore these models.

# References



- [Text Vectorization, Get Applied Text Analysis with Python](https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html)

- [Introduction bag of words](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)

- [Distributed Representations of Words and Phrases
and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)

- [Efficient Estimation of Word Representations in
Vector Space](https://arxiv.org/abs/1301.3781)


- [Introduction to word embedding Keras](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

- [Skip-gram model explained](https://medium.com/datadriveninvestor/word2vec-skip-gram-model-explained-383fa6ddc4ae)

- [Word Vectors with Word2Vec](https://deeplearningdemystified.com/article/nlp-1)

- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)


- [Distributed Representations of Sentences and Documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)

- [A gentle introduction to Doc2Vec](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)

- [Text classification with doc2vec](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4)

- [Word Embedding & GloVe](https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6)

- [Intuitive Guide to Understanding GloVe Embeddings](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010)

- [GloVe word vectors Coursera](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)

- [Global Vectors (GloVe)](https://medium.com/sciforce/word-vectors-in-natural-language-processing-global-vectors-glove-51339db89639)


- [A Visual Guide to FastText Word Embeddings](https://amitness.com/2020/06/fasttext-embeddings/)

- [Understanding fasttext](https://medium.com/@adityamohanty/understanding-fasttext-an-embedding-to-look-forward-to-3ee9aa08787)

- [FastText model in gensim other (tutorial)](https://radimrehurek.com/gensim/models/fasttext.html)

- [Word2Vec and FastText Word Embedding with Gensim](https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c)



