# Text embeddings with Word2Vec, Sentence2Vec and Doc2Vec

Many machine learning models require numeric features, thus text must be converted to numeric before they can be used in the model. Whilst a straight forward way is to one-hot encode the text, this results in extremely sparse datasets and an explosion in the number of columns. Text embedding is a smarter way to convert text to numeric features. We will look at Word2Vec, Sentence2Vec and Doc2Vec.

### One hot encoding

To better understand why word embeddings are useful, we start by looking at how one hot encoding works. Suppose we have the sentence "The cat sat on the mat". There are 5 unique words in this sentence, thus after one hot encoding, it would look like the matrix below:

![one_hot_encode](one_hot_encode.png)

With just 5 words in our vocabularly, we have 5 new columns. Not only is this sparse, there are also no meaning between the numeric representation of words, as they are all just vectors with all 0s, except a 1 for the column mapped to the word. It would be better if we could compress and represent words as more densely filled numeric features, where the numeric word features could be compared against each other.

### Word2Vec

How can we map words to meaningful numeric vectors? What do we mean when we intuitively think of 2 words as similar? The subtleties of language can not be captured in any simple way, however word2vec takes two approaches which seems to work well in practice:

1. skip-gram approach - given a word, we try to predict its surrounding words
2. Continuous bag of words (CBOW) approach - given surrounding words, predict the word

In both cases, intuitively we are saying that the context of the word (i.e. words surrounding the given word) should be similar if two words are similar. The beauty of this approach is that we do need to provide any labels. Given any text, our labels are simply the surrounding words (or the word itself in case of CBOW).

Word2Vec is actually just a simple neural network.

![neural_net_word2vec](neural_net_word2vec.png)

As in the above diagram, we can see 3 layers:

1. First, we have a one-hot vector of all the unique words in the vocabulary. The length of the vector is the number of unique words.
2. In the middle, we have the hidden layer. The length of this vector will be the size of our embedded numeric representation of each feature.
3. Lastly, we have an output layer. This is the same length as the input layer with each node being a word.

In the case of skip-gram, the input is just a word. The label for this input will be the word before and after it (depending on window size). For example, for the sentence 'the cat sat on the mat', suppose we have window size 1. Given the input 'cat', there would be two targets: 'the' and 'sat'.

That is all there really is to it. We have an input and output. There is no activation function in the hidden layer, and the output layer is just a softmax layer. You may ask, 'ok, but where is the embedding for each word?'. For each input word, there is a connection to each node in the hidden layer. In the diagram above, each input node is connected to 300 nodes in the hidden layer. Thus, there will be 300 weights from each input node to the nodes in the hidden layer. These 300 weights are the embedding for the words! For each input node, we have a different set of 300 weights.

CBOW is very similar, with only the input and output effectively being swapped. This is clearly shown in the image below (with window size 2).

![cbow_vs_skipgram](cbow_vs_skipgram.png)

### Python implementation of Word2vec

We will no have a go at implementing this in Python. We will use a dataset which contains text from different news articles. We will generate word embeddings for each word from this text, using gensim package.

Let us read in this data and have a look at a few rows.

In [21]:
import pandas as pd
from gensim.models import Word2Vec
from gensim.utils import tokenize

In [11]:
# read in data
true = pd.read_csv("True.csv")
fake = pd.read_csv("Fake.csv")

# assign target for true or fake news
true['label'] = 'True'
fake['label'] = 'Fake'

df = pd.concat([true, fake], axis=0)
df.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",True
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",True
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",True
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",True
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",True


For each news, we have the title, the text in the article, as well as subject, date and label. For creating the word embeddings, we will just use the text column.

The gensim Word2Vec model requires the input to be a list of lists. Each item in the list is a 'document' and each document is also a list of tokenised words. We will iterate through each row, and use gensim's 'tokenize' function, which tokenises the article with some basic text cleaning.

In [40]:
# generate list of tokenized articles
sent = [list(tokenize(row, lowercase=True)) for row in df['text']]

In [44]:
# build doc2vec model
word2vec_model = Word2Vec(
    sent
    ,min_count=1
    ,size= 50
    ,workers=8
    ,window =3
    ,sg = 1
)

In [48]:
word2vec_model.wv['trump']

array([-0.197617  , -0.09111612,  0.536869  ,  0.5075222 , -0.20479237,
       -0.12949395, -0.8391654 ,  0.34476176, -0.58287984, -0.2688873 ,
       -0.5941344 ,  0.31438622,  0.14963606, -0.22333175,  0.36819482,
        0.16479246, -0.41388914,  0.20203425,  0.13636146, -0.65705323,
       -0.56397605,  0.23625855,  0.5943967 , -0.06735351,  0.00571863,
        0.00537719,  0.45918745, -0.35381147, -0.8086109 , -0.07890741,
        0.22426248,  0.29491538, -0.4335018 ,  0.05501224,  1.0677358 ,
        0.21837595,  0.32808843, -0.03840185,  0.617614  , -0.20691352,
       -0.4464837 , -0.47698906,  0.26200438, -0.939531  ,  0.18013324,
        0.1676601 , -0.66991186, -0.29150403,  0.6372676 , -0.191315  ],
      dtype=float32)