# Manual Word2Vec
The notebooks sheds light on some of the details of word2vec by showcasing how you can 're-implement' the w2v model using basic keras functionality. In case you would like to take it one step further and code everyting yourself using just numpy, I recommend [Nathan Rooy's post](https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/) the codes of which are available from his [GitHub repo](https://github.com/nathanrooy/word2vec-from-scratch-with-python/blob/master/word2vec.py). A re-implementation of this example with a nice Excel demo is available on [Towards Data Science](https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281); [here is yet another example.](https://towardsdatascience.com/word2vec-from-scratch-with-numpy-8786ddd49e72) 

In [1]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn import model_selection
import nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
%matplotlib inline

import gensim
import collections

#Setting dataframe max limit of columns in output to 10
pd.set_option('display.max_columns', 10)

#Loading data and printing the columns
data = pd.read_csv("political_social_media.csv", encoding = "ISO-8859-1")
print(data.columns)

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'audience', 'audience:confidence', 'bias',
       'bias:confidence', 'message', 'message:confidence', 'orig__golden',
       'audience_gold', 'bias_gold', 'bioid', 'embed', 'id', 'label',
       'message_gold', 'source', 'text'],
      dtype='object')


In [2]:
data=data[["message","message:confidence","label","source","text"]]

Seems that we can trust the labels. Let's now look at the text:

### Text Cleaning

In [3]:
raw = list(data.text)
raw[10]
#ok, we will probably have to clean it 

'As POTUS golfs, pushes amnesty &amp; ignores Keystone, American people are concerned about jobs, econ &amp; health care costs http://t.co/p9sPDYOAca'

In [4]:
# word2vec expexts a list of list: each document is a list of tokens
import re
prep=[]
for i,line in enumerate(raw):
    prep.append(gensim.utils.simple_preprocess(re.sub(r'http\S+', '', line)))
prep[10]

['as',
 'potus',
 'golfs',
 'pushes',
 'amnesty',
 'amp',
 'ignores',
 'keystone',
 'american',
 'people',
 'are',
 'concerned',
 'about',
 'jobs',
 'econ',
 'amp',
 'health',
 'care',
 'costs']

### Map Text to Integer Dictionary

The `keras` implementation of negative sampling uses a heuristic to determine the frequency of words: The rank of the words sorted by frequency. Thus, we want to construct a dictionary from words to integers, where the words/integers are sorted from most frequent to least frequent

In [31]:
# Loop through the words and update a counter keeping track of word counts
word_counter = collections.Counter()
for sentence in prep:
    for word in sentence:
        word_counter.update({word: 1})

Select only the most common words. Less than 6,000 words occur more than once in this dataset of tweets! The most common 3,000 words occur about 5 times

In [32]:
vocab = word_counter.most_common(3000)
vocab = [x[0] for x in vocab]

In [33]:
#vocab = list(set.union(*map(set, prep))) 

In [8]:
idx = range(1,len(vocab)+1)
word2idx = dict(zip(vocab, idx))
idx2word = dict(zip(idx, vocab))

Map unknown words to a special idx. This is how we deal with uncommon words or new words that come in during day-to-day business

In [9]:
word2idx["UNKNOWN"] = 0
idx2word[0] = "UNKNOWN"

In [34]:
# Helper function to help map unknown words to index 'unknown'
def words_to_labels(sentence, dictionary):
    output = []
    for word in sentence:
        if word not in dictionary.keys():
            output.append(dictionary["UNKNOWN"])
        else:
            output.append(dictionary[word])
    return output

In [35]:
VOCAB_SIZE = len(word2idx)

Little test of our dictionaries

In [36]:
idx2word[1]

'the'

In [37]:
word2idx["potus"]

627

In [38]:
idx2word[627]

'potus'

Turn text into labels

In [39]:
corpus = [] 
for sentence in prep:
    corpus.append(words_to_labels(sentence, word2idx))

In [40]:
corpus[0:2]

[[79, 0, 206, 0, 0, 1312, 0, 68, 1004],
 [332, 68, 232, 4, 407, 369, 3, 745, 1368]]

## Manual word2vec

In [17]:
import keras
from keras.models import Model
from keras.preprocessing import sequence
from keras.layers import Embedding, Input, Reshape, Dot, Activation

Using TensorFlow backend.
  return f(*args, **kwds)


https://adventuresinmachinelearning.com/word2vec-keras-tutorial/ has more explanations, but they use a shared embedding layer, which is not correct IMHO

In [41]:
# Embedding dimension.Typically 50-300 for words.
EMB_DIM = 10

In [19]:
# Set up embedding layers
# Embed the target word
embedding_target = Embedding(VOCAB_SIZE, EMB_DIM, input_length=1, name='embedding_target')
# Embed the context word
# As in matrix factorization, we factorize into two matrices: target embeddings and context embeddings
embedding_context = Embedding(VOCAB_SIZE, EMB_DIM, input_length=1, name='embedding_context')

The reshaping below is necessary to make the output dimensions of each layer comfort to the next layer. Check the `model.summary()` when you build functional models yourself!

In [43]:
# Build the model architecture

# Take a single target word
input_target = Input((1,))
target = embedding_target(input_target)
target = Reshape((EMB_DIM, 1))(target)

# Take another word either from the context or a random word from vocabulary
input_context = Input((1,))
context = embedding_context(input_context)
context = Reshape((EMB_DIM, 1))(context)

# Calculate the dot product as an unnormalized cosine distance
dot_product = keras.layers.Dot(axes=1, normalize=False)([target, context])
dot_product = Reshape((1,))(dot_product)

# Predict if the words are in the same context -> Binary yes/no
output = Activation(activation='sigmoid')(dot_product)

In [46]:
# Compile the model
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam')

See how the model is not much of a neural network? The only trainable parameters are the embeddings, which are then dot-multiplied. We thus have two hidden layers side-by-side rather than one after the other and no non-linear activation of the hidden layers! This is very similar to matrix factorization and you can use the same architecture to build a collaborative filter on users (one embedding matrix) and items (one embedding matrix). 

In [47]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_target (Embedding)    (None, 1, 10)        30010       input_5[0][0]                    
__________________________________________________________________________________________________
embedding_context (Embedding)   (None, 1, 10)        30010       input_6[0][0]                    
__________________________________________________________________________________________________
reshape_7 

### Prepare the data

A window size of `i` translates to $[w_{-i},\ldots, w, \ldots, w_{+i}]$, so the number of context words to consider is window size $\times 2$. 

In [56]:
WINDOW_SIZE = 2

In [57]:
# Sample words with equalized probability, here a more complicated version of 1/frequency(word)
# make_sampling_table approximates frequency by the rank of occurence frequency in the vocabulary list
sampling_table = sequence.make_sampling_table(VOCAB_SIZE)

The `sampling_table` is a list of sampling probabilities, one for each word. 

In [58]:
sampling_table

array([0.00315225, 0.00315225, 0.00547597, ..., 0.50726163, 0.50735608,
       0.50745052])

`skipgrams` takes a sentence and outputs:

1. target words in combination with a context word
2. a label if the context word is from the actual context or randomly sampled.

Example for, importantly, a *single* sentence.

In [102]:
example_sentence = corpus[1]
example_sentence

[332, 68, 232, 4, 407, 369, 3, 745, 1368]

In [103]:
sequence.skipgrams(example_sentence, 
                   VOCAB_SIZE, window_size=WINDOW_SIZE, 
                   negative_samples=1.,  # Ratio of positive to negative samples
                   sampling_table=sampling_table)

([[1368, 1700], [1368, 802], [1368, 3], [1368, 745]], [0, 0, 1, 1])

Training loop

In [28]:
NO_EPOCH = 10

A batch should be sampled from single sentence and we update the model after one batch/sentence. In other words, the standard keras training loop doesn't fit and we train manually to make sure to update the model sentence by sentence. 

In [105]:
for e in range(NO_EPOCH):
        print('-'*40)
        print('Epoch', e)
        print('-'*40)

        samples_seen = 0
        losses = []
        
        for i, seq in enumerate(corpus):
            # get skipgram couples for one text in the dataset
            couples, labels = sequence.skipgrams(seq, VOCAB_SIZE, window_size= WINDOW_SIZE, negative_samples=1., sampling_table=sampling_table)
            if couples:
                # one gradient update per sentence (one sentence = a few 1000s of word couples)
                X = np.array(couples, dtype="int32")
                loss = model.train_on_batch([X[:,0],X[:,1]], labels)
                losses.append(loss)
        print(f'Average loss over last 1000 batches: {np.mean(losses[-1000:])}')

----------------------------------------
Epoch 0
----------------------------------------
Average loss over last 1000 batches: 0.4232877790927887
----------------------------------------
Epoch 1
----------------------------------------
Average loss over last 1000 batches: 0.4340161085128784
----------------------------------------
Epoch 2
----------------------------------------
Average loss over last 1000 batches: 0.4258624017238617
----------------------------------------
Epoch 3
----------------------------------------
Average loss over last 1000 batches: 0.4115869402885437
----------------------------------------
Epoch 4
----------------------------------------
Average loss over last 1000 batches: 0.4147600829601288
----------------------------------------
Epoch 5
----------------------------------------
Average loss over last 1000 batches: 0.41622471809387207
----------------------------------------
Epoch 6
----------------------------------------
Average loss over last 1000 batch