## Skipgrams in Keras

- In this lecture, we will implement Skipgrams in `Keras`.

#### Loading in and preprocessing data
- Load the Alice in Wonderland data in Corpus using Keras utility
- `Keras` has some nice text preprocessing features too!
- Split the text into sentences.
- Use `Keras`' `Tokenizer` to tokenize sentences into words.

In [1]:
# Imports
# Basics
from __future__ import print_function, division
import pandas as pd 
import numpy as np
import random
from IPython.display import SVG
%matplotlib inline

# nltk
from nltk import sent_tokenize

# keras
np.random.seed(13)
from keras.models import Sequential
from keras.layers import Dense, Embedding, Reshape, Activation
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import model_to_dot 
from keras.preprocessing.sequence import skipgrams

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lucas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# We'll use Alice in Wonderland
path = get_file('carrol-alice.txt', origin='http://www.gutenberg.org/files/11/11-0.txt')
corpus = open(path, encoding='utf-8').read()

In [3]:
# Split document into sentences first
corpus = corpus[corpus.index('\n\n')+2:]
sentences = sent_tokenize(corpus)

base_filter = '!"#$%&()*+,-./:;`<=>?@[\\]^_{|}~\t\n' + "'"
tokenizer = Tokenizer(filters=base_filter)
tokenizer.fit_on_texts(sentences)

sequences = tokenizer.texts_to_sequences(sentences)
nb_samples = sum(len(s) for s in corpus)

print(len(sequences), tokenizer.document_count)

1104 1104


In [4]:
# To understand what is happening;
print(sentences[324])
print(sequences[324])

“Keep your temper,” said the Caterpillar.
[2354, 66, 769, 2, 9, 1, 166]


#### Skipgrams: Generating Input and Output Labels
- Now that we have sentences, and word tokenization, we are in good position to create our training set for skipgrams.
- Now we need to generate our `X_train` and `y_train`

In [5]:
# Let's first see how Keras' skipgrams function works.
couples, labels = skipgrams(sequences[324],
                            len(tokenizer.word_index) + 1,
                            window_size=2, 
                            negative_samples=0, 
                            shuffle=True, 
                            categorical=False, 
                            sampling_table=None)

index_2_word = {val : key for key, val in tokenizer.word_index.items()}

for w1, w2 in couples:
    if w1 == 13:
        print(index_2_word[w1], index_2_word[w2])


In [12]:
# Function to generate the inputs and outputs for all windows
vocab_size = len(tokenizer.word_index) + 1
dim = 100
window_size = 2

def generate_data(sequences, window_size, vocab_size):
    for seq in sequences:
        X, y = [], []
        couples, _ = skipgrams(seq, 
                              vocab_size, 
                              window_size=window_size, 
                              negative_samples=0, 
                              shuffle=True, 
                              categorical=False, 
                              sampling_table=None)
        
        if not couples:
            continue
        
        for in_word, out_word in couples:
            X.append(in_word)
            y.append(np_utils.to_categorical(out_word, vocab_size))
            
        X, y = np.array(X), np.array(y)
        X = X.reshape(len(X), 1)
        y = y.reshape(len(X), vocab_size)
        yield X, y
        
data_generator = generate_data(sequences, window_size, vocab_size)

### Skipgrams: Creating the Model
- Lastly, we create the (shallow) network!

In [13]:
# Create the Keras model and view it 
skipgram = Sequential()
skipgram.add(Embedding(input_dim=vocab_size, output_dim=dim, embeddings_initializer='glorot_uniform', input_length=1))
skipgram.add(Reshape((dim,)))
skipgram.add(Dense(input_dim=dim, units=vocab_size, activation='softmax'))
#SVG(model_to_dot(skipgram, show_shapes=True).create(prog='dot', format='svg'))

### Skipgrams: Compiling and Training
- Time to compile and train
- We use crossentropy, common loss for classification

In [14]:
# Compile the Keras Model
from keras.optimizers import SGD

sgd = SGD(lr=1e-4, decay=1e-6, momentum=0.9)

skipgram.compile(loss='categorical_crossentropy', optimizer='adadelta')

for iteration in range(10):
    loss = 0
    for x, y in generate_data(sequences, window_size, vocab_size):
        loss = loss + skipgram.train_on_batch(x, y)
        
    print('iteration {}, loss is {}'.format(iteration, loss))

iteration 0, loss is 8942.264931678772
iteration 1, loss is 8941.894713401794
iteration 2, loss is 8941.525113105774
iteration 3, loss is 8941.155524253845
iteration 4, loss is 8940.785878181458
iteration 5, loss is 8940.416198730469
iteration 6, loss is 8940.046796798706
iteration 7, loss is 8939.677290916443
iteration 8, loss is 8939.3076171875
iteration 9, loss is 8938.938042640686


### Skipgrams: Looking at the vectors

To get word_vectors now, we look at the weights of the first layer.

Let's also write functions giving us similarity of two words.

In [15]:
word_vectors = skipgram.get_weights()[0]

from scipy.spatial.distance import cosine

def get_dist(w1, w2):
    i1, i2 = tokenizer.word_index[w1], tokenizer.word_index[w2]
    v1, v2 = word_vectors[i1], word_vectors[i2]
    return cosine(v1, v2)

def get_similarity(w1, w2):
    return get_dist(w1, w2)

def get_most_similar(w1, n=10):
    sims = {word : get_similarity(w1, word) for word in tokenizer.word_index.keys() if word != w1}
    sims = pd.Series(sims)
    sims.sort_values(inplace=True, ascending=False)
    return sims.iloc[:n]


print(get_similarity('king', 'queen'))
print('')
print(get_most_similar('queen'))

0.8762330189347267

yelled       1.372109
piteous      1.342071
engaged      1.336292
coils        1.316476
tremble      1.292187
shock        1.289189
financial    1.288006
across       1.283690
utf          1.280388
hedge        1.277146
dtype: float64


## Your turn -- Modify the code above to create a CBOW Model