## Testing embedding models

In this notebook we are going to setup our LSTM model for the word generation task.

In [1]:
from __future__ import print_function

import numpy as np
import gensim
import string

from keras.callbacks import LambdaCallback
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation
from keras.models import Sequential
from keras.utils.data_utils import get_file

Using TensorFlow backend.


### Load models 

In [2]:
word_model = gensim.models.Word2Vec.load('./data/meta-n2v/meta-n2v100MB')

In [3]:
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')

for word in ['model']:
  most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in word_model.most_similar(word)[:8])
  print('  %s -> %s' % (word, most_similar))

Result embedding shape: (725, 100)
Checking similar words:
  model -> standard (0.85), embraer (0.82), brazil (0.60), between (0.54), sum (0.52), honor (0.52), fitted (0.51), aside (0.50)


In [3]:
# Now for loading the corresponding amount of sentences we need to pass the number of sentences parameter
from sentence_loader import lazy_load

In [4]:
tokenised_sents, sents = lazy_load(chunk_size=10240)

In [5]:
print(len(sents))

73


In [26]:
#ub = np.max([len(sent) for sent in tokenised_sents])

long_sents = 0
for sent in tokenised_sents:
    if len(sent) > 45:
        long_sents += 1

long_sents

3

In [8]:
# Let this length be the upperbound. 
# reformat the sentences to have length no more than the maximum length
sentences = [sentence for sentence in tokenised_sents if len(sentence) < 45]
print(len(sentences))

673


In [31]:
print(sentences[0])

['knightmare', 'chess', 'is', 'fantasy', 'chess', 'variant', 'published', 'by', 'steve', 'jackson', 'games', 'in', '0000']


In [28]:
max_sentence_len = 45

def word2idx(word):
  return word_model.wv.vocab[word].index
def idx2word(idx):
  return word_model.wv.index2word[idx]

In [29]:
train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
  for t, word in enumerate(sentence[:-1]):
    train_x[i, t] = word2idx(word)
  train_y[i] = word2idx(sentence[-1])
print('train_x shape:', train_x.shape)
print('train_y shape:', train_y.shape)

train_x shape: (70, 45)
train_y shape: (70,)


### Training

In [30]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead


In [38]:
# implement a random sampling mechanism

def sample(preds, temperature=1.0):
  if temperature <= 0:
    return np.argmax(preds)
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def generate_next(text, num_generated=10):
  word_idxs = [word2idx(word) for word in text.lower().split()]
  for i in range(num_generated):
    prediction = model.predict(x=np.array(word_idxs))
    idx = sample(prediction[-1], temperature=0.7)
    word_idxs.append(idx)
  return ' '.join(idx2word(idx) for idx in word_idxs)
def on_epoch_end(epoch, _):
  print('\nGenerating text after epoch: %d' % epoch)
  texts = [
    'chess',
    'is',
    'fantasy',
    'by',
    'steve'
  ]
  for text in texts:
    sample = generate_next(text)
    print('%s... -> %s' % (text, sample))

In [39]:
model.fit(train_x, train_y,
          batch_size=128,
          epochs=20,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])

Epoch 1/20

Generating text after epoch: 0
chess... -> chess joe points western canterbury shorts astrophysical play nothing children range
is... -> is kenya among during broken methven affairs into season jim tone
fantasy... -> fantasy services musical german put given requirement little laid branching shp
by... -> by see 0000 bradford writer invented customized model among too methven
steve... -> steve references storm going hunting cast introduction indicating can branching originally
Epoch 2/20

Generating text after epoch: 1
chess... -> chess consecutive style how relationship rooms sketches culture ashburton move enjoying
is... -> is greater problems appeared strike silverware playing perform tv preference create
fantasy... -> fantasy footnotes alter accommodation incredible behemoth orchestral perhaps will decks target
by... -> by funders late providers death trailing introduced bill baker rhyl branch
steve... -> steve teacher aside lineup claim refuelled faidutti embraer base f

Epoch 16/20

Generating text after epoch: 15
chess... -> chess rules references variant ranks leaves waiter template intended see kilometres
is... -> is athletic decision organ named if region square hence original remained
fantasy... -> fantasy broken february 0000 lineup cup noggin winners wolverines era decks
by... -> by consisting branching footballer image australian nothing just sold athletic career
steve... -> steve botham the fact island currently built wilhelm honor most around
Epoch 17/20

Generating text after epoch: 16
chess... -> chess hunting private embraer otherwise mathematical children for former seat figurehead
is... -> is that becoming per carrier epitomised op order given marked squares
fantasy... -> fantasy vulgar liverpool squad increasing side british turbulent zealand number courts
by... -> by along voted show abstract asterisk pilatus craters turboprop deck career
steve... -> steve characters claim dueling for seat instruction success bill echiquier removed
Ep

<keras.callbacks.History at 0x7f1391980250>

In [7]:
%%file lstm_model.py

from __future__ import print_function
import numpy as np

from keras.callbacks import LambdaCallback
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation
from keras.models import Sequential

def sample(preds, temperature=1.0):
    if temperature <= 0:
        return np.argmax(preds)
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def eval_on_lstm(tokenised_sents, word_model, max_sentence_len, test_ratio=0.2):
    
    sentences = [sentence for sentence in tokenised_sents if len(sentence) < max_sentence_len]
    pretrained_weights = word_model.wv.syn0
    vocab_size, emdedding_size = pretrained_weights.shape
    def word2idx(word):
        try:
            idx = word_model.wv.vocab[word].index
        except:
            print("word: {} not in vocab using default word card\n".format(word))
            idx = 0
        return idx
    
    def idx2word(idx):
        return word_model.wv.index2word[idx]
    
    total = len(sentences)
    train_size = int(total * (1 - test_ratio))
    test_size = total - train_size
    
    train_x = np.zeros([train_size, max_sentence_len], dtype=np.int32)
    train_y = np.zeros([train_size], dtype=np.int32)
    test_x = np.zeros([test_size, max_sentence_len], dtype=np.int32)
    test_y = np.zeros([test_size], dtype=np.int32)

    for i, sentence in enumerate(sentences[:train_size]):
        for t, word in enumerate(sentence[:-1]):
            train_x[i, t] = word2idx(word)
        train_y[i] = word2idx(sentence[-1])
    for i, sentence in enumerate(sentences[train_size:]):
        for t, word in enumerate(sentence[:-1]):
            test_x[i, t] = word2idx(word)
        test_y[i] = word2idx(sentence[-1])
    
    model = Sequential()
    model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, weights=[pretrained_weights]))
    model.add(LSTM(units=emdedding_size))
    model.add(Dense(units=vocab_size))
    model.add(Activation('softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    
    def generate_next(text, num_generated=10):
        word_idxs = [word2idx(word) for word in text.lower().split()]
        for i in range(num_generated):
            prediction = model.predict(x=np.array(word_idxs))
            idx = sample(prediction[-1], temperature=0.7)
            word_idxs.append(idx)
        return ' '.join(idx2word(idx) for idx in word_idxs)
    
    def on_epoch_end(epoch, _):
        print('\nGenerating text after epoch: %d' % epoch)
        texts = [
        'chess',
        'is',
        'fantasy',
        'by',
        'steve',
        'lasting'
        ]
        for text in texts:
            sample = generate_next(text)
            print('%s... -> %s' % (text, sample))
            
    model.fit(train_x, train_y,
          batch_size=128,
          epochs=20,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])
    
    scores = model.evaluate(test_x, test_y, verbose=0)
    print("Accuracy of the model is {}".format(scores))

Overwriting lstm_model.py


In [2]:
from lstm_model import eval_on_lstm

Using TensorFlow backend.


In [6]:
eval_on_lstm(tokenised_sents,word_model, 40)

Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Epoch 1/20

Generating text after epoch: 0
chess... -> chess blues professional his bruno months starting vilela awarded move blank
is... -> is uk sat cost redbridge figurehead winning throughout funders epitomised producer
fantasy... -> fantasy revealed aircraft he default producer pt rural defeat otherwise steve
by... -> by it 0000s workshop doncaster different concerned friend more lyndhurst pun
steve... -> steve in there fitting desired kuwait bird writer off culture published
lasting... -> lasting games teenage little accommodation lower which wilhelm version using championships
Epoch 2/20

Generating text after epoch: 1
chess... -> chess little rural life travelled wife former hutt rewarded goalkeepers awarded
is... -> is hutt states rare translation greater revived total from the blues
fantasy... -> fantasy offer success shuffling shar

Epoch 13/20

Generating text after epoch: 12
chess... -> chess nightmarish friend fly transfer blues victory skifield channel university remained
is... -> is november prevent figurehead marked family huddersfield revived opponent several is
fantasy... -> fantasy op pbs south can goals waiter standards unavailable copy orchestral
by... -> by brazilian graduates paintings fly seats december university 00 rovers walked
steve... -> steve botham out flying goalkeeper trade friend rules given turboprop engined
lasting... -> lasting celtic australian back crew ex pitch than blues services north
Epoch 14/20

Generating text after epoch: 13
chess... -> chess chess faidutti our shrewsbury home powerful requests square engined have
is... -> is away bruno through town indicating pbs channel point says successful
fantasy... -> fantasy half storm steve ballooning use place february action brazil be
by... -> by transformation canopy region medals david low handicap 0000 requests apparently
steve... -

In [1]:
# now lets test the 1Mb model of syntactic_n2v
from sentence_loader import lazy_load
tokenised_sents, sents = lazy_load(chunk_size=1048576/2)

In [4]:
len(sents)

8

In [2]:
from lstm_model import eval_on_lstm

Using TensorFlow backend.


In [3]:
import gensim

In [4]:
word_model = gensim.models.Word2Vec.load('./data/syncode/syncode_model_1MB')
eval_on_lstm(tokenised_sents,word_model, 40)

word: ibm not in vocab using default word card

word: 000mm not in vocab using default word card

word: antenna not in vocab using default word card

word: wmvp not in vocab using default word card

word: anna not in vocab using default word card

word: backquotes not in vocab using default word card

word: Ruo not in vocab using default word card

word: Lin not in vocab using default word card

word: Guang not in vocab using default word card

word: Xing not in vocab using default word card

word: ponna not in vocab using default word card

word: ponna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: 00m not in vocab using default word card

word: Da not in vocab using default word card

word: Tai not in vocab using default word card

word: Dao not in vocab using default word card

word: Tai n

steve... -> steve girls defending barbados crisis mountains ready supplements altering slide dewar
lasting... -> lasting baked kary discuss lizzie temporarily damage denomination avoids serve vomma
Epoch 7/20
Generating text after epoch: 6
chess... -> chess fondation kevin drivetrain street outcome hits persisting number differs disputes
is... -> is stuttering sandstone year proved shaping frederick list rochester bath commanders
fantasy... -> fantasy ovaj lanes rakaia wamira exterior way ark really terminology islands
by... -> by peaked last hurd realism set gully pilatus tabernacle mysterious compromised
steve... -> steve nanjo housewife categorized sullivan byname kannada tasked pipes affixed money
lasting... -> lasting remote township transformation today vismistananda hearts coast eastbound affiliation epidemic
Epoch 8/20
Generating text after epoch: 7
chess... -> chess springs developed threatening friendly asking raja assisting hartwick lines hoping
is... -> is powerful roret li

steve... -> steve florence doubt kn fm geilenkirchen elle appropriately pop britain quechua
lasting... -> lasting sortied speculate share waste contacted horticultural kelowna lowry officials ensuing
Epoch 18/20
Generating text after epoch: 17
chess... -> chess enough indira grant figure defect three serbia lyndon entrance witness
is... -> is rush insanity madagascar merhtens register smugglers penny quite haydon cameos
fantasy... -> fantasy volunteers prior plates club fenton heftel leuca hour readers centuries
by... -> by deer campaing advance steal yerba frequent fluorescent track organism operations
steve... -> steve mariamma destruction recanted chalukya tender modes olfaction experiments share wanted
lasting... -> lasting neighborhoods premise amanda graduate enhance gen weekday refusal bra container
Epoch 19/20
Generating text after epoch: 18
chess... -> chess trekking seriously perl footbridge canted obtained dab jump breastplate bilums
is... -> is fatalities boosters scores sp

In [5]:
word_model = gensim.models.Word2Vec.load('./data/w2v/w2v_model_1MB')
eval_on_lstm(tokenised_sents,word_model, 40)

word: ibm not in vocab using default word card

word: 000mm not in vocab using default word card

word: antenna not in vocab using default word card

word: wmvp not in vocab using default word card

word: anna not in vocab using default word card

word: backquotes not in vocab using default word card

word: Ruo not in vocab using default word card

word: Lin not in vocab using default word card

word: Guang not in vocab using default word card

word: Xing not in vocab using default word card

word: ponna not in vocab using default word card

word: ponna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: janna not in vocab using default word card

word: 00m not in vocab using default word card

word: Da not in vocab using default word card

word: Tai not in vocab using default word card

word: Dao not in vocab using default word card

word: Tai n

Epoch 7/20
Generating text after epoch: 6
chess... -> chess lifespan classroom ludicorp pervert thatched precursors wanted wind alleges contains
is... -> is char bill landmarks vitus vernacular comparative kronfeld direct ruficaudatus syndrome
fantasy... -> fantasy sprouted deities responded puzzled satisfy amount online tacitly married alone
by... -> by cleared abbreviations peruvian ours passive surfaced comparisons anyway caprice factories
steve... -> steve district think liu monopolize mediatised resolved judas friendly glacial retirement
lasting... -> lasting cobbling cree enlargement tattvartha rangers jargon shivamara beta auto tenets
Epoch 8/20
Generating text after epoch: 7
chess... -> chess ship lingayatism clarinet k0 collectors outfitters slams dislike hills dixieland
is... -> is worth playoff shaft eugene circus craw slogan biochips quality whitewashing
fantasy... -> fantasy workers granules importance packed protested has armed destroyer hanson gas
by... -> by motifs spri

steve... -> steve person recycled saxons burst product christians spring metals repulsive engineer
lasting... -> lasting balfour narasimhachar dyno variants accoutrements employee whitewashing scale brca0 brett
Epoch 18/20
Generating text after epoch: 17
chess... -> chess rumours bodies severance lujack subjects honored rejoined lucy reactivated historiography
is... -> is sean treatise clothes strife economy essex miyazaki samples visual administrative
fantasy... -> fantasy spielberg bohannon it schleitheim schemes centers retour secession attributed shamar
by... -> by mystics passamaquoddy strictly advertising experimenting father unique selective provider ij
steve... -> steve overlap lemur communities artillery stack move wounding nicknames circuses titled
lasting... -> lasting punishment suga crusaders ones unwillingness gary skylight shown chevrolet pressing
Epoch 19/20
Generating text after epoch: 18
chess... -> chess pilgrimage stabilizers duties op altering escapist yerupaja com

In [10]:
%%file evaluate_models.py

from sentence_loader import lazy_load
from lstm_model import eval_on_lstm
import gensim

sizes = [1,2,4,8]
tokenized_sents = []
sents = []
for size in sizes:
    ts, s = lazy_load(chunk_size=(10240)*size)
    tokenized_sents.extend(ts)
    sents.extend(s)
    word_model = gensim.models.Word2Vec.load('./data/syncode/syncode_model_'+str(size)+'MB')
    eval_on_lstm(tokenized_sents,word_model, 40)
    print "\n done with syncode "+ str(size)+"\n"
    word_model = gensim.models.Word2Vec.load('./data/w2v/w2v_model_'+str(size)+'MB')
    eval_on_lstm(tokenized_sents,word_model, 40)
    print "\n done with word2vec "+ str(size)+"\n"

Overwriting evaluate_models.py
