# Word Embeddings/RNNs

## Copyright notice

Parts of this code are adapted from the [Keras example](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py), (c) 2015 - 2018, François Chollet, [MIT License](https://github.com/keras-team/keras/blob/master/LICENSE). This version (c) 2018 Fabian Offert, [MIT License](LICENSE). 

## Imports

We are using the Gensim and SpaCy NLP libraries that provide high-level interfaces for a lot of common NLP tasks.

In [1]:
import warnings
warnings.filterwarnings('ignore')

import gensim
import numpy as np
import string
import os
import spacy
import random
from collections import *

# python -m spacy download en
nlp = spacy.load('en')

from keras.callbacks import LambdaCallback
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation, Dropout, TimeDistributed, Flatten, Embedding
from keras.layers import LSTM, GRU
from keras.optimizers import RMSprop

Using TensorFlow backend.


## Pre-trained embeddings (Google News corpus, $10^{10}$ words)

In [2]:
# C binary format
wv_news = gensim.models.KeyedVectors.load_word2vec_format('7-nlp/google300.bin', binary=True)  

In [3]:
wv_news.wv.most_similar(positive=['man', 'programmer'], negative=['dog'])

[('computer_programmer', 0.5240642428398132),
 ('programer', 0.4957401156425476),
 ('engineer', 0.44241905212402344),
 ('mathematician', 0.41245338320732117),
 ('programmers', 0.4080742597579956),
 ('polymath', 0.4050711393356323),
 ('animator', 0.3985060453414917),
 ('Programmer', 0.3977847993373871),
 ('mechanical_engineer', 0.39661023020744324),
 ('Kaelin_Jacobson', 0.394615113735199)]

In [None]:
# This will generate a lot of output
wv_news.wv.accuracy('7-nlp/questions-words.txt')

## Self-trained embeddings ("In Search of Lost Time")

"In Search of Lost Time" is all about [links and similarities](https://en.wikipedia.org/wiki/Involuntary_memory): between times, places, things, senses, and people. Its arguably most famous scene is the "Madeleine" passage, where the experience of eating a simple [French coffee cake](https://en.wikipedia.org/wiki/Madeleine_(cake) leads the narrator to remember a childhood episode and, subsequently, his whole childhood and youth. How can we explore these links and similarities computationally? With word embeddings and RNNs, of course.

> Many years had elapsed during which nothing of Combray, save what was comprised in the theatre and the drama of my going to bed there, had any existence for me, when one day in winter, as I came home, my mother, seeing that I was cold, offered me some tea, a thing I did not ordinarily take. I declined at first, and then, for no particular reason, changed my mind. She sent out for one of those short, plump little cakes called 'petites madeleines,' which look as though they had been moulded in the fluted scallop of a pilgrim's shell. And soon, mechanically, weary after a dull day with the prospect of a depressing morrow, I raised to my lips a spoonful of the tea in which I had soaked a morsel of the cake. No sooner had the warm liquid, and the crumbs with it, touched my palate than a shudder ran through my whole body, and I stopped, intent upon the extraordinary changes that were taking place. An exquisite pleasure had invaded my senses, but individual, detached, with no suggestion of its origin. And at once the vicissitudes of life had become indifferent to me, its disasters innocuous, its brevity illusory - this new sensation having had on me the effect which love has of filling me with a precious essence; or rather this essence was not in me, it was myself. I had ceased now to feel mediocre, accidental, mortal. Whence could it have come to me, this all-powerful joy? I was conscious that it was connected with the taste of tea and cake, but that it infinitely transcended those savours, could not, indeed, be of the same nature as theirs. Whence did it come? What did it signify? How could I seize upon and define it?
I drink a second mouthful, in which I find nothing more than in the first, a third, which gives me rather less than the second. It is time to stop; the potion is losing its magic. It is plain that the object of my quest, the truth, lies not in the cup but in myself. The tea has called up in me, but does not itself understand, and can only repeat indefinitely with a gradual loss of strength, the same testimony; which I, too, cannot interpret, though I hope at least to be able to call upon the tea for it again and to find it there presently, intact and at my disposal, for my final enlightenment. I put down my cup and examine my own mind. It is for it to discover the truth. But how? What an abyss of uncertainty whenever the mind feels that some part of it has strayed beyond its own borders; when it, the seeker, is at once the dark region through which it must go seeking, where all its equipment will avail it nothing. Seek? More than that: create. It is face to face with something which does not so far exist, to which it alone can give reality and substance, which it alone can bring into the light of day.
And I begin again to ask myself what it could have been, this unremembered state which brought with it no logical proof of its existence, but only the sense that it was a happy, that it was a real state in whose presence other states of consciousness melted and vanished. I decide to attempt to make it reappear. I retrace my thoughts to the moment at which I drank the first spoonful of tea. I find again the same state, illumined by no fresh light. I compel my mind to make one further effort, to follow and recapture once again the fleeting sensation. And that nothing may interrupt it in its course I shut out every obstacle, every extraneous idea, I stop my ears and inhibit all attention to the sounds which come from the next room. And then, feeling that my mind is growing fatigued without having any success to report, I compel it for a change to enjoy that distraction which I have just denied it, to think of other things, to rest and refresh itself before the supreme attempt. And then for the second time I clear an empty space in front of it. I place in position before my mind's eye the still recent taste of that first mouthful, and I feel something start within me, something that leaves its resting-place and attempts to rise, something that has been embedded like an anchor at a great depth; I do not know yet what it is, but I can feel it mounting slowly; I can measure the resistance, I can hear the echo of great spaces traversed.
Undoubtedly what is thus palpitating in the depths of my being must be the image, the visual memory which, being linked to that taste, has tried to follow it into my conscious mind. But its struggles are too far off, too much confused; scarcely can I perceive the colourless reflection in which are blended the uncapturable whirling medley of radiant hues, and I cannot distinguish its form, cannot invite it, as the one possible interpreter, to translate to me the evidence of its contemporary, its inseparable paramour, the taste of cake soaked in tea; cannot ask it to inform me what special circumstance is in question, of what period in my past life.
Will it ultimately reach the clear surface of my consciousness, this memory, this old, dead moment which the magnetism of an identical moment has travelled so far to importune, to disturb, to raise up out of the very depths of my being? I cannot tell. Now that I feel nothing, it has stopped, has perhaps gone down again into its darkness, from which who can say whether it will ever rise? Ten times over I must essay the task, must lean down over the abyss. And each time the natural laziness which deters us from every difficult enterprise, every work of importance, has urged me to leave the thing alone, to drink my tea and to think merely of the worries of to-day and of my hopes for to-morrow, which let themselves be pondered over without effort or distress of mind.
And suddenly the memory returns. The taste was that of the little crumb of madeleine which on Sunday mornings at Combray (because on those mornings I did not go out before church-time), when I went to say good day to her in her bedroom, my aunt Leonie used to give me, dipping it first in her own cup of real or of lime-flower tea. The sight of the little madeleine had recalled nothing to my mind before I tasted it; perhaps because I had so often seen such things in the interval, without tasting them, on the trays in pastry-cooks' windows, that their image had dissociated itself from those Combray days to take its place among others more recent; perhaps because of those memories, so long abandoned and put out of mind, nothing now survived, everything was scattered; the forms of things, including that of the little scallop-shell of pastry, so richly sensual under its severe, religious folds, were either obliterated or had been so long dormant as to have lost the power of expansion which would have allowed them to resume their place in my consciousness. But when from a long-distant past nothing subsists, after the people are dead, after the things are broken and scattered, still, alone, more fragile, but with more vitality, more unsubstantial, more persistent, more faithful, the smell and taste of things remain poised a long time, like souls, ready to remind us, waiting and hoping for their moment, amid the ruins of all the rest; and bear unfaltering, in the tiny and almost impalpable drop of their essence, the vast structure of recollection.
And once I had recognized the taste of the crumb of madeleine soaked in her decoction of lime-flowers which my aunt used to give me (although I did not yet know and must long postpone the discovery of why this memory made me so happy) immediately the old grey house upon the street, where her room was, rose up like the scenery of a theatre to attach itself to the little pavilion, opening on to the garden, which had been built out behind it for my parents (the isolated panel which until that moment had been all that I could see); and with the house the town, from morning to night and in all weathers, the Square where I was sent before luncheon, the streets along which I used to run errands, the country roads we took when it was fine. And just as the Japanese amuse themselves by filling a porcelain bowl with water and steeping in it little crumbs of paper which until then are without character or form, but, the moment they become wet, stretch themselves and bend, take on colour and distinctive shape, become flowers or houses or people, permanent and recognisable, so in that moment all the flowers in our garden and in M. Swann's park, and the water-lilies on the Vivonne and the good folk of the village and their little dwellings and the parish church and the whole of Combray and of its surroundings, taking their proper shapes and growing solid, sprang into being, town and gardens alike, from my cup of tea.

In [5]:
class yield_file(object):
    def __init__(self, filename):
        self.filename = filename
 
    # By default, these yields nice, clean lists of sentence words
    def __iter__(self):
        # File only has line brakes at paragraph boundaries
        # Always remove possible BOMs with vim -c "set nobomb" -c wq! myfile
        for paragraph in open(self.filename):
            for sentence in paragraph.split('.'):
                
                # Use only lower case
                sentence = sentence.lower()

                # Remove all punctuation
                exclude = set(string.punctuation)
                sentence = ''.join(char for char in sentence if char not in exclude)

                # Remove whitespaces
                sentence = sentence.strip()

                # Line as list
                sentence = sentence.split()
                
                # Only return non-empty lines
                if len(sentence) > 0: yield sentence

In [6]:
sentences = yield_file('7-nlp/proust_ascii.txt') 
wv_proust = gensim.models.Word2Vec(sentences, size=300, window=5, min_count=5, workers=4)
wv_proust.wv.most_similar(positive=['woman', 'king'], negative=['man'])
# wv_proust.wv.accuracy('7-nlp/questions-words.txt')

[('queen', 0.840269923210144),
 ('laundress', 0.8244017958641052),
 ('historian', 0.82276451587677),
 ('patronage', 0.8200148344039917),
 ('jardin', 0.8188639879226685),
 ('south', 0.808866560459137),
 ('balloon', 0.8062515258789062),
 ('manager', 0.8060844540596008),
 ('painter', 0.803223729133606),
 ('ambassador', 0.799126148223877)]

## Improving Embeddings with named entity recognition

In [7]:
class yield_file_tagged(object):
    
    def __init__(self, filename):
        self.filename = filename
    
    def _tag_word(self, word):
        text = word.text
        if word.ent_type_: tag = word.ent_type_
        else: tag = word.pos_
        return text + '|' + tag
 
    # By default, these yields nice, clean lists of sentence words
    def __iter__(self):
        # File only has line brakes at paragraph boundaries
        # Always remove possible BOMs with vim -c "set nobomb" -c wq! myfile
        for paragraph in open(self.filename):
            # SpaCy magic
            doc = nlp(paragraph)
    
            # Detect and merge entitites
            for ent in doc.ents:
                ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.root.ent_type_)
    
            # Detect and merge noun chunks
            for nc in doc.noun_chunks:
                while len(nc) > 1 and nc[0].dep_ not in ('advmod', 'amod', 'compound'):
                    nc = nc[1:]
                nc.merge(tag=nc.root.tag_, lemma=nc.text, ent_type=nc.root.ent_type_)
            
            for sentence in doc.sents:
                words = []
                for word in sentence:
                    if not word.is_space: 
                        words.append(self._tag_word(word))
                            
                yield words

In [8]:
sentences = yield_file_tagged('7-nlp/proust_ascii.txt') 
wv_proust_tagged = gensim.models.Word2Vec(sentences, size=300, window=5, min_count=5, workers=4)
wv_proust_tagged.save('7-nlp/wv_proust_tagged.gensimmodel')

In [9]:
wv_proust_tagged_reloaded = gensim.models.Word2Vec.load('7-nlp/wv_proust_tagged.gensimmodel')
print(wv_proust_tagged_reloaded)
wv_proust_tagged_reloaded.wv.most_similar(positive=['Albertine|PERSON', 'M. de Charlus|PERSON'], negative=['I|PRON'])

Word2Vec(vocab=10617, size=300, alpha=0.025)


[('Morel|PERSON', 0.8458129167556763),
 ('Bloch|PERSON', 0.8372248411178589),
 ('Saint-Loup|ORG', 0.8185210824012756),
 ('Swann|PERSON', 0.8098665475845337),
 ('M. de Norpois|ORG', 0.8037635087966919),
 ('M. de Guermantes|PERSON', 0.7948862910270691),
 ('Elstir|PERSON', 0.7704919576644897),
 ('Robert|PERSON', 0.765677809715271),
 ('Odette|PROPN', 0.7484169006347656),
 ('Cottard|PERSON', 0.7473478317260742)]

## Generating Proust from scratch with a character-level LSTM RNN

In [25]:
DATA_SIZE = 1000000 # Limit dataset to DATA_SIZE characters
SEQUENCE_LEN = 40 # Order of the language model
SEQUENCE_STEP = 3 # Redundancy of the training samples
EPOCHS = 100
LSTM_SIZE = 128
LR = 0.001
BATCH_SIZE = 128

In [26]:
# Read the whole corpus into memory, only use lower case
text = ''
with open('7-nlp/proust_ascii.txt') as f: text = f.read().lower()

# If the dataset is too large, the loss either explodes (this model), or converges way too fast (other models)
# Conretely, the loss will slightly rise during later epochs, and if the epoch "takes too long" will explode
# "Cutting" epochs "dips" the loss, so that, averaged over epochs, it still decreases
# For this model, ~1M characters seem too work reasonably well
text = text[:DATA_SIZE]
    
print('Corpus length (characters):', len(text))

# Create the set of all characters that appear in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
print(chars)

# Create two dictionaries to "translate" from a char to an index and vice versa
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# Cut the text in semi-redundant sequences of SEQUENCE_LEN characters
sentences = []
next_chars = []
for i in range(0, len(text) - SEQUENCE_LEN, SEQUENCE_STEP):
    # A sequence of length SEQUENCE_LEN
    sentences.append(text[i: i + SEQUENCE_LEN])
    # The following character
    next_chars.append(text[i + SEQUENCE_LEN])
print('Sequences:', len(sentences))

# Generate one-hot vectors, x contains the sequence of SEQUENCE_LEN size, y contains the next char
print('Vectorization...')

x = np.zeros((len(sentences), SEQUENCE_LEN, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Corpus length (characters): 1000000
Unique characters: 44
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '2', '5', '7', '9', ':', ';', '?', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Sequences: 333320
Vectorization...


In [None]:
print('Building model...')

char_model = Sequential()
char_model.add(LSTM(LSTM_SIZE, input_shape=(SEQUENCE_LEN, len(chars))))
char_model.add(Dense(len(chars)))
char_model.add(Activation('softmax'))
optimizer = RMSprop(lr=LR)
char_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

print('Model built\n')

In [29]:
# To get an actual character from the probability array returned as a prediction by the model,
# we need to employ some probability math
def sample(preds, diversity):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / diversity
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [22]:
def generate(model, diversities, length):
     
    # A random point in the text
    start_index = random.randint(0, len(text) - SEQUENCE_LEN - 1)
    
    # An empty dictionary
    returns = {}
    
    for diversity in diversities:
        
        generated = ''
        sentence = text[start_index: start_index + SEQUENCE_LEN]
        generated += sentence

        for i in range(length):
            
            # Vectorize sentence
            x_pred = np.zeros((1, SEQUENCE_LEN, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            # Predict next character probabilities
            preds = model.predict(x_pred, verbose=0)[0]
            
            # Add char
            next_index = sample(preds, diversity)  
            next_char = indices_char[next_index]
            generated += next_char
            
            # "Move" the sentence one char over to predict the next next character
            sentence = sentence[1:] + next_char
        
        returns[diversity] = generated
          
    return returns

In [None]:
# A callback function to sample some text after each epoch
def on_epoch_end(epoch, logs):
    print('\n' + '### DIVERSITY: 0.5 ###')
    print(generate(char_model, [0.5], 300)[0.5] + '\n')

history = char_model.fit(x, 
           y, 
           batch_size=BATCH_SIZE, 
           epochs=EPOCHS, 
           callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)],
           validation_split=0.1)
char_model.save('7-nlp/char_model_100.hdf5')

In [30]:
loaded_model = load_model('7-nlp/char_model_100.hdf5')
# Try different diversity values, i.e. less and more "exotic" predictions
for d in [0.1, 0.2, 0.3, 0.5, 1.0, 2.0]:
    print('\n' + '### DIVERSITY: ' + str(d) + ' ###')
    print(generate(loaded_model, [d], 1000)[d] + '\n')


### DIVERSITY: 0.1 ###
more, to pay an infinitely scrupulous at once again, and the presencial before he had not to leave her that he was not the princess of the pink of the present its withing the place of odette's lever, who had supposed that she was not the same said of the probuil in a serious changess who had been able to see her all the least of the probull of the pink of the place in the same said of the portralte talls face of the probull of the country was to him by the family that he was not the first time there was nothing for the princess that he had been able to see her all the last that he was not the seepers and conceation.
of and to see the course of the place in the same said of the promise of the presence of the present its sure do really the same time a disting of the probull of the provision of the presence of the street chimpselse, which she had not to the same of the pleasure which the most seck the company of the pleasure which she had not allowed that it was a 

## Generating Proust from scratch with an unsmoothed maximum likelihood character level language model

In [31]:
def train_char_lm(fname, order=4):
    data = open(fname).read()
    lm = defaultdict(Counter)
    for i in range(len(data)-order):
        history, char = data[i:i+order], data[i+order]
        lm[history][char]+=1
    def normalize(counter):
        s = float(sum(counter.values()))
        return [(c, cnt/s) for c, cnt in counter.items()]
    outlm = {hist:normalize(chars) for hist, chars in lm.items()}
    return outlm

def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        keys = [k[0] for k in lm[history]]
        values = [k[1] for k in lm[history]]
        return np.random.choice(keys, p=values)
        
def generate_text(lm, order, nletters=1000):
    # Seed is a random pick from vocabulary
    history = random.choice(list(lm.keys()))
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

In [14]:
for order in [1,2,3,4,5,10,15,20]:
    print('### ORDER: %d ###' % order)
    lm = train_char_lm('7-nlp/proust_ascii.txt', order)
    print(generate_text(lm, order) + '\n')

### ORDER: 1 ###
pldise Ely waghir at, ther myhomenscincast ndit rcrin dind pathe orer l wer all om mmby ysed h whitusisal, at w o he, ha ile aby th alsh ssiffache dntey ws meve stterentat Vermiche ithads tesideran ifasith tongor co t Le orofare onf hather, ad wabe pstosstolanthise at t d mathero I anxal thare or ce thawhern aulome og por mpre amen Conenay ime, tofe l s, we arnghou, linle ice tmsincthasosonso alich I tha thet tof t ea d tre, a e lich friexhel fouberes a teblderlad sp ang Intcuther how trero d mam her. myene sm, arfr spll thes, rrarme lf, whinttin iosa. d wht higr, epeweminowa int warthent ghem ucarney puno t me msarevemus s f. th, hes hedrta tr outloke ite oun ot hurrrechees, e jula werink bl e on o tomysthato bing h lur, tatendialathes th l Bulle raltharatos showhena w y me ldof y foof Anco fe icerin asir fusisound henknt finad ing ie s n culyoullllff bud keduthe bean prof ughowerhin, tese wertesott Bly f, wedon d ldr theape fr ie, bs l d thef se d hinsicecedoulenof t

In [32]:
order = 15
print('### ORDER: %d ###' % order)
lm = train_char_lm('7-nlp/proust_ascii.txt', order)
print(generate_text(lm, order, nletters=10000))

### ORDER: 15 ###
 panes are mere transparencies. He would try to make out Odette. And then, how were we not to include in it the uncles who are not really related by blood, being the uncles only of their nephews' wives. The Messieurs de Charlus are indeed so convinced that a certain man in society who had taken her out of the room that it remembered, which he had been bored to tears in her friend's presence, as much as the dazzling glories of the beach, albeit my jealousy was suddenly revealing her anxiety to look my best before Albertine had paid me. I saw her in successive years of my life occupying, with regard to whom one completely forgot about dinner and the time; here again as at Balbec I had not been mistaken, for he would think of me now no longer had any idea what it is?" "Well . . . I've heard people say 'the Rohans' or in contempt, as she herself still supposed that one could examine here in public, for, the Princesse de Parme and announced by one verbal blow after another

## RNN resources

Research areas: text-to-speech, translation, image captions, handwriting, music, ...

- [Keras recurrent layers](https://keras.io/layers/recurrent/)
- [Emergence of grid-like representations by training recurrent neural networks to perform spatial localization](https://openreview.net/pdf?id=B17JTOe0-)
- [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [The unreasonable effectiveness of Character-level Language Models](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)
- [Sasha Poflepp: Recursion (sic!)](http://pohflepp.net/Recursion)
- [Polyphonic Music Generation Using Tied Parallel Networks](https://www.cs.hmc.edu/~ddjohnson/tied-parallel/)
- [A Connectionist Approach to Algorithmic Composition](http://www.indiana.edu/~abcwest/pmwiki/pdf/todd.compmusic.1989.pdf)
- [Four Experiments in Handwriting with a Neural Network](https://distill.pub/2016/handwriting/)
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [do androids dream of cooking?](https://gist.github.com/nylki/1efbaa36635956d35bcc)
- [Folk music style modelling using LSTMs](https://github.com/IraKorshunova/folk-rnn)
- [RNN Bible Bot](https://twitter.com/RNN_Bible)