# Fulfillomatic

##### Adriana Souza, Roger Filmyer

![NLG](http://www.pngall.com/wp-content/uploads/2016/07/Meditation-Transparent.png)

In general, sentence generation can either be really easy or really hard depending on the quality of output you'd like. NLG (Natural Language Generation), mainly deals with generating text that describes a set of data. In our case, we are taking a training corpus of quotes and generating our own context-free-grammar sentences.

One common approach is using n-gram based sentence generators. Here, the output is sometimes riddled with nonsense that is often ungrammatical since these just chain together of words that tend to appear in sequence. However, word on the street is that 3-4-gram generators look quite okay sometimes.

In this notebook, we tried 4 approaches: 

1. Unweighted Parts-of-Speech (Fulfillomatic v0)
2. Bigram Model (Fulfillomatic v1)
3. Trigram Model (Fulfillomatic v2)
4. LSTM (Fulfillomatic v3)

Results varied and, as expected, improved steadily as we moved from v0 to v2, but the quality of the output changed dramatically with the LSTM. We believe this might be because of the strange structure of our training data: inspirational/zen quotes are single sentences that almost have to be treated as individual documents. Single sentences have significantly less words than more common corpora which usually include one or more full bodies of text. 

Future work includes implementing some kind of subject verb agreement, some smarter padding and input sequences for the LSTM, and using Markov chains to generate sentences. This could be done using a transition matrix that says how likely it is to transition between every every part-of-speech.

Ultimately, the most important thing is to have inner peace. 

Namaste ॐ

### Loading data

In [1]:
# Packages
import numpy as np
import nltk
import random
import string

from collections import defaultdict

In [2]:
# Selecting the file to use
#file = 'training/inspirational_quotes.txt'
#file = 'training/nietzsche_quotes.txt'
file = 'training/zen_quotes.txt'
#file = 'training/everything.txt'  # Worse quote results due to different styles being mixed together.

# Storing quotes from file in a list
with open(file, encoding='utf-8') as opened_file: 
    lists = opened_file.read().splitlines()
    quotes = []
    for line in lists:
        quotes.append(line)

***

## Version 0: Unweighted Parts-of-Speech

To start, we tried...

In [3]:
# Tokenize
tokenized_corpus = []
for quote in quotes:
    tokenized_quote = nltk.tokenize.word_tokenize(quote)
    tagged_quote = nltk.pos_tag(tokenized_quote)
    tokenized_corpus.append(tagged_quote)

# Set up the language "model"
parts_of_speech = defaultdict(list)
sentence_structures = []
for quote in tokenized_corpus:
    sentence_structure = []
    for word, pos in quote:
        parts_of_speech[pos].append(word)
        sentence_structure.append(pos)
    sentence_structures.append(sentence_structure)

# Generate an example sentence
# get_mindful_v0()
def chaos():
    """
    Generate an inspirational sentence. 
    
    Ensure that you are in the proper state of mind before running. ॐ
    """
    sentence_skeleton = random.choice(sentence_structures)
    reconstituted_sentence = []
    for part_of_speech in sentence_skeleton:
        new_word = random.choice(parts_of_speech[part_of_speech])
        reconstituted_sentence.append(new_word)
    return " ".join(reconstituted_sentence)

In [4]:
# Output
chaos()

'of we exist motionless , buddhahood up . with yourself are possible , awaken out . you have the opinion on the bottle with rebellion and eating .'

### Version 0 - Chaos: Results

* your ready Speak begins when you can hear you not and never .
* in I think busy forwards of coffee , forever it . in you will live aware library you , make your education .
* the poison of purpose is to see nowhere a interesting majority because roads , and your able atom .
* without t denies my anything that bulk , yourself can once call You .
* as I are dreams to don grief , never the sun is to tolerate able you .
* all valuable choice is than the painful comfort , it can keep imprisoned believe only not that you ’ you .

## Next:

We see we need to do a lot of things, most of which we should've done even before we started (like lowercasing, removing punctuation, taking care of contractions). It seems that just assuming words would have a uniform distribution if we know the input is some sort of "quote"-esque type sentence wasn't enough. Since we kept our quotes separate and they aren't particularly long sentences, let's start with a bigram model.

***

## Version 1: Bigram Model

Well, that worked great. Maybe some context _would_ be good.

In [5]:
# Turning list into string
corpus = ""
for word in quotes:
    # Lowercasing
    word = word.lower()
    
    # Adding end tokens to mark the end of quotes
    word = word.replace('.', ' END ')   
    
    # Remove punctuation
    table = str.maketrans('','', string.punctuation + '…”“–')      
    word = word.translate(table)
    
    # Adding cleaned text to corpus
    corpus = corpus + word  

# Tokenizing
def tokenize(input_string):
    return input_string.split()

# Getting bigram model
def get_bigrams(corpus):
    corpus_fd_unigram = nltk.FreqDist(tokenize(corpus))
    bigrams = nltk.bigrams(['END'] + tokenize(corpus))
    bigrams_fd = nltk.FreqDist(bigrams)
    results = {}
    for bigram, bigram_frequency in bigrams_fd.items():
        first_word, second_word = bigram
        probability = (bigram_frequency / corpus_fd_unigram[first_word])    
        results[bigram] = probability
    return results

bigram_model = get_bigrams(corpus)

## New version 

Below, we use a bigram model and also take some care in structuring how the sentence will come out. We make sure that our quote starts with a bigram of the form `[END, word]` and ends with a bigram of the form `[word, END]`. 

In [6]:
# Creating function to get an n-gram model
def get_sentence_with_ngram_model(num_words, model):
    words_in_sentence = ['END' for i in range(0, num_words - 1)] # Pad the start of the sentence with 'END' tokens
    final_word = None
    
    while final_word != 'END':        
        initial_n_gram_words = words_in_sentence[-(num_words - 1):]
        matching_n_gram_keys = []
        
        #Get probabilites
        for n_gram in model.keys():
            words_to_match = zip(n_gram, initial_n_gram_words)
            if all(a == b for a, b in words_to_match):
                matching_n_gram_keys.append(n_gram) 
                
        # Pick probabilities        
        n_gram_probabilities = [model[n_gram] for n_gram in matching_n_gram_keys]        
        total_probability = sum(n_gram_probabilities)                
        final_word = np.random.choice(
                        a=[n_gram[-1] for n_gram in matching_n_gram_keys],
                        p=[p for p in n_gram_probabilities])
        words_in_sentence.append(final_word)
        
    words_in_sentence = words_in_sentence[(num_words - 1): -1]
    
    # Capitalize first letter of first word
    if len(words_in_sentence) > 0:
        first_word = words_in_sentence[0]
        first_word = first_word[0].upper() + first_word[1:]
        words_in_sentence[0] = first_word
        sentence = " ".join(words_in_sentence) + '.'
    else:
        sentence = get_sentence_with_ngram_model(num_words, model)
    return sentence

Let's try it with a bigram model:

In [7]:
# Version 1 of Fulfillomatic
def duality():    
    """
    You must only concentrate on the next step, the next breath, 
    the next stroke of the broom, and the next, and the next. Nothing else.
    ॐ
    
    (Bigram Model)
    """    
    sentence = ""
    while len(sentence.split()) < 4:
        sentence = get_sentence_with_ngram_model(2, bigram_model)
    return sentence

In [8]:
# Creating a function that will print a desired number of generated quotes
def repeat(times, f):
    for i in range(times): f()
    
def do_v1():
    print(duality())

# Printing 5 generated quotes
repeat(5, do_v1)

By what our task must be serene in the future is not happening in its very nature of the night rain just let its searing power.
The unreality of zen merely points.
It is only because he takes a cult.
Life is always longing for the way will not interfering not seek nothing unless we should rather than continuing to balance.
Those who seek a liberated one that supreme love is ready for happiness only one loses joy and it.


In [9]:
def do_v0():
    print(chaos())

# Printing 5 generated quotes
repeat(5, do_v0)

of you are to go them should only feel Other in living .
about it are to be you , you will attain it . go to be it , and yourself will Keep I .
you see thought up not by he achieve then of I , of we achieve to let accomplished within your everyday things but possibilities . what He put and enlightenment still is positive to it . but if seeking demonic because the dissatisfaction at a world , the way dwell never find right No more .
Miller activities of of we should establish out to The master behind our faith , us will find to be and use up of you .
of a yourselves , body others . Zen from an effort and hold you my wood .


### Version 1 results

* Just do it.
* In my friends you can get the fire you grow from it should scare you do drunk.
* You.
* I believe in the least for anything i believe in god from a man to exist.
* Dont bother just take rest is too little one that you better.
* If you can not what we know what you will remain constant.
* What we are travelling more difficult than to forget is no greatness.
* Anything you look for what you do not being yourself.
* Let the wilderness of all else is still looking for us entirely happy because i told dismiss that can do something.

***

## Version 2: Trigram Model

It's... marginally better. Our ratio of "potentially good" generated quotes to "gibberish quotes" is still pretty awful. Let's see how a trigram model does instead.

In the steps above, we took some risks with our tokens. Since we ended up turning our corpus back into a long string instead of a list, now we just have quotes after quotes that aren't necessarily related. This is a problem because we don't necessarily want trigrams that span from the end of one quote to the next. Those trigrams do not represent tokens that could follow each other in a text -- they are completely accidental.

To address this, we added double end tokens for the trigrams: now, starting tokens look like `[END, END, word]` and end tokens like `[word, END, END]`.

In [10]:
# Adding extra END tokens
def add_extra_end_token(tokenized_document):
    new_document = []
    for token in tokenized_document:
        new_document.append(token)
        if token == "END":
            new_document.append("END")
    return new_document

def get_trigrams(document):
    corpus = tokenize(document)
    corpus = add_extra_end_token(corpus)
    corpus_fd_bigram = nltk.FreqDist(nltk.bigrams(["END"] + corpus))
    trigrams = nltk.trigrams(["END", "END"] + corpus)
    trigrams_fd = nltk.FreqDist(trigrams)
    results = {}
    for trigram, trigram_frequency in trigrams_fd.items():
        first_word, second_word, third_word = trigram
        probability = (trigram_frequency) / (corpus_fd_bigram[(first_word, second_word)])
        results[trigram] = probability
    return results

#get_trigrams(corpus)

trigram_model = get_trigrams(corpus)

We modified `get_mindful_v1` to be able to work with an N-gram model below, and `get_mindful_v2` is born:

In [11]:
# Get mindful with Fulfillomatic version 3
def open_your_third_eye():
    """
    Three things cannot long be hidden: the sun, the moon, and the truth. ॐ
    
    (Trigram Model)
    """
    sentence = ""
    while len(sentence.split()) < 4:
        sentence = get_sentence_with_ngram_model(3, trigram_model)
    return sentence

Let's generate some examples:

In [12]:
# Print 5 generated sentences
def do_v2():
    print(open_your_third_eye())
    
repeat(15,do_v2)

He plunges recklessly towards an irrational death.
Be present above all else.
When i am the infinite the vastness that is likely to hurt.
Purity is something that is brought about by a calm mind and such peace of mind produces right values produce right thoughts produce right thoughts.
Calmness in activity is true calmness.
The more it tends to be.
Those who worship do not let go of old judgments and opinions.
Whether we like it or not change comes and goes comes and the entire sky are reflected in one dewdrop on the tops of mountains is the lesson.
Anger ego jealousy are the slave to them.
In the act of being open to all that.
It will take quite a long time before you find out for yourself.
Unless we die to ourselves we can open up our small mind.
And when they played they really played.
Not engaging in ignorance is wisdom.
And when they played they really played.


***

## Thoughts

Let's take a look at how some of our quotes are being put together:

##### Example: "It takes courage **to grow** sharper."

Take: *"The world is full of magic things, patiently waiting for our senses* **to grow** *sharper."*

And: *"It takes courage* **to grow** *up and become who you really are."*

##### Get: It takes courage **to grow** sharper.



### What if we feed the model a bunch of Nietzsche quotes?

* Without music life would be a means to conceal oneself.
* The noble soul reveres itself.
* What is the struggle of opinions that is to preserve the distance which separates us from other men.
* God is a rope over an abyss.
* But there is also always some reason in madness.
* We have forgotten are illusions.
* Christianity is our taste no longer our reasons.
* The end of a bad memory is too good.
* The advantage of a strong faith is infallible.
* There are two different types of people in the enemy’s staying alive.

### What if we feed the model a bunch of Zen quotes?

* When another person makes you rise to new heights no matter what.
* The foolish reject what they crave.
* The waters are in motion but the love of the need for complicated philosophy.
* Wisdom is letting go of who you are.
* So do the wise to resist pleasures but the moon does not last.
* Nurture your mind you should burn yourself completely.

Not great, but not too bad either! On average, and producing batches of 10, usually 4 of the quotes will be *pretty okay.*

**Conclusion:** 40% of the time it works every time!

### How do we know if our model is any good?

Since we were generating sentences without a specifically pre-defined grammar, it was harder to justify using some of the metrics we learned this semester (F1 score, etc). Our methodology was looking at batches, cherry-picking the good ones, and seeing the ratios. This would vary every time we ran things but, it's safe to say that, except for a few gems, most of the quotes were pretty nonsensical.

With this scenario, we thought about pushing things forward by using LSTM and implementing some kind of subject/verb agreement. We didn't do the latte but our attempt at the former is below.

***

## Trying an LSTM

The next step was, naturally, trying an RNN -- because why not? Anything larger than trigram as a long term dependency. Unfortunately, an RNN does not work practically in this situation. During the training, as the information loops it results in very large updates to neural network model weights, due to the accumulation of error gradients during an update. [This results in an unstable network.](https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47).

Fortunately, LSTMs are a thing! A Long Short-Term Memory model outperforms the other models when we want our model to learn from long term dependencies. LSTM’s ability to forget, remember and update the information pushes it one step ahead of RNNs. To give this a shot, we followed and adapted the [How to Develop a Word-Level Neural Language Model and Use it to Generate Text](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/) tutorial to try to generate some mindfulness, and go one step further from opening our third eye.

Right at the beginning of the article, the author says:

*"Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions."*

This is already alarming since our corpus (~614 quotes in the `inspirational_quotes.txt` file) isn't exactly big. The author also uses a sequence of the 50 previous words to predict the next using Plato's The Republic, which is way too much for us. We already thought we were at the limit with trigrams, but we need to try something bigger to give the LSTM a shot.

In [13]:
# Loading packages
import h5py  # Warning: this was a headache along with making sure all the HDF5 stuff was good too 
import keras
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from pickle import dump
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

Using TensorFlow backend.


Below is the code from the tutorial with a few adjustments for our data, like the size of the input sequences. You will note it is not exactly the same because there were some errors in the code itself that we corrected.

### Loading and cleaning the file

In [14]:
# Load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# Turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

# Save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

# Load document
in_filename = 'training/inspirational_quotes.txt'
doc = load_doc(in_filename)
#print(doc[:200])

# Clean document
tokens = clean_doc(doc)


# Organize into sequences of tokens
length = 3 + 1     # Changed from 50+1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)

# Save sequences to file
out_filename = 'inspirational_sequences.txt'
save_doc(sequences, out_filename)

Here's a look at some statistics about our corpus:

In [15]:
# Print some statistics about our quotes
#print(tokens[:50])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))
print('Total Sequences: %d' % len(sequences))

Total Tokens: 14479
Unique Tokens: 2319
Total Sequences: 14475


### Tokenizing and setting up the sequences and model

In [16]:
# Load
in_filename = 'inspirational_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# Integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

# Vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

# Define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 3, 50)             115950    
_________________________________________________________________
lstm_1 (LSTM)                (None, 3, 100)            60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 2319)              234219    
Total params: 501,069
Trainable params: 501,069
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x213cc604160>

We had some issues installing `h5py` and having current `HDF5` on Windows so the two lines commented on the next cell address an error that comes up in the Keras package about these two dependencies when you try to save the model.

In [23]:
# from importlib import reload
# reload(keras.models)

# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

In [24]:
# load cleaned text sequences
in_filename = 'inspirational_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

seq_length = len(lines[0].split()) - 1

# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

encoded = tokenizer.texts_to_sequences([seed_text])

calligraphy painting music or

whatever you think you



In [20]:
# predict probabilities for each word
yhat = model.predict_classes(np.array(encoded)[:, 1:], verbose=0)

In [25]:
out_word = ''
for word, index in tokenizer.word_index.items():
    if index == yhat:
        out_word = word
        break
        
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

In [29]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 200)
print(generated)

cannot do interfere with what you can do that you can at all the times you can to all the people you can as long as you keep yourself centered the time is always right to do what is essential what the logos of a social being requires and in the requisite way which brings a double satisfaction to do less better because most of what we say and do is not doing not being able to choose which way to go and for what purpose to make the right things happen what really distinguishes this generation in all countries from earlier generations is its determination to act its joy in action the assurance of being able to choose which way to go and for what purpose to make the right things happen what really distinguishes this generation in all countries from earlier generations is its determination to act its joy in action the assurance of being able to choose which way to go and for what purpose to make the right things happen what really distinguishes this generation in all countries from earlier 

![NLG](https://supportivedivorcesolutions.com/wp-content/uploads/2017/03/iStock-468140568.jpg)