## Word representation
The collection of documents is named a corpus. The documents being vectors and the collection, the corpus, the vector-space. Each dimension or axis is often called a term or token, signifying that it encompasses both words and characters. The translational mapping between the words an their vector axis is a dictionary. For the subsequent examples we can use the following dictionary and documents:

| Term | Mapping |
| --- | --- |
| where | 0 |
| is | 1 |
| my | 2 |
| money | 3 |
| car | 4 |
| wallet | 5 |

doc1 = "where is my money"

doc2 = "i keep my money in my wallet" 

doc3 = "my car is where my money is"

In [1]:
dct = {"where": 0, "is": 1, "my": 2, "money":3, "car": 4, "wallet": 5}

doc1 = "where is my money"
doc2 = "i keep my money in my wallet"
doc3 = "my car is where my money is"

### Word counts or bags of words

Here we represent each document as an array or a tuple. Matter doesn't order word. At least not in this representation. Remember that the model is measured by its usefulness--so let's ignore order and see how far we get.

In the case of an *word-count* array we can assume the the position in the array represents the related word dimension. So doc1 could be represented as the array [1,1,1,1,0,0]. There is one occurance of "where", so there is a corresponding value of unity at the zeroth position. There is one occurance of "is", so there is a value of unity at the first position of the array. Et cetera.

Representing the document as a tuple, the format is similar to a vector where the first position indicates the term axis and the second the count along that axis. Rather than wallet=1, a tuple for wallet would be given as (5, 1), since it is in the fifth position. So doc3 would be represented as [(2,2),(4,1),(1,2),(0,1),(3,1)], given that "is" and "my" occur twice. The benefit of this representation is that is a dense array. Meaning that terms without a count are not required in the description, thus negating the need for lots of zeroes. This is particularly useful where the dictionary may consist of 3000 to 100,000 terms.

Note that in the bag-of-words format, "money my is where" is identical to doc1. We lose the context in this format.

In [2]:
doc2_array = [0, 0, 2, 1, 0, 1]  # word count vector

## Vectors & vampires
Let's parse the text [Carmilla](https://www.gutenberg.org/ebooks/10007) into a series of bag-of-words tuples and arrays.

P.S. Carmilla is a wonderful book!

In [3]:
with open('carmilla.txt', 'r') as f:
    corpus = f.read()
    
print(corpus[:300])  # the index is by character in a string

CARMILLA

J. Sheridan LeFanu

1872



PROLOGUE

_Upon a paper attached to the Narrative which follows, Doctor Hesselius
has written a rather elaborate note, which he accompanies with a
reference to his Essay on the strange subject which the MS. illuminates.

This mysterious subject he treats, in tha


In [4]:
# Here we are using a custom method to reveal the basic workings.
# Excellent prebuilt methods are available with:
# gensim, hugging-face, spaCy, and scikit-learn

from utils import dctConstr  # custom class
dct = dctConstr()  # initialize the method
dct.constructor(corpus)  # build the dictionary of terms
print("Term/token count:", len(dct.terms))

Term/token count: 4368


In [5]:
# Here we can see the bag-of-words and word-index array formats
# The bag-of-words should be the same length as the number of unique terms.
# The word-index array should be the same length as the original words in selection
# The word-count array will be the same length as the dictionary >4000 terms, so we  won't print it here

sample = corpus[103:257]
print(sample, "\n")
print(dct(sample), "\n\n", dct.to_idx(sample))

Doctor Hesselius
has written a rather elaborate note, which he accompanies with a
reference to his Essay on the strange subject which the MS. illuminates. 

[(8, 2), (11, 1), (12, 2), (14, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)] 

 [16, 17, 18, 19, 8, 20, 21, 22, 14, 23, 24, 25, 8, 26, 11, 27, 28, 29, 12, 30, 31, 14, 12, 32, 33]


## Selection of terms in the dictionary: do we need all those?

What are words? Perhaps too heavy an epistomological question before coffee or alcohol... Thankfully there is a vsauce video to help discuss the [Zipf function](https://www.youtube.com/watch?v=fCn8zs912OE) and some of its implications.

Some works offer more "useful" information than others for many tasks. If the subject of a conversation is inherant, the term "I" is redundant. It is common in many languages for its equivalent to be absent. This is the first of many terms that offer limited information for determining what the document concerns. Generally in knowledge extraction and classification tasks we remove these stop-words as they clutter the vector representations. The common practice is to remove these stop-words as by doing so, we most often improve the accuracy of the model that follows.

On the opposite end of Zipf's distribution are the terms that are used very infrequently. Does the inclusion of the term "parsimonius" in your dictionary help you improve a model? Perhaps if the object of your model is to separate documents written by academics and everyone else it may be useful... However, if it occurs infrequently within a corpus it represents an outlier in the data. Any model we develop against a corpus will include these infrequent terms. In doing so the model will fit the training data more closely, but therein lies the problem. It will fit the training data and not necessarily the real data.

Determining where to cull most frequent and infrequent terms is ultimately a question of the language, the dataset, and the model being used. Build the model based on a best estimate, revise the dictionary, rebuild the model, repeat, and graph the outcomes. If the model fit quickly becomes poor with further reduction, stop.


In [6]:
common_words = [i for i, j in dct.counts.most_common(30)]  # this uses the Counter class
print(common_words)

['the', 'and', 'I', 'of', 'a', 'to', 'in', 'was', '"', 'my', 'her', 'that', 'with', 'you', 'it', 'had', 'me', 'as', 'which', 'she', 'not', 'he', 'is', 'for', 'at', 'have', 'so', 'his', 'on', 'very']


In [7]:
infrequent_words = sorted(dct.counts.items(), key=lambda x: x[1])
print(infrequent_words[:30])

[('⊹', 1), ('CARMILLA', 1), ('J', 1), ('Sheridan', 1), ('LeFanu', 1), ('1872', 1), ('PROLOGUE', 1), ('Upon', 1), ('follows', 1), ('elaborate', 1), ('accompanies', 1), ('reference', 1), ('MS', 1), ('illuminates', 1), ('treats', 1), ('acumen', 1), ('condensation', 1), ('publish', 1), ('"laity', 1), ('forestall', 1), ('relates', 1), ('due', 1), ('consideration', 1), ('abstain', 1), ('presenting', 1), ('précis', 1), ('reasoning', 1), ('extract', 1), ('describes', 1), ('"involving', 1)]


Note the character '⊹'. This is used where the given word is unknown. So if the *dct* vectorizer translates "parsimonius" or "proactive" it will assign it to the index zero. We could have used a special word like *UNKN* here, but our method would no longer work with a language like Chinese--where words are not delimited by spaces.

In [8]:
from utils import dctConstr
dct = dctConstr(stop_words=common_words, ignore_case=True)  # initialize the method
dct.constructor(corpus)  # build the dictionary of terms
dct.trimmer(min_num=10)  # if occuring less than two times -> remove
print("Term/token count:", len(dct.counts))

before trim number of terms: 4167
after trim: 338
Term/token count: 338


## Natural delimitation of language
So far we have taken an entire corpus and constructed a method for translating it atomically into a simple machine-interpretable vector form. However, we also have the natural units of sentences and paragraphs to work with. 

Unlike paragraphs, sentences can be hard. Where sentences contain quotations, colons, or semicolons, the period may no longer represent the end of the sentence. If sentences are of interest, libraries such as the [punkt tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) are the best place to begin.

Here, we will have to separate the corpus into a series of "documents" by dividing the corpus by paragraphs.

In [9]:
from utils import split_by_paragraphs  # this is separating by \n\n
paragraph_corp = split_by_paragraphs(corpus)
print(paragraph_corp[100])

we sat here this night, and with candles lighted, were talking over the adventure of the evening.


In [10]:
# Lowercase and split by white space, remove
carmilla_paragraphs = [dct.to_count_vec(paragraph) for paragraph in paragraph_corp]
print(carmilla_paragraphs[100][:100]) # limiting output to first 100 terms

[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


## Frankenstein's monster vs. vampires: a classification task
Now that we have a method for translating a complete corpus into a series of machine-interpretable objects, let's apply some machine-learning methods to it.

Here we will decide if a given document came from book about vampires or about Frankenstein's monster (let's just say Frankenstein henceforth even though it is incorrect). Let's build the vectorizer from the collated corpora of Frankenstein & Carmilla, trim it, separate each corpus into paragraphs, and vectorize each set of paragraphs. With the two sets of vectorized documents available let's then train a logistic regression model. Subsequently, we can combine our vectorizer and the model to decide if any given document is vampire or frankenstein.

In [11]:
# begin by building the vectorizer
import random
from sklearn.linear_model import LogisticRegression

from utils import dctConstr, split_by_paragraphs

with open('carmilla.txt', 'r') as f:
    carmilla = f.read()
with open('frankenstein.txt', 'r') as f:
    frankenstein = f.read()
    
dct = dctConstr(ignore_case=True)
dct.constructor(carmilla + frankenstein)  # combine the two strings
dct.trimmer(max_num=500, min_num=22)  # remove "less useful" terms

before trim number of terms: 9116
after trim: 513


In [12]:
# let's split each corpus into paragraphs and vectorize them
c_para = split_by_paragraphs(carmilla)
f_para = split_by_paragraphs(frankenstein)
print(len(c_para), len(f_para))  # be certain that we have an userful number of documents for each

c_docs = [dct.to_count_vec(p) for p in c_para]
f_docs = [dct.to_count_vec(f) for f in f_para]

print(c_para[224], "\n\n", f_para[222])

676 790
"yes, a long time. i suffered from this very illness; but i forget all but my pain and weakness, and they were not so bad as are suffered in other diseases." 

 i remained motionless. the thunder ceased; but the rain still continued, and the scene was enveloped in an impenetrable darkness. i revolved in my mind the events which i had until now sought to forget: the whole train of my progress toward the creation; the appearance of the works of my own hands at my bedside; its departure. two years had now nearly elapsed since the night on which he first received life; and was this his first crime? alas! i had turned loose into the world a depraved wretch, whose delight was in carnage and misery; had he not murdered my brother?


In [13]:
# now we can label and randomize their order before we feed them into a model
c_labels = [1] * len(c_docs)  # use unity to indicate carmilla -> vampire
f_labels = [0] * len(f_docs)  # use zero to indicate frankenstein
num_docs = len(c_docs) + len(f_docs)

X_data = c_docs + f_docs
y_data = c_labels + f_labels

print(num_docs, len(X_data), len(y_data))  # double check lengths

Z = list(zip(X_data, y_data))  # pair values
random.shuffle(Z)
X_data, y_data = zip(*Z)

print(X_data[0][:10], y_data[0])

1466 1466 1466
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 0


In [14]:
# we can now use a scikit-learn model
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
clf = LogisticRegression(random_state=42).fit(X_data, y_data)

clf.score(X_data, y_data)  # how well does it score against the trained data?

0.9822646657571623

In [15]:
print(clf.predict(
    [dct.to_count_vec("yes, a long time. i suffered from this very illness; but i forget all but my pain and weakness, and they were not so bad as are suffered in other diseases.")]))
print(clf.predict(
    [dct.to_count_vec(" i remained motionless. the thunder ceased; but the rain still continued, ")]))

[1]
[0]


In [16]:
# Bram Stoker's Dracula
dracula = """Poor fellow! He looked desperately sad and broken; even his stalwart
manhood seemed to have shrunk somewhat under the strain of his
much-tried emotions. He had, I knew, been very genuinely and devotedly
attached to his father; and to lose him, and at such a time, was a
bitter blow to him. With me he was warm as ever, and to Van Helsing he
was sweetly courteous; but I could not help seeing that there was some
constraint with him. The Professor noticed it, too, and motioned me to
bring him upstairs. I did so, and left him at the door of the room, as I
felt he would like to be quite alone with her, but he took my arm and
led me in, saying huskily:--

"You loved her too, old fellow; she told me all about it, and there was
no friend had a closer place in her heart than you. I don't know how to
thank you for all you have done for her. I can't think yet...."""

d_docs = [dct.to_count_vec(d) for d in split_by_paragraphs(dracula)]

In [17]:
clf.predict_proba(d_docs)  # these are the probabilities for both labels: 0=frankenstein

array([[0.06541074, 0.93458926],
       [0.14465824, 0.85534176]])

This is only a sample of two, but we have a strong probability for both paragraphs of being labelled as vampire... Apparently, the bag-of-words approach works well.

In [18]:
import pickle
complete_path = "cloud-run-app/models/literary_monsters.pkl"
with open(complete_path, 'wb') as f:
    pickle.dump(clf, f, protocol=pickle.HIGHEST_PROTOCOL)
complete_path = "cloud-run-app/models/text_vectorizer.pkl"
with open(complete_path, 'wb') as f:
    pickle.dump(dct, f, protocol=pickle.HIGHEST_PROTOCOL)

## Activity: Predict a vampire
Paste the following URL into your browser and add your own sentence/paragraph after the **?x=** to return a prediction. E.g.:

[gcp-cloud-run](https://nlp-demo.fraign.dev/api/vampire?x=i%20remained%20motionless.%20the%20thunder%20ceased;%20but%20the%20rain%20still%20continued)

```https://nlp-demo.fraign.dev/api/vampire?x=```

This is a Google Cloud Run, which is basically a cloud hosted docker container. The code for this to be run locally is in the repo.

# Beget a vampire: A generative neural network
Let's now build a simple recurrent neural network with TensorFlow & Keras. We'll train this network to predict the next individual character/token from the previous characters in the sequence.

For our simple model, let's follow [tensorflow introduction to text generation](https://www.tensorflow.org/tutorials/text/text_generation) tutorial.

In [19]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import numpy as np
import os
"""
If using a GPU you often have to set the memory allocation. Without setting "growth",
all GPU memory is automatically allocated which can cause it to fallover...
"""
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)
    
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
print(tf.__version__)

1 Physical GPUs, 1 Logical GPUs
2.6.0


In [20]:
with open('carmilla.txt', 'r') as f:
    carmilla = f.read()
    
print(carmilla[:1000])

CARMILLA

J. Sheridan LeFanu

1872



PROLOGUE

_Upon a paper attached to the Narrative which follows, Doctor Hesselius
has written a rather elaborate note, which he accompanies with a
reference to his Essay on the strange subject which the MS. illuminates.

This mysterious subject he treats, in that Essay, with his usual
learning and acumen, and with remarkable directness and condensation. It
will form but one volume of the series of that extraordinary man's
collected papers.

As I publish the case, in this volume, simply to interest the "laity," I
shall forestall the intelligent lady, who relates it, in nothing; and
after due consideration, I have determined, therefore, to abstain from
presenting any précis of the learned Doctor's reasoning, or extract from
his statement on a subject which he describes as "involving, not
improbably, some of the profoundest arcana of our dual existence, and
its intermediates."

I was anxious on discovering this paper, to reopen the correspondence
comm

In this case they are parsing and making predictions at the character level. In the English case we are limiting our dictionary/vocab to the 27 (& is [and-per-se-and](https://en.wikipedia.org/wiki/Ampersand)) characters, upper & lower, punctuation, and \n etc. So our NN has a small number of possible outputs. For generalized multi-lingual support, this approach doesn't really perform so elegantly.

In [21]:
from collections import Counter
print(Counter(carmilla))

Counter({' ': 25391, 'e': 15246, 'a': 10127, 't': 9989, 'o': 8600, 'n': 8176, 'i': 7467, 's': 7323, 'r': 7228, 'h': 7072, 'd': 5699, 'l': 5105, 'u': 3504, 'm': 3294, '\n': 3235, 'c': 2981, 'w': 2797, 'y': 2659, ',': 2603, 'f': 2407, 'g': 2328, 'p': 1910, 'b': 1462, '.': 1287, 'v': 1202, 'I': 1041, 'k': 865, '"': 769, ';': 290, 'T': 229, '-': 213, "'": 206, 'x': 202, 'M': 177, 'H': 136, 'S': 135, 'q': 129, 'C': 116, '?': 110, 'A': 107, 'W': 98, 'j': 96, 'B': 85, 'G': 79, '_': 65, 'Y': 59, 'z': 49, 'N': 48, '!': 42, 'P': 37, 'D': 36, 'L': 31, ':': 30, 'K': 26, 'O': 23, 'E': 19, 'V': 19, 'F': 17, 'R': 13, 'X': 8, 'U': 7, 'J': 5, '1': 3, '8': 3, '6': 2, '9': 2, '7': 1, '2': 1, 'é': 1, 'Q': 1})


In [22]:
import re
carmilla = carmilla[49:] # shift to start
carmilla = carmilla.replace('é', 'e')
carmilla, _ = re.subn("[0-9]", "Z", carmilla)
carmilla = re.sub("(?<=[\w\d])\n(?=[\w\d])", " ", carmilla)
print(carmilla[:1000])

Upon a paper attached to the Narrative which follows, Doctor Hesselius has written a rather elaborate note, which he accompanies with a reference to his Essay on the strange subject which the MS. illuminates.

This mysterious subject he treats, in that Essay, with his usual learning and acumen, and with remarkable directness and condensation. It will form but one volume of the series of that extraordinary man's collected papers.

As I publish the case, in this volume, simply to interest the "laity," I shall forestall the intelligent lady, who relates it, in nothing; and after due consideration, I have determined, therefore, to abstain from presenting any precis of the learned Doctor's reasoning, or extract from his statement on a subject which he describes as "involving, not improbably, some of the profoundest arcana of our dual existence, and its intermediates."

I was anxious on discovering this paper, to reopen the correspondence commenced by Doctor Hesselius, so many years before, 

In [23]:
vocab = sorted(set(carmilla))
print ('{} unique characters: {}'.format(len(vocab), vocab))

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in carmilla])

64 unique characters: ['\n', ' ', '!', '"', "'", ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Here we are constructing data in a pipeline that our RNN can consume. Character sequences are constructed of length 100. The sequences are random cuts of the corpus. We are defining the number of examples per epoch such that we have an idea of when we have run over the complete corpus about once.

In [24]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(carmilla)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [25]:
# here the drop_remainder ensures that all batches are the same length
# the +1 is to handle input/output. See below.
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

The model input is a sequence of 100 characters and the output is the character following each character of the input sequence.

In [26]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

The batch and buffer sizes are determined by the memory available to iterate over the model. Generally, start at a small size to avoid trying to push 10GB to a GPU with only 6GB of memory... TensorFlow errors can be hard to debug.
Since the PCIe can be a bottleneck, it is likely quicker to use most of the GPU memory.

In [27]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

### The model itself
Unless you are running this model on a GPU, it is best to stick with the pretrained model.

This is a simple sequential model that forms a single pipeline. You can add more layers or change any of those layers to see if you can obtain a more effective or efficient model

In [28]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024

In [29]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size, 
                          name="t_out")
  ])
  return model

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [30]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [31]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           16384     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
t_out (Dense)                (64, None, 64)            65600     
Total params: 4,020,288
Trainable params: 4,020,288
Non-trainable params: 0
_________________________________________________________________


In [32]:
model.compile(optimizer='adam', loss=loss)

In [33]:
# Directory where the checkpoints will be saved
checkpoint_dir = 'storage/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_freq=10,
    save_weights_only=True)

In [34]:
# NOTE THAT THIS WILL TAKE A LOOOOOOOOOOOOOOOOOOOOOOOOONG TIME ON CPU
# ~90 seconds per epoch
# Nvidia 1660 ~2s
# filesize is around 50MB for this model.

history = model.fit(dataset, epochs=200, callbacks=[checkpoint_callback])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
Epoch 150/200
Epoch 151/200
Epoch 152/200
Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 

In [41]:
def generate_text(_model, start_string, temperature = 0.001):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  ######################################################################
  # Note this idea of "temperature" found in the tutorial is misleading. 
  # We are re-weighting the multinomial distribution, such that the most 
  # likely character is almost certainly the chosen output.
  ######################################################################

  # *Low temperatures* results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.

  # Here batch size == 1
  period_count = 0
  _model.reset_states()
  for i in range(num_generate):
      predictions = _model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])
    
      # We're stopping early here if there are several periods just to speed up our prediction
      if predicted_id == char2idx["."]:
            period_count += 1
      if period_count >= 5:
        break

  return (start_string + ''.join(text_generated))

In [42]:
# In this case the model is relatively small, so we can load another into memory safely.
modelp = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)  # note the batch size
modelp.load_weights("storage/training_checkpoints/ckpt_1")
print(generate_text(modelp, start_string=u"wild nonsense"))

wild nonsense th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th th

In [43]:
modelp = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)  # note the batch size
modelp.load_weights("storage/training_checkpoints/ckpt_10")
print(generate_text(modelp, start_string=u"wild nonsense"))

wild nonsensed the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the said the 

In [44]:
modelp = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)  # note the batch size
modelp.load_weights("storage/training_checkpoints/ckpt_200")
print(generate_text(modelp, start_string=u"wild nonsense"))

wild nonsense apart, deep into my breast. I waked with a scream. The room was lighted by the candle that burnt the men to force the lock. They did so, and we stood, holding our lights aloft, in the doorway, and so that in a little while its site was quite forgotten."

"Can you point out where it stone, and accompanied her to the window.


In [49]:
print(generate_text(modelp, start_string=u"wild nonsense", temperature=0.5))

wild nonsense of strangulation begins?"

"Yes," I answered.

"And--recollect as well as you can--the same person who long ago was called Mircalla, Countess Karnstein. Depapa, but for two opposite reasons. At one time I thought he would laugh at my story, and I could not even point my inquiry, attested the marvelous fact that there was no sign visible that any such thing had happened to me.

The housekeeper whispered to the nurse: "Lay your hand along that hollow her face,' I said; 'and she could not know the shriek at a pace that was perfectly frightful, swerved so as to bring the wheel over the projecting roots, or some other malady, as they often do, he said, knocks at the door, and I really thought, for some seconds,
I saw a dark figure near the chimney-piece, but I felt us so like the night you came to us," I said.


In [45]:
print(generate_text(modelp, start_string=u"fascism"))

fascismall livid mark which all concurred in describing as that induced by the demon's lips,
and every symptoms that more than reconcing with manifest delight.
"My dear Baron, how happy I am to see you, I had no hope of meet this evening."

And then they repeated their directions to me and to Madame, you will be so good as not to let Miss Laura be alone for one moment. That is the only direct to herself, her mother, her history,
everything in fact connected with her life, plans, and people to sleep in their coffins, they exhibit all the symptoms that more than reconcing with many swinghis panic to the rest, and after a plunge or two, the whole team broke into a wild gallop together, and I must confess the refined and beautiful face white fleets of water lilies.

Over all this the schloss shows its many-windowed front; its towed, caught him in her tiny grasp by the wrist.


In [46]:
print(generate_text(modelp, start_string=u"she fell"))

she fell moonlight with you."

"How do you feel now, dear Carmilla? Are you really better?" I asked.

I was very near the turning point from which began the doctor had been broaching, but I think I guess it now.



V

_A Wonderful Likeness_

The first thing I recollect after, is Madame standing at the foot of the bed, a little at the dreams. I used to think that evil spirits made dreams, but our doctor told me it is no such things.


## Activity: Carmilla Copilot

Paste the following URL into your browser and add your own sentence/phrase after the ?x= to return continue the text in the style of Carmilla. E.g.:

https://nlp-demo.fraign.dev/api/carmilla?x=

This is a Google Cloud Run, which is basically a cloud hosted docker container. The code for this to be run locally is in the repo.

## Useful references
- [A fun podcast about English words and their origins](http://www.lexitecture.com/)
- [A wonderfully informative review of recurrent neural networks](https://arxiv.org/pdf/1506.00019)
- [Huggingface](https://huggingface.co/)
- [spaCy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- [Natural language toolkit](https://www.nltk.org/)