# A few simple corpus-driven approaches to narrative analysis and generation

By [Allison Parrish](http://www.decontextualize.com/)

This notebook is a fast introduction to a few techniques for working with narrative corpora. By "narrative corpora," I mean pre-existing bodies of text that mostly contain the texts of narratives. In particular, we're going to use Mark Riedl's [WikiPlots corpus](https://github.com/markriedl/WikiPlots), which has the titles and plot summaries of more than one hundred thousand movies, books, television shows and other media from Wikipedia.

The notebook takes you through using [spaCy](http://spacy.io) to extract words, noun chunks, parts of speech and entities from the text and then sew them back together with [Tracery](http://tracery.io). It then shows how to use [Markovify](https://github.com/jsvine/markovify) to create new narratives from existing narrative text, along with a quick example of recurrent neural network text generation with [textgenrnn](https://github.com/minimaxir/textgenrnn).

The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed. Even if the notebook itself doesn't end up being useful to you, hopefully it spurs a few ideas that you can take with you into your practice as a storyteller and/or programmer.

## Loading the corpus

The first step is to get the narrative corpus into the program. Because WikiPlots is so big, we're actually going to be working with a smaller subset: only the plot summaries for romantic comedy movies. The subcorpus was made using [this notebook on creating a subcorpus of WikiPlots](creating-a-wikiplots-subcorpus.ipynb), which you can consult if you want to make your own with a different subset of WikiPlots.

The corpus we're working with takes the form of a TSV file ("tab separated values"), with each line containing the title of the movie, a number indicating where in the plot summary the sentence for this line occurs, the total number of sentences in the summary, and the actual text of the sentence. The following cell loads the data into a list of dictionaries:

In [383]:
sentences = []
for line in open("romcom_plot_sentences.tsv"):
    line = line.strip()
    items = line.split("\t")
    sentences.append(
        {'title': items[0],
         'index': int(items[1]),
         'total': int(items[2]),
         'text': items[3]})

Just to make sure it worked, we'll print out a random sentence:

In [384]:
import random

In [385]:
random.choice(sentences)

{'index': 15,
 'text': 'Throughout the first season, it is hinted that Jimmy expected the arrangement to be temporary and he sometimes wonders aloud when Edgar will move out.',
 'title': "You're the Worst",
 'total': 17}

## Natural language processing

To get an idea of what's happening in the text of the plots, we can do a bit of Natural Language Processing. I cover just the bare essentials in this notebook. [Here's a more in-depth tutorial that I wrote](https://github.com/aparrish/rwet/blob/master/nlp-concepts-with-spacy.ipynb).

Most natural language processing is done with the aid of third-party libraries. We're going to use one called spaCy. To use spaCy, you first need to install it (i.e., download the code and put it in a place where Python can find it) and download the language model. (The language model contains statistical information about a particular language that makes it possible for spaCy to do things like parse sentences into their constituent parts.)

If you're using this notebook in Binder, then spaCy has already been installed for you! Otherwise, to install spaCy, [follow the instructions here](https://spacy.io/usage/). If you're using Anaconda, you'll need to open a Terminal window (or the equivalent on your operating system) and type

    conda install -c conda-forge spacy

This line installs the library. You'll also need to download a language model. For that, type:

    python -m spacy download en_core_web_sm

(Replace en with the language code for your desired language, if there's a model available for it.) The language model contains the statistical information necessary to parse text into sentences and sentences into parts of speech. Note that this download is several hundred megabytes, so it might take a while!

Once you've installed the library and downloaded the model, you should be able to load the model in the following cell:

In [386]:
import spacy
nlp = spacy.load('en_core_web_sm')

(This could also take a while–the model is potentially very large and your computer needs to load it from your hard drive and into memory. When you see a `[*]` next to a cell, that means that your computer is still working on executing the code in the cell.)

Right off the bat, the spaCy library gives us access to a number of interesting units of text:

* All of the sentences (`doc.sents`)
* All of the words (`doc`)
* All of the "named entities," like names of places, people, #brands, etc. (`doc.ents`)
* All of the "noun chunks," i.e., nouns in the text plus surrounding matter like adjectives and articles

The cell below, we extract these into variables so we can play around with them a little bit. (Parsing sentences is hungry work and the following cell will take a while to execute.)

In [483]:
words = []
noun_chunks = []
entities = []
# only use 1000 sentences sampled at random by default; comment out this `for...`
# uncomment the `for...` beneath to use every sentence in the corpus.
for i, sent in enumerate(random.sample(sentences, 1000)):
#for i, sent in enumerate(sentences):
    if i % 100 == 0:
        print(i, len(sentences))
    doc = nlp(sent['text'])
    words.extend([w for w in list(doc) if w.is_alpha])
    noun_chunks.extend(list(doc.noun_chunks))
    entities.extend(list(doc.ents))

0 22593
100 22593
200 22593
300 22593
400 22593
500 22593
600 22593
700 22593
800 22593
900 22593


Just to make sure it worked, print out ten random words:

In [484]:
for item in random.sample(words, 10):
    print(item.text)

the
he
bathroom
being
work
made
work
Kit
her
new


Ten random noun chunks:

In [389]:
for item in random.sample(noun_chunks, 10):
    print(item.text)

him
a nightmare
store manager
her own daughter
harp music
her
he
Kate
ndash
Sam Beacham


Ten random entities:

In [390]:
for item in random.sample(entities, 10):
    print(item.text)

Tom
Ted
Ramona
Scott
Chester Hooton
Andie
Distraught
Harold
Rodeo Drive
Mabel


### Grammatical roles

The parser included with spaCy can also give us information about the grammatical roles in the sentence. For example, the `.root.dep_` attribute of a noun chunk tells us whether that noun chunk is the subject of the sentence ("nsubj") or a direct object ("dobj") of the sentence. (See the "Universal Dependency Labels" of spaCy's [annotation specs](https://spacy.io/api/annotation) for more possible roles.) Using this information, we can make a list of sentence subjects and sentence objects:

In [391]:
subjects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'nsubj']
objects = [chunk for chunk in noun_chunks if chunk.root.dep_ == 'dobj']

In [392]:
random.sample(subjects, 10)

[he, he, Kasim, you, Chris, she, the best man, Max, Danielle, it]

In [393]:
random.sample(objects, 10)

[Kevin,
 genital herpes,
 the blame,
 a Christmas present,
 David,
 Murray,
 her,
 manners,
 what,
 the business]

### Parts of speech

The spaCy parser allows us to check what part of speech a word belongs to. In the cell below, we create four different lists—`nouns`, `verbs`, `adjs` and `advs`—that contain only words of the specified parts of speech. Using the `.tag_` attribute, we can easily get only particular forms of verbs; in this case, I'm just getting verbs that are in the past tense. ([There's a full list of part of speech tags here](https://spacy.io/docs/usage/pos-tagging#pos-tagging-english).)

In [394]:
nouns = [w for w in words if w.pos_ == "NOUN"]
verbs = [w for w in words if w.pos_ == "VERB"]
past_tense_verbs = [w for w in words if w.tag_ == 'VBD']
adjs = [w for w in words if w.pos_ == "ADJ"]
advs = [w for w in words if w.pos_ == "ADV"]

And now we can print out a random sample of any of these:

In [395]:
for item in random.sample(nouns, 12): # change "nouns" to "verbs" or "adjs" or "advs" to sample from those lists!
    print(item.text)

love
relationship
insurance
daughter
adultery
who
events
accident
arms
security
mother
activities


### Entity types

The parser in spaCy not only identifies "entities" but also assigns them to a particular type. [See a full list of entity types here.](https://spacy.io/docs/usage/entity-recognition#entity-types) Using this information, the following cell builds lists of the people, locations, and times mentioned in the text:

In [396]:
people = [e for e in entities if e.label_ == "PERSON"]
locations = [e for e in entities if e.label_ == "LOC"]
times = [e for e in entities if e.label_ == "TIME"]

And then you can print out a random sample:

In [397]:
for item in random.sample(times, 12): # change "times" to "people" or "locations" to sample those lists
    print(item.text.strip())

the night
night
the night
the night
later that night
That night
late-night
the night
morning
one-night
one-night
night


### Finding the most common

We won't go too deep into text analysis in this tutorial, but it's useful to be able to do the most fundamental task in text analysis: finding the things that are most common. The code to do this task looks like the following, which gives us a way to look up how often any word occurs in the text:

In [398]:
from collections import Counter
word_count = Counter([w.text for w in words])

In [399]:
word_count['Meanwhile']

67

... and also tells us which words are most common:

In [400]:
word_count.most_common(12)

[('the', 4430),
 ('to', 4166),
 ('and', 3527),
 ('a', 2749),
 ('her', 2058),
 ('is', 1805),
 ('of', 1541),
 ('in', 1523),
 ('his', 1396),
 ('he', 1294),
 ('with', 1275),
 ('that', 1260)]

You can make a counter for any of the other lists we've worked with using the same syntax. Just make up a unique variable name on the left of the `=` sign and put the name of the list you want to count in the brackets to the right (replacing `words`). E.g., to find the most common people:

In [402]:
people_count = Counter([w.text for w in people])

In [403]:
people_count.most_common(12)

[('Joe', 81),
 ('Tom', 63),
 ('Steve', 59),
 ('Mary', 56),
 ('Sam', 55),
 ('Peter', 52),
 ('George', 50),
 ('Chris', 40),
 ('Jeff', 39),
 ('Nick', 37),
 ('Jack', 37),
 ('Michael', 36)]

The most common past-tense verbs:

In [404]:
vbd_count = Counter([w.text for w in past_tense_verbs])

In [406]:
vbd_count.most_common(12)

[('was', 190),
 ('had', 129),
 ('did', 38),
 ('were', 25),
 ('left', 19),
 ('got', 18),
 ('gave', 16),
 ('wanted', 14),
 ('married', 13),
 ('made', 13),
 ('died', 12),
 ('met', 12)]

### Writing to a file

The following cell defines a function for writing data from a `Counter` object to a file. The file is in "tab-separated values" format, which you can open using most spreadsheet programs. Execute it before you continue:

In [407]:
def save_counter_tsv(filename, counter, limit=1000):
    with open(filename, "w") as outfile:
        outfile.write("key\tvalue\n")
        for item, count in counter.most_common():
            outfile.write(item.strip() + "\t" + str(count) + "\n")    

Now, run the following cell. You'll end up with a file in the same directory as this notebook called `100_common_words.tsv` that has two columns, one for the words and one for their associated counts:

In [408]:
save_counter_tsv("100_common_words.tsv", word_count, 100)

Try opening this file in Excel or Google Docs or Numbers!

If you want to write the data from another `Counter` object to a file:

* Change the filename to whatever you want (though you should probably keep the `.tsv` extension)
* Replace `word_count` with the name of any of the `Counter` objects we've made in this sheet and use it in place of `word_count`
* Change the number to the number of rows you want to include in your spreadsheet.

### When do things happen in this text?

Here's another example. Using the `times` entities, we can make a spreadsheet of how often particular "times" (durations, times of day, etc.) are mentioned in the text.

In [409]:
time_counter = Counter([e.text.lower().strip() for e in times])
save_counter_tsv("time_count.tsv", time_counter, 100)

Do the same thing, but with people:

In [410]:
people_counter = Counter([e.text.lower() for e in people])
save_counter_tsv("people_count.tsv", people_counter, 100)

### Generating stories from a corpus and Tracery grammars

Once you've isolated entities and parts of speech, you can recombine them in interesting ways. One is to use a Tracery grammar to write sentences that include the isolated parts. Because the parts have been labelled using spaCy, you can be reasonbly sure that they'll fit into particular slots in the sentence. (I used a similar technique for my [Cheap Space Nine](https://twitter.com/cheapspacenine) bot.)

In [411]:
import tracery
from tracery.modifiers import base_english

In [412]:
rules = {
    "subject": [w.text for w in subjects],
    "object": [w.text for w in objects],
    "verb": [w.text for w in past_tense_verbs],
    "adj": [w.text for w in adjs],
    "people": [w.text for w in people],
    "loc": [w.text for w in locations],
    "time": [w.text for w in times],
    "origin": "#scene#\n\n[charA:#subject#][charB:#subject#][prop:#object#]#sentences#",
    "scene": "SCENE: #loc#, #time.lowercase#",
    "sentences": [
        "#sentence#\n#sentence#",
        "#sentence#\n#sentence#\n#sentence#",
        "#sentence#\n#sentence#\n#sentence#\n#sentence#"
    ],
    "sentence": [
        "#charA.capitalize# #verb# #prop#.",
        "#charB.capitalize# #verb# #prop#.",
        "#prop.capitalize# became #adj#.",
        "#charA.capitalize# and #charB# greeted each other.",
        "'Did you hear about #object.lowercase#?' said #charA#.",
        "'#object.capitalize# is #adj#,' said #charB#.",
        "#charA.capitalize# and #charB# #verb# #object#.",
        "#charA.capitalize# and #charB# looked at each other.",
        "#sentence#\n#sentence#"
    ]
}

In [413]:
grammar = tracery.Grammar(rules)
grammar.add_modifiers(base_english)

In [417]:
for i in range(3):
    print(grammar.flatten("#origin#"))
    print()

SCENE: Ozarks, the night

Dudley was the train.
'Did you hear about changes?' said Anabel.

SCENE: Europe, every morning

What and she looked at each other.
'Did you hear about commercial jingles?' said what.

SCENE: Vortex, a little of his

'Crawl is her,' said The chickens.
She and The chickens greeted each other.
She and The chickens lost the subject.



## Markov chain text generation

Another way to produce new narratives from existing narrative text is to find statistical patterns in the text itself and then make the computer create new text that follows those statistical patterns. Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. Writing [code to implement a Markov chain generator](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) on your own is certainly possible, but Markovify comes with a lot of extra niceties that will make our lives easier.

To install Markovify on your computer, run the cell below. (You can skip this step if you're using this notebook in Binder.)

In [418]:
!pip install markovify

[33mYou are using pip version 9.0.3, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


And then run this cell to make the library available in your notebook:

In [419]:
import markovify

We need a list of strings to train the Markov generator. For now, let's just get all of the sentences from any movie in the corpus:

In [420]:
all_text = [item['text'] for item in sentences]

The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `all_text_gen`.

In [421]:
all_text_gen = markovify.Text(all_text)

You can then call the `.make_sentence()` method to generate a sentence from the model:

In [429]:
print(all_text_gen.make_sentence())

Shirley Mae's troubles come to an end on the history and artworks of Paris.


The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence:

In [433]:
print(all_text_gen.make_short_sentence(50))

Kenneth carries her into the Italian Music Awards.


By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter:

In [437]:
print(all_text_gen.make_short_sentence(40, tries=100))

She reminds him of insincerity.


Or by disabling the check altogether with `test_output=False`:

In [445]:
print(all_text_gen.make_short_sentence(40, test_output=False))

Their talk is interrupted by Stifler.


### Changing the order

When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare:

In [446]:
gen_1 = markovify.Text(all_text, state_size=1)
gen_4 = markovify.Text(all_text, state_size=4)

In [447]:
print("order 1")
print(gen_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_4.make_sentence(test_output=False))

order 1
Even though the hope that the professors drive back to doctors.

order 4
Ken who is the law office receptionist, is smitten with her.


In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error.

### Changing the level

Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:

In [448]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words.

The following cell implements a character-level Markov text generator for the word "condescendences":

In [449]:
con_model = SentencesByChar("condescendences", state_size=2)

Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!

In [450]:
con_model.make_sentence()

'condencescencencencencescesces'

Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:

In [451]:
gen_char = SentencesByChar(all_text, state_size=7)

And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.)

In [452]:
print(gen_char.make_sentence(test_output=False))

Jeff realizes that she lost that they do, Lee carrying her himself off from a devastating the Hostess Room of the objects him.


### Thinking about structure

It's one thing to be able to produce one plausible sentence of a plot summary using Markov chains, but another to create a sense of overall structure between sentences, and generating narratives with these kinds of long-term dependencies is still an open problem in computational creativity. The approach I'm going to suggest below relies on the intuition that sentences in a plot summary share characteristics based on their position in the summary. First sentences will generally introduce characters and present an initial situation; last sentences will generally describe how the situation was resolved; and sentences in between will describe developing action.

Following this intuition, let's create *three different Markov chains*: one for beginning sentences, one for middle sentences, and one for final sentences. We can use the `index` of each sentence in our corpus to give us this information.

First, the beginnings are lines whose index is zero (i.e., they're the first sentence for this plot):

In [453]:
beginnings = [line['text'] for line in sentences if line['index'] == 0]

In [454]:
random.sample(beginnings, 5)

['Lucky Fritz was hit by a lightning strike twice.',
 'Judy (Swanson) and Nicholas Randall (Olivier) are a newly married couple who agree to a marriage based on "perfect understanding".',
 'Confirmed bachelor Jud Parker (Larry Parks) likes his life the way it is.',
 'Gary Starke (Garcia) is a New York City ticket scalper who, in the old traditional style of scalping "works the street", known as "the walk", as opposed to the plethora of modern-day ticket brokers who ply the Internet for sales.',
 'Renowned surgeon Sir Lancelot Spratt (James Robertson Justice) arranges a cruise for his patient, the famous television star Basil Beauchamp (Simon Dee).']

And endings are sentences that come last in the plot (i.e., their index is one less than the total number of sentences):

In [455]:
endings = [line['text'] for line in sentences if line['index'] == line['total'] - 1]

In [456]:
random.sample(endings, 5)

['Regrettably surveying the wreckage of his work, David soon accepts the destruction and chaos, gives in, and hugs and kisses Susan.',
 'Upon learning the news, Alfie drives to the crash site and cries over the wreckage.',
 "(Zak Orth) attempt to figure out why McKinley (Michael Ian Black) hasn't been with a woman, the reason being that McKinley is in love with Ben (Bradley Cooper), whom he marries in a ceremony in the lake; Victor (Ken Marino) attempts to lose his virginity with the resident loose-girl Abby (Marisa Ryan); and Susie (Amy Poehler) and Ben attempts to produce and choreograph the greatest talent show Camp Firewood has ever seen.",
 'But this simple transaction will bring problems after Uma gets pregnant, and the feelings that no one was waiting for appear, transforming the lives of each of the characters.',
 'Mickey and Ellen arrive at the restaurant together and re-tell Liz the story of their relationship.']

And "middles" are anything in between:

In [457]:
middles = [line['text'] for line in sentences if 0 < line['index'] < line['total'] - 1]

In [458]:
random.sample(middles, 5)

['Later, apparently on a spur of the moment, she lends Randy a copy of Walt Whitman’s Leaves of Grass, which Randy starts to devour.',
 'Serving with the 3rd Armored "Spearhead" Division in West Germany, McLean dreams of running his own nightclub when he leaves the army, but such dreams don\'t come cheap.',
 'When Maisie notices that they are being billed for twice as many parts (the plotters are building a second copy of the prototype), Barbara invites her to a Sunday social at an exclusive club, where she spikes her drink.',
 "The friendship is soon to become more, as Brendan appears unexpectedly late one night at Leo's door and sleeps with him; after which they become something of a couple, to the consternation of one man in their men's group, though it encourages another, Terry (Con O'Neill), to explore his sexuality.",
 'The relationship between Melvin and Carol remains complicated until Simon (whom Melvin has allowed to move in with him until he can fully heal from his injuries a

The following cell creates the models:

In [459]:
beginning_gen = markovify.Text(beginnings)
middle_gen = markovify.Text(middles)
ending_gen = markovify.Text(endings)

Now you can generate tiny narratives by producing a beginning sentence, a middle sentence, and an ending sentence:

In [474]:
print(beginning_gen.make_short_sentence(100))
print(middle_gen.make_short_sentence(100))
print(ending_gen.make_short_sentence(100))

The film opens with a soldier and nurse getting out of college and ready to propose to her.
Things get so complicated that the film leads to a children's playing area to the Royal Court.
Logan, however, decides to marry him.


The narratives still feel disconnected (and there are often jarring mismatches in pronoun antecedents), but the artifacts produced with this method do feel a bit narrative-like? Maybe?

### Combining models

Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them.

The code below combines models for beginning sentences, middle sentences, and ending sentences into one model:

In [475]:
combo = markovify.combine([beginning_gen, middle_gen, ending_gen], [10, 1, 10])

The bit of code `[10, 1, 10]` controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly beginnings with but a bit of middles and a *soupçon* of ends, try `[10, 2, 1]`.)

Then you can create sentences using the combined model:

In [480]:
print(combo.make_short_sentence(120))

Jeff realizes he loves her and the film closes.


## Neural network text prediction with `textgenrnn`

Like a [Markov chain](ngrams-and-markov-chains.ipynb), a recurrent neural network (RNN) is a way to make predictions about what will come next in a sequence. For our purposes, the sequence in question is a sequence of characters, and the prediction we want to make is *which character will come next*. Both Markov models and recurrent neural networks do this by using statistical properties of text to make a *probability distribution* for what character will come next, given some information about what comes before. The two procedures work very differently internally, and we're not going to go into the gory details about implementation here. (But if you're interested in the gory details, [here's a good place to start](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).) For our purposes, the main *functional* difference between a Markov chain and a recurrent neural network is the *portion* of the sequence used to make the prediction. A Markov model uses a fixed window of history from the sequence, while an RNN (theoretically) uses the *entire history* of the sequence.

The primary benefit of an RNN over a Markov model for text generation is that an RNN takes into account *the entire history* of a sequence when generating the next character. This means that, for example, an RNN can theoretically learn how to close quotes and parentheses, which a Markov chain will never be able to reliably do (at least for pairs of quotes and parentheses longer than the n-gram of the Markov chain).

The drawback of RNNs is that they are *computationally expensive*, from both a processing and memory perspective. This is (again) a simplification, but internally, RNNs work by "squishing" information about the training data down into large matrices, and make predictions by performing calculations on these large matrices. That means that you need a lot of CPU and RAM to train an RNN, and the resulting models (when stored to disk) can be very large. Training an RNN also (usually) takes a lot of time.

Another consideration is the size of your corpus. Markov models will give interesting and useful results even for very small datasets, but RNNs require large amounts of data to train—the more data the better.

So what do you do if you *don't* have a very large corpus? Or if you don't have a lot of time to train on your corpus?

### RNN generation from pre-trained models

Fortunately for us, developer and data scientist [Max Woolf](https://github.com/minimaxir) has made a Python library called [textgenrnn](https://github.com/minimaxir/textgenrnn) that makes it really easy to experiment with RNN text generation. This library includes a model (according to the documentation) "trained on hundreds of thousands of text documents, from Reddit submissions (via BigQuery) and Facebook Pages (via my Facebook Page Post Scraper), from a very diverse variety of subreddits/Pages," and allows you to use this model as a starting point for your own training.

First install textgenrnn with `pip`: (Again, you can skip this step if you're working with this notebook in Binder)

In [32]:
!pip install --upgrade textgenrnn

Requirement already up-to-date: textgenrnn in /Users/allison/anaconda/lib/python3.6/site-packages
Requirement already up-to-date: h5py in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)
Requirement already up-to-date: keras>=2.1.5 in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)
Requirement already up-to-date: scikit-learn in /Users/allison/anaconda/lib/python3.6/site-packages (from textgenrnn)
Requirement already up-to-date: six in /Users/allison/anaconda/lib/python3.6/site-packages (from h5py->textgenrnn)
Requirement already up-to-date: numpy>=1.7 in /Users/allison/anaconda/lib/python3.6/site-packages (from h5py->textgenrnn)
Requirement already up-to-date: keras-applications>=1.0.6 in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)
Requirement already up-to-date: scipy>=0.14 in /Users/allison/anaconda/lib/python3.6/site-packages (from keras>=2.1.5->textgenrnn)
Requirement already up-to-date: keras-prepr

Once it's installed, import the `textgenrnn` class from the package:

In [366]:
from textgenrnn import textgenrnn

Using TensorFlow backend.


And create a new `textgenrnn` object like so. (The `name` parameter controls the filename used when automatically saving the model to disk, so pick something descriptive!)

In [367]:
textgen = textgenrnn(name="all_text")

This object has a `.generate()` method which will, by default, generate text from the pre-trained model only.

In [369]:
textgen.generate()

A collection of a lady about a new month and can stop strings and I think it's higher than the current falls screen and they are a good movie for it.



To train a text generator on your own text, use the `.train_on_texts()` method, passing in a list of strings. The `num_epochs` parameter allows you to indicate how many epochs (i.e., passes over the data) should be performed. The more epochs the better, especially for shorter texts, but you'll get okay results even with just a few.

Training a neural network usually takes a really long time! So it makes sense to "try out" a text before committing to the many hours it might take to train the network on the full text. The following example trains the neural network on 100 randomly sampled lines from all plot sentences, which lets you get an idea of what the output will look like when training on its entire contents. You'll notice that the `train_on_texts()` function prints output as it goes, showing what the generated text is likely to look like.

In [264]:
textgen.train_on_texts(random.sample(all_text, 100), num_epochs=3)

Training on 11,723 character sequences.
Epoch 1/3
####################
Temperature: 0.2
####################
The confuses at the confuser and when he can the confuse too her but he convends the strik and she wants to get the suburber and she was all and the collecter and the car and she got his collectivition who has her and and when he can get the collecter and as a sheet of the story, and and where the

He will she was all he wants to and she is surfers to have the and the confuse too her and she wants to his better and party and she was an and he was an and where the competition when he confeds and she she is an and the confuser tools from the shot to the first confuse to her buy who as a sheet 

He who she was an and he wants to have an and and where he wills and the cannot play the game and as a sheer and and where he she was an and he was an are the sing and he was an and where and he continuees to his singing and she is surfreeling the story of the shoot and his wife and and he 

After training, you can generate new text using the `.generate()` method again:

In [370]:
textgen.generate()

[Review] India was to improve your student and she wants to find a hand and a good point



The results aren't very interesting because by default the generator is very conservative in how it samples from the probability distribution. You can use the `temperature` parameter to make the sampling a bit more likely to pick improbable outcomes. The higher the value, the weirder the results. The default is 0.2, and going above 1.0 is likely to produce unacceptably strange results:

In [371]:
textgen.generate(temperature=0.5)

I was really ever aren't to start a good piece of experience. I drink the game of the game.



In [372]:
textgen.generate(temperature=0.9)

fun-hearing in the pic, theories are one so what do you guys?



In [374]:
textgen.generate(temperature=1.5)

This Per Projects Undersave Sedul Stam Top Has Laweing For Everything:✋



If you pass a number `n` to the `.generate()` method as its first parameter, `.generate()` will print out `n` instances of text generation from the model. The code in the following cell prints out ten examples from the specified temperature:

In [269]:
textgen.generate(10, temperature=0.5)

I ametriced to spend the personal and storys insists with a charle and in a suburbling and song the girl who was preparinced to soon a speeches and to deadline to his father and casse to even his crosses at the same town and has a car throughout.

A her finally amare that he says a more is just and a grower and any resume.

Werning the cast of the punch and the singing and her telling in a friendly cast.

A his and he telling the producing and publicly to an anime professional.

Hail and he can arrad his and her finally asked to be a part tool in the stair on his party.

However, she let himself, I discovered his married and she can get a secrument and the can to leave to the girl who made her and goes a sold better, and manage and gets arriving to her and becomes that he should and eventually sees him to a hot and he undendinings that he sees their party party an

I finally seen to be a warn of his are he she wants to be a google and his restaurates to an aside to hit but are the care

(This may take a little while.)

When you're satisfied with the results and you're ready to train on all of the sentences, just remove the `[:100]` from the call to `.train_on_texts()`. (Given the size of the corpus we're working with in this example, the following will take a *long* time on most computers. You might consider using a machine with a GPU.)

In [None]:
textgen.train_on_texts(all_text, num_epochs=5)

The textgenrnn library automatically saves the model to disk after each epoch in the same directory as this notebook. You can load a model you've previously trained by passing its filename to the `textgenrnn` function:

In [270]:
textgen = textgenrnn("all_text_weights.hdf5")

And then you can call the `.generate()` method as normal:

### Generating with shorter texts

I've found that `textgenrnn` works especially well with very short, texts. For example, let's generate romantic comedy titles using the information in our corpus!

The code in the following cell makes a list of all of the titles:

In [375]:
titles = list(set([item['title'] for item in sentences]))

In [376]:
random.sample(titles, 5)

['How to Commit Marriage',
 'No More Ladies',
 'College Swing',
 'The Saturday Night Kid',
 'Ball of Fire']

And create another textgenrnn object:

In [377]:
title_gen = textgenrnn(name="titles")

Now, train the RNN on these titles. One epoch will do the trick:

In [378]:
title_gen.train_on_texts(titles, num_epochs=1)

Training on 22,247 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
The Same Santa in Strip

The Pressions of the Night

The Marriers Sally Sally

####################
Temperature: 0.5
####################
Rio Girls

Pars in Stepa

The Onlinge Something

####################
Temperature: 1.0
####################
Tootthousmphit

E N HDDa HP

Mandie on a Liget andy



Now generate a list of new titles:

In [482]:
title_gen.generate(25, temperature=0.5)

Thing Lib and the Friend

Turing Purse

The Big Girls

The Booss Amanthean

New Stripper Handy findis

Tesla and Taying to Cananta

Comminch Travely

The Parisa and Teslan Marria

Water Story

Traveling Bill Marie

Seven Labur Nights

True I Day Canatt

Just Night San Broelly

The Barty Sall Counstry

The Trail fan

Setting Marry

Trearing Brollah Spectachube

Termand Promet Sally

The Deverot Mariean Sander

The Andy Only Benet

Three Rost Darlo

The Married Affanting

The Finderbody Paris

Sexen Broll Andiment

Skin Man Charrentes



## Further reading

* [This notebook from the creator of textgenrnn](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) covers everything about the library that I covered in this tutorial—and much more, including how to start generation from a particular "seed" and how to save and load models (useful if you spent an afternoon training a model on your own corpus and don't want to have to do it again!)
* Take a look at [Janelle Shane's wonderful overview of how she uses RNNs in her process](http://aiweirdness.com/faq). And then take a look at her [wonderful creative work with RNNs](http://aiweirdness.com/).
* Hayes, Brian. “Computer recreations.” Scientific American, vol. 249, no. 5, 1983, pp. 18–31. JSTOR, http://www.jstor.org/stable/24969024. (Original column from Scientific American that described how Markov chain text generation works—very readable! I can send a PDF, hit me up.)
* [A Travesty Generator for Micros](https://elmcip.net/critical-writing/travesty-generator-micros) is a follow-up to Hayes' article that has some more theory and an actual Pascal listing (which is now mostly of only historical interest).
* [This notebook](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) shows how to implement a Markov chain generator from scratch in Python, if you're interested in such things!