Socrates and Aristotle Are Fighting Again
This is my entry for NaNoGenMo 2015.
Take a look at the generated text here!.
Using a corpus of philosophical text and a simple ngram/markov chain model, this program generates text that is intended to be recognizable as Socratic dialogues. Within the corpus I use, each line is considered to be a separate document. This corpus is not included in the repo in an attempt to be respectful of the authors' copyrights.
Overall, I'm somewhat happy with the output of this generator. It has the potential to make some very funny passages, but it often produces conversation that doesn't make much sense.
Beyond the basic Markov chain that has been done many times before, I added in a few interesting techniques, listed below.
Socrates and Aristotle each have their own ngram tree, with some overlap, that they use to generate text.
I split sentences before putting them into my Markov model and add special tokens at the start and end. This allows for starting from a semi-known state and ending in one of two states. Either the Markov model completes a sentence, or it runs out of matches and bails out early.
During the model-building phrase, sentences are classified as one of question, fact, and declaration, and one of these types can also be requested during text generation. If a particular type cannot complete a phrase, the model will take the next token from any phrase type, so that the text generation doesn't break if there are too few questions or facts in the corpus.
Even a short citation "(Socrates 2015)" has 4 tokens, which can lead to even more mid-sentence topic changes. To counteract this problem, citations are detected while building the corpus and replaced with a single special token, which lets us keep context from before a citation to after. During text generation, citations are replaced with a reference to Aristotle or Socrates and a random year.
The Socratic Method at work
As Socrates and Aristotle are talking, Aristotle learns by adding whatever Socrates says into his corpus, and then spontaneously discovering new information by learning text generated by the Socrates corpus.
In an attempt to make some of the unfortunate artifacts that can occur from the Markov chain model more palatable, I have introduced stage directions.
If a actor repeats themselves in conversation, it may seem unrealistic. By adding a stage direction that makes this repetition appear intentional, it seems to be a bit less jarring.
Similarly, due to the overlap in corpus and the learning method used, one actor may repeat the same phrase as another actor. In this case a stage direction is added as well.
If an actor fails to generate a sentence to completion, and another actor responds, the stage direction "interrupting" is added.
There are a few other directions that are used only rarely, but I think are kind of fun!