# Markov Chain Text Generator

How to create a program that can write poem that reads like Shakespeare? This is actually an interview question from Khan Academy for summer internship... You may consider using Machine Learning, Deep Leaning and all those fancy algorithms and techniques, as it seems a highly creative and unpredictable process, and you may assume that the program has to be trained very well in order to sound like Shakespeare.

However, what if I tell you that the program can be extremely simple and you can implement it within 30 minutes, and all the resources you need to teach the program to learn Shakespeare's style are just a couple of his works? This tutorial will teach you to implement a text generator that can create new gibberish that reads like normal. Besides, depending on the resources you choose to "train" the algorithm, the text generator can mimic the style of certain writer (e.g. Shakespeare).

### This program is based on the idea of Markov Chain,
which is a probabalistic model about the transition between different states. Below is a diagram of a Markov Chain: ![title](markovdiag.png) Given you current state, Markov Chain will randomly choose the next state from the connected states specified by the graph. Moreover, different states have different possibilities of being chosen.

Take the above diagram as an example, if the current the state is sunny, then the next state might turn rainy with 10% probablity, might stay sunny with 50% probablity, and might turn cloudy with 40% probablity.

### In general, Markov Chain will randomly choose the next state based on the current state.
Thus Markov Chain can create new stuff based on the recent history. In our case, we can consider each state being the last several words we have generated, and the states we can transition to in the next step are the potential next word we will generate. 

Before we use the program to generate any text, we need to feed some texts to program so that it can construt all the possible states and the corresponding probablistic transitions between them. In other words, the program needs to, at first, build up the markov chain by doing some statistics over some legitimate texts.

### Helper functions

In [1]:
import random

First of all, as each state of our Markov Chain consists of words, we need a function that takes in a file, reads the file, and then split the file into words.

In [2]:
def read_and_split(file_name):
    
    # open the file
    stream = open(file_name)
    startposition = 0
    stream.seek(startposition)
    
    # process the file into string
    processed_file = stream.read()
    
    # split the string into list of words
    words = processed_file.split()
    
    return words

Next, we need to build up a python dictionary that maps a tuple of last several words seen to a list of words that can potentially be the next word. We will write a function that takes in a list of words (parameter source), and also a size for how many words we should look back to determine the next word (parameter tuple_size). By the way, "tuple_size" should be consistent with the function we will write later that serves to generate the new texts.

In [3]:
def build_dictionary(source, tuple_size):
    
    # initialize an empty dictionary
    dictionary = {}
    
    # get the length of the input "source"
    n = len(source)
    
    # iterate from the the (tuple_size)th element in the list
    # so that we have enough "history" to look at
    for i in range(0, n):
        
        # determine the endpoints of the segment we use to build the key
        startIndex = i - tuple_size
        endIndex = i
        
        # fetch the segment
        if startIndex < 0:
            offset_front_index = (startIndex + n)
            first_part = source[offset_front_index : n]
            second_part = source[0 : endIndex]
            segment = first_part + second_part
        else:
            segment = source[startIndex : endIndex]
        
        # turn the segment list into a tuple so that it can act as a key
        key = tuple(segment)
        
        # check whether the key already existent in the dictionary
        key_exists = key in dictionary
        
        if key_exists:
            current_word = source[i]
            # append the current_word at the end of the old list
            current_list = dictionary[key]
            current_list.append(current_word)
        else:
            current_word = source[i]
            # create the new list with the current_word
            dictionary[key] = [current_word]
            
    return dictionary       

### Let's test the functions above a little bit, 
and have a general idea about the direction we are proceeding:

In [4]:
preprocessed = read_and_split("test_prep.txt")
print preprocessed

['How', 'to', 'create', 'a', 'program', 'that', 'can', 'write', 'poem', 'that', 'reads', 'like', 'Shakespeare?', 'This', 'is', 'actually', 'an', 'interview', 'question', 'from', 'Khan', 'Academy', 'for', 'summer', 'internship...', 'You', 'may', 'consider', 'using', 'Machine', 'Learning,', 'Deep', 'Leaning', 'and', 'all', 'those', 'fancy', 'algorithms', 'and', 'techniques,', 'as', 'it', 'seems', 'a', 'highly', 'creative', 'and', 'unpredictable', 'process,', 'and', 'you', 'may', 'assume', 'that', 'the', 'program', 'has', 'to', 'be', 'trained', 'very', 'well', 'in', 'order', 'to', 'sound', 'like', 'Shakespeare.']


What does our markov chain look like? Test build_dictionary function! Recall that this function takes in two arguments, one is the file (in the format of array of words), the other is the number of words for each state. We will test our function on the same file but with two different tuple_size parameters.

In [5]:
temp_result = build_dictionary(preprocessed, 1)
print temp_result

{('creative',): ['and'], ('the',): ['program'], ('internship...',): ['You'], ('you',): ['may'], ('in',): ['order'], ('This',): ['is'], ('from',): ['Khan'], ('may',): ['consider', 'assume'], ('like',): ['Shakespeare?', 'Shakespeare.'], ('sound',): ['like'], ('for',): ['summer'], ('to',): ['create', 'be', 'sound'], ('has',): ['to'], ('it',): ['seems'], ('program',): ['that', 'has'], ('using',): ['Machine'], ('algorithms',): ['and'], ('an',): ['interview'], ('You',): ['may'], ('summer',): ['internship...'], ('assume',): ['that'], ('very',): ['well'], ('Academy',): ['for'], ('Learning,',): ['Deep'], ('consider',): ['using'], ('Machine',): ['Learning,'], ('well',): ['in'], ('reads',): ['like'], ('interview',): ['question'], ('Deep',): ['Leaning'], ('actually',): ['an'], ('highly',): ['creative'], ('a',): ['program', 'highly'], ('trained',): ['very'], ('poem',): ['that'], ('Leaning',): ['and'], ('order',): ['to'], ('seems',): ['a'], ('process,',): ['and'], ('is',): ['actually'], ('write',): 

In [6]:
temp_result = build_dictionary(preprocessed, 2)
print temp_result

{('and', 'techniques,'): ['as'], ('Khan', 'Academy'): ['for'], ('unpredictable', 'process,'): ['and'], ('a', 'highly'): ['creative'], ('those', 'fancy'): ['algorithms'], ('to', 'create'): ['a'], ('program', 'that'): ['can'], ('for', 'summer'): ['internship...'], ('as', 'it'): ['seems'], ('like', 'Shakespeare.'): ['How'], ('algorithms', 'and'): ['techniques,'], ('you', 'may'): ['assume'], ('order', 'to'): ['sound'], ('may', 'assume'): ['that'], ('sound', 'like'): ['Shakespeare.'], ('highly', 'creative'): ['and'], ('write', 'poem'): ['that'], ('summer', 'internship...'): ['You'], ('be', 'trained'): ['very'], ('to', 'sound'): ['like'], ('Machine', 'Learning,'): ['Deep'], ('You', 'may'): ['consider'], ('may', 'consider'): ['using'], ('Shakespeare?', 'This'): ['is'], ('very', 'well'): ['in'], ('to', 'be'): ['trained'], ('How', 'to'): ['create'], ('is', 'actually'): ['an'], ('Leaning', 'and'): ['all'], ('trained', 'very'): ['well'], ('interview', 'question'): ['from'], ('well', 'in'): ['orde

### Do you notice the difference between 'build_dictionary(preprocessed, 1)' and 'build_dictionary(preprocessed, 2)'? 
Most entries have exactly one choice of word, because the file we are experimenting on is basically the first paragraph of this tutorial, so the resource and variation are very limited. However, we can still find some entires in 'build_dictionary(preprocessed, 1)' that has several choices (for instance, ('and',): ['all', 'techniques,', 'unpredictable', 'you'], and ('to',): ['create', 'be', 'sound'],) whereas there isn't any entry with multiple choices of words in 'build_dictionary(preprocessed, 2)'.

The reason is that the more words each state consists of, the more detailed each state will be, and so the behavior of each state will be more predictable. Since we want a random speech generator, does it mean that we should make the 'tuple_size' parameter as large as possible? We will discuss this tradeoff later after we compelte a functional markov chain text generator.

### Implement the text generator

Now, we have done all the preparation work, so we should move on to the "magical" function that actually creates new texts, based on the texts resources preprocessed by the two functions above.

Remember that we rely on a markov chain, so when we choose the next word, it only depends on the last several words we generated.

In [7]:
def generate_text(lib, tuple_size, length):
    
    # initialize an empty string
    result = ""
    
    # initialize a random number generator based on time
    random.seed()
    
    # first, choose the starting state
    lib_length = len(lib)
    keys = lib.keys()
    # generate a random number within range
    last_index = lib_length - 1
    init = random.randint(0, last_index)
    # obtain a random 'key', that is, our initial state
    init = keys[init]
    init = list(init)
    # add every string in the list to our result, seperated by space
    for word in init:
        result = result + " "
        result = result + word
    
    # a counter keeps track of the length of already generated text
    count = tuple_size
    
    # maintain a queue keeping track of the last tuple_size many words we generated
    # this queue represents the state in our markov chain
    state = init
    
    # start generating the text
    while count < length:
        
        # fetch the choices of the next word based on the current state
        state_tuple = tuple(state)
        choices = lib[state_tuple]
        
        # randomly select the next word from the choices
        choices_length = len(choices)
        last_index = choices_length - 1
        choice = random.randint(0, last_index)
        next_word = choices[choice]
        
        # add the word into our result
        result = result + " " 
        result = result + next_word
        
        # maintain our state
        state.pop(0)
        state.append(next_word)
        
        # increment the counter
        count = count + 1
    
    return result

### Wrap up everything
Before we test our final product, it's a good practice and style to modularize our text generator into a class. Each instance of such class should be initialized with a file, and can generate different new texts by user's choice of length and tuple_size.

In [8]:
class MarkovChainTextGenerator:
    
    # constructor
    def __init__ (self, file_name):
        
        # save the file in the good format
        processed_file = read_and_split(file_name)
        self.source = processed_file
        
    def generate (self, tuple_size, output_length):
        
        # build up our markov chain
        library = build_dictionary(self.source, tuple_size)
        
        # generate the speech
        result = generate_text(library, tuple_size, output_length)
        
        return result
        

Fantastic! Now, we should test our generator on something REAL!!! We will use real Shakespeare's sonnets to train our library.

### Test it out!

In [9]:
generator_from_sonnets = MarkovChainTextGenerator("sonnets.txt")

In [10]:
temp_result = generator_from_sonnets.generate(1, 100)
print temp_result

 bred Where all you alone kingdoms of year set, And sweets and poets can lend, And my heart's right gracious, And play the weary car, Like to this huge rondure hems. O' let your countenance fill'd his rank thoughts canst not, if never shaken; It suffers not every where! And take a gilded monuments Of my mind, thy mind; Those hours, that deep vermilion in youth convertest. Herein lives in my love's might. O, that I feel thou turn back the subject lends not press My life repair, Which husbandry in selling hours in my heart doth shadow doth come


In [11]:
temp_result = generator_from_sonnets.generate(2, 100)
print temp_result

 she lies, That she hath thee, is of my wailing chief, A loss in love loves not to tell me so; As testy sick men, when their deaths be near, No news but health from their physicians know; For if you read this line, remember not The hand that writ it; for I love you dearer: Yet then my eye doth feast And to the most of praise add something more; But that thou mayst in me is wanting, And so of you, As he takes thee hence. O, that you alone are you? In whose confine immured is the


If we compare the two generated texts above, the one with tuple_size 1 is clearly more random, but it reads more like gibberish. The one with tuple_size 2 reads more legit, but it has many segments that are exactly the same as Shakespeare's original works.

Therefore, a good text generator based on Markov Chain should achieve a optimal balance between randomness and legitimacy, so the tuple_size parameter can't be too high or too low.

Now let's test it on something more interesting. Maybe our generator can write a play? 

In [12]:
generator_from_hamlet = MarkovChainTextGenerator("hamlet.txt")

In [13]:
temp_result = generator_from_hamlet.generate(1, 100)
print temp_result

 sleep: Thus bad but mutes or fortune's star, Their residence, both friend and helpful to this more violent property and bloody deed is eaten. A sister be free, Confound the minutes of actions fair thought thy face doth hourly grow great, the most grave, Who shall stand a state, The inward breaks, and the question: Whether in together; And call the Dane. Fran. I forbid my lord? Ham. How strangely? Clown. A little of fierce events, As may the matter if the stars with you- why the hobby-horse, whose huge spokes ten thousand souls and dole, Taken to double grace;


In [14]:
temp_result = generator_from_hamlet.generate(2, 100)
print temp_result

 me most To my sick soul (as sin's true nature is) Each toy seems Prologue to some great amiss. So full of threats to all- To you alone. Mar. Look with what courteous action It waves you to ravel all this matter out, That I essentially am not in their birth,- wherein they are actions that a man faithful and honourable. Pol. I did proceed? Hor. I think I saw him yesterday, or t'other day, Or then, or then, with such or such; and, as you please; But if you mouth it, as many of our neglected tribute. Haply the


Maybe a mixture of Shakespeare's different works? (sonnets + Macbeth + King Lear + Hamlet)

Let's implement a helper function that combines different file sources into one single file so that our generator can be initialized with.

In [15]:
def combine_files(file_list, output_file_name):
    with open(output_file_name, 'w') as outfile:
        length = len(file_list)
        for i in range(0,length):
            file_name = file_list[i]
            with open(file_name) as infile:
                for line in infile:
                    outfile.write(line)

In [16]:
files = ["sonnets.txt", "hamlet.txt", "macbeth.txt", "KingLear.txt"]
output_file_name = "Mixed.txt"
combine_files(files, output_file_name)
generator_from_mixed = MarkovChainTextGenerator(output_file_name)

In [17]:
temp_result = generator_from_mixed.generate(1,100)
print temp_result

 indeed, it feed and on the scenes, set forth? OSWALD OSWALD Why, then, finding By his speech, but he loved mansionry, that fair name. So much of art; They were sent To offer you at my man's scope, Each toy in the noble having weigh'd it, And braggart with swift As 'Well, well, my friends, deserved at the riotous appetite. Down from thee, And that more worthier way of view in by your beauty shall be the sisters saluted me, true event, and sue a fast, Thence comes here? Follow me like a vice must not too rough night. EDMUND


In [18]:
temp_result = generator_from_mixed.generate(2,100)
print temp_result

 But that thou teachest how to make one twain, By praising him here by me, Do thou for him if I be so able as now. Lord. The Queen his mother shall uncharge the practice And call it winter, which being full of changes his age is; the observation we have shuffled off this mortal coil, Must give us bearing To tell us what Lord Hamlet said. We heard it all.- My lord, as will fill up The cistern of my mystery; you would drive me into him, I say. Servants bind him REGAN Hard, hard. O filthy traitor! GLOUCESTER


In [19]:
temp_result = generator_from_mixed.generate(3,100)
print temp_result

 may be, very like. Pol. Hath there been such a time- I would fain know that- That I have frequent been with unknown minds And given to time your own dear-purchased right That I have hoisted sail to all the world besides methinks are dead. Since I left you, mine eye is famish'd for a look, Or heart in love with sighs himself doth smother, With my love's picture then my eye doth feast And to the last bended their light on me. Pol. Come, go with me. Exeunt SCENE III. The British camp near Dover. Enter GLOUCESTER, and EDGAR


As we use more resources to train the markov chain, we can use bigger number for the tuple_size parameter, so as to achieve more legitimate codes with acceptable degree of randomness.

### Further Thoughts
1. Once the generator starts working, the tuple_size is fixed, what if we change it randomly throughout the process to better the randomness?
2. When we train our Markov Chain, we didn't take care of the unwanted effects of punctuation. That is, "me" and "me." and "me," will be treated as three different things. We should be able to optimize it by seperating the word and the punctuation.

# Thank you so much for reading my tutorial!

Shangda (Harry) Li, andrewID: shangdal