# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,total_words
Amos,the words of amos who was among the herdmen o...,38376
Chronicles,adam seth enosh kenan mahalalel jared enoc...,32707
Daniel,in the third year of the reign of jehoiakim k...,24312
Deuteronomy,these are the words which moses spoke unto al...,32656
Ecclesiastes,the words of koheleth the son of david king i...,28595
Esther,now it came to pass in the days of ahasuerust...,19046
Exodus,now these are the names of the sons of israel...,18975
Ezekiel,now it came to pass in the thirtieth year in ...,46058
EzraNehemiah,e now in the first year of cyrus king of pers...,48206
Genesis,in the beginning god created the heaven and t...,36932


In [21]:
# Extract only Ali Wong's text
proverbs_text = data.transcript.loc['Proverbs']
proverbs_text[:200]

' the proverbs of solomon the son of david king of israel to know wisdom and instruction to comprehend the words of understanding to receive the discipline of wisdom justice and right and equity to giv'

In [22]:
texts = {book:data.transcript.loc[book] for book in data.index}
texts['Proverbs'][:200]

' the proverbs of solomon the son of david king of israel to know wisdom and instruction to comprehend the words of understanding to receive the discipline of wisdom justice and right and equity to giv'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [33]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
#     words = text.split(' ')
    words = text.split()
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [34]:
# Create the dictionary for Ali's routine, take a look at it
proverbs_dict = markov_chain(proverbs_text)
# proverbs_dict

In [35]:
dicts = {book:markov_chain(texts[book]) for book in data.index}

In [36]:
proverbs_dict == dicts['Proverbs']

True

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [39]:
import random
from nltk import word_tokenize, pos_tag

In [42]:
pos_tag(['bird'])[0][1]

'NN'

In [45]:
def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # Make sure sentence ends with a noun
    noun_ending = False
    word_lst = sentence.split()
    while noun_ending == False:
        
        last_word = word_lst[-1]
        word_lst = word_lst[:-1]
        
        noun_ending = pos_tag([last_word])[0][1]
        word_lst = word_lst[:-1]
    
    sentence = ' '.join(word_lst)
        
    # End it with a period
    sentence += '.'
    return(sentence)

In [46]:
for book in data.index:
    print('---')
    print(book + ':')
    for i in range(0,1):
        print(generate_sentence(dicts[book]))

---
Amos:
Ease in the lord hear ye that swear by a cart creaketh that speaketh.
---
Chronicles:
Ishpan and prostrated himself he walked in gilead whom the increase of judah and.
---
Daniel:
Gold its head with thee that they shall set my bed then answered and.
---
Deuteronomy:
Forgive o israel for thy house of the causes between whom thou shalt dwell.
---
Ecclesiastes:
Tell a wise man what profit hath no preeminence above all his neighbour this.
---
Esther:
Long as joined themselves together the words of the rest from among the jews.
---
Exodus:
Foot beside the land of her unto the sockets twenty boards two men and.
---
Ezekiel:
Solicited and sent them thus saith the time of thine eyes to all the.
---
EzraNehemiah:
Jahaziel and the sons of silver with me timber to pass when i gave.
---
Genesis:
Years old and the men into egypt for a few and that is therein.
---
Habakkuk:
Law and say woe to scatter me upon tables that saith to the sea.
---
Haggai:
Garment and how do ye clothe you that th

## Additional Exercises

1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.