# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [4]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus_interview.pkl')
data

Unnamed: 0,transcript,full_name
BERGMAN,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMOVIES \n\n\n\n\...,SCOTT PELLEY
CARLSON,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPOLITICS \n\n\n\...,JOHN GRAY
DENNETT,\n\n\n\n\n\n\n\n\n\n\nDan Dennett: Interview w...,TUCKER CARLSON
FALLACI,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHISTORY \n\n\n\n...,JON STEWART
GRAY,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBOOKS \n\n\n\n\n...,ORIANA FALLACI
Goodman,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMOVIES \n\n\n\n\...,Oliver Stone
HARARI,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCULTURE \n\n\n\n...,INGMAR BERGMAN
PELLEY,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMOVIES \n\n\n\n\...,Susan Goodman
STEWART,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCOMEDY \n\n\n\n\...,YUVAL NOAH HARARI
Stone,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCOMEDY \n\n\n\n\...,ELLEN


In [5]:
# Extract only Carlson's text
letter_text = data.transcript.loc['CARLSON']
letter_text[:200]

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPOLITICS \n\n\n\n\n\n\n\nTucker Carlson Interviews Vladimir Putin | Transcript \n\n\n\n\n\n\n\nFebruary 10, 2024 \n\n\n\n\n\n\n\n\t\t\tPutin's interview with Tucker Carlson: Discusses Russia-Ukraine history, crit"

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [6]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [7]:
# Create the dictionary for Carlson's routine, take a look at it
letter_dict = markov_chain(letter_text)
letter_dict

{'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPOLITICS': ['\n\n\n\n\n\n\n\nTucker'],
 '\n\n\n\n\n\n\n\nTucker': ['Carlson'],
 'Carlson': ['Interviews', '[Introducing'],
 'Interviews': ['Vladimir'],
 'Vladimir': ['Putin', 'Putin', 'Putin,', 'Putin', 'himself', 'in', 'Putin'],
 'Putin': ['|',
  'provided',
  'expressed',
  'addressed',
  'hinted',
  'portrayed',
  'went',
  'believes',
  'provided',
  '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSHARE'],
 '|': ['Transcript'],
 'Transcript': ['\n\n\n\n\n\n\n\nFebruary'],
 '\n\n\n\n\n\n\n\nFebruary': ['10,'],
 '10,': ['2024'],
 '2024': ["\n\n\n\n\n\n\n\n\t\t\tPutin's"],
 "\n\n\n\n\n\n\n\n\t\t\tPutin's": ['interview'],
 'interview': ['with', 'with'],
 'with': ['Tucker',
  'Tucker',
  'Orthodox',
  'what',
  'the',
  'it',
  'that,',
  'Ukraine',
  'two',
  'its',
  'Poland.',
  'Poland',
  'Poland',
  'Russia.',
  'Poland.',
  'Poland',
  'Hitler\xa0—',
  'Hitler,',
  'East',
  'Hitler',
  'Hitler',
  'Hitler.',
  'Poland.5\nBy\xa0the\xa0way,',
  'people',
  'Ukrain

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [8]:
import random

def generate_sentence(chain, count):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [9]:
generate_sentence(letter_dict, 100)

'Beneficial and\xa0safe. So Romania and a complex historical processes. The narrative correctly identifies the Ukrainianization. Their motive was 988. This was thanks to\xa0this that is a\xa0job.\nTechnically they turned out of\xa0the\xa0bottle. Moreover, I\xa0have told you the\xa0background, how to\xa0reverse the\xa0situation. We have completely out that territory was eight years as\xa0president, that in\xa0a\xa0conspiratorial manner, let’s just a\xa0statement of\xa0fact. We’re ready to\xa0sign it with the\xa0President?“ He was pursued against them. They do about my\xa0proposal to\xa0work together not accurate but they are fighting, so that you said anything new. Nevertheless, after 1991, when the\xa0Ukrainian soldiers should fight against you, I’ll tell you about.\nI\xa0was told it.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [10]:
def create_dict(str):
    return markov_chain(data.transcript.loc[str])

In [13]:
dennett=create_dict('DENNETT')
generate_sentence(dennett,100)

'Political process of towering intellect who do the book which is resolve to you, look, 5 — and lands with other crazy supernatural stuff, in a Natural Phenomenon?”\nDaniel Dennett: No, no.\nBill Moyers: What do is we can show how we have to draw the dog that created all the hoped-for utopia, echoing feudal disparities with churches did for an awesome, blood-curdling and wrestling with a scientific process. That would catch people’s affections and learn the issue is the few years ago. And they have to say to be alive, to say thanks in a corpse. It’s still want to Sunday.'

In [15]:
fallaci=create_dict('FALLACI')
generate_sentence(fallaci, 250)

'Anyway if it and now finally you reconcile the point I can’t get nomi\xadnated twice become internationally fa\xadmous? It never been just war, so I had no illu\xadsions. He does power and fighting. Do Cao Ky. A legend was to you, I really like her?” He went to be clarified and Leonid Brezhnev will never bothered him of a little by the cease-fire?” Taken by a little of all so well at first. The relationship with him, I know what degree it’s more in Vietnam instead of procedure, of power, and nothing else, or a fellow who’s always alluring. Dr. Kissinger?\nH.K.: Well … No, I’m sure of his elegant office, full of procedure, of state, and courteous. Also often happens before holding its mother. Kissinger | by an hypothesis that Nixon with them merci\xadlessly.” And if I think that nations have great importance of being an hypothesis that I’d rather associate you can see it… Ah! No, I’m wondering if I became the fact that I won’t tell you want to their virility. I had always clear, at tha

In [16]:
generate_sentence(create_dict('FALLACI'),200)

'Days about him arrive out of Chou En-lai and left. And that Hanoi agrees to the great ability. Even though he had wriggled out a temptation of my work. We’re going back to study him the most powerful man and galloping toward him out of his relatives died in the second most important in Vietnam instead has dared you see them went on a clash of my work? Rather, you from a bad taste of Thieu, do you compare the moment of him.” This amazing, romantic and finally you what I swear that I suppose I had put a hearing, and now with China, and whatever I’ve always clear, at the troops, and diplomacy was a rock, or even threatened to Peking without a position toward a presi\xaddent who were to know, and Nixon announced reaching an hour spent with the Easter offensive was a powerful secretary of which Kissinger acted alone. Americans like to pay a demo\xadcratic Spain prepares for herself in the town, the mutual slaughter will be, certainly, if it embarrass me a fact, they certainly don’t feel re

In [18]:
dennett_text=data.transcript.loc['DENNETT']
dennett_text[:50]

'\n\n\n\n\n\n\n\n\n\n\nDan Dennett: Interview with Bill Moyers'

In [21]:
dennett_dict = markov_chain(dennett_text)

In [22]:
generate_sentence(dennett_dict,500)

'People who say there are very different or most natural phenomenon. Didn’t always adding, and thank goodness you mean take risks. They’re being with a Baptist or hate them, actually — I go, who’s there? Who’s there? Who’s there? And they afraid to look up and people are clerics in the country wasn’t a good for the music and said, oh, my coinage, although if young people of Islam. Think of the Muslim world, it’s a group of God has been treated so you’re being that don’t — I think they care? Why do the question is, religion, because you’ve been treated so that that I’m calling for, I can she see me as —\nBill Moyers: What did in memorial for General Motors, where there’s just fine. I don’t we do to try to God and just thought, this isn’t something that difference.\nDaniel Dennett: I don’t believe that religious people of the lore. But you want somebody else. And I think it is a very cut any religion has evolved over which was what’s good strategy, even think the job the theory as a crea

In [23]:
from collections import defaultdict
import string
def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    punc_dict = defaultdict(list)

    for i,word in enumerate(words):
        if word in string.punctuation or word == '♪':
            continue
        elif word[-1] in string.punctuation:
            if '...' in word:
                punc_dict['...'].append(word[:-3])
                words[i] = word[:-3]
            else:
                punc_dict[word[-1]].append(word[:-1])
                words[i] = word[:-1]
        elif word[0] in string.punctuation:
            punc_dict[word[0]].append(word[1:])
            words[i] = word[1:]

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    punc_dict = dict(punc_dict)
    return m_dict, punc_dict

In [24]:
import random

def generate_sentence(chain,punc_dict,count):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    if word1 in string.punctuation or word1 == '♪':
        return sentence
    # End it with a punctuation
    for i,punc in enumerate(punc_dict):
        if word1 in punc_dict[punc]:
            sentence += punc
            return sentence
    return sentence

In [27]:
# letter_txt=data.transcript.loc('CARLSON')
# letter_txt[:50]
letter_text = data.transcript.loc['CARLSON']
letter_text[:200]

"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPOLITICS \n\n\n\n\n\n\n\nTucker Carlson Interviews Vladimir Putin | Transcript \n\n\n\n\n\n\n\nFebruary 10, 2024 \n\n\n\n\n\n\n\n\t\t\tPutin's interview with Tucker Carlson: Discusses Russia-Ukraine history, crit"

In [28]:
letter_dict, punc_dict=markov_chain(letter_text)

In [29]:
sentence= generate_sentence(letter_dict,punc_dict,500)

In [30]:
print(sentence)

The independent Ukraine should fight in Ukraine Peace talks.
Vladimir Putin I will finish the previous thought that I am talking about the start implementing his voters.
Tucker Carlson Discusses Russia-Ukraine history of the Rus’ cities.
Rus’ and the Kievan Rus’.
Baptism of Russian statehood and say something that territory were saying “Russians do want it was baptized Russia Make an agreement with the outgoing President Yanukovich came to Moscow with Hitler and his team I proposed that Ukrainians started There are in power in the West have been erected they never existed before.
Tucker Carlson I appreciate all the presidents that they simply led us not blow up the valve please don’t.“ What was a moment when the doors of NATO were referring to is Russian history criticizes NATO or CIA did not express concern over their Motherland they want assuming that I just don’t do In violation of international law but not misunderstanding what was right away and the West positing that No we were i