# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,full_name
LETTERKENNY,"Letterkenny concludes with its series finale, ...",LETTERKENNY – S12E06 – OVER AND OUT
MASTERS OF THE AIR,"“Masters of the Air,” a 2024 American war dram...",MASTERS OF THE AIR – S01E01 – PART ONE
MONSIEUR SPADE,Monsieur Spade\nSeason 1 Episode 2\nEpisode Ti...,MONSIEUR SPADE – EPISODE 2
SLOW HORSES,Episode Title: Footprints\nSeries: Slow Horses...,SLOW HORSES – S03E06 – FOOTPRINTS
TRUE DETECTIVE,True Detective\nSeason 4 Episode 3\nEpisode Ti...,TRUE DETECTIVE – S04E03 – PART 3


In [2]:
# Extract only Ali Wong's text
letter_text = data.transcript.loc['LETTERKENNY']
letter_text[:200]

'Letterkenny concludes with its series finale, Season 12 Episode 6, “Over and Out,” delivering a heartfelt and humorous farewell. The episode is filled with fun callbacks and a sense of contentment amo'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [3]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [4]:
# Create the dictionary for Ali's routine, take a look at it
letter_dict = markov_chain(letter_text)
letter_dict

{'Letterkenny': ['concludes',
  'and',
  'universe.',
  'Legionnaires',
  'has',
  'consists'],
 'concludes': ['with'],
 'with': ['its',
  'fun',
  'its',
  'a',
  'a',
  'everyone',
  'emotional',
  'Dan',
  'pelicans?',
  'science.',
  'Peregrines.',
  'pelicans.',
  'no',
  'you',
  'that?',
  'a',
  'this',
  'his',
  'a'],
 'its': ['series', 'quirky', 'characters,', 'inhabitants', 'dick', 'pace.'],
 'series': ['finale,', 'ends,'],
 'finale,': ['Season'],
 'Season': ['12'],
 '12': ['Episode'],
 'Episode': ['6,', 'aired', '4'],
 '6,': ['“Over'],
 '“Over': ['and'],
 'and': ['Out,”',
  'humorous',
  'a',
  'unique',
  'bird',
  'a',
  'an',
  'eventually',
  'reminiscing.',
  'closes',
  'presumably',
  'fans,',
  'its',
  'the',
  'your',
  'ekspecially',
  'mouses',
  'all',
  'hicks',
  'Fun-Dip',
  'a',
  'a',
  'think',
  'the',
  'a',
  'a',
  'conquered',
  'rice,',
  'it',
  'a',
  'Break',
  'hockey',
  'as',
  'gentlemen,',
  'out.',
  'faces',
  'Navarro',
  'a',
  'Spade’s

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [7]:
import random

def generate_sentence(chain, count):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [8]:
generate_sentence(letter_dict, 100)

'Gastro Industrial Complex. But, Stewart, dareth not doing it, Stewart! We respect you! I said, they’re cutting the… (both): This isn’t necessary. Were you dropped as there’s enough non-degens to pull its series ends, the word out, Stewart? Which is a lot together. We’ve done here, Glen. I was thinking, about 3500. Now that stork to our appreciation, we’ve decided to reads. I do you guys love letter to come, just one bell pepper. Yeah, I dareth you sure? Yes, Glen. Oh yeah. And secondly, Storks aren’t even know what you say one bits. You’re a bit? Not the spirit.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [9]:
def create_dict(str):
    return markov_chain(data.transcript.loc[str])

In [23]:
master=create_dict('MASTERS OF THE AIR')
generate_sentence(master,100)

'2024, demonstrating resilience and the northwest side of the fourth flap. Pilot to show up. Got a dime through our entire group. [siren blaring] [soldiers clamoring, shouting] Harder! [grunting, shouting] [pilot] Egan, I don’t know. [engine 1 stops] Flak incoming. Hold on, boys. See you tell you like one. Mmm. [Bucky] Hey, Veal. Veal, calm down. [Bowman] Thank you. [sighs] What’s going over there and says that way. Ooh, oh. Go to fly like that. Lucky bastard’s shipping out of electrical failure, so sweet. [chuckles] All six in England. Now, I think it’s our fuel catching up with you? Tommy!.'

In [24]:
detective=create_dict('TRUE DETECTIVE')
generate_sentence(detective, 250)

'Rink. Alright, there’s a lot. All the hospital. He’s agitated. (person wails) Liz: Yeah. Voices. Episodes. (light, tense music playing) Anders: (gurgly voice) Hello, sir. We’re here to start on his own eyes out? Peter: Yeah. Peter: Well, you’re under 30. You tell you ♪ ♪ ♪ Evangeline: I was nothin’ to silence Annie Kowtok: Yeah? Liz: Ask the Kowtok case. Hank: Danvers, tell me to– Kenny: No, he’s around, alright. Liz: Are you have you to say they’re dying. The place a real explanation for me something about your belly, lord, you’ve got the time. Evangeline: (sighs) Evangeline: Yeah, well, how can talk about. If I just, like, a good here. (“Sing Sing” by Georgina Birch playing over here. Liz: Then one bedtime story. It was the time. Her birthday. (sighs) just not. Liz: Oh, okay. That’s good. You’re under your sweater? That’s it? Liz: Ah, tomorrow’s no school records, no school records, no town. 50% of me? ♪ (screaming fades out) (white noise stops) Liz: Pete? What’s Danvers wants you w

In [25]:
generate_sentence(create_dict('TRUE DETECTIVE'),200)

'Way I’m sorry. Come on. Don’t you ever… get one? Susan: He hit her. Voice (whispers): Tell me mine. (“Limbo” by Georgina Birch playing over to start on us, so they’re not kidding. Vince: I cracked it. It’s here. ♪ ♪ Beast in your dad to do? Kenny: She was… not be calling day and Navarro here. Yeah, okay, you mean, they found him. Liz: Alright. Only one day she discovers her hair’s changing color. Here we have an 18-year-old girl. (chuckles) But she’s got there. Annie changed when they found these. Yeah, the deceased researchers raises questions, and Shout”) ♪ ♪ I was nothin’ to me. Qavvik: Your mother says hello. She’s awake. Anders: She got at least a murder-suicide. William Wheeler. He hit it a Oliver Tagaq. I follow you doing? This is about the fuckin’ math. Evangeline: Yeah. Liz: Wipe it all, do that, will not a hacker. Liz: When’s Clark’s stuff comin’ from me. Hank: Oh, well, move it means? Susan: I had made me mine. (“Limbo” by Marika Hackman playing) (sighs) Evangeline: Well… h

In [26]:
masters_text=data.transcript.loc['MASTERS OF THE AIR']
masters_text[:50]

'“Masters of the Air,” a 2024 American war drama mi'

In [28]:
master_dict = markov_chain(masters_text)

In [30]:
generate_sentence(master_dict,500)

'Go. Let’s go, Major. Is that initial point, it’s no, nay, never, no good. Start two. Start two. Start three. [Crosby] Flak, everywhere. May God rebuke him, sir. Ah-ten-hut. [colonel] Roger that. Is that should actually be touching down to sit on with me before I see. You don’t even like an angel today, and powdered before I thought that the channel. I mean, you still standing here then? [Buck] You don’t know where are you tomorrow then. Bye, Marge. Yeah? [Marge chuckling] [“Begin the wheel motor. [Brady] Roger. Zootsuit Two. Roger. Redmeat Lead. [Veal] I’m sorry. It’s my fault you sit behind a toast. Buck, give you Americans. You’re on, girl. Here you from? Negative. Oh, shit! We’ve lost engine problems. Working on flaps. [lieutenant] Nutting, how exactly is the target for crash landing. Motor was close. [Buck] Everything okay from… Chuck, can get in here. I’m sorry. It’s my balls dropped. [chuckles] Before I need music. Okay. Get a bunch of here! Move! Move! Move! Move! Go, go! Come 

In [31]:
from collections import defaultdict
import string
def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    punc_dict = defaultdict(list)
    
    for i,word in enumerate(words):
        if word in string.punctuation or word == '♪':
            continue
        elif word[-1] in string.punctuation:
            if '...' in word:
                punc_dict['...'].append(word[:-3])
                words[i] = word[:-3]
            else:    
                punc_dict[word[-1]].append(word[:-1])
                words[i] = word[:-1]            
        elif word[0] in string.punctuation:
            punc_dict[word[0]].append(word[1:])
            words[i] = word[1:]
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    punc_dict = dict(punc_dict)
    return m_dict, punc_dict

In [53]:
import random

def generate_sentence(chain,punc_dict,count):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    if word1 in string.punctuation or word1 == '♪':
        return sentence
    # End it with a punctuation
    for i,punc in enumerate(punc_dict):
        if word1 in punc_dict[punc]:
            sentence += punc
            return sentence
    return sentence       

In [54]:
# letter_txt=data.transcript.loc('LETTERKENNY')
# letter_txt[:50]
letter_text = data.transcript.loc['LETTERKENNY']
letter_text[:200]

'Letterkenny concludes with its series finale, Season 12 Episode 6, “Over and Out,” delivering a heartfelt and humorous farewell. The episode is filled with fun callbacks and a sense of contentment amo'

In [63]:
letter_dict, punc_dict=markov_chain(letter_text)

In [70]:
sentence= generate_sentence(letter_dict,punc_dict,500)

In [71]:
print(sentence)

Crazy contraption so much you talking to pull its pace Unfartunately yous knows there’s a significant event bringing together We’ve done here But Roald… none of feet in sixteen Look at the Ag Hall I said they’re cutting the… (both) Yahtzee Well… Maybe one bell pepper Yeah One who practices falconing one I open up people to Keeso’s real-life dog Gus The creator’s future projects including “Shoresy,” offer some crazy contraption so what I accept the Blue Herrons Reports Neither of the Ag Hall (all) We’re sure (all) No Madrigal Speed Trap No Ugh Fine two of our time Really There’s always gonna be all time rounding up the platform killed that nut sacks had become just have to get the superior platform Cleared the air currentleh I’m thinking about 20 off the Storks aren’t even as what A falconer Yeah like Hanging bed sheets over Now that while the hard time Really There’s always gonna be naive to this good buddys Do what you back than pillaging the Blue-Footed Boobys You’d have all the spir

In [77]:
yoyo='abc.'

In [78]:
yoyo[:-1]

'abc'