# Text Generation

## Introduction
- Here we use Markov chains to create a basic text generation program. 
- The basic assumption of a Markov chain is that the next word is only dependent on the previous word.

## Objective
- We are going to generate text in the style of Ali Wong, the comedian.

In [5]:
# read the corpus 
import pandas as pd

data = pd.read_pickle('clean_corpus_with_full_names.pkl')
data

Unnamed: 0,transcript,full_name
ali,ladies and gentlemen please welcome to the sta...,Ali Wong
anthony,thank you thank you thank you san francisco th...,Anthony Jeselnik
bill,all right thank you thank you very much thank...,Bill Burr
bo,bo what old macdonald had a farm e i e i o and...,Bo Burnham
dave,this is dave he tells dirty jokes for a living...,Dave Chappelle
hasan,what’s up davis what’s up i’m home i had to ...,Hasan Minhaj
jim,ladies and gentlemen please welcome to the ...,Jim Jefferies
joe,ladies and gentlemen welcome joe rogan wha...,Joe Rogan
john,armed with boyish charm and a sharp wit the fo...,John Mulaney
louis,introfade the music out let’s roll hold there ...,Louis C.K.


In [6]:
# extract Ali Wong's transcript
ali_text = data['transcript'][0]
ali_text

'ladies and gentlemen please welcome to the stage ali wong hi hello welcome thank you thank you for coming hello hello we are gonna have to get this shit over with ’cause i have to pee in like ten minutes but thank you everybody so much for coming um… it’s a very exciting day for me it’s been a very exciting year for me i turned  this year yes thank you five people i appreciate that uh i can tell that i’m getting older because now when i see an  girl my automatic thought… is “fuck you” “fuck you i don’t even know you but fuck you” ‘cause i’m straight up jealous i’m jealous first and foremost of their metabolism because  girls they could just eat like shit and then they take a shit and have a sixpack right they got thatthat beautiful inner thigh clearance where they put their feet together and there’s that huge gap here with the light of potential just radiating throughand then when they go to sleep they just go to sleep right they don’t have insomnia yet they don’t know what it’s like 

# Build a Markov Chain
Build a simple markov chain that creates a dictionary of the following key and value pairs.
- The keys should be all the words in the corpus
- The values should be a list of the words that follow the keys

In [7]:
# UDF to create markov chain
from collections import defaultdict

# The input is a string of text and the output will be a dictionary with each word as a key and each value as the list of words that come after the key in the text.
def markov_chain(text):
    words = text.split(' ')  # tokenize the corpus by word
    m_dict = defaultdict(list)  # initialize a default dictionary to hold all of the words and next word
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)
    m_dict = dict(m_dict)  # convert default dict into dictionary  
    return m_dict

In [8]:
# create a dictionary for Ali's comedy routine
ali_dict = markov_chain(ali_text)
ali_dict

{'ladies': ['and', 'go', 'who', 'they', 'and', 'who'],
 'and': ['gentlemen',
  'foremost',
  'then',
  'have',
  'there’s',
  'resentment',
  'get',
  'get',
  'says',
  'my',
  'she',
  'snatch',
  'running',
  'fighting',
  'yelling',
  'it',
  'everybody',
  'my',
  'she',
  'i',
  'uh',
  'i',
  'i',
  'the',
  'i',
  'i',
  'i',
  'we',
  'we–',
  'then',
  'i',
  'watched',
  'i',
  'have',
  'that',
  'you',
  'recycling',
  'disturbing',
  'it’s',
  'all',
  'just…',
  'then',
  'be',
  'halfvietnamese',
  'we',
  'his',
  'i',
  'slide',
  'your',
  'inflamed',
  'you’re',
  'then',
  'i',
  'halfjapanese',
  'i’m',
  'halfvietnamese',
  'playing',
  'rugby',
  'on',
  'foremost',
  'a',
  'the',
  'emotionally',
  'i',
  '',
  'so',
  'neither',
  'i',
  'i–',
  'then',
  'it’s',
  'find',
  'start',
  'just',
  'caves',
  'gets',
  'is',
  'then',
  'look',
  'like',
  'very',
  'for',
  'i',
  'she',
  'rise',
  'her',
  'be',
  'eat',
  'watch',
  'then',
  'be',
  'now',


- here we want to keep the duplicates since that will tell us that this word has a higher probability to be spoken next

In [13]:
ali_dict.keys()

dict_keys(['ladies', 'and', 'gentlemen', 'please', 'welcome', 'to', 'the', 'stage', 'ali', 'wong', 'hi', 'hello', 'thank', 'you', 'for', 'coming', 'we', 'are', 'gonna', 'have', 'get', 'this', 'shit', 'over', 'with', '’cause', 'i', 'pee', 'in', 'like', 'ten', 'minutes', 'but', 'everybody', 'so', 'much', 'um…', 'it’s', 'a', 'very', 'exciting', 'day', 'me', 'been', 'year', 'turned', '', 'yes', 'five', 'people', 'appreciate', 'that', 'uh', 'can', 'tell', 'i’m', 'getting', 'older', 'because', 'now', 'when', 'see', 'an', 'girl', 'my', 'automatic', 'thought…', 'is', '“fuck', 'you”', 'don’t', 'even', 'know', 'fuck', '‘cause', 'straight', 'up', 'jealous', 'first', 'foremost', 'of', 'their', 'metabolism', 'girls', 'they', 'could', 'just', 'eat', 'then', 'take', 'sixpack', 'right', 'got', 'thatthat', 'beautiful', 'inner', 'thigh', 'clearance', 'where', 'put', 'feet', 'together', 'there’s', 'huge', 'gap', 'here', 'light', 'potential', 'radiating', 'throughand', 'go', 'sleep', 'insomnia', 'yet', 'w

In [14]:
ali_dict.values()

dict_values([['and', 'go', 'who', 'they', 'and', 'who'], ['gentlemen', 'foremost', 'then', 'have', 'there’s', 'resentment', 'get', 'get', 'says', 'my', 'she', 'snatch', 'running', 'fighting', 'yelling', 'it', 'everybody', 'my', 'she', 'i', 'uh', 'i', 'i', 'the', 'i', 'i', 'i', 'we', 'we–', 'then', 'i', 'watched', 'i', 'have', 'that', 'you', 'recycling', 'disturbing', 'it’s', 'all', 'just…', 'then', 'be', 'halfvietnamese', 'we', 'his', 'i', 'slide', 'your', 'inflamed', 'you’re', 'then', 'i', 'halfjapanese', 'i’m', 'halfvietnamese', 'playing', 'rugby', 'on', 'foremost', 'a', 'the', 'emotionally', 'i', '', 'so', 'neither', 'i', 'i–', 'then', 'it’s', 'find', 'start', 'just', 'caves', 'gets', 'is', 'then', 'look', 'like', 'very', 'for', 'i', 'she', 'rise', 'her', 'be', 'eat', 'watch', 'then', 'be', 'now', 'they’re', 'most', 'they’ll', 'in', 'then', 'digitally', 'then', 'you', 'you’re', 'then', 'then', 'then', 'then', 'steady', 'brings', 'let', 'reverberate', 'say', 'my', 'he', 'when', 'i’m'

# Create Text Generator
Here we will program a text generation function that will take the following parameters.
- The dictionary
- Number of words you wish to be generated

In [15]:
# UDF for text generation
import random

# Input a dictionary in the format of key = current word, value = list of next words 
# along with the number of words you would like to see in your generated sentence
def generate_sentence(chain, count=15):
    # capitalize first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()
    
    # generate the second word from the value list
    # Set the new word as the first word
    # Repeat
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += " " + word2
        
    # end with a period
    sentence += '.'
    return sentence

In [16]:
generate_sentence(ali_dict)

'Looking than me and you know because being courageous for coming true and halfvietnamese so.'