# Text Generation

Markov chains can be used for very basic text generation. Takes every word in a corpus as a state. A simple assumption is made that the next word is only dependent on the previous word .

In [24]:
import pandas as pd
import numpy as np
import random
import re

import nltk
nltk.download('punkt')

from collections import defaultdict

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SUNSHINE\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [25]:
def tokenize_text(text):
    '''Tokenize the input text by sentence and word.'''
    sentences = nltk.sent_tokenize(text)
    tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
    return tokenized_sentences

Build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [26]:
def build_markov_chain(tokenized_text):
    '''Build a Markov chain dictionary from the tokenized text.
       The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    m_dict = defaultdict(list)
    for sentence in tokenized_text:
        m_dict['START'].append(sentence[0])  # Add 'START' token
        for current_word, next_word in zip(sentence, sentence[1:]):
            m_dict[current_word].append(next_word)
    return dict(m_dict)

In [27]:
def generate_sentence(chain, start_word='START', count=15, randomness=1.0):
    '''Generate a sentence using the Markov chain dictionary.'''
    sentence = []
    current_word = start_word

    for _ in range(count):
        next_word_options = chain[current_word]
        if not next_word_options:
            break
        next_word = random.choice(next_word_options)
        sentence.append(next_word)
        current_word = next_word

    return ' '.join(sentence)

In [28]:
data = pd.read_csv("frame2.csv")
com_text = data.Transcript.loc[0] # Extract text of any comedian using serial number

In [29]:
# Tokenize the text
tokenized_text = tokenize_text(com_text)

In [30]:
# Build Markov chain dictionary
com_dict = build_markov_chain(tokenized_text)

In [31]:
# Generate a sentence
generated_sentence = generate_sentence(com_dict, start_word='START', count=60)
print(generated_sentence)

hey hey did have you know what struck me up and night he goes i told me i have to me two times because well sometimes theyre like that happened i tom he goes yeah it six months ago i do you get fuckin piece on a sixyearold let me a magazine or meeting tom the same it is nice


##### The generated text looks like English with  a few spelling errors though it doesn't really make much sense. There are grammar and syntax errors everywhere but this is partially to be expected given that the source text is composed of transcripts from spoken stand-up comedy routines.