<a href="https://colab.research.google.com/github/youavang/NLP_Inaugural_Speech/blob/main/NLP_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation

In this notebook we are going to generate sentences using Markov chains method. Markov chains can be used for very basic text generation. For this method, we can assume that the next word is dependent on the previous word.

## Select Text to Imitate

We are going to generate text in the style of President Barack Obama. The first step is to extract the text from his speech.

In [1]:
# Mount to Google Drive to access saved files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('/content/drive/MyDrive/pickle/corpus.pkl')
data

Unnamed: 0,transcript,full_name
bush,"Thank you, all. Chief Justice Rehnquist, Presi...",George W. Bush
clinton,"My fellow citizens, today we celebrate the mys...",Bill Clinton
obama,"My fellow citizens, I stand here today humbled...",Barack Obama
trump,"Chief Justice Roberts, President Carter, Presi...",Donald Trump


In [3]:
# Extract only President Obama's text
obama_text = data.transcript.loc['obama']
obama_text

'My fellow citizens, I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our Nation, as well as the generosity and cooperation he has shown throughout this transition. Forty-four Americans have now taken the Presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet every so often, the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because we the people have remained faithful to the ideals of our forebears and true to our founding documents. So it has been; so it must be with this generation of Americans. That we are in the midst of crisis is now well understood. Our Nation is at war against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence

## Build a Markov Chain Function

Build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [4]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [5]:
# Create the dictionary for President Obama's speech.
obama_dict = markov_chain(obama_text)
obama_dict

{'My': ['fellow'],
 'fellow': ['citizens,'],
 'citizens,': ['I'],
 'I': ['stand', 'thank', 'say'],
 'stand': ['here', 'before'],
 'here': ['today'],
 'today': ['humbled', 'is'],
 'humbled': ['by'],
 'by': ['the',
  'our',
  'the',
  'these',
  'inducing',
  'every',
  'dying',
  'our',
  'the',
  'Gerhard'],
 'the': ['task',
  'trust',
  'sacrifices',
  'generosity',
  'Presidential',
  'still',
  'oath',
  'skill',
  'people',
  'ideals',
  'midst',
  'part',
  'Nation',
  'ways',
  'indicators',
  'next',
  'challenges',
  'petty',
  'recriminations',
  'words',
  'time',
  'God-given',
  'greatness',
  'path',
  'fainthearted,',
  'pleasures',
  'risk-takers,',
  'doers,',
  'makers',
  'long,',
  'West,',
  'lash',
  'whip,',
  'hard',
  'sum',
  'differences',
  'journey',
  'most',
  'work',
  'economy',
  'roads',
  'electric',
  'sun',
  'winds',
  'soil',
  'demands',
  'scale',
  'cynics',
  'ground',
  'stale',
  'answer',
  'answer',
  "public's",
  'light',
  'vital',
  'q

## Create a Text Generator

Create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

In [6]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [11]:
generate_sentence(obama_dict)

'Job, which this we will restore science to power to you can now is the.'

## Next Step
The next step on improving text generation is to build a deep learning model that can generate well thoughtout sentences that actually make sense.
