# Presidential Speech Generator #

This project demonstrates how to generate text using a character-based RNN. It is based on the following tutorial:

https://www.tensorflow.org/tutorials/text/text_generation
 

The purpose of the project is to create a model that can automatically generate a speech by a US president. The dataset used to train the model is a collection of all speeches made by US presidents until September 25, 2019. The data can be found here:

https://www.kaggle.com/littleotter/united-states-presidential-speeches

Since linguistic styles have changed over time, and because the dataset is quite large, this project will focus on speeches made during the Sixth Party System, from 1964 until the present.

This notebook is meant to be run in Google Colaboratory



In [1]:
#%tensorflow_version 2.x  # this line is not required unless you are in a notebook
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np
import pandas as pd
from google.colab import files

In [2]:
# Find the file on your local computer (must be downloaded first)
speech_file = files.upload()

Saving sixth_party_corpus.csv to sixth_party_corpus (4).csv


In [3]:
# Create DataFrame
speeches = pd.read_csv('sixth_party_corpus.csv')
speeches.shape

(10, 3)

In [4]:
# Take a look
speeches.head(10)

Unnamed: 0.1,Unnamed: 0,Party,transcripts
0,Lyndon B. Johnson,Democratic,"Mr. Speaker, Mr. President, Members of the Hou..."
1,Richard M. Nixon,Republican,"Senator Dirksen, Mr. Chief Justice, Mr. Vice P..."
2,Gerald Ford,Republican,"Mr. Chief Justice, my dear friends, my fellow ..."
3,Jimmy Carter,Democratic,"I am Edwin Newman, moderator of this first deb..."
4,Ronald Reagan,Republican,Thank you. Thank you very much. Thank you and ...
5,George H. W. Bush,Republican,I have many friends to thank tonight. I thank ...
6,Bill Clinton,Democratic,"My fellow citizens, today we celebrate the mys..."
7,George W. Bush,Republican,"President Clinton, distinguished guests and my..."
8,Barack Obama,Democratic,To Chairman Dean and my great friend Dick Durb...
9,Donald Trump,Republican,"Chief Justice Roberts, President Carter, Presi..."


In [5]:
# Display the beginning of President Johnson's speeches
speeches['transcripts'][0]



In [6]:
# Check the length. Johnson's speeches total more than 1.4 million characters
len(speeches['transcripts'][0])

1417677

In [7]:
# Create an empty string, then add all of the speech transcripts to it
all_speeches = ''

for i in range(10):
  all_speeches += speeches['transcripts'][i]

print(len(all_speeches))

7334371


In [8]:
all_speeches[:500]

'Mr. Speaker, Mr. President, Members of the House, Members of the Senate, my fellow Americans: All I have I would have given gladly not to be standing here today. The greatest leader of our time has been struck down by the foulest deed of our time. Today John Fitzgerald Kennedy lives on in the immortal words and works that he left behind. He lives on in the mind and memories of mankind. He lives on in the hearts of his countrymen. No words are sad enough to express our sense of loss. No words are'

In [9]:
vocab = sorted(set(all_speeches))
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(all_speeches)

In [10]:
# lets look at how part of our text is encoded
print("Text:", all_speeches[:30])
print("Encoded:", text_to_int(all_speeches[:30]))

Text: Mr. Speaker, Mr. President, Me
Encoded: [38 71 11  0 44 69 58 54 64 58 71  9  0 38 71 11  0 41 71 58 72 62 57 58
 67 73  9  0 38 58]


In [11]:
# Converts integers back to text
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:30]))

Mr. Speaker, Mr. President, Me


In [12]:
seq_length = 100  # length of sequence for a training example
examples_per_epoch = len(all_speeches)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [13]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [14]:
def split_input_target(chunk):  # for the example: hello
    input_text = chunk[:-1]  # hell
    target_text = chunk[1:]  # ello
    return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)  # we use map to apply the above function to every entry

In [15]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)  # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

In [16]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE,EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           23296     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 91)            93275     
Total params: 5,363,547
Trainable params: 5,363,547
Non-trainable params: 0
_________________________________________________________________


In [17]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [18]:
model.compile(optimizer='adam', loss=loss)

In [19]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [20]:
# Train the model
history = model.fit(data, epochs=20, callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [21]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

In [22]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

In [None]:
# Don't run unless you want to use the model from a specific checkpoint
# checkpoint_num = 48
# model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
# model.build(tf.TensorShape([1, None]))

In [23]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
    
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [24]:
inp = input("Type a starting string: ")
output = generate_text(model, inp)
output

Type a starting string: Good evening


"Good evening as professionalism, and restore stability and community. I made it clear some of the worries of business income fraps move to get rock solution – and they will send D- not just what you still has in your carse. We can't stop security. America is also calling outlined at a time when Republicans survived by shelpingless representatives to work with your own business and out of retirees. In addition, there have been insurance that could not be true trade rules: there's no dollar do things perhaps here in America. Our military families on decades, including our greatest of our institutions, and fixty financill the United States. We'll do it to the colontax ally, but no one knows how important this way, we have received ranchers from our great violent. And on the tecan school, ICE. If you've been tautible ) I approe it to be an approval department of immigration rural and industries that they need to make a difference table. I think it is far away, they're going to get it this

## Conclusion ##

The end result is a body of text which, although it is not particularly coherent, does a suprisingly good job of capturing the cadence and vocabulary of a presidential speech. The model has even invented some words, such as "shelpingless", which sounds almost like it could be a word in English.