<a href="https://colab.research.google.com/github/romerik/fcc_Machine_Learning/blob/main/Obama_Speech_Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%tensorflow_version 2.x  # this line is not required unless you are in a notebook
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `2.x  # this line is not required unless you are in a notebook`. This will be interpreted as: `2.x`.


TensorFlow 2.x selected.


In [2]:
from google.colab import files
path_to_file = list(files.upload().keys())[0]

Saving Discours.txt to Discours.txt


In [3]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 14328 characters


In [4]:
# Take a look at the first 250 characters in text
print(text[:1000])

THE PRESIDENT: Hello, everybody! Thank you. Thank you. Thank you, everybody. All right, everybody go ahead and have a seat. How is everybody doing today? (Applause.) How about Tim Spicer? (Applause.) I am here with students at Wakefield High School in Arlington, Virginia. And we've got students tuning in from all across America, from kindergarten through 12th grade. And I am just so glad that all could join us today. And I want to thank Wakefield for being such an outstanding host. Give yourselves a big round of applause. (Applause.)
I know that for many of you, today is the first day of school. And for those of you in kindergarten, or starting middle or high school, it's your first day in a new school, so it's understandable if you're a little nervous. I imagine there are some seniors out there who are feeling pretty good right now -- (applause) -- with just one more year to go. And no matter what grade you're in, some of you are probably wishing it were still summer and you could've

###Encoding
Since this text isn't encoded yet well need to do that ourselves. We are going to encode each unique character as a different integer.



In [5]:
vocab = sorted(set(text))
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [6]:
print(vocab)
print(len(vocab))

['\n', '\r', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '7', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
71


In [7]:
# lets look at how part of our text is encoded
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: THE PRESIDENT
Encoded: [39 28 25  2 36 37 25 38 29 24 25 34 39]


In [8]:
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

THE PRESIDENT


###Creating Training Examples
Our task is to feed the model a sequence and have it return to us the next character. This means we need to split our text data from above into many shorter sequences that we can pass to the model as training examples. 

The training examples we will prepapre will use a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to the right. For example:

```input: Hell | output: ello```

Our first step will be to create a stream of characters from our text data.

In [9]:
seq_length = 100  # length of sequence for a training example
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

In [10]:
print(char_dataset.take(1))

<TakeDataset shapes: (), types: tf.int64>


Next we can use the batch method to turn this stream of characters into batches of desired length.

In [11]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

Now we need to use these sequences of length 101 and split them into input and output.

In [12]:
def split_input_target(chunk):  # for the example: hello
    input_text = chunk[:-1]  # hell
    target_text = chunk[1:]  # ello
    return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)  # we use map to apply the above function to every entry

In [13]:
print(dataset)

<MapDataset shapes: ((100,), (100,)), types: (tf.int64, tf.int64)>


In [14]:
for x, y in dataset.take(2):
  print("\n\nEXAMPLE\n")
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))



EXAMPLE

INPUT
THE PRESIDENT: Hello, everybody! Thank you. Thank you. Thank you, everybody. All right, everybody go

OUTPUT
HE PRESIDENT: Hello, everybody! Thank you. Thank you. Thank you, everybody. All right, everybody go 


EXAMPLE

INPUT
ahead and have a seat. How is everybody doing today? (Applause.) How about Tim Spicer? (Applause.) I

OUTPUT
head and have a seat. How is everybody doing today? (Applause.) How about Tim Spicer? (Applause.) I 


Finally we need to make training batches.

In [15]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)  # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

###Building the Model


In [16]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE,EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           18176     
_________________________________________________________________
lstm (LSTM)                  (64, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (64, None, 71)            72775     
Total params: 5,337,927
Trainable params: 5,337,927
Non-trainable params: 0
_________________________________________________________________


###Creating a Loss Function
Now we are going to create our own loss function for this problem. This is because our model will output a (64, sequence_length, 71) shaped tensor that represents the probability distribution of each character at each timestep for every sequence in the batch. 



However, before we do that let's have a look at a sample input and the output from our untrained model. This is so we can understand what the model is giving us.



In [17]:
for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch)  # ask our model for a prediction on our first batch of training data (64 entries)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")  # print out the output shape

(64, 100, 71) # (batch_size, sequence_length, vocab_size)


In [18]:
# we can see that the predicition is an array of 64 arrays, one for each entry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[ 7.62699265e-03  1.10152166e-03 -5.25636598e-03 ... -4.35336353e-03
   -8.28266144e-04 -2.02604313e-03]
  [ 6.04327675e-03  2.86248047e-03 -9.24623758e-03 ...  2.23226426e-03
    1.29912049e-03 -1.92878593e-03]
  [ 3.59292817e-03  2.43182317e-03 -8.23112950e-03 ... -1.44667551e-03
   -1.01471459e-02  5.04916441e-03]
  ...
  [ 8.13252479e-03  8.13604333e-04 -7.46793812e-03 ... -3.92857473e-03
    3.27172689e-03 -6.48423564e-03]
  [ 1.01223588e-02  2.08001677e-03 -9.94499866e-03 ... -9.70614236e-03
    7.45589845e-03 -1.04377419e-02]
  [ 8.06148630e-03 -2.60718819e-03 -9.54290386e-03 ... -1.22782192e-03
   -5.12548722e-04 -5.06054470e-03]]

 [[-4.11526160e-03  4.17520991e-04  1.18001818e-03 ...  2.91393546e-04
    9.04617738e-03 -2.77288212e-03]
  [ 4.83230641e-03  9.93007794e-04 -4.12813900e-03 ... -4.57193376e-03
    5.76864230e-03 -5.16943866e-03]
  [-5.95908612e-04  1.65244867e-03 -2.74502579e-03 ... -3.34624201e-03
    1.27961477e-02 -7.95506407e-03]
  ...
  [ 1.372

In [19]:
# lets examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

100
tf.Tensor(
[[ 0.00762699  0.00110152 -0.00525637 ... -0.00435336 -0.00082827
  -0.00202604]
 [ 0.00604328  0.00286248 -0.00924624 ...  0.00223226  0.00129912
  -0.00192879]
 [ 0.00359293  0.00243182 -0.00823113 ... -0.00144668 -0.01014715
   0.00504916]
 ...
 [ 0.00813252  0.0008136  -0.00746794 ... -0.00392857  0.00327173
  -0.00648424]
 [ 0.01012236  0.00208002 -0.009945   ... -0.00970614  0.0074559
  -0.01043774]
 [ 0.00806149 -0.00260719 -0.0095429  ... -0.00122782 -0.00051255
  -0.00506054]], shape=(100, 71), dtype=float32)


In [20]:
# and finally well look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# and of course its 71 values representing the probabillity of each character occuring next

71
tf.Tensor(
[ 0.00762699  0.00110152 -0.00525637 -0.00745045 -0.00363152  0.00123794
 -0.00023096  0.00579792 -0.00155136  0.00397557 -0.00246714 -0.00068156
 -0.00641047  0.00050122  0.00417974  0.00793499 -0.00331001 -0.00047541
 -0.00625166 -0.00166564 -0.00200743 -0.00389417  0.00098266 -0.00189935
 -0.00281346  0.00356629  0.00299683  0.00073833 -0.00132915 -0.00163202
 -0.00059693  0.00193355 -0.00667748 -0.00165157 -0.00169201 -0.00279811
  0.0044852  -0.00376152  0.00438907 -0.00011781  0.00175983  0.00271078
  0.00302047 -0.00236719  0.00013213  0.0082527  -0.00096923  0.00051052
  0.00098145  0.00084023  0.00264101 -0.00087558 -0.00169617 -0.00305396
  0.00516298  0.00357213  0.00316941 -0.00347713  0.0024611   0.0001195
 -0.00226035 -0.00230259 -0.00627074 -0.00199328 -0.00169039 -0.00017979
  0.00210524  0.00038536 -0.00435336 -0.00082827 -0.00202604], shape=(71,), dtype=float32)


In [21]:
# If we want to determine the predicted character we need to sample the output distribution (pick a value based on probabillity)
sampled_indices = tf.random.categorical(pred, num_samples=1)

# now we can reshape that array and convert all the integers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars  # and this is what the model predicted for training sequence 1

'Wz\rEILe3Pv,aTm\nsEu13TUk-0BMz7ESO)pCf(Dzo.hhejHelf\rWpMqA!dEpd2BANNYyEm\rhm0HzCvK".\r FoL"HuDCLHy):s?I);'

So now we need to create a loss function that can compare that output to the expected output and give us some numeric value representing how close the two were. 

In [22]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

###Compiling the Model
At this point we can think of our problem as a classification problem where the model predicts the probabillity of each unique letter coming next. 


In [23]:
model.compile(optimizer='adam', loss=loss)

###Creating Checkpoints
Now we are going to setup and configure our model to save checkpoinst as it trains. This will allow us to load our model from a checkpoint and continue training it.

In [24]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

###Training
Finally, we will start training the model. 

**If this is taking a while go to Runtime > Change Runtime Type and choose "GPU" under hardware accelerator.**



In [25]:
history = model.fit(data, epochs=500, callbacks=[checkpoint_callback])

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 21/500
Epoch 22/500
Epoch 23/500
Epoch 24/500
Epoch 25/500
Epoch 26/500
Epoch 27/500
Epoch 28/500
Epoch 29/500
Epoch 30/500
Epoch 31/500
Epoch 32/500
Epoch 33/500
Epoch 34/500
Epoch 35/500
Epoch 36/500
Epoch 37/500
Epoch 38/500
Epoch 39/500
Epoch 40/500
Epoch 41/500
Epoch 42/500
Epoch 43/500
Epoch 44/500
Epoch 45/500
Epoch 46/500
Epoch 47/500
Epoch 48/500
Epoch 49/500
Epoch 50/500
Epoch 51/500
Epoch 52/500
Epoch 53/500
Epoch 54/500
Epoch 55/500
Epoch 56/500
Epoch 57/500
Epoch 58/500
Epoch 59/500
Epoch 60/500
Epoch 61/500
Epoch 62/500
Epoch 63/500
Epoch 64/500
Epoch 65/500
Epoch 66/500
Epoch 67/500
Epoch 68/500
Epoch 69/500
Epoch 70/500
Epoch 71/500
Epoch 72/500
Epoch 73/500
Epoch 74/500
Epoch 75/500
Epoch 76/500
Epoch 77/500
Epoch 78

###Loading the Model
We'll rebuild the model from a checkpoint using a batch_size of 1 so that we can feed one peice of text to the model and have it make a prediction.

In [26]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

Once the model is finished training, we can find the **lastest checkpoint** that stores the models weights using the following line.



In [27]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

We can load **any checkpoint** we want by specifying the exact file to load.

###Generating Text
Now we can use the lovely function provided by tensorflow to generate some text using any starting string we'd like.

In [28]:
def generate_text(model, start_string,num_generate = 10000):
  # Evaluation step (generating text using the learned model)

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
    
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [30]:
inp = input("Type a starting string: ")
print(generate_text(model, inp))

Type a starting string: Hello, everybody
Hello, everybody. All right, everybody go deiff you quit on school -- you're not just quitting on yourself, you're quitting on your country.
Now of you, what's your contribution going to be? What problems are you goingot to fecused or vaccine frobaterg a ped just a little bit longer this monybulking a for your fort life your few times beffere you get it right. You might have to read something a few times before you understant to your family down or your country down. Most of all, don't let yourself down. Make us all proud.
That's and puse thought this mights and put a man on the moon. Students who sat where you sit 20 years ago who founded Google and her way to becoming Dr. Jazmin Perez.
The to focus on todant where a tifference or how it un you give up on yourself, you give ou sond of shomsn't mit to bighou can be rich and successful without any hard work -- that your ticket to success is through rappingither and AIDS, and to develop new energy

In [33]:
inp = input('Type some text :  ')
print(generate_text(model, inp, 100))

Type some text :  Thanks
Thanks who are Obaieared, and you get your who Prting from his her way to becoming Dr. Jazmin Perez.
I'm 


In [34]:
model.save('obama_speech_generator.h5')

