## Text Generation with a Recurrent Neural Network

In this notebook, we build and train a character-level text generation model using TensorFlow. The workflow includes:

1. **Data Preparation:**
   - Load and preprocess text data from a CSV file.
   - Convert the text into a sequence of integers based on a character vocabulary.

2. **Dataset Creation:**
   - Create sequences of fixed length from the text data.
   - Prepare input-target pairs for training.

3. **Model Definition:**
   - Define a Recurrent Neural Network (RNN) model with an embedding layer, LSTM layer, and a dense output layer.

4. **Training:**
   - Compile and train the model on the prepared dataset.
   - Save model checkpoints to enable recovery and further evaluation.

5. **Text Generation:**
   - Use the trained model to generate text based on various starting strings.
   - Evaluate the model's ability to produce coherent and contextually relevant text.

This process demonstrates how to use TensorFlow to create a text generation model, explore its training dynamics, and assess its performance through generated text samples.


In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import os


#### Data Loading and Pre Processing

In [2]:
# Load data
data = pd.read_csv('data/training.1600000.processed.noemoticon.csv', 
                   encoding='latin-1',
                   header=None)
data.head(3)

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...


In [3]:
# concatenate the text
text = ' '.join(data[5])
text[:300]

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah! @Kenichan I dived many times for the ball. Managed to save 50%  The rest"

In [4]:
# vectorization
# see: https://www.tensorflow.org/tutorials/text/text_generation
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

193 unique characters



#### Training data


In [5]:
# Sequence length and examples per epoch
# The maximum length sentence we want for a single input in characters
seq_length = 128
examples_per_epoch = len(text) // (seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
    print(idx2char[i.numpy()])

@
s
w
i
t


In [6]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D is upset that"
" he can't update his Facebook by texting it... and might cry as a result  School today also. Blah! @Kenichan I dived many times f"
'or the ball. Managed to save 50%  The rest go out of bounds my whole body feels itchy and like its on fire  @nationwideclass no, '
"it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.  @Kwesidei not the whole crew  Need a h"
"ug  @LOLTrish hey  long time no see! Yes.. Rains a bit ,only a bit  LOL , I'm fine thanks , how's you ? @Tatiana_K nope they didn"


In [7]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [8]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D is upset tha"
Target data: "switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D is upset that"


In [9]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 30 ('@')
  expected output: 81 ('s')
Step    1
  input: 81 ('s')
  expected output: 85 ('w')
Step    2
  input: 85 ('w')
  expected output: 71 ('i')
Step    3
  input: 71 ('i')
  expected output: 82 ('t')
Step    4
  input: 82 ('t')
  expected output: 65 ('c')


In [10]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<_BatchDataset element_spec=(TensorSpec(shape=(64, 128), dtype=tf.int32, name=None), TensorSpec(shape=(64, 128), dtype=tf.int32, name=None))>

#### Text generation


In [11]:
# Define the model function
def get_model(vocab, embedding_dim=256, rnn_units=512):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(len(vocab), embedding_dim,
                                  input_shape=[None]),  # Use input_shape instead of batch_input_shape
        tf.keras.layers.LSTM(rnn_units,
                            return_sequences=True,
                            recurrent_initializer='glorot_uniform'),  # Removed stateful=True
        tf.keras.layers.Dense(len(vocab))
    ])
    return model

# Initialize the model
model = get_model(vocab)

# Test the model with a batch from the dataset
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)


  super().__init__(**kwargs)


In [12]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([133,   0, 127, 177,  15,  24, 100, 139, 115, 167, 129,  33,  13,
       174,  26, 116, 110,  78, 148, 183, 183, 158,  28, 141,  48, 182,
        44,  20, 167,  50, 167,  84, 134,   7, 160,  20,  90, 186, 188,
       175,  77,   1,  32, 190,  69, 147, 121,  74,  77,  55,  64, 187,
        95, 131, 160, 108,  64, 133, 134,  77, 157, 169,  63,  56, 181,
        83, 148,  24, 186, 180,   8,  85, 134,  37, 189, 106,  17, 133,
       153,  94,  22,  62,  46,  16, 155,  68,  64,  95,  36, 175, 120,
        27, 124,  81, 102, 189, 127,  23,  72, 119,  52, 104,  59, 164,
        81,  57, 149,  92, 168,  89, 124,  52, 192,  73,  72,  66, 136,
        62, 152,  10,  73, 138, 134, 121,  95,  43, 167, 185], dtype=int64)

In [13]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices])))

Input: 
 "ancurtis  i'm here, friend, and i love you. @kariajay  All this time you didn't notice I was gone...just needed DB2 is it?? @aar"

Next Char Predictions: 
 "¬\t¦à/8\x87²\x99Ð¨C-Ú:\x9a\x94p»ææÇ=´RåN4ÐTÐv\xad'É4|éëÛo Bígº\xa0loYbê\x82ªÉ\x92b¬\xadoÆÒaZäu»8éã(w\xadGì\x8e1¬Â\x806`P0Äfb\x82FÛ\x9f;£s\x89ì¦7j\x9eV\x8b]Ís[¼~Ñ{£Vïkjd¯`¿*k±\xad\xa0\x82MÐè"


In [14]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

model.compile(optimizer='adam', 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

model.fit(dataset, 
          epochs=2,
          callbacks=[checkpoint_callback],
          verbose=1)


Epoch 1/2
[1m 4342/14552[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m3:11:52[0m 1s/step - loss: 2.0509

In [None]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = 1.0

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [None]:
# rebuild model with batch size = 1 for generating
generating_model = get_model(vocab)

generating_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

generating_model.build(tf.TensorShape([1, None]))

In [None]:
print(generate_text(generating_model, start_string=u"Well, "))

In [None]:
# Define start strings to generate text
start_strings = [
    "Once upon a time",
    "In a galaxy far, far away",
    "The quick brown fox",
    "To be or not to be",
    "Hello world"
]

# Generate and print text for each start string
for start_string in start_strings:
    print(f"Start String: {start_string}")
    generated_text = generate_text(generating_model, start_string)
    print(f"Generated Text:\n{generated_text}\n")
    print("="*50)  # Separator for readability