<a href="https://colab.research.google.com/github/woodstone10/nlp-natural-language-lab/blob/main/NLP_text_generation_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation using character RNN

Reference:
- Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
- https://github.com/ageron/handson-ml2/blob/master/16_nlp_with_rnns_and_attention.ipynb

In [None]:
import glob
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import sklearn.preprocessing

**Data**

---



First, let’s download all of Shakespeare’s work, using Keras’s handy get_file() function and downloading the data from Andrej Karpathy’s Char-RNN project : 

In [None]:
url = "https://homl.info/shakespeare" 

In [None]:
import urllib
f = urllib.request.urlopen(url)
html = f.read()

In [None]:
print(html[:250])

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n'


In [None]:
filepath = tf.keras.utils.get_file("shakespeare.txt", url) 
with open(filepath) as f: 
  shakespeare_text = f.read()
print(len(shakespeare_text))
print(shakespeare_text[:250]) 

Downloading data from https://homl.info/shakespeare
1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



Save (writing) shakespeare_text.txt file on Google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
f = open("/content/drive/My Drive/My Colab/deep_learning_study-python/shakespeare_text.txt", "a")
f.write(shakespeare_text)
f.close()

Vocab

In [None]:
vocab = sorted(set(shakespeare_text))
print('{} unique characters'.format(len(vocab)))
print(vocab)

65 unique characters
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [None]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

we must encode every character as an integer. One option is to create a custom preprocessing layer. But in this case, it will be simpler to use Keras’s Tokenizer class. First we need to fit a tokenizer to the text: it will find all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text) 

Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells us how many distinct characters there are and the total number of characters in the text:

In [None]:
tokenizer.texts_to_sequences(["First"])  

[[20, 6, 9, 8, 3]]

In [None]:
tokenizer.sequences_to_texts([[ 20 , 6 , 9 , 8 , 3 ]]) 


['f i r s t']

number of distinct characters

In [None]:
max_id = len(tokenizer.word_index) 
max_id

39

total number of characters

In [None]:
dataset_size = tokenizer.document_count 
dataset_size

1115394

Let’s encode the full text so each character is represented by its ID

In [None]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1 
encoded

array([19,  5,  8, ..., 20, 26, 10])

the first 90% of the text for the training set 

In [None]:
train_size = dataset_size * 90 // 100 
train_size

1003854

In [None]:
int(dataset_size * 0.9)

1003854

create a tf.data.Dataset that will return each character one by one from this set:

In [None]:
dataset = tf.data.Dataset.from_tensor_slices( encoded[:train_size] ) 

**Chopping the Sequential Dataset into Multiple Windows **

The training set now consists of a single sequence of over a million characters. we would have a single (very long) instance to train it. Instead, we will use the dataset’s window() method to convert this long sequence of characters into many smaller windows of text. 




In [None]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

In [None]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [None]:
np.random.seed(42)
tf.random.set_seed(42)
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)
for X_batch, Y_batch in dataset.take(1):
  print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


**Model**

---



Please use GPU for faster learning

In [None]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


GRU class will only use the GPU (if you have one) when using the default values for the following arguments: activation, recurrent_activation, recurrent_dropout, unroll, use_bias and reset_after. This is why I commented out recurrent_dropout=0.2 (compared to the book).

In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.GRU(128, return_sequences=True, 
                      input_shape=[None, max_id],
                      dropout=0.2),
  tf.keras.layers.GRU(128, return_sequences=True,
                      dropout=0.2),
  tf.keras.layers.TimeDistributed(
      tf.keras.layers.Dense(max_id, activation="softmax"))
])

model.compile(loss="sparse_categorical_crossentropy", 
              optimizer="adam")

with tf.device('/device:GPU:0'):
  history = model.fit(dataset, 
                      steps_per_epoch=train_size // batch_size,
                      epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


**Test**

---



In [None]:
def preprocess(texts):
  X = np.array(tokenizer.texts_to_sequences(texts)) - 1
  return tf.one_hot(X, max_id)

In [None]:
X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]

'u'

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

In [None]:
next_char("How are yo", temperature=1)

'u'

In [None]:
y= next_char("How are yo", temperature=1)
y

'u'

In [None]:
def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
print(complete_text("how are you doing", temperature=1))

how are you doing for a senuty.
it comes him to great all the eseme
