https://colab.research.google.com/drive/1D9vdshsEujmMlbdyXtxe22uAOB5bdeVo?usp=sharing

#Implementing the character-level RNN text generation
## Prepare the data
Let's start by downloading a corpus (a large txt), with a bit of pre-processing

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

# here I chose to use as source text Dante's Divina Commedia (but you can choose any other one)
path = keras.utils.get_file(
    'pg45334.txt',
    origin = 'https://www.gutenberg.org/cache/epub/45334/pg45334.txt' #promessi sposi 
    #origin = 'https://www.gutenberg.org/cache/epub/84/pg84.txt'
)
text = open(path).read().lower()
#reduce it a bit (Inferno starts at character 2215)
#text = text[2215:500000]
text = text[94387:500000] #inizio del romanzo


print('corpus length:', len(text))

2023-05-10 18:04:19.858377: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-10 18:04:20.021991: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-10 18:04:20.023215: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading data from https://www.gutenberg.org/cache/epub/45334/pg45334.txt
corpus length: 405613


Now, we extract partially overlapping sequences of length maxlen, one-hot encode them, and pack in a 3d numpy array x of shape (sequences, maxlen, unique_characters). At the same time, we prepare an array y containing the corresponding targets: the one-hot-encoded characters that come after each extracted sequence.

In [2]:
maxlen = 60 #length of the extracted sequences for training
step = 3 #we sample a sequence every step characters


sentences = [] #this will hold the extracted sequences
next_chars = [] #this will hold the target charcters (the follow-up characters)
for i in range(0, len(text)-maxlen, step):
  sentences.append(text[i:i+maxlen])
  next_chars.append(text[i+maxlen])

print('Number of sequences:', len(sentences))

chars = sorted(list(set(text))) #list of unique characters in the corpus
print('Unique characters:', len(chars))
print(chars)
#create a dictionary that maps unique characters to their index in the list "chars"
char_indices = dict((char,chars.index(char)) for char in chars)

print('Vectorization...', end = '')
#we (one-hot-)encode the charcters into binary arrays 
x = np.zeros((len(sentences), maxlen, len(chars)), dtype = np.bool)
y = np.zeros((len(sentences), len(chars)), dtype = np.bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
  y[i,char_indices[next_chars[i]]] = 1
print('completed')

Number of sequences: 135185
Unique characters: 61
['\n', ' ', '!', "'", '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'x', 'y', 'z', '«', '»', 'à', 'æ', 'è', 'é', 'ì', 'í', 'ò', 'ô', 'ù']
Vectorization...

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  x = np.zeros((len(sentences), maxlen, len(chars)), dtype = np.bool)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  y = np.zeros((len(sentences), len(chars)), dtype = np.bool)


completed


In [3]:
x.shape

(135185, 60, 61)

In [4]:
y.shape

(135185, 61)

##Building the Neural Network model

In this example, let us use an LSTM layer followed by a Dense classifier with softmax over all the possible characters.

In [5]:
model = keras.models.Sequential()
model.add(keras.layers.LSTM(128, input_shape = (maxlen, len(chars))))
model.add(keras.layers.Dense(len(chars), activation = 'softmax'))

optimizer = keras.optimizers.RMSprop(learning_rate = 0.01)
#we use categorical crossentropy because the targets are one-hot encoded
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer)


2023-05-10 18:04:35.698417: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-10 18:04:35.702302: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-10 18:04:35.705463: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

##Training the language model and sampling from it

Given a trained model and a seed text, you can generate new text by repeating:

1. draw from the model a probability distribution for the next character, given the generated text available so far
2. reweight the distribution to a certain temperature
3. sample the next character at random following the reweighted distribution
4. add the new character at the end of the available text




The following code implements the sampling function.

In [6]:
import random
import sys

def sample(preds, temperature =1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds)/temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1,preds,1)
  return np.argmax(probas)


Let's fit the model for a few eopochs

In [7]:
model.fit(x,y,batch_size=1024, epochs = 30)


Epoch 1/30


2023-05-10 18:04:35.938419: W tensorflow/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 494777100 exceeds 10% of free system memory.
2023-05-10 18:04:37.240269: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-10 18:04:37.244656: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-10 18:04:37.248671: I tensorflow/core/common_runtime/executor.cc:1197] [/dev

Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
  6/133 [>.............................] - ETA: 1:38 - loss: 1.7908

KeyboardInterrupt: 

Let us now generate text using a range of temperatures.


In [None]:
#print('epoch', epoch)
#model.fit(x,y,batch_size=128, epochs = 1)

#select a text seed at random
start_index = random.randint(0,len(text)-maxlen -1)
generated_text = text[start_index:start_index+maxlen]
print(' --- Generating text with seed:"' + generated_text + '"')

#tries a range of different temperatures
for temperature in [0.2, 0.5, 1.0, 1.2]:
  print('----temperature:', temperature)
  sys.stdout.write(generated_text)
  #generates 400 characters starting from the seed text
  for i in range(400):
    #one-hot encodes the characters generated so far
    sampled = np.zeros((1,maxlen,len(chars)))
    for t, char in enumerate(generated_text):
      sampled[0, t, char_indices[char]] = 1.
    
    #samples the next character
    preds = model.predict(sampled, verbose = 0)[0]
    next_index = sample(preds, temperature)
    next_char = chars[next_index]
    
    #appends the newly generated character 
    generated_text += next_char
    generated_text = generated_text[1:]

    sys.stdout.write(next_char)
  sys.stdout.write('\n')

 --- Generating text with seed:" file should be named 18457-8.txt or 18457-8.zip *****
this "
----temperature: 0.2
 file should be named 18457-8.txt or 18457-8.zip *****
this 

2023-05-10 17:45:48.062198: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-05-10 17:45:48.065577: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients/split_grad/concat/split/split_dim}}]]
2023-05-10 17:45:48.069929: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You mus

al capoco io non poffoffetto
della sentrro da sua madre!

_indicherà la rigaratala. e anche compodo, e più di non possono
di sitaro in centanto di scena.

_il capocomico (alzandosi e i provo di terrattando)._ e lei loro di figlio, a quel come nen può si personaggio compagnio di quellioscre che così, una veder mi scena.

_il capocomico (subito, e promere ne sono dintiti, di dire, per luegato di sce
----temperature: 0.5
bito, e promere ne sono dintiti, di dire, per luegato di scente noll'abbia--che si recazzo.

_il capocomico (subito, e da un po' di scena._ e sono? per con gui tenta vedere con lei dieste un'ascia mia cuolla quel siapoco comedicia da
quell'altro così!

_al capocomico (in una gavdo)._ e alla signora, perché come vode, signorino,
di questa voggerà col figlio chi pocere; e come vede. non si non fuola!

_il capocomico._ e poto
gire lai sare in testo mio a tutt
----temperature: 1.0
!

_il capocomico._ e poto
gire lai sare in testo mio a tutti
io
conticchiona di
modre, fua di
g