<a href="https://colab.research.google.com/github/rodrigorenemenegazzo/Artificial-Intelligence/blob/main/RNN_e_Classifica%C3%A7%C3%A3o_de_Textos_Shakespeare_Gera%C3%A7%C3%A3o_de_texto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Rodrigo Rene Menegazzo

Prática: Geração de Textos com RNN

  * Texto de Shakespeare (retirado do site oficial do TensorFlow)
  * Entrada: “Shakespear”
  * Predição: “e”
  * A produção do texto pode ser feita chamando-se o modelo
repetidamente

https://www.tensorflow.org/tutorials/text/text_generation

Impotações

In [1]:
import tensorflow as tf
import numpy as np
import os
import time

Carga do texto para treino

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt',
'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
# Leitura do texto
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# Tamanho do texto em número de caracteres
print(f'Tamanho do texto: {len(text)} caracteres')

# Primeiros 250 caracteres do texto
print(text[:250])

# Caracteres únicos
vocab = sorted(set(text))
print(f'{len(vocab)} caracters únicos')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
Tamanho do texto: 1115394 caracteres
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

65 caracters únicos


Processamento do texto

In [3]:
# Processamento do texto
# Converte um caractere em um ID único
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)

# Faz o contrário, converte os IDs em caracteres
chars_from_ids = tf.keras.layers.StringLookup(vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

# Função onde, dado uma lista de IDs, gera o texto
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

Geração base de treino

In [4]:
# Gerar base de treino
# Exemplo : Para a palagra "Hello"
# Suponha seq_length = 4
# Então: Entrada "Hell" e Saída "ello"
# Tem que dividir o texto em pedaços de tamanho seq_length+1
# from_tensor_slices - cria um dataset com os dados
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
# Converte as sequências no tamanho desejado : seq_length+1
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
b'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


Gerar base treino

In [5]:
# Função onde, dado uma sequência "Hello", gera entrada e saída: "Hell" e "ello"

def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

# dataset contém as sequências contendo entrada e saída
dataset = sequences.map(split_input_target)

# Criar lotes de treinamento
# Batch size
BATCH_SIZE = 64

# Tamanho do buffer para randomizar o dataset
BUFFER_SIZE = 10000
dataset = (
  dataset
  .shuffle(BUFFER_SIZE)
  .batch(BATCH_SIZE, drop_remainder=True)
  .prefetch(tf.data.experimental.AUTOTUNE))

Construção do modelo

In [6]:
# Construir o modelo

# Tamanho do vocabulário em número de caracteres
vocab_size = len(vocab)

# Dimensão do Embedding
embedding_dim = 256

# Número de unidades RNN
rnn_units = 1024

In [7]:
# Classe que gera o modelo: Embedding -> GRU -> Dense
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                  return_sequences=True,
                                  return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
      x = inputs
      x = self.embedding(x, training=training)
      if states is None:
        states = self.gru.get_initial_state(x)
      x, states = self.gru(x, initial_state=states, training=training)
      x = self.dense(x, training=training)

      if return_state:
        return x, states
      else:
        return x

In [8]:
# Criação do modelo
model = MyModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(ids_from_chars.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

Compilação do Modelo

  * Otimizador: Adam
  * Sparse Categorical Cross-Entropy
  * from_logits: Se as predições são em logits. Por default são probabilidades
    * Logits são valores brutos, não normalizados, usados como entrada em uma softmax
  * Usado pois a camada Densa não tem função de ativação

In [9]:
# Função de perda é sparse_categorical_crossentropy
# Modelo retorna Logits, sinaliza from_logits
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compila o modelo
model.compile(optimizer='adam', loss=loss)

# Treinar
EPOCHS = 20
history = model.fit(dataset, epochs=EPOCHS)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Classe para Geração – um passo

In [21]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=2.0):
      super().__init__()
      self.temperature = temperature
      self.model = model
      self.chars_from_ids = chars_from_ids
      self.ids_from_chars = ids_from_chars

      # Create a mask to prevent "[UNK]" from being generated.
      skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
      sparse_mask = tf.SparseTensor(
          # Put a -inf at each bad index.
          values=[-float('inf')]*len(skip_ids),
          indices=skip_ids,
          # Match the shape to the vocabulary
          dense_shape=[len(ids_from_chars.get_vocabulary())])
      self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()
    
    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask
    
    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)
    
    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)
    
    # Return the characters and model state.
    return predicted_chars, states

one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

# Executar em um laço para gera o texto
start = time.time()
states = None
next_char = tf.constant(['LEONTES: I am happy'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*50)
print('\nRun time:', end - start)

LEONTES: I am happy brues bewempt; why
Havinate heir only poor madee ou,
But in whicE JDUTh Cameagom wills, Butwif Junipht-move, blows blinders,
And poorf your
sword elizate murtrey asks,
Where's of Grenzecom: thy Jastain's
Skilour's, aJdwiMs away?

ISABELLA:
Give me whom?
JBuoffer!' and go presser than you loved!
First, mulder merry jelley, nose, ougiqua;
Richard-nad interrup, celeted, methin my fault
Un'D; Baptus-armity,
Upon my noble votio's dount; blies own gons!
Lign-cive absance, madam lad, Wablist:
ReXephand-gety downright:
Wherein byrr face withal,
of Ladmas sha, nick and lousk'd, that of
Rome; I charge ye?
The effects sitoly or grmoulded mins
her-feel to my resertly's Gram.
Was it no wronced partons whom, lay
Iscands I little.

DUKE VILERDUCETH:
Oh: get Pife 'twere's now nichm's talky: vexart, good, true!
Worthy CimilPo. Is Mercutio about?
teesh,-dear lord? who?'t
Ou, Romer, Offorgh, nor pbace;
Where riddle moue? 'hippany.
IAmno as penO; alsoad unw;
Lebt's temp the umpertainme