<a href="https://colab.research.google.com/github/teticio/aventuras-con-textos/blob/master/BERT_predict.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicir las palabras que faltan en una frase con BERT

El entrenamiento de los modelos de BERT pre-entrenados consiste en realizar dos tareas no supervisadas: (1) adivinar las palabras que faltan en una frase y (2) determinar si una frase sigue la otra o no. En este notebook, vamos a poner un modelo pre-entrenado a prueba con la primera tarea.

Se puede adaptar para generar frases enteras de texto "a boleo". Ver [Bert Babble](https://colab.research.google.com/drive/1MxKZGtQ9SSBjTK5ArsZ5LKhkztzg52RV).

### Descargar el modelo pre-entrenado de BERT

In [29]:
from keras.utils import get_file

# https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
# https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip

url = 'https://storage.googleapis.com/bert_models/2019_05_30/' #@param {type : 'string'}
modelo = 'wwm_uncased_L-24_H-1024_A-16' #@param {type : 'string'}
get_file(modelo + '.zip', origin=url + modelo + '.zip', extract=True, archive_format="zip", cache_dir='./', cache_subdir='')

Downloading data from https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip


'./wwm_uncased_L-24_H-1024_A-16.zip'

### Instalar e importar BERT

In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
    sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling as tfm
import tokenization as tft
import run_pretraining as rp

### Importar las librerías

In [0]:
import tensorflow as tf
from keras.preprocessing import sequence
import numpy as np

model_dir = './' + modelo +'/'
vocab_file = model_dir + "vocab.txt"
bert_config_file = model_dir + "bert_config.json"
init_checkpoint = model_dir + "bert_model.ckpt"
max_seq_length = 512
max_predictions_per_seq = 20

### Configurar BERT y preparar el tokenizador

In [0]:
convertir_a_minusculas = 'uncased' in modelo
bert_config = tfm.BertConfig.from_json_file(bert_config_file)
tokenizer = tft.FullTokenizer(vocab_file=vocab_file, do_lower_case=convertir_a_minusculas)

### Definir los inputs al modelo

In [0]:
input_ids = tf.placeholder(name='input_ids', shape=(1, max_seq_length), dtype='int32')
input_mask = tf.placeholder(name='input_mask', shape=(1, max_seq_length), dtype='int32')
segment_ids = tf.placeholder(name='segment_ids', shape=(1, max_seq_length), dtype='int32')
masked_lm_positions = tf.placeholder(name='masked_lm_positions', shape=(1, max_predictions_per_seq), dtype='int32')
masked_lm_ids = tf.placeholder(name='masked_lm_ids', shape=(1, max_predictions_per_seq), dtype='int32')
masked_lm_weights = tf.placeholder(name='masked_lm_weights', shape=(1, max_predictions_per_seq), dtype='float32')

### Construir el modelo

In [34]:
model = tfm.BertModel(config=bert_config,
                      is_training=False,
                      input_ids=input_ids,
                      input_mask=input_mask,
                      token_type_ids=segment_ids,
                      use_one_hot_embeddings=False)

W0816 14:47:27.435554 140376845076352 deprecation_wrapper.py:119] From bert_repo/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0816 14:47:27.452529 140376845076352 deprecation_wrapper.py:119] From bert_repo/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0816 14:47:27.479144 140376845076352 deprecation_wrapper.py:119] From bert_repo/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W0816 14:47:30.494494 140376845076352 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0816 14:

### Definir los outputs del modelo

In [0]:
(masked_lm_loss,
 masked_lm_example_loss, masked_lm_log_probs) = rp.get_masked_lm_output(
    bert_config, model.get_sequence_output(), model.get_embedding_table(),
    masked_lm_positions, masked_lm_ids, masked_lm_weights)

### Inizializar los pesos con el checkpoint del modelo pre-entrenado

In [0]:
(assignment_map,
 initialized_variable_names) = tfm.get_assignment_map_from_checkpoint(
    tf.trainable_variables(), init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

### Crear los inputs

In [60]:
ejemplo = "Orbiting this at a distance of roughly  ninety-two million miles is  an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea." #@param {type : "string"}
palabras_a_adivinar = ['miles', 'insignificant', 'planet', 'neat', 'primitive', 'digital'] #@param
tokens = tokenizer.tokenize(ejemplo)
masked_lm_positions_ = positions = [tokens.index(_) for _ in palabras_a_adivinar if _ in tokens]
masked_lm_ids_ = [0] * len(masked_lm_positions_)
masked_lm_weights_ = [1.0] * len(masked_lm_positions_)
for _ in masked_lm_positions_:
    tokens[_] = '[MASK]'
input_ids_ = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
segment_ids_ = [0] * len(input_ids_)
input_mask_ = [1] * len(input_ids_)

['orbiting', 'this', 'at', 'a', 'distance', 'of', 'roughly', 'ninety', '-', 'two', 'million', '[MASK]', 'is', 'an', 'utterly', '[MASK]', 'little', 'blue', 'green', '[MASK]', 'whose', 'ape', '-', 'descended', 'life', 'forms', 'are', 'so', 'amazingly', '[MASK]', 'that', 'they', 'still', 'think', '[MASK]', 'watches', 'are', 'a', 'pretty', '[MASK]', 'idea', '.']


### Hacer que las secuencias tengan el mismo tamaño

In [0]:
input_ids_ = sequence.pad_sequences([input_ids_], maxlen=max_seq_length, padding='post', value=0)
segment_ids_ = sequence.pad_sequences([segment_ids_], maxlen=max_seq_length, padding='post', value=0)
input_mask_ = sequence.pad_sequences([input_mask_], maxlen=max_seq_length, padding='post', value=0)
masked_lm_positions_ = sequence.pad_sequences([masked_lm_positions_], maxlen=max_predictions_per_seq, padding='post', value=0)
masked_lm_ids_ = sequence.pad_sequences([masked_lm_ids_], maxlen=max_predictions_per_seq, padding='post', value=0)
masked_lm_weights_ = sequence.pad_sequences([masked_lm_weights_], maxlen=max_predictions_per_seq, padding='post', value=0)

### Invocar el modelo

In [0]:
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    matrix = sess.run(masked_lm_log_probs,
                      feed_dict={
                          input_ids: input_ids_,
                          input_mask: input_mask_,
                          segment_ids: segment_ids_,
                          masked_lm_positions: masked_lm_positions_,
                          masked_lm_ids: masked_lm_ids_,
                          masked_lm_weights: masked_lm_weights_,})

### Convertir los tokens a texto (las predicciones están en mayúsculas)

In [63]:
print(ejemplo)
print('->')
text = ''
leave_space = False
for i, token in enumerate(tokens):
    if i in positions:
        token = tokenizer.convert_ids_to_tokens(np.argmax(matrix, axis=-1))[positions.index(i)].upper()
    if token[0:2] == '##':
        text += token[2:]
    else:
        if leave_space and token[0].isalpha():
            text += ' '
        text += token
    leave_space = token[-1] != "'" and token[-1] != "-"
print(text)

Orbiting this at a distance of roughly  ninety-two million miles is  an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.
->
orbiting this at a distance of roughly ninety-two million KILOMETERS is an utterly PEACEFUL little blue green PLANET whose ape-descended life forms are so amazingly ADVANCED that they still think POCKET watches are a pretty GOOD idea.
