# Predicir las palabras que faltan en una frase con BERT

El entrenamiento de los modelos de BERT pre-entrenados consiste en realizar dos tareas no supervisadas: (1) adivinar las palabras que faltan en una frase y (2) determinar si una frase sigue la otra o no. En este notebook, vamos a poner un modelo pre-entrenado a prueba con la primera tarea.

Se puede adaptar para generar frases enteras de texto "a boleo". Ver [Bert Babble](https://colab.research.google.com/drive/1MxKZGtQ9SSBjTK5ArsZ5LKhkztzg52RV).

### Descargar el modelo pre-entrenado de BERT

In [0]:
# https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
# https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip

url = 'https://storage.googleapis.com/bert_models/2018_11_23/' #@param {type : 'string'}
modelo = 'multi_cased_L-12_H-768_A-12' #@param {type : 'string'}
command = 'wget '+ url + modelo + '.zip && unzip '+ modelo + '.zip'
!{command}

--2019-07-11 10:33:24--  https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.125.128, 2404:6800:4008:c02::80
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.125.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 662903077 (632M) [application/zip]
Saving to: ‘multi_cased_L-12_H-768_A-12.zip.1’


2019-07-11 10:33:28 (154 MB/s) - ‘multi_cased_L-12_H-768_A-12.zip.1’ saved [662903077/662903077]

Archive:  multi_cased_L-12_H-768_A-12.zip
replace multi_cased_L-12_H-768_A-12/bert_model.ckpt.meta? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


### Instalar e importar BERT

In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
    sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling as tfm
import tokenization as tft
import run_pretraining as rp

W0711 10:33:37.661941 140706045007744 deprecation_wrapper.py:119] From bert_repo/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



### Importar las librerías

In [0]:
import tensorflow as tf
from keras.preprocessing import sequence
import numpy as np

model_dir = './' + modelo +'/'
vocab_file = model_dir + "vocab.txt"
bert_config_file = model_dir + "bert_config.json"
init_checkpoint = model_dir + "bert_model.ckpt"
max_seq_length = 512
max_predictions_per_seq = 20

Using TensorFlow backend.


### Configurar BERT y preparar el tokenizador

In [0]:
convertir_a_minusculas = 'uncased' in modelo
bert_config = tfm.BertConfig.from_json_file(bert_config_file)
tokenizer = tft.FullTokenizer(vocab_file=vocab_file, do_lower_case=convertir_a_minusculas)

W0711 10:33:37.750049 140706045007744 deprecation_wrapper.py:119] From bert_repo/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



### Definir los inputs al modelo

In [0]:
input_ids = tf.placeholder(name='input_ids', shape=(1, max_seq_length), dtype='int32')
input_mask = tf.placeholder(name='input_mask', shape=(1, max_seq_length), dtype='int32')
segment_ids = tf.placeholder(name='segment_ids', shape=(1, max_seq_length), dtype='int32')
masked_lm_positions = tf.placeholder(name='masked_lm_positions', shape=(1, max_predictions_per_seq), dtype='int32')
masked_lm_ids = tf.placeholder(name='masked_lm_ids', shape=(1, max_predictions_per_seq), dtype='int32')
masked_lm_weights = tf.placeholder(name='masked_lm_weights', shape=(1, max_predictions_per_seq), dtype='float32')

### Construir el modelo

In [0]:
model = tfm.BertModel(
    config=bert_config,
    is_training=False,
    input_ids=input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False)

W0711 10:33:38.511967 140706045007744 deprecation_wrapper.py:119] From bert_repo/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0711 10:33:38.519322 140706045007744 deprecation_wrapper.py:119] From bert_repo/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0711 10:33:38.567062 140706045007744 deprecation_wrapper.py:119] From bert_repo/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W0711 10:33:40.195948 140706045007744 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0711 10:

### Definir los outputs del modelo

In [0]:
(masked_lm_loss,
 masked_lm_example_loss, masked_lm_log_probs) = rp.get_masked_lm_output(
    bert_config, model.get_sequence_output(), model.get_embedding_table(),
    masked_lm_positions, masked_lm_ids, masked_lm_weights)

### Inizializar los pesos con el checkpoint del modelo pre-entrenado

In [0]:
(assignment_map,
 initialized_variable_names) = tfm.get_assignment_map_from_checkpoint(
    tf.trainable_variables(), init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

### Crear los inputs

In [0]:
ejemplo = "Esto es un ejemplo de lo que se puede hacer con BERT. Es capaz de hacer cosas muy interesantes!" #@param {type : "string"}
palabras_a_adivinar = ['hacer'] #@param
tokens = tokenizer.tokenize(ejemplo)
masked_lm_positions_ = [tokens.index(_) for _ in palabras_a_adivinar if _ in tokens]
masked_lm_ids_ = [0] * len(masked_lm_positions_)
masked_lm_weights_ = [1.0] * len(masked_lm_positions_)
for _ in masked_lm_positions_:
    tokens[_] = '[MASK]'
input_ids_ = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
segment_ids_ = [0] * len(input_ids_)
input_mask_ = [1] * len(input_ids_)

['Esto', 'es', 'un', 'ejemplo', 'de', 'lo', 'que', 'se', 'puede', '[MASK]', 'con', 'BE', '##RT', '.', 'Es', 'capaz', 'de', 'hacer', 'cosas', 'muy', 'interesante', '##s', '!']


### Hacer que las secuencias tengan el mismo tamaño

In [0]:
input_ids_ = sequence.pad_sequences([input_ids_], maxlen=max_seq_length, padding='post', value=0)
segment_ids_ = sequence.pad_sequences([segment_ids_], maxlen=max_seq_length, padding='post', value=0)
input_mask_ = sequence.pad_sequences([input_mask_], maxlen=max_seq_length, padding='post', value=0)
masked_lm_positions_ = sequence.pad_sequences([masked_lm_positions_], maxlen=max_predictions_per_seq, padding='post', value=0)
masked_lm_ids_ = sequence.pad_sequences([masked_lm_ids_], maxlen=max_predictions_per_seq, padding='post', value=0)
masked_lm_weights_ = sequence.pad_sequences([masked_lm_weights_], maxlen=max_predictions_per_seq, padding='post', value=0)

### Invocar el modelo

In [0]:
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    matrix = sess.run(masked_lm_log_probs,
                      feed_dict={
                          input_ids: input_ids_,
                          input_mask: input_mask_,
                          segment_ids: segment_ids_,
                          masked_lm_positions: masked_lm_positions_,
                          masked_lm_ids: masked_lm_ids_,
                          masked_lm_weights: masked_lm_weights_,})

### Convertir los tokens a texto (las predicciones están en mayúsculas)

In [0]:
i = 0
text = ''
apostrophe = False
for token in tokens:
    if token == '[MASK]':
        token = tokenizer.convert_ids_to_tokens(np.argmax(matrix, axis=-1))[i].upper()
        i += 1
    if token[0:2] == '##':
        text += token[2:]
    else:
        if token[0].isalpha() and not apostrophe:
            text += ' '
        text += token
    apostrophe = token[-1] == "'"
print(text)

 Esto es un ejemplo de lo que se puede HACER con BERT. Es capaz de hacer cosas muy interesantes!
