<a href="https://colab.research.google.com/github/teticio/aventuras-con-textos/blob/master/BERT_entiende.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contestar preguntas sobre un texto

BERT ha conseguido resultados del estado del arte en la tarea de SQuAD (Stanford Question Answering Dataset), que consiste en indentificar la sección de un texto que corresponda a una pregunta.

Por ejemplo:

Q: *Where do water droplets collide with ice crystals to form precipitation?*

A: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.

Vamos a usar un modelo que ha sido fine-tuneado con esta tarea para contestar preguntas sobre textos en general.

# Answer questions about a text

BERT has acheived state of the art results in the SQuAD task (Stanford Question Answering Dataset), which consists of identifying the section of a text that corresponds to a question.

For example:

Q: *Where do water droplets collide with ice crystals to form precipitation?*

A: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.

We are going to use a model that has been fine-tuned with the task to answer general questions about texts.

### Descargar el modelo de BERT pre-entrenado con SQUAD

### Download the BERT model pre-trained with SQUAD

In [3]:
# de https://github.com/Maaarcocr
!test -f squad_bert_base.tgz || wget https://s3.eu-west-2.amazonaws.com/nlpfiles/squad_bert_base.tgz
!test -e squad_bert_base || tar -xvf squad_bert_base.tgz

squad_bert_base/
squad_bert_base/bert_config.json
squad_bert_base/model.ckpt-14599.meta
squad_bert_base/vocab.txt
squad_bert_base/model.ckpt-14599.data-00000-of-00001
squad_bert_base/model.ckpt-14599.index


### Instalar e importar BERT

### Install and import BERT

In [4]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
    sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling as tfm
import tokenization as tft
import run_squad as rs

Cloning into 'bert_repo'...
remote: Enumerating objects: 333, done.[K
remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 333[K
Receiving objects: 100% (333/333), 282.45 KiB | 945.00 KiB/s, done.
Resolving deltas: 100% (183/183), done.



### Importar las librerías

### Import libraries

In [0]:
import os
import numpy as np
import tensorflow as tf
from keras.preprocessing import sequence
from keras.utils import get_file
from IPython.core.display import HTML, display

model_dir = './squad_bert_base/'
vocab_file = model_dir + "vocab.txt"
bert_config_file = model_dir + "bert_config.json"
init_checkpoint = model_dir + "model.ckpt-14599"
max_seq_length = 512

### Configurar BERT y preparar el tokenizador

### Configure BERT and prepare the tokenizer

In [6]:
# convert to lower case?
convertir_a_minusculas = True
bert_config = tfm.BertConfig.from_json_file(bert_config_file)
tokenizer = tft.FullTokenizer(vocab_file=vocab_file,
                              do_lower_case=convertir_a_minusculas)




### Definir los inputs al modelo

### Define the inputs to the model

In [0]:
input_ids = tf.placeholder(name='input_ids',
                           shape=(1, max_seq_length),
                           dtype='int32')
input_mask = tf.placeholder(name='input_mask',
                            shape=(1, max_seq_length),
                            dtype='int32')
segment_ids = tf.placeholder(name='segment_ids',
                             shape=(1, max_seq_length),
                             dtype='int32')

### Construir el modelo

### Build the model

In [0]:
(start_logits, end_logits) = rs.create_model(bert_config=bert_config,
                                             is_training=False,
                                             input_ids=input_ids,
                                             input_mask=input_mask,
                                             segment_ids=segment_ids,
                                             use_one_hot_embeddings=False)

### Inizializar los pesos con el checkpoint del modelo pre-entrenado

### Initialize the weights with the checkpoint of the pre-trained model

In [0]:
(assignment_map,
 initialized_variable_names) = tfm.get_assignment_map_from_checkpoint(
     tf.trainable_variables(), init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

### Crear los inputs

### Create the inputs

In [0]:
# context
contexto = "In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”."  #@param {type : "string"}
# question
pregunta = "Where do water droplets collide with ice crystals to form precipitation?"  #@param
tokens = ['[CLS]'] + tokenizer.tokenize(pregunta) + ['[SEP]'] + tokenizer.tokenize(contexto) + ['[SEP]']
input_ids_ = tokenizer.convert_tokens_to_ids(tokens)
len_seg = tokens.index('[SEP]') + 1
segment_ids_ = [0] * len_seg + [1] * (len(input_ids_) - len_seg)
input_mask_ = [1] * len(input_ids_)

### Hacer que las secuencias tengan el mismo tamaño

### Make sequences the same size

In [0]:
input_ids_ = sequence.pad_sequences([input_ids_],
                                    maxlen=max_seq_length,
                                    padding='post',
                                    value=0)
segment_ids_ = sequence.pad_sequences([segment_ids_],
                                      maxlen=max_seq_length,
                                      padding='post',
                                      value=0)
input_mask_ = sequence.pad_sequences([input_mask_],
                                     maxlen=max_seq_length,
                                     padding='post',
                                     value=0)

### Invocar el modelo

### Invoke the model

In [0]:
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    tup = sess.run(
        (start_logits, end_logits),
        feed_dict={
            input_ids: input_ids_,
            input_mask: input_mask_,
            segment_ids: segment_ids_,
        })

In [0]:
start = np.argmax(tup[0])
end = np.argmax(tup[1])

### Convertir los tokens a texto (las predicciones están en mayúsculas)

### Convert tokens to text (predictions are in uppercase)

In [16]:
text = '<p>'
leave_space = False
for i, token in enumerate(tokens):
    if i == start:
        text += "<b>"
    if i == end + 1:
        text += "</b>"
    if token == '[CLS]' or token == '[SEP]':
        continue
    if token[0:2] == '##':
        text += token[2:]
    else:
        if leave_space and token[0].isalnum():
            text += ' '
        text += token
    leave_space = token[-1] != "'" and token[-1] != "-"
text += '</p>'
display(HTML(text))

### Ahora buscamos la respuesta a una pregunta en un texto más largo...

### Now let's look for the answer to a question in a longer text...

In [19]:
# Harry Potter and the Goblet of Fire
get_file(
    os.getcwd() + '/example.txt',
    origin=
    'https://docs.google.com/uc?export=download&id=10OhbIQHNJrtBiKer8tP_LbxjASqItNzZ'
)

Downloading data from https://docs.google.com/uc?export=download&id=10OhbIQHNJrtBiKer8tP_LbxjASqItNzZ


'/content/example.txt'

In [0]:
pregunta = 'Who killed Cedric?'