<a href="https://colab.research.google.com/github/teticio/aventuras-con-textos/blob/master/BERT_entiende.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contestar preguntas sobre un texto

BERT ha conseguido resultados del estado del arte en la tarea de SQuAD (Stanford Question Answering Dataset), que consiste en indentificar la sección de un texto que corresponda a una pregunta.

Por ejemplo:

Q: *Where do water droplets collide with ice crystals to form precipitation?*

A: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.

Vamos a usar un modelo que ha sido fine-tuneado con esta tarea para contestar preguntas sobre textos en general.

# Answer questions about a text

BERT has acheived state of the art results in the SQuAD task (Stanford Question Answering Dataset), which consists of identifying the section of a text that corresponds to a question.

For example:

Q: *Where do water droplets collide with ice crystals to form precipitation?*

A: In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.

We are going to use a model that has been fine-tuned with the task to answer general questions about texts.

### Descargar el modelo de BERT pre-entrenado con SQUAD

### Download the BERT model pre-trained with SQUAD

In [1]:
# de https://github.com/Maaarcocr
!test -f squad_bert_base.tgz || wget https://s3.eu-west-2.amazonaws.com/nlpfiles/squad_bert_base.tgz
!test -e squad_bert_base || tar -xvf squad_bert_base.tgz

--2019-10-02 20:05:23--  https://s3.eu-west-2.amazonaws.com/nlpfiles/squad_bert_base.tgz
Resolving s3.eu-west-2.amazonaws.com (s3.eu-west-2.amazonaws.com)... 52.95.148.56
Connecting to s3.eu-west-2.amazonaws.com (s3.eu-west-2.amazonaws.com)|52.95.148.56|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1313280000 (1,2G) [application/x-compressed]
Saving to: ‘squad_bert_base.tgz’


2019-10-02 20:07:08 (12,0 MB/s) - ‘squad_bert_base.tgz’ saved [1313280000/1313280000]

squad_bert_base/
squad_bert_base/bert_config.json
squad_bert_base/model.ckpt-14599.meta
squad_bert_base/vocab.txt
squad_bert_base/model.ckpt-14599.data-00000-of-00001
squad_bert_base/model.ckpt-14599.index


### Instalar e importar BERT

### Install and import BERT

In [1]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
    sys.path += ['bert_repo']

# import python modules defined by BERT
import modeling as tfm
import tokenization as tft
import run_squad as rs

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
W1002 20:22:11.620714 140572954515264 deprecation_wrapper.py:119] From bert_repo/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



### Importar las librerías

### Import libraries

In [2]:
import os
import numpy as np
import tensorflow as tf
from keras.preprocessing import sequence
from keras.utils import get_file
from scipy.special import softmax
from IPython.core.display import HTML, display

model_dir = './squad_bert_base/'
vocab_file = model_dir + "vocab.txt"
bert_config_file = model_dir + "bert_config.json"
init_checkpoint = model_dir + "model.ckpt-14599"
max_seq_length = 512

Using TensorFlow backend.


### Configurar BERT y preparar el tokenizador

### Configure BERT and prepare the tokenizer

In [3]:
# convert to lower case?
convertir_a_minusculas = True
bert_config = tfm.BertConfig.from_json_file(bert_config_file)
tokenizer = tft.FullTokenizer(vocab_file=vocab_file,
                              do_lower_case=convertir_a_minusculas)

W1002 20:22:13.978984 140572954515264 deprecation_wrapper.py:119] From bert_repo/modeling.py:93: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.



### Definir los inputs al modelo

### Define the inputs to the model

In [4]:
input_ids = tf.placeholder(name='input_ids',
                           shape=(None, max_seq_length),
                           dtype='int32')
input_mask = tf.placeholder(name='input_mask',
                            shape=(None, max_seq_length),
                            dtype='int32')
segment_ids = tf.placeholder(name='segment_ids',
                             shape=(None, max_seq_length),
                             dtype='int32')

### Construir el modelo

### Build the model

In [5]:
(start_logits, end_logits) = rs.create_model(bert_config=bert_config,
                                             is_training=False,
                                             input_ids=input_ids,
                                             input_mask=input_mask,
                                             segment_ids=segment_ids,
                                             use_one_hot_embeddings=False)

W1002 20:22:29.883412 140572954515264 deprecation_wrapper.py:119] From bert_repo/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W1002 20:22:29.885444 140572954515264 deprecation_wrapper.py:119] From bert_repo/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W1002 20:22:29.904827 140572954515264 deprecation_wrapper.py:119] From bert_repo/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W1002 20:22:30.373260 140572954515264 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W1002 20:

### Inizializar los pesos con el checkpoint del modelo pre-entrenado

### Initialize the weights with the checkpoint of the pre-trained model

In [6]:
(assignment_map,
 initialized_variable_names) = tfm.get_assignment_map_from_checkpoint(
     tf.trainable_variables(), init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

### Crear los inputs

### Create the inputs

In [56]:
# context
contexto = "In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”."  #@param {type : "string"}
# question
pregunta = "Where do water droplets collide with ice crystals to form precipitation?"  #@param {type : "string"}
tokens = ['[CLS]'] + tokenizer.tokenize(pregunta) + ['[SEP]'] + tokenizer.tokenize(contexto) + ['[SEP]']
input_ids_ = tokenizer.convert_tokens_to_ids(tokens)
len_seg = tokens.index('[SEP]') + 1
segment_ids_ = [0] * len_seg + [1] * (len(input_ids_) - len_seg)
input_mask_ = [1] * len(input_ids_)

### Hacer que las secuencias tengan el mismo tamaño

### Make sequences the same size

In [57]:
input_ids_ = sequence.pad_sequences([input_ids_],
                                    maxlen=max_seq_length,
                                    padding='post',
                                    value=0)
segment_ids_ = sequence.pad_sequences([segment_ids_],
                                      maxlen=max_seq_length,
                                      padding='post',
                                      value=0)
input_mask_ = sequence.pad_sequences([input_mask_],
                                     maxlen=max_seq_length,
                                     padding='post',
                                     value=0)

### Invocar el modelo

### Invoke the model

In [58]:
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    tup = sess.run(
        (start_logits, end_logits),
        feed_dict={
            input_ids: input_ids_,
            input_mask: input_mask_,
            segment_ids: segment_ids_,
        })

In [59]:
start = np.argmax(tup[0])
end = np.argmax(tup[1])

### Convertir los tokens a texto (las predicciones están en mayúsculas)

### Convert tokens to text (predictions are in uppercase)

In [60]:
text = '<p>'
leave_space = False
for i, token in enumerate(tokens):
    if i == start:
        text += "<b>"
    if i == end + 1:
        text += "</b>"
    if token == '[CLS]' or token == '[SEP]':
        continue
    if token[0:2] == '##':
        text += token[2:]
    else:
        if leave_space and token[0].isalnum():
            text += ' '
        text += token
    leave_space = token[-1] != "'" and token[-1] != "-"
text += '</p>'
display(HTML(text))

### Ahora buscamos la respuesta a una pregunta en un texto más largo...

### Now let's look for the answer to a question in a longer text ...

In [13]:
# Harry Potter and the Goblet of Fire
get_file(
    os.getcwd() + '/example.txt',
    origin=
    'https://docs.google.com/uc?export=download&id=10OhbIQHNJrtBiKer8tP_LbxjASqItNzZ'
)

Downloading data from https://docs.google.com/uc?export=download&id=10OhbIQHNJrtBiKer8tP_LbxjASqItNzZ


'/home/teticio/ML/aventuras-con-textos/example.txt'

In [198]:
text = ''
with open(os.getcwd() + '/example.txt', 'rt') as file:
    for line in file.readlines():
        text += line

In [199]:
tokens = tokenizer.tokenize(text)

In [218]:
pregunta = "Where does Harry go to school?"  #@param {type : "string"}

In [219]:
batch_size = 32  #@param {'type' : 'integer'}
q = tokenizer.tokenize(pregunta)
chunk_size = max_seq_length - 3 - len(q)
results = []

input_ids_ = [None] * batch_size
segment_ids_ = [None] * batch_size
input_mask_ = [None] * batch_size
index = [None] * batch_size

j = 0
for i in range(0, len(tokens), chunk_size - chunk_size // 2):
    if i + chunk_size > len(tokens):  # último batch
        # last batch
        i = len(tokens) - chunk_size

    chunk = ['[CLS]'] + q + ['[SEP]'] + tokens[i:i + chunk_size] + ['[SEP]']
    input_ids_[j] = tokenizer.convert_tokens_to_ids(chunk)
    len_seg = chunk.index('[SEP]') + 1
    segment_ids_[j] = [0] * len_seg + [1] * (len(input_ids_[j]) - len_seg)
    input_mask_[j] = [1] * len(input_ids_[j])
    index[j] = i

    if i + chunk_size > len(tokens):  # último batch
        # last batch
        for j in range(j, batch_size):
            input_ids_[j] = np.zeros(max_seq_length)
            segment_ids_[j] = np.zeros(max_seq_length)
            input_mask_[j] = np.zeros(max_seq_length)

    if j == batch_size - 1:
        with tf.Session() as sess:
            tf.global_variables_initializer().run()
            tup = sess.run(
                (start_logits, end_logits),
                feed_dict={
                    input_ids: np.vstack(input_ids_),
                    input_mask: np.vstack(input_mask_),
                    segment_ids: np.vstack(segment_ids_),
                })

        for _ in range(batch_size):
            prob = np.max(softmax(tup[0][_])) * np.max(softmax(tup[1][_]))
            start = np.argmax(tup[0][_])
            end = np.argmax(tup[1][_])
            if (start > 0 or end > 0):
                results += [(prob, index[_], start, end)]

    j += 1
    if j >= batch_size:
        j = 0

results = sorted(results, key=lambda x: -x[0])

In [220]:
display(HTML('<p><i>' + pregunta + '</i></p>'))
for i in range(min(len(results), 5)):
    text = f'<p>[{results[i][0]:.2f}] <b>'
    leave_space = False
    for token in tokens[results[i][1] + results[i][2] - len(q) -
                        2:results[i][1] + results[i][3] + 1 - len(q) - 2]:
        if token[0:2] == '##':
            text += token[2:]
        else:
            if leave_space and token[0].isalnum():
                text += ' '
            text += token
        leave_space = token[-1] != "'" and token[-1] != "-"
    text += '</b></p>'
    display(HTML(text))