<a href="https://colab.research.google.com/github/teticio/aventuras-con-textos/blob/master/Dr%20Bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dr Bert
Vamos a seguir la tradicción de [ELIZA](https://en.wikipedia.org/wiki/ELIZA) y crear un psicoterapeuta con inteligencia artificial. Vamos a aprovechar la capacidad que tiene el modelo de BERT de reconocer frases consecutivas.

# Dr Bert
We will follow the tradition of [ELIZA](https://en.wikipedia.org/wiki/ELIZA) and create a psychotherapist using artificial intelligence. We will take advantage of the ability of the BERT model to recognize consecutive sentences.

### Importar las librerías

### Import libraries

In [1]:
# instalar BERT
# install BERT
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
    sys.path += ['bert_repo']

# import python modules defined by BERT
# import módulos de python de BERT
import tokenization

Cloning into 'bert_repo'...
remote: Enumerating objects: 333, done.[K
remote: Total 333 (delta 0), reused 0 (delta 0), pack-reused 333
Receiving objects: 100% (333/333), 282.45 KiB | 510.00 KiB/s, done.
Resolving deltas: 100% (183/183), done.


In [2]:
import os
import random
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K
import keras.layers as layers
from keras.preprocessing import sequence
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Dense, Dropout
from keras.engine import Layer
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.utils import get_file, to_categorical
from scipy.special import softmax

os.environ['TFHUB_CACHE_DIR'] = './tfhub'
# limit of words in the sequences
limite_de_palabras_en_la_secuencia = 512  #@param {type : "number"}

Using TensorFlow backend.


In [0]:
sess = tf.Session()

### Preparamos los datos

### We prepare the data

In [4]:
get_file(
    os.getcwd() + '/transcript.txt',
    origin=
    'https://docs.google.com/uc?export=download&id=1ygz01H8ugP9QKs7VJk1GkFQjROeeF9S0'
)

Downloading data from https://docs.google.com/uc?export=download&id=1ygz01H8ugP9QKs7VJk1GkFQjROeeF9S0


'/content/transcript.txt'

In [0]:
lines = []
last = ''
with open('transcript.txt', 'rt', encoding='utf-8') as file:
    for line in file:
        if '(Rogers stands as Gloria enters.) Good morning.' in line or \
            'Well, where would you like to start this morning?' in line or \
            'Where do you want to start this morning?' in line or \
            'Whatever we do is up to us.' in line or \
            'Well, I\'m really glad to meet you.' in line:
            continue
        if 'This transcript is available for purposes of research' in line:
            last = ''
            continue
        who = line.lstrip()
        if who[:5] == 'Carl:' or who[:7] == 'Sylvia:' or who[:8] == 'Sylvia1:' or \
                (len(who) > 1 and (who[0] == 'C' or who[0] == 'T' or who[0] == 'S')
                and not (who[1].isalpha())):
            if 'Commentary' in line:
                continue
            this = who[0]
            continued = '(continued)' in line
            if '\t' in line:
                line = line[line.find('\t') + 1:]
            elif ':' in line:
                line = line[line.find(':') + 1:]
        else:
            continue
        if '[' in line:
            while '[' in line and ']' in line:
                line = line[:line.find('[')] + line[line.find(']') + 1:]
            line = ' '.join(line.split()).strip()
            if len(line) == 0:
                line = next(file)
        line = ' '.join(line.split()).strip()
        if len(line) > 0:
            if continued or last == this:
                lines[-1] = lines[-1] + ' ' + line
            else:
                lines.append(line)
            last = this

In [6]:
from IPython.core.display import HTML, display
for i in range(0, len(lines), 2):
    display(HTML('Client: ' + lines[i] + '</p>'))
    display(HTML('<p><b>Therapist: ' + lines[i + 1] + '</b>'))

In [0]:
data = [(lines[i], lines[i + 1]) for i in range(0, len(lines), 2)]
random.seed(12345)  # para resultados reproducibles
                    # for reproducible results
random.shuffle(data)

In [8]:
len(data)

481

### Definir el modelo

### Define the model

In [0]:
modelo_de_bert = 'bert_uncased_L-12_H-768_A-12/1'  #@param ["bert_uncased_L-12_H-768_A-12/1", "bert_cased_L-12_H-768_A-12/1", "bert_uncased_L-24_H-1024_A-16/1", "bert_cased_L-24_H-1024_A-16/1", "bert_multi_cased_L-12_H-768_A-12/1"]
bert = hub.Module('https://tfhub.dev/google/' + modelo_de_bert)

# instanciar el tokenizador
# create instance of tokenizer
tokenization_info = bert(signature='tokenization_info', as_dict=True)
vocab_file, do_lower_case = sess.run([
    tokenization_info['vocab_file'],
    tokenization_info['do_lower_case'],
])
tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file,
                                       do_lower_case=do_lower_case)

In [0]:
class BertEmbeddingLayer(Layer):
    def __init__(
            self,
            output_key='sequence_output',  # 'sequence_output': embedding de las palabras, 'pooled_ouput': embedding de la frase
                                           # 'sequence_output': word embedding, 'pooled_output': sentence embdding
            n_fine_tune_layers=0,  # número de capas a entrenar (sin contar la de pooling)
                                   # number of layers to train (not counting the pooling layer)
            bert_path="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1",  # modelo de BERT preentrenado
                                                                                  # pre-trained BERT model
            max_len=512,  # número máximo de tokens en las secuencias
                          # maximum number of tokens in the sequences
            **kwargs):
        assert output_key == 'sequence_output' or output_key == 'pooled_output'
        super(BertEmbeddingLayer, self).__init__(**kwargs)
        self.output_key = output_key
        self.n_fine_tune_layers = n_fine_tune_layers
        self.bert_path = bert_path
        self.max_len = max_len

    def build(self, input_shape):
        self.bert = hub.Module(self.bert_path,
                               trainable=self.trainable,
                               name="{}_module".format(self.name))
        if self.trainable:
            if self.output_key == 'pooled_output':
                # añadir las variables de la capa de pooling a las que vamos a entrenar
                # add the pooling layer variables to those we are going to train
                self.trainable_weights += [
                    var for var in self.bert.variables if 'pooler/' in var.name
                ]
            # añadir las variables de las últimas n capas a las que vamos a entrenar
            # add the variables from the last n layers to those we are going to train
            top_layer = max([
                int(_[_.find('layer_'):][6:_[_.find('layer_'):].find('/')])
                for _ in
                [var.name for var in bert.variables if 'layer_' in var.name]
            ])
            self.trainable_weights += [
                var for var in self.bert.variables if any([
                    f'layer_{top_layer-i}/' in var.name
                    for i in range(self.n_fine_tune_layers)
                ])
            ]
            self.non_trainable_weights += [
                var for var in self.bert.variables
                if var not in self.trainable_weights
            ]
        super(BertEmbeddingLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype='int32') for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(input_ids=input_ids,
                           input_mask=input_mask,
                           segment_ids=segment_ids)
        result = self.bert(inputs=bert_inputs,
                           signature='tokens',
                           as_dict=True)[self.output_key]
        return result

    def compute_output_shape(self, input_shape):
        if self.output_key == 'pooled_output':
            # embedding de la frase
            # sentence embedding
            return (input_shape[0], self.bert.get_output_info_dict('tokens')[
                self.output_key].get_shape()[1].value)
        else:
            # embedding de las palabras
            # word embedding
            return (input_shape[0], self.max_len,
                    self.bert.get_output_info_dict('tokens')[
                        self.output_key].get_shape()[2].value)

In [0]:
def build_bert_classification_model(
        trainable=True,
        n_fine_tune_layers=10,
        bert_path="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1",
        max_len=limite_de_palabras_en_la_secuencia,
        num_classes=2):
    in_id = Input(shape=(max_len, ), name="input_ids")
    in_mask = Input(shape=(max_len, ), name="input_masks")
    in_segment = Input(shape=(max_len, ), name="segment_ids")
    bert_inputs = [in_id, in_mask, in_segment]
    embedding = BertEmbeddingLayer(trainable=trainable,
                                   output_key='pooled_output',
                                   n_fine_tune_layers=n_fine_tune_layers,
                                   bert_path=bert_path,
                                   max_len=max_len)(bert_inputs)
    dropout = Dropout(0.1)(embedding)
    pred = Dense(num_classes, activation='sigmoid')(dropout)
    model = Model(inputs=bert_inputs, outputs=pred)
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(),
                  metrics=['accuracy'])
    model.summary()
    return model

### Preparar los datos en el formato que espera BERT
```python
example = [CLS] How are you? [SEP] Fine, thanks [SEP]

mask    =   1    1   1   1     1     1     1      1    0 ... 0

segment =   0    0   0   0     1     1     1      1    0 ... 0
```

### Prepare the data in the format expected by BERT
```python
example = [CLS] How are you? [SEP] Fine, thanks [SEP]

mask    =   1    1   1   1     1     1     1      1    0 ... 0

segment =   0    0   0   0     1     1     1      1    0 ... 0
```

In [0]:
train_test_split = 400


# generar casos positivos y negativos
# generate positive and negative cases
def get_data(data, max_len):
    examples = []
    mask = []
    segment = []
    label = []
    for i in range(len(data)):
        # consecutivos
        # consecutive phrases
        q = tokenizer.tokenize(data[i][0])
        a = tokenizer.tokenize(data[i][1])
        pad = [0] * (max_len - (len(q) + len(a) + 3))
        examples.append(
            (tokenizer.convert_tokens_to_ids(['[CLS]'] + q + ['[SEP]'] + a +
                                             ['[SEP]'])[:max_len] +
             pad)[:max_len])
        mask.append(([1] * (len(q) + len(a) + 3) + pad)[:max_len])
        segment.append(
            ([0] * (len(q) + 2) + [1] * (len(a) + 1) + pad)[:max_len])
        label.append('1')  # resultado positivo
        # positive result

        # no consecutivos
        # non-consecutive phrases
        for _ in range(1):
            noti = (random.randrange(len(data) - 3) + i + 2) % len(data)
            assert (noti < i - 1 or noti > i + 1)
            q = tokenizer.tokenize(data[i][0])
            a = tokenizer.tokenize(data[noti][1])
            pad = [0] * (max_len - (len(q) + len(a) + 3))
            examples.append(
                (tokenizer.convert_tokens_to_ids(['[CLS]'] + q + ['[SEP]'] +
                                                 a + ['[SEP]'])[:max_len] +
                 pad)[:max_len])
            mask.append(([1] * (len(q) + len(a) + 3) + pad)[:max_len])
            segment.append(
                ([0] * (len(q) + 2) + [1] * (len(a) + 1) + pad)[:max_len])
            label.append('0')  # resultado negativo
            # negative result
    return (np.array(examples), np.array(mask), np.array(segment),
            to_categorical(label, 2))


train_examples, train_mask, train_segment, train_label = get_data(
    data[:train_test_split], limite_de_palabras_en_la_secuencia)
test_examples, test_mask, test_segment, test_label = get_data(
    data[train_test_split:], limite_de_palabras_en_la_secuencia)

### Entrenar el modelo

### Train the model

In [16]:
# trainable?
entrenable = False  #@param {type : 'boolean'}
# number of layers to fine-tune
numero_de_capas_a_tunear = 0  #@param {type: 'slider', min : 0, max : 24}
checkpoint_filename = '/DrBertModel.h5'
bert_model = build_bert_classification_model(
    trainable=entrenable,
    n_fine_tune_layers=numero_de_capas_a_tunear,
    bert_path='https://tfhub.dev/google/' + modelo_de_bert,
    max_len=limite_de_palabras_en_la_secuencia,
    num_classes=2)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          (None, 512)          0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        (None, 512)          0                                            
__________________________________________________________________________________________________
bert_embedding_layer_2 (BertEmb ((None, 512), 768)   0           input_ids[0][0]                  
                                                                 input_masks[0][0]          

In [17]:
bert_model.fit(
    [train_examples, train_mask, train_segment],
    train_label,
    validation_data=([test_examples, test_mask, test_segment], test_label),
    epochs=1,
    batch_size=32  #@param {type : "number"}
    #@markdown La memoría utilizada por el GPU depende del tamaño del batch y el número de palabras en las sequencias
    #@markdown The memory used by the GPU depends on the batch size and the number of words in the sequences
    )

Train on 800 samples, validate on 162 samples
Epoch 1/1


<keras.callbacks.History at 0x7f8a128840b8>

### Ahora podemos empezar la sesión de psicoterapia...

### Now we can start the psychotherapy session ...

In [0]:
max_len = limite_de_palabras_en_la_secuencia
try:
    while True:
        texto = input('You: ')
        examples = []
        mask = []
        segment = []
        label = []
        for i in range(len(data)):
            # consecutivos
            # consecutive phrases
            q = tokenizer.tokenize(texto)
            a = tokenizer.tokenize(data[i][1])
            pad = [0] * (max_len - (len(q) + len(a) + 3))
            examples.append(
                tokenizer.convert_tokens_to_ids(['[CLS]'] + q + ['[SEP]'] + a +
                                                ['[SEP]'])[:max_len] + pad)
            mask.append([1] * (len(q) + len(a) + 3) + pad)
            segment.append([0] * (len(q) + 2) + [1] * (len(a) + 1) + pad)
        result = bert_model.predict([examples, mask, segment])
        print('Dr Bert: ' + data[np.argmax(softmax(result, axis=1)[:,1])][1])
except:
    print('Dr Bert: Bye!')

You: Hello Dr Bert
Dr Bert: Hi.
You: I am feeling so depressed, I just don't know what to do!
Dr Bert: Well, all I can do is what I am feeling- that is I feel close to you in this moment.
You: That's nice of you to say that. I feel fine when I am in session, but when I go home, I feel much worse.
Dr Bert: Mmm, you say you're not sure whether crying for yourself is constructive. I feel also you're afraid of crying for yourself.
You: I am afraid of doing that. But what good does it do?
Dr Bert: Are you afraid of the responsibility or, or what aspect of it is most frightening?
You: I'm afraid of hurting people around me.
Dr Bert: Perhaps at a deeper level you're afraid of the hurt that you may experience if you let yourself experience the anger.
You: That could be it. I am terrified of losing control.
Dr Bert: Mhm. Yeah being, being frightened of what you’re taking on and sad of what you’re losing for yourself. (S: Nods.) Those are two very real (S: Hmm.) feelings.
You: OK, thanks. I have