# **Author Profiling**  
**Mario Xavier Canche Uc**, mario.canche@cimat.mx  
*Doctorado en Ciencias de la Computación*  
Centro de Investigación en Mátematicas, A.C. (CIMAT)



Como parte del examen de candidatura del doctorado, para evaluar lo aprendido en la asignatura de Aprendizaje Maquina II. A continuación se presentan los resultados obtenidos para el problema de Author Profiling, el cual tiene el objetivo de identificar rasgos demográficos (e.g., genero, edad, nacionalidad, rasgos de personalidad, etc.)  de alguna población objetivo. Para nuestro caso en particular solo nos enfocaremos en la tarea de indentificar si el conjunto de datos de perfiles de usuarios de twitter proviene de un perfil "male" o "female". Los datos fueron obtenidos de https://pan.webis.de/clef17/pan17-web/author-profiling.html

El código a continuación es una adaptación del código usado para identificar agresividad en tweets en español. Se le agrego todo lo necesario para cumplir la nueva tarea y para evaluar con una nueva capa de atención jerárquica.

# Para ejecutar en Google Colab en Drive

In [None]:
# Montamos el Drive al Notebook
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
# Verificamos el directorio en el que nos encontramos
!pwd
!ls

/content
drive  sample_data


In [None]:
# Cambiamos de directorio al Drive
import os
os.chdir("drive/My Drive/PruebasCOLAB4/candidatura/NLP/")
!ls

Author_Profiling_MarioXavierCancheUc.ipynb  word2vec_col.txt
author_profiling_pan


# Cargamos las librerias


In [None]:
import pandas as pd
import pickle
import numpy as np
import nltk
nltk.download('punkt')
from tqdm.auto import tqdm
import copy

import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pad_packed_sequence, pack_padded_sequence
import torch.nn.functional as F

from sklearn.metrics import f1_score

import xml.etree.ElementTree as etree
import os
from nltk.tokenize import RegexpTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Definimos función para cargar los Datos

Definimos la función que lee los datasets de train y test que estan en formato xml.  

Agrupa los tweets de la misma persona en uno solo, por lo que cada muestra representa a un usuario diferente con todos sus tweets.

In [None]:
def load_dataset(path_dir):

    # Cargamos la lista de etiquetas para cada usuario (archivo)
    labels = pd.read_csv(path_dir + "/"+ "truth.txt", sep=':::', header=None, index_col=0)

    # Obtenemos la lista de archivos en la carpeta
    archivos = os.listdir(path_dir)

    data = []
    # Recorremos la lista de archivos en la carpeta
    for archivo in archivos:

        # Procesamos los archivos xml con los tweets
        if archivo.endswith('.xml'):
            # Extraemos el nombre de usuario del nombre del archivo
            user = archivo.split(".")[0]
            # Extraemos el genero del usuario de los labels
            genero = labels.loc[user,1]
            # Extraemos el pais del usuario de los labels
            #pais = labels.loc[user,2]

            # Leemos el archivo xml
            tree = etree.parse(path_dir + "/" + archivo)
            root = tree.getroot()

            tweet_user = []
            # Extraemos la información del archivo
            for i in range(len(root[0])):
                # Tweet
                tweet_user.append(str(root[0][i].text).lower())
            
            tweet = ":::::".join(tweet_user)
            # Guardamos la informacion
            if genero == 'female':
                data.append([user, tweet, 0])
            else:
                data.append([user, tweet, 1])
            #data.append([user, tweet, genero, pais])

    # Guardamos en un DataFrame
    df = pd.DataFrame(data, columns=["user","text","target"])
    df = df.set_index(["user"])
    return(df)


# Tarea 1: Preprocesamiento y limpieza de los Datos

La clase Dataset de Pytorch permite un manejo ordenado de nuestros datos y una interacción sencilla con el objeto DataLoader utilizado para crear y cargar los batch de datos.

También se realiza la limpieza de los tweets.

In [None]:
class author_profiling_dataset(Dataset):
    def __init__(self, split):
        super(Dataset, self).__init__()
        self.load_data(split)
        self.vocab, self.emb_mat = self.load_vocab_embeddings()
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        '''Método principal para cargar una observación del dataset.
           label: categoría a la que pertenece la observación.
           word_ids: lista de índices de las palbras en el vocabulario.
        '''
        label = self.data.iloc[index]['target']
        words, word_ids, num_tweets = self.preprocessed_text(index)
        return word_ids, label, words, num_tweets
        
    def preprocessed_text(self, index):
        '''Preprocess text and '''

        text = self.data.iloc[index]['text']

        #num_tweets = [len(nltk.word_tokenize(tweet)) for tweet in text.split(":::::")]
        num_tweets = []
        words = []
        for tweet in text.split(":::::"):
              num_tweets.append(len(nltk.word_tokenize(tweet)))
              words += nltk.word_tokenize(tweet)

        #words = nltk.word_tokenize(text)
        word_ids = [self.vocab[word] if word in self.vocab.keys() else self.emb_mat.shape[0]-1\
                        for word in words]
        return words, word_ids, num_tweets

    def load_data(self, split):
        '''Método para cargar datos.
           El texto está en la columna "text" y las categorías en la columna "target".
        '''
        #self.data = pd.read_csv('%s.csv'%(split))
        self.data = load_dataset(split)

        print("Muestra del dataset: ", split)
        print(self.data)

    def load_vocab_embeddings(self):
        '''Embeddings preentrenados en twitter.
           emb_mat: Matriz de embeddings. Un vector de tamaño 200 para cada palabra del vocabulario.
           vocab: Diccionario, asigna a cada palabra su renglón correspondiente en la matriz de embeddings.
        '''
        embeddings_list = []
        self.vocab_dict = {}
        vocab = {}
        with open('word2vec_col.txt', 'r') as f:
            for i, line in enumerate(f):
                if i!=0:
                    values = line.split()
                    self.vocab_dict[i+1] = values[0]
                    vocab[values[0]] = i+1
                    vector = np.asarray(values[1:], "float32")
                    embeddings_list.append(vector)
        embeddings_list.insert(0,np.mean(np.vstack(embeddings_list), axis=0))
        embeddings_list.insert(0,np.zeros(100))
        self.vocab_dict[0] = '[PAD]'
        self.vocab_dict[1] = '[UNK]'
        vocab['[PAD]'] = 0
        vocab['[UNK]'] = 1
        emb_mat = np.vstack(embeddings_list)

        return vocab, emb_mat

    def get_weights(self):
        '''Devuelve pesos inversos para cada categoría. Mayor peso para la categoría con menos observaciones.'''

        cat_0 = len(self.data[self.data['target']==0])
        cat_1 = len(self.data[self.data['target']==1])
        maxi = max(cat_0, cat_1)
        return torch.tensor([maxi/cat_0, maxi/cat_1])

    def collate_fn(self, batch):
        '''Función que ejecuta el dataloader para formar batches de datos.'''

        zipped_batch = list(zip(*batch))

        word_ids = [torch.tensor(t) for t in zipped_batch[0]]
        word_ids = torch.cat(word_ids, dim=0)
        lengths = torch.tensor([len(t) for t in zipped_batch[0]])
        num_tweets = torch.tensor(zipped_batch[3])

        labels = torch.tensor(zipped_batch[1])
        words = zipped_batch[2]
        return word_ids, lengths, labels, words, num_tweets

## Leemos los Datos

In [None]:
train_val_dataset = author_profiling_dataset('author_profiling_pan/es_train')
test_dataset = author_profiling_dataset('author_profiling_pan/es_test')

  after removing the cwd from sys.path.


Muestra del dataset:  author_profiling_pan/es_train
                                                                               text  target
user                                                                                       
6ab3b49b1edfa8b13c2c080141835ddb  nuevos seguidores: 0, unfollowers: 1 (17:01) #...       0
5b1b1e9e9ef62dd40e8a4744fd025368  viendo la eficacia de #supertanker el gobierno...       0
fa63d3d1399be48e325b92376b9d658   lady gaga recita el pledge allegiance, al inic...       0
2b0deda35c16864b466df4fc25661278  si existe una forma correcta de leer esta libr...       0
a5a287c0ce20ff56c0427f58634ec2ae  buenos dias ^^  , (acabo de volver del putho d...       0
...                                                                             ...     ...
7a07493129d3085a29703dcb3bd1ac51  #entérate estos son los próximos #shows de @my...       0
964a53d429b17a146ee3bbad8c32ec7e  como cuando una gorda se lleva tu barra en el ...       0
138706982a29b9d655edff5039e8

  after removing the cwd from sys.path.


Muestra del dataset:  author_profiling_pan/es_test
                                                                               text  target
user                                                                                       
1c5dc65dac09e3ef8fca01d7ad685a2d  @pressrperu y @rmapalacios ? tiene más de un m...       1
71055288ab6e01e42317ab3eeca513f   me sentí bien cuando me preguntaban y no me ju...       1
117e1af20c362d89f4ef8fc493476981  me gustó un video de @youtube de @sebasparadis...       1
fc0f02f571852e9d68747104f13ac032  odebrecht pagó us$ 20 mllns en sobornos para e...       1
c352ba9c4aff480900b40b2d6c7633f9  @alangarciaperu obvio que a disposión si tiene...       1
...                                                                             ...     ...
1d8d0e866df789bd500e424b68c7af96  esto de @mmunizvilla https://t.co/qnb5brvwv9::...       1
7a56048e0247da6392c71fbf78e2d832  @julia_de_lucia  quienes las dos peazos de tia...       1
cf6a0eb7318455e0138318a206a2d

## Preparamos los datos en batches

Definimos los iterables "dataloader" que se encargaran de generar los batch de datos.

In [None]:
batch_size= 32

In [None]:
# Dividimos el dataset en 80% train y 20% Validation
train_size = int(0.8 * len(train_val_dataset))
test_size = len(train_val_dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(train_val_dataset, [train_size, test_size])

train_dataset = train_dataset.dataset
val_dataset = val_dataset.dataset

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, collate_fn = train_dataset.collate_fn, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, collate_fn = val_dataset.collate_fn, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, collate_fn = test_dataset.collate_fn, shuffle=False)

# Tarea 2: Clasificador RNN

## Definimos el Modelo
El modelo se define heredando la clase nn.Module

In [None]:
class SimpleRNN(nn.Module):
    def __init__(self, input_size=100, hidden_size=128, num_layers=1,
                 bidirectional=False, emb_mat=None, dense_hidden_size=256):
        '''Constructor, aquí definimos las capas.
        input:
            input_size: Tamaño de los embeddings de las palabras.
            hidden_size: Tamaño de la capa oculta de la GRU.
            num_layers: Número de capas de la GRU.
            bidirectional: True si se quiere una GRu bidireccional.
            emb_mat: Matriz de embeddings del vocabulario.
            dense_hidden_size: Tamaño de la capa ocula del clasificador.
        '''
        super(SimpleRNN, self).__init__()
        # Matriz entrenable de embeddings, tamaño vocab_size x 100
        self.embeddings = nn.Embedding.from_pretrained(\
                            torch.FloatTensor(emb_mat), freeze=False)
        # Gated Recurrent Unit
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, 
                          num_layers=num_layers, bidirectional=bidirectional)
        # Número de direcciones de la GRU
        directions = 2 if bidirectional else 1
        # Clasificador MLP
        self.classifier = nn.Sequential(\
                            nn.Linear(hidden_size*directions, dense_hidden_size),
                            nn.BatchNorm1d(dense_hidden_size),
                            nn.ReLU(),
                            nn.Linear(dense_hidden_size, 2))
    
    def forward(self, input_seq, lengths):
        '''Función feed-forward de la red.
        input:
            input_seq: Lista de ids para cada palabra.
            lengths: Número de palabras en cada una de las observaciones del batch.
        output:
            x: vectores para clasificar.
            return None for consistency with the next model
        '''
        # Calcula el embedding para cada palabra.
        x = self.embeddings(input_seq)
        # Forma las secuencias de palabras que entraran a la GRU.
        x = x.split(lengths.tolist())
        # Añade pading y empaqueta las secuencias (mayor velocidad de cómputo).
        x = pad_sequence(x)
        x = pack_padded_sequence(x, lengths, enforce_sorted=False)
        output, hn = self.gru(x)
        hn = torch.cat([h for h in hn], dim=-1)
        x = self.classifier(hn)
        return x, None


In [None]:
def eval_model(model, dataloader, criterion, device):
    '''Función para evaluar el modelo.'''
    with torch.no_grad():
        model.eval()
        losses = []
        preds = torch.empty(0).long()
        targets = torch.empty(0).long()
        scores_list = []
        words_list = []
        pred_list = []
        for data in tqdm(dataloader):
            torch.cuda.empty_cache()
            seq, seq_len, labels, words, num_tweets = data
            seq, labels = seq.to(device), labels.to(device)
            output, scores = model(seq, seq_len)
            output = F.log_softmax(output, dim=1)
            loss = criterion(output, labels)
            losses.append(loss.item())
            predictions = F.log_softmax(output, dim=1).argmax(1)
            preds = torch.cat([preds, predictions.cpu()], dim=0)
            targets = torch.cat([targets, labels.cpu()], dim=0)
            if scores is not None:
                pred_list += predictions.tolist()
                scores = scores.cpu().squeeze(2).tolist()
                scores_list += scores
                words_list += words

        model.train()
        preds = preds.numpy()
        targets = targets.numpy()
        f1 = f1_score(targets, preds, average='binary')

        return np.mean(losses), f1, scores_list, words_list, pred_list, targets

## Definimos los parámetros del Modelo

Definimos el los parámetros del optimizaor Adam y el dispositivo en que se entrenará la red, cuda o cpu.

In [None]:
lr = 0.001
epochs = 10
weight_decay=0.0001
beta1=0
beta2=0.999
device = torch.device('cuda')
#device = torch.device('cpu')

Definimos el modelo, el optimizador y la función de pérdida (Negative Log-Likelihood).

In [None]:
model = SimpleRNN(emb_mat=train_dataset.emb_mat, bidirectional=False).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr,weight_decay=weight_decay, betas = (beta1, beta2))
weight = train_dataset.get_weights().to(device)
criterion = nn.NLLLoss(weight = weight)

## Entrenamos el Modelo

Entrenamos el modelo durante las épocas deseadas. Se guarda el modelo con mejor f1_score en el conjunto de valuación.

In [None]:
best_val_f1 = 0
for epoch in range(epochs):
    for data in tqdm(train_dataloader):
        # Limpia basura de la memoria GPU
        torch.cuda.empty_cache()
        # Reiniciamos el cálculo del gradiente
        optimizer.zero_grad()
        # Desempaca los datos que salen del dataloader
        seq, seq_len, labels, _, _ = data
        # Mueve los datos al mismo device en el que este el modelo
        seq, labels = seq.to(device), labels.to(device)
        output, _ = model(seq, seq_len)
        output = F.log_softmax(output, dim=1)
        loss = criterion(output, labels)
        # Calcula el gradiente de la pérdida
        loss.backward()
        # Realiza un paso de la optimización
        optimizer.step()
    
    #Evalúa los modelos en los conjuntos de entrenamiento y valuación
    train_loss, train_f1, _, _, _, _ = eval_model(model, train_dataloader, criterion, device)
    val_loss, val_f1, _, _, _, _ = eval_model(model, val_dataloader, criterion, device)
    print('epoch: %d'%(epoch))
    print('train_loss: %5f | val_loss: %5f | train_f1: %5f | val_f1: %5f'%(train_loss, val_loss, train_f1, val_f1)) 
    if val_f1>best_val_f1:
        best_val_f1=val_f1
        best_state_dict=copy.deepcopy(model.state_dict())

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 0
train_loss: 0.488579 | val_loss: 0.489764 | train_f1: 0.850318 | val_f1: 0.850318


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 1
train_loss: 0.391739 | val_loss: 0.393888 | train_f1: 0.796813 | val_f1: 0.796813


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 2
train_loss: 0.264669 | val_loss: 0.262741 | train_f1: 0.926244 | val_f1: 0.926244


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 3
train_loss: 0.155685 | val_loss: 0.156435 | train_f1: 0.959866 | val_f1: 0.959866


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 4
train_loss: 0.127702 | val_loss: 0.126490 | train_f1: 0.953043 | val_f1: 0.953043


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 5
train_loss: 0.083536 | val_loss: 0.087015 | train_f1: 0.980066 | val_f1: 0.980066


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 6
train_loss: 0.031327 | val_loss: 0.031845 | train_f1: 0.995008 | val_f1: 0.995008


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 7
train_loss: 0.040510 | val_loss: 0.038587 | train_f1: 0.988235 | val_f1: 0.988235


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 8
train_loss: 0.014421 | val_loss: 0.014460 | train_f1: 0.998331 | val_f1: 0.998331


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 9
train_loss: 0.027922 | val_loss: 0.028600 | train_f1: 0.993289 | val_f1: 0.993289


## Evaluamos el Modelo

Una vez que acabamos de entrenar cargamos el mejor modelo y lo evaluamos en los tres conjuntos.

In [None]:
model.load_state_dict(best_state_dict)
train_loss, train_f1, _, _, _, _ = eval_model(model, train_dataloader, criterion, device)
val_loss, val_f1, _, _, _, _ = eval_model(model, val_dataloader, criterion, device)
test_loss, test_f1, _, _, _, _ = eval_model(model, test_dataloader, criterion, device)
print('train_loss: %5f | train_f1: %5f'%(train_loss, train_f1)) 
print('val_loss: %5f | val_f1: %5f'%(val_loss, val_f1)) 
print('test_loss: %5f | test_f1: %5f'%(test_loss, test_f1)) 

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

train_loss: 0.014411 | train_f1: 0.998331
val_loss: 0.015026 | val_f1: 0.998331
test_loss: 1.577312 | test_f1: 0.611765


# Tarea 3: Clasificador  RNN con atención

## Definimos Modelo con atención

La sintaxis del modelo es similar al anterior pero se añade un módulo de atención. El modulo de atención toma los vectores de salida $h_t$ de la GRU y calcula una representación $s$ como suma ponderada:

$$ s = \sum_t \alpha_t h_t,$$

donde

\begin{align*}
    u_{t} &= \tanh(Wh_{t}+b),\\
    \alpha_{t} &= \frac{\exp(u_t^Tu)}{\sum_i\exp(u_{i}^Tu)}.
\end{align*}


In [None]:
class AttnModule(nn.Module):
    def __init__(self, input_size, attn_hidden_size=128):
        '''
        input:
            input_size: tamaño de la capa oculta de la GRU.
            attn_hidden_size: tamaño de la capa oculta.
        '''
        super(AttnModule, self).__init__()
        self.fc1 = nn.Linear(input_size, attn_hidden_size)
        self.fc2 = nn.Linear(attn_hidden_size, 1, bias=False)

    def forward(self, seq, lengths):
        '''
        input:
            seq: secuencia de vectores ocultos de la GRU.
            lengths: número de palabras en cada observación.
        '''
        x = pad_packed_sequence(seq)[0]
        seq_len, batch_size, nhid = x.size()
        u = self.fc1(x.view(batch_size*seq_len, nhid))
        u = torch.tanh(u)
        scores = self.fc2(u)
        scores = scores.view(seq_len, batch_size, 1)
        # Asigna -100 a las posiciones con padding para que no sean consideados en la atención.
        scores = nn.utils.rnn.pack_padded_sequence(scores, lengths=lengths,enforce_sorted=False)
        scores = nn.utils.rnn.pad_packed_sequence(scores, padding_value=-100)[0]
        scores = F.softmax(scores, dim=0)
        scores = scores.transpose(0,1)
        x = x.transpose(0,1).transpose(1,2)
        x = torch.bmm(x, scores)
        return x.squeeze(2), scores

Definimos el modelo RNN con atención

In [None]:
class AttnRNN(nn.Module):
    def __init__(self, input_size=100, hidden_size=128, num_layers=1,
                 bidirectional=False, emb_mat=None, dense_hidden_size=256,
                 attn_hidden_size=128):
        super(AttnRNN, self).__init__()
        self.embeddings = nn.Embedding.from_pretrained(\
                            torch.FloatTensor(emb_mat), freeze=False)
        self.gru = nn.GRU(input_size=input_size, hidden_size=hidden_size, 
                          num_layers=num_layers, bidirectional=bidirectional)
        directions = 2 if bidirectional else 1
        self.attn = AttnModule(input_size=hidden_size*directions)
        self.classifier = nn.Sequential(\
                            nn.Linear(hidden_size*directions, dense_hidden_size),
                            nn.BatchNorm1d(dense_hidden_size),
                            nn.ReLU(),
                            nn.Linear(dense_hidden_size, 2))
        
    def forward(self, input_seq, lengths):
        x = self.embeddings(input_seq)
        x = x.split(lengths.tolist())
        x = pad_sequence(x)
        x = pack_padded_sequence(x, lengths, enforce_sorted=False)
        output, hn = self.gru(x)
        x, scores = self.attn(output, lengths)
        x = self.classifier(x)
        return x, scores.detach()

## Definimos los parámetros del Modelo

In [None]:
lr = 0.0001
epochs = 20
device = torch.device('cuda')
#device = torch.device('cpu')
weight_decay=0.0001
beta1=0
beta2=0.999

In [None]:
model = AttnRNN(emb_mat=train_dataset.emb_mat, bidirectional=False).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr,weight_decay=weight_decay, betas = (beta1, beta2))
weight = train_dataset.get_weights().to(device)
criterion = nn.NLLLoss(weight = weight)

## Entrenamos el Modelo

In [None]:
best_val_f1 = 0
for epoch in range(epochs):
    for data in tqdm(train_dataloader):
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        seq, seq_len, labels, _, num_tweets = data
        seq, labels = seq.to(device), labels.to(device)
        output, _ = model(seq, seq_len)
        output = F.log_softmax(output, dim=1)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    
    train_loss, train_f1, _, _, _, _ = eval_model(model, train_dataloader, criterion, device)
    val_loss, val_f1, _, _, _, _ = eval_model(model, val_dataloader, criterion, device)
    print('epoch: %d'%(epoch))
    print('train_loss: %5f | val_loss: %5f | train_f1: %5f | val_f1: %5f'%(train_loss, val_loss, train_f1, val_f1)) 
    if val_f1>best_val_f1:
        best_val_f1=val_f1
        best_state_dict=copy.deepcopy(model.state_dict())

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 0
train_loss: 0.649511 | val_loss: 0.648890 | train_f1: 0.781250 | val_f1: 0.781250


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 1
train_loss: 0.594944 | val_loss: 0.594745 | train_f1: 0.815057 | val_f1: 0.815057


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 2
train_loss: 0.515688 | val_loss: 0.511211 | train_f1: 0.824348 | val_f1: 0.824348


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 3
train_loss: 0.464706 | val_loss: 0.466780 | train_f1: 0.812500 | val_f1: 0.812500


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 4
train_loss: 0.431014 | val_loss: 0.428996 | train_f1: 0.860465 | val_f1: 0.860465


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 5
train_loss: 0.457458 | val_loss: 0.457549 | train_f1: 0.755020 | val_f1: 0.755020


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 6
train_loss: 0.409802 | val_loss: 0.406159 | train_f1: 0.832090 | val_f1: 0.832090


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 7
train_loss: 0.399943 | val_loss: 0.404722 | train_f1: 0.881789 | val_f1: 0.881789


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 8
train_loss: 0.381312 | val_loss: 0.380096 | train_f1: 0.847015 | val_f1: 0.847015


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 9
train_loss: 0.381697 | val_loss: 0.384442 | train_f1: 0.885400 | val_f1: 0.885400


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 10
train_loss: 0.358441 | val_loss: 0.358854 | train_f1: 0.890302 | val_f1: 0.890302


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 11
train_loss: 0.328963 | val_loss: 0.324375 | train_f1: 0.913333 | val_f1: 0.913333


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 12
train_loss: 0.329793 | val_loss: 0.324757 | train_f1: 0.895683 | val_f1: 0.895683


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 13
train_loss: 0.439551 | val_loss: 0.439278 | train_f1: 0.691304 | val_f1: 0.691304


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 14
train_loss: 0.319029 | val_loss: 0.320570 | train_f1: 0.913738 | val_f1: 0.913738


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 15
train_loss: 0.353895 | val_loss: 0.353462 | train_f1: 0.808679 | val_f1: 0.808679


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 16
train_loss: 0.284509 | val_loss: 0.283223 | train_f1: 0.900901 | val_f1: 0.900901


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 17
train_loss: 0.310858 | val_loss: 0.309772 | train_f1: 0.871028 | val_f1: 0.871028


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 18
train_loss: 0.247280 | val_loss: 0.246163 | train_f1: 0.944538 | val_f1: 0.944538


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 19
train_loss: 0.242564 | val_loss: 0.242806 | train_f1: 0.950495 | val_f1: 0.950495


## Evaluación del Modelo

In [None]:
model.load_state_dict(best_state_dict)
train_loss, train_f1, train_scores, train_words, train_pred, train_target = eval_model(model, train_dataloader, criterion, device)
val_loss, val_f1, val_scores, val_words, val_pred, val_target = eval_model(model, val_dataloader, criterion, device)
test_loss, test_f1, test_scores, test_words, test_pred, test_target = eval_model(model, test_dataloader, criterion, device)
print('train_loss: %5f | train_f1: %5f'%(train_loss, train_f1)) 
print('val_loss: %5f | val_f1: %5f'%(val_loss, val_f1)) 
print('test_loss: %5f | test_f1: %5f'%(test_loss, test_f1)) 

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

train_loss: 0.243130 | train_f1: 0.950495
val_loss: 0.247435 | val_f1: 0.950495
test_loss: 0.622944 | test_f1: 0.758148


# Tarea 5: Visualizando la atención

Uno de los beneficios de los mecanismos de atención es que nos permiten identificar qué elementos de las oraciones resultan más importantes.

In [None]:
from IPython.display import display, HTML
import matplotlib
import matplotlib.pyplot as plt

In [None]:
def colorize(words, color_array):
    '''
        Función para visuzalizar la atención, tomada de https://gist.github.com/ihsgnef/f13c35cd46624c8f458a4d23589ac768,
    '''
    # words is a list of words
    # color_array is an array of numbers between 0 and 1 of length equal to words

    # normalizamos color_array
    color_array = np.array(color_array)
    vmax = np.max(color_array)
    vmin = np.min(color_array)
    color_array = (color_array-vmin)/(vmax-vmin)

    cmap = matplotlib.cm.get_cmap('Reds')
    template = '<span class="barcode"; style="color: black; background-color: {}">{}</span>'
    colored_string = ''
    for word, color in zip(words, color_array):
        color = matplotlib.colors.rgb2hex(cmap(color)[:3])
        colored_string += template.format(color, '&nbsp' + word + '&nbsp')
    return colored_string

Las palabras con más atención se muestran en color rojo y aquellas con menor atención en color blanco.

In [None]:
att = np.linspace(0,1,50)
p = [' ']*50
s = colorize(p, att)
# to display in ipython notebook
display(HTML(s))

In [None]:
max_attn = [np.max(scores) for scores in train_scores]
maxi = np.flip(np.argsort(max_attn))

for j in range(5):
    i = maxi[j]
    s = colorize(train_words[i], train_scores[i][:len(train_words[i])])
    # to display in ipython notebook
    category = 'Male' if train_pred[maxi[j]]==1 else 'Female'
    category0 = 'Male' if train_target[maxi[j]]==1 else 'Female'
    print('Categoría predicha:  %s'%(category))
    print('Categoría verdadera: %s'%(category0))
    display(HTML(s))

Categoría predicha:  Male
Categoría verdadera: Male


Categoría predicha:  Male
Categoría verdadera: Male


Categoría predicha:  Male
Categoría verdadera: Male


Categoría predicha:  Female
Categoría verdadera: Female


Categoría predicha:  Female
Categoría verdadera: Female


# Tarea 4: Clasificador RNN con atención Jerárquica

Modulo de atención

In [None]:
class AttnModule(nn.Module):
    def __init__(self, input_size, attn_hidden_size=128):
        '''
        input:
            input_size: tamaño de la capa oculta de la GRU.
            attn_hidden_size: tamaño de la capa oculta.
        '''
        super(AttnModule, self).__init__()
        self.fc1 = nn.Linear(input_size, attn_hidden_size)
        self.fc2 = nn.Linear(attn_hidden_size, 1, bias=False)

    def forward(self, seq, lengths):
        '''
        input:
            seq: secuencia de vectores ocultos de la GRU.
            lengths: número de palabras en cada observación.
        '''
        x = pad_packed_sequence(seq)[0]
        seq_len, batch_size, nhid = x.size()
        u = self.fc1(x.view(batch_size*seq_len, nhid))
        u = torch.tanh(u)
        scores = self.fc2(u)
        scores = scores.view(seq_len, batch_size, 1)
        # Asigna -100 a las posiciones con padding para que no sean consideados en la atención.
        scores = nn.utils.rnn.pack_padded_sequence(scores, lengths=lengths,enforce_sorted=False)
        scores = nn.utils.rnn.pad_packed_sequence(scores, padding_value=-100)[0]
        scores = F.softmax(scores, dim=0)
        scores = scores.transpose(0,1)
        x = x.transpose(0,1).transpose(1,2)
        x = torch.bmm(x, scores)
        return x.squeeze(2), scores

Definimos el modelo RNN con atención jerárquica, similar al paper:    
https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

In [None]:
class Hierarchical_AttnRNN(nn.Module):
    def __init__(self, input_size=100, hidden_size=128, num_layers=1,
                 bidirectional=False, emb_mat=None, dense_hidden_size=256,
                 attn_hidden_size=128):
        super(Hierarchical_AttnRNN, self).__init__()

        directions = 2 if bidirectional else 1
        self.embeddings = nn.Embedding.from_pretrained(\
                            torch.FloatTensor(emb_mat), freeze=False)
        
        # Word Encoder
        self.gru_word = nn.GRU(input_size=input_size, hidden_size=hidden_size, 
                          num_layers=num_layers, bidirectional=bidirectional)
        # Word Attention
        self.attn_word = AttnModule(input_size=hidden_size*directions)


        # Sentence Enconder
        self.gru_sent = nn.GRU(input_size=hidden_size*2, hidden_size=hidden_size, 
                          num_layers=num_layers, bidirectional=bidirectional)
        # Sentence Attention
        self.attn_sent = AttnModule(input_size=hidden_size*directions)


        # Classifier Document
        self.classifier = nn.Sequential(\
                            nn.Linear(hidden_size*directions, dense_hidden_size),
                            nn.BatchNorm1d(dense_hidden_size),
                            nn.ReLU(),
                            nn.Linear(dense_hidden_size, 2))
        
    def forward(self, input_seq, lengths, num_tweets):
        x = self.embeddings(input_seq)
        x = x.split(lengths.tolist())
        
        # Para dividir los tweets de cada usuario
        # recorremos cada usuario
        tweets_users = []
        for i, user in enumerate(x):
            list_tweets = user.split(num_tweets.tolist()[i])
            list_tweets = pad_sequence(list_tweets)
            tweets_users.append(list_tweets)
            del list_tweets
        # Concatenamos la informacion de todos los usuarios
        del x
        x = pad_sequence(tweets_users)
        del tweets_users
        
        # Iteramos sobre todos los tweets para procesar por palabra
        x = x.permute(2,0,1,3) # Ponemos hasta adelante la dim de las sentencias (tweets)
        num_tweets = num_tweets.permute(1,0)
        sent_score = [] 
        sent_x = []
        for i, words in enumerate(x):
            # Word Encoder
            x_word = pack_padded_sequence(words, num_tweets.tolist()[i], enforce_sorted=False)
            output_word, hn_word = self.gru_word(x_word)
            # Word Attention
            x_word, scores_word = self.attn_word(output_word, num_tweets[i])
            sent_score.append(scores_word.transpose(0,1))
            sent_x.append(x_word)
        # Concatenamos las sentencias (tweets)
        sent_score = pad_sequence(sent_score).transpose(0,2)
        sent_x = pad_sequence(sent_x).permute(1,0,2)

        # Sentences Encoder
        lengths_sent = [sent_x.shape[0]]*sent_x.shape[1]
        sent_x = pack_padded_sequence(sent_x, lengths_sent, enforce_sorted=False)
        output, hn = self.gru_sent(sent_x)
        # Sentences Attention
        sent_x, scores = self.attn_sent(output, lengths_sent)
        
        # Classifier Document
        cl = self.classifier(sent_x)

        return cl, scores.detach(), sent_score.detach()

In [None]:
def eval_model(model, dataloader, criterion, device):
    '''Función para evaluar el modelo.'''
    with torch.no_grad():
        model.eval()
        losses = []
        preds = torch.empty(0).long()
        targets = torch.empty(0).long()
        scores_sent_list = []
        scores_word_list = []
        words_list = []
        pred_list = []
        for data in tqdm(dataloader):
            torch.cuda.empty_cache()
            seq, seq_len, labels, words, num_tweets = data
            seq, labels = seq.to(device), labels.to(device)
            output, scores_sent, scores_word = model(seq, seq_len, num_tweets)
            output = F.log_softmax(output, dim=1)
            loss = criterion(output, labels)
            losses.append(loss.item())
            predictions = F.log_softmax(output, dim=1).argmax(1)
            preds = torch.cat([preds, predictions.cpu()], dim=0)
            targets = torch.cat([targets, labels.cpu()], dim=0)
            if scores_sent is not None:
                pred_list += predictions.tolist()
                scores_sent = scores_sent.cpu().squeeze(2).tolist()
                scores_word = scores_word.cpu().squeeze(3).tolist()
                scores_sent_list += scores_sent
                scores_word_list += scores_word
                words_list += words

        model.train()
        preds = preds.numpy()
        targets = targets.numpy()
        f1 = f1_score(targets, preds, average='binary')

        return np.mean(losses), f1, scores_sent_list, scores_word_list, words_list, pred_list, targets

## Definimos los parámetros del Modelo

In [None]:
lr = 0.0001
epochs = 20
device = torch.device('cuda')
#device = torch.device('cpu')
weight_decay=0.0001
beta1=0
beta2=0.999

In [None]:
model = Hierarchical_AttnRNN(emb_mat=train_dataset.emb_mat, bidirectional=True).to(device)
optimizer = optim.Adam(model.parameters(), lr=lr,weight_decay=weight_decay, betas = (beta1, beta2))
weight = train_dataset.get_weights().to(device)
criterion = nn.NLLLoss(weight = weight)

## Entrenamos el Modelo

In [None]:
best_val_f1 = 0
for epoch in range(epochs):
    for data in tqdm(train_dataloader):
        torch.cuda.empty_cache()
        optimizer.zero_grad()
        seq, seq_len, labels, _, num_tweets = data
        seq, labels = seq.to(device), labels.to(device)
        output, _, _ = model(seq, seq_len, num_tweets)
        output = F.log_softmax(output, dim=1)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
    
    train_loss, train_f1, _, _, _, _, _ = eval_model(model, train_dataloader, criterion, device)
    val_loss, val_f1, _, _, _, _, _ = eval_model(model, val_dataloader, criterion, device)
    print('epoch: %d'%(epoch))
    print('train_loss: %5f | val_loss: %5f | train_f1: %5f | val_f1: %5f'%(train_loss, val_loss, train_f1, val_f1)) 
    if val_f1>best_val_f1:
        best_val_f1=val_f1
        best_state_dict=copy.deepcopy(model.state_dict())

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 0
train_loss: 0.675812 | val_loss: 0.675492 | train_f1: 0.790257 | val_f1: 0.790257


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 1
train_loss: 0.638237 | val_loss: 0.637976 | train_f1: 0.803540 | val_f1: 0.803540


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 2
train_loss: 0.593701 | val_loss: 0.591988 | train_f1: 0.858420 | val_f1: 0.858420


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 3
train_loss: 0.471306 | val_loss: 0.473283 | train_f1: 0.864000 | val_f1: 0.864000


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 4
train_loss: 0.374797 | val_loss: 0.378605 | train_f1: 0.881239 | val_f1: 0.881239


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 5
train_loss: 0.348111 | val_loss: 0.347741 | train_f1: 0.883636 | val_f1: 0.883636


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 6
train_loss: 0.304542 | val_loss: 0.306280 | train_f1: 0.927487 | val_f1: 0.927487


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 7
train_loss: 0.299047 | val_loss: 0.300588 | train_f1: 0.908108 | val_f1: 0.908108


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 8
train_loss: 0.403019 | val_loss: 0.396028 | train_f1: 0.715203 | val_f1: 0.715203


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 9
train_loss: 0.238373 | val_loss: 0.237776 | train_f1: 0.946087 | val_f1: 0.946087


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 10
train_loss: 0.322722 | val_loss: 0.321222 | train_f1: 0.802395 | val_f1: 0.802395


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 11
train_loss: 0.236677 | val_loss: 0.239954 | train_f1: 0.919210 | val_f1: 0.919210


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 12
train_loss: 0.240797 | val_loss: 0.237453 | train_f1: 0.941548 | val_f1: 0.941548


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 13
train_loss: 0.167077 | val_loss: 0.165331 | train_f1: 0.973422 | val_f1: 0.973422


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 14
train_loss: 0.158248 | val_loss: 0.158206 | train_f1: 0.965636 | val_f1: 0.965636


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 15
train_loss: 0.280908 | val_loss: 0.278461 | train_f1: 0.861480 | val_f1: 0.861480


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 16
train_loss: 0.153532 | val_loss: 0.152761 | train_f1: 0.972603 | val_f1: 0.972603


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 17
train_loss: 0.160444 | val_loss: 0.162328 | train_f1: 0.954704 | val_f1: 0.954704


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 18
train_loss: 0.413625 | val_loss: 0.422508 | train_f1: 0.876833 | val_f1: 0.876833


  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

epoch: 19
train_loss: 0.092935 | val_loss: 0.094364 | train_f1: 0.996667 | val_f1: 0.996667


## Evaluación del Modelo

In [None]:
model.load_state_dict(best_state_dict)
train_loss, train_f1, train_scores_sent, train_scores_word, train_words, train_pred, train_target = eval_model(model, train_dataloader, criterion, device)
val_loss, val_f1, val_scores_sent, val_scores_word, val_words, val_pred, val_target = eval_model(model, val_dataloader, criterion, device)
test_loss, test_f1, test_scores_sent, test_scores_word, test_words, test_pred, test_target = eval_model(model, test_dataloader, criterion, device)
print('train_loss: %5f | train_f1: %5f'%(train_loss, train_f1)) 
print('val_loss: %5f | val_f1: %5f'%(val_loss, val_f1)) 
print('test_loss: %5f | test_f1: %5f'%(test_loss, test_f1)) 

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

train_loss: 0.094357 | train_f1: 0.996667
val_loss: 0.093115 | val_f1: 0.996667
test_loss: 0.794768 | test_f1: 0.735395


# Tarea 5: Visualizando la atención

Uno de los beneficios de los mecanismos de atención es que nos permiten identificar qué elementos de las oraciones resultan más importantes.

In [None]:
from IPython.display import display, HTML
import matplotlib
import matplotlib.pyplot as plt

In [None]:
def colorize(words, color_array, color_sent):
    '''
        Función para visuzalizar la atención, tomada de https://gist.github.com/ihsgnef/f13c35cd46624c8f458a4d23589ac768,
    '''
    # words is a list of words
    # color_array is an array of numbers between 0 and 1 of length equal to words

    # normalizamos color_array
    color_array = np.array(color_array)
    vmax = np.max(color_array)
    vmin = np.min(color_array)
    color_array = (color_array-vmin)/(vmax-vmin)

    cmap_sent = matplotlib.cm.get_cmap('Reds')
    cmap_word = matplotlib.cm.get_cmap('Blues')
    template = '<span class="barcode"; style="color: black; background-color: {}">{}</span>'
    #colored_string = ''
    colored_string = template.format(matplotlib.colors.rgb2hex(cmap_sent(color_sent)[:3]), '&nbsp  &nbsp')*4 + " "
    for word, color in zip(words, color_array):
        color = matplotlib.colors.rgb2hex(cmap_word(color)[:3])
        colored_string += template.format(color, '&nbsp' + word + '&nbsp')
    return colored_string

def colorize_sent(sent, scores, scores_sent):

    # normalizamos scores_sent
    color_array = np.array(scores_sent)
    vmax = np.max(color_array)
    vmin = np.min(color_array)
    scores_sent = (color_array-vmin)/(vmax-vmin)

    pos0 = 0
    # Recorremos cada sentencia
    for i, score_word in enumerate(scores):
        posf = len(np.where(score_word)[0])
        words = sent[pos0:(pos0+posf)]
        pos0 += posf
        if len(words)==0:
            return
        s = colorize(words, score_word[:posf], scores_sent[i])
        display(HTML(s))


Las palabras con más atención se muestran en color azul y aquellas con menor atención en color blanco. De igual forma los tweets con mayor atención se encuentran en color rojo café (del lado izquiedo) mientras que los de menor atención en un color rosa a blanco.

In [None]:
att = np.linspace(0,1,50)
p = [' ']*50
s = colorize(p, att,0.9)
# to display in ipython notebook
display(HTML(s))

In [None]:
for j in range(4):

    # to display in ipython notebook
    category = 'Male' if train_pred[j]==1 else 'Female'
    category0 = 'Male' if train_target[j]==1 else 'Female'
    print('Categoría predicha:  %s'%(category))
    print('Categoría verdadera: %s'%(category0))
    colorize_sent(train_words[j], train_scores_word[j], train_scores_sent[j])

Categoría predicha:  Female
Categoría verdadera: Male


Categoría predicha:  Male
Categoría verdadera: Male


Categoría predicha:  Female
Categoría verdadera: Female


Categoría predicha:  Female
Categoría verdadera: Female
