<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Физтех-Школа Прикладной математики и информатики (ФПМИ) МФТИ</b></h3>

---

# Задание 3

## Классификация текстов

В этом задании вам предстоит попробовать несколько методов, используемых в задаче классификации, а также понять насколько хорошо модель понимает смысл слов и какие слова в примере влияют на результат.

In [None]:
# import torchtext
# torchtext.__version__

'0.13.1'

In [None]:
# Понизим версию torchtext для корректного импорта
!pip install torchtext==0.8

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.8
  Downloading torchtext-0.8.0-cp37-cp37m-manylinux1_x86_64.whl (6.9 MB)
[K     |████████████████████████████████| 6.9 MB 6.7 MB/s 
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
Successfully installed torchtext-0.8.0


In [None]:
import pandas as pd
import numpy as np
import torch

try:
  from torchtext.datasets import IMDB
except OSError:
  from torchtext.datasets import IMDB

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

from torchtext.vocab import Vectors, GloVe

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
from tqdm.autonotebook import tqdm

In [None]:
import warnings
warnings.filterwarnings("ignore") 

In [None]:
torch.__version__

'1.12.1+cu113'

В этом задании мы будем использовать библиотеку torchtext. Она довольна проста в использовании и поможет нам сконцентрироваться на задаче, а не на написании Dataloader-а.

In [None]:
TEXT = Field(sequential=True, lower=True, include_lengths=True, batch_first=True)  # Поле текста
LABEL = LabelField(dtype=torch.float)  # Поле метки

In [None]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Датасет на котором мы будем проводить эксперементы это комментарии к фильмам из сайта IMDB.

In [None]:
train, test = IMDB.splits(TEXT, LABEL)  # загрузим датасет
train, valid = train.split(random_state=random.seed(SEED))  # разобьем на части

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:10<00:00, 8.08MB/s]


In [None]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)

In [None]:
LABEL.vocab.stoi

defaultdict(None, {'neg': 0, 'pos': 1})

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

train_iter, valid_iter, test_iter = BucketIterator.splits(
    (train, valid, test), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)

## RNN

Для начала попробуем использовать рекурентные нейронные сети. На семинаре вы познакомились с GRU, вы можете также попробовать LSTM. Можно использовать для классификации как hidden_state, так и output последнего токена.

In [None]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.GRU(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bias=True,
            batch_first=True,
            dropout=dropout,
            bidirectional=bidirectional
        )
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
        
    def forward(self, text):        
        
        x = self.embedding(text)
        out, hidden = self.rnn(x)
        output = self.fc(hidden.sum(axis=0))
            
        return output.flatten()

Поиграйтесь с гиперпараметрами

In [None]:
vocab_size = len(TEXT.vocab)
emb_dim = 300
hidden_dim = 256
output_dim = 1
n_layers = 3
bidirectional = True
dropout = 0.2
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
patience=3

In [None]:
model = RNNBaseline(
    vocab_size=vocab_size,
    embedding_dim=emb_dim,
    hidden_dim=hidden_dim,
    output_dim=output_dim,
    n_layers=n_layers,
    bidirectional=bidirectional,
    dropout=dropout,
    pad_idx=PAD_IDX
)

In [None]:
model = model.to(device)

In [None]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()

max_epochs = 20

Обучите сетку! Используйте любые вам удобные инструменты, Catalyst, PyTorch Lightning или свои велосипеды.

In [None]:
def fit_model(model, opt, loss_func, train_iter, valid_iter, max_epochs, patience):

    min_loss = np.inf

    cur_patience = 0
    max_grad_norm = 2
    for epoch in range(1, max_epochs + 1):
        train_loss = 0.0
        train_acc = 0.0
        train_num_objs = 0    
        model.train()
        pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
        pbar.set_description(f"Epoch {epoch}. Training")
        for it, batch in pbar:
            opt.zero_grad()
            texts = batch.text[0].to(device)
            labels = batch.label.to(device)
            preds = model(texts)
            loss = loss_func(preds, labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            opt.step()
            preds = torch.sigmoid(preds) > 0.5
            cur_acc = (labels == preds).float().sum()
            train_acc += cur_acc
            train_num_objs += len(labels)
            train_loss += loss
            pbar.set_description(f"Epoch {epoch}. Train Loss: {loss:.4}. Train acc: {cur_acc / len(labels):.4}")
        train_loss /= len(train_iter)
        train_acc /= train_num_objs
        val_loss = 0.0
        val_acc = 0.0
        val_num_objs = 0
        model.eval()
        pbar = tqdm(enumerate(valid_iter), total=len(valid_iter), leave=False)
        pbar.set_description(f"Epoch {epoch}. Validation")
        for it, batch in pbar:
            with torch.no_grad():
              texts = batch.text[0].to(device)
              labels = batch.label.to(device)
              preds = model(texts)
              loss = loss_func(preds, labels)
              preds = torch.sigmoid(preds) > 0.5
              cur_acc = (labels == preds).float().sum()
              val_acc += cur_acc
              val_num_objs += len(labels)
              val_loss += loss
              pbar.set_description(f"Epoch {epoch}. Val Loss: {loss:.4}. Val acc: {cur_acc / len(labels):.4}")
        val_loss /= len(valid_iter)
        val_acc /= val_num_objs
        if val_loss < min_loss:
            min_loss = val_loss
            best_model = model.state_dict()            
        else:
            cur_patience += 1
            if cur_patience == patience:
                cur_patience = 0
                break
        
        print('Epoch: {}, Training Loss: {:.4}, Training Acc: {:.4}, Validation Loss: {:.4}, Validation Acc: {:.4}'.format(epoch, train_loss, train_acc, val_loss, val_acc))
    model.load_state_dict(best_model)
    return model

In [None]:
model = fit_model(model, opt, loss_func, train_iter, valid_iter, max_epochs, patience)

  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.6014, Training Acc: 0.6645, Validation Loss: 0.441, Validation Acc: 0.8012


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.3081, Training Acc: 0.868, Validation Loss: 0.3589, Validation Acc: 0.8437


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.1396, Training Acc: 0.9481, Validation Loss: 0.3591, Validation Acc: 0.8641


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.04034, Training Acc: 0.9859, Validation Loss: 0.6188, Validation Acc: 0.8373


  0%|          | 0/274 [00:00<?, ?it/s]

  0%|          | 0/118 [00:00<?, ?it/s]

Посчитайте f1-score вашего классификатора на тестовом датасете.

**Ответ**: 0.8619

С использование nn.utils.rnn.pack_padded_sequence и контатенацией слоев hidden лучший результат **0.78**.
Поэтому в итоговой версии, я оставила вариант, при котором слои hidden суммируются.

Кроме того, были протестированы разные параметры dropout, количество слоев gru, наличие и отсутствие bidirectional, learning rate. Финальный результат содержит решение с самым высоким f1-score из полученных в ходе экспериментов

In [None]:
def f1_score_from_model(model, test_iter):

    model.eval()
    pbar = tqdm(enumerate(test_iter), total=len(test_iter), leave=True)
    TP = 0
    FP = 0
    FN = 0
    for it, batch in pbar:
        with torch.no_grad():
          texts = batch.text[0].to(device)
          labels = batch.label.to(device)
          preds = model(texts)
          preds = torch.sigmoid(preds) > 0.5
          TP += ((preds == 1) & (labels == 1)).sum()
          FP += ((preds == 1) & (labels == 0)).sum()
          FN += ((preds == 0) & (labels == 1)).sum()

    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    F1 = 2 * precision * recall / (precision + recall)

    print(f'F1 score test dataset: {F1:.4}')

In [None]:
f1_score_from_model(model, test_iter)

  0%|          | 0/391 [00:00<?, ?it/s]

F1 score test dataset: 0.8619


## CNN

![](https://www.researchgate.net/publication/333752473/figure/fig1/AS:769346934673412@1560438011375/Standard-CNN-on-text-classification.png)

Для классификации текстов также часто используют сверточные нейронные сети. Идея в том, что как правило сентимент содержат словосочетания из двух-трех слов, например "очень хороший фильм" или "невероятная скука". Проходясь сверткой по этим словам мы получим какой-то большой скор и выхватим его с помощью MaxPool. Далее идет обычная полносвязная сетка. Важный момент: свертки применяются не последовательно, а параллельно. Давайте попробуем!

In [None]:
# Чтобы не менять функцию обучения, оставила include_lengths=True (как для RNN)
TEXT = Field(sequential=True, lower=True, batch_first=True, include_lengths=True)  # batch_first тк мы используем conv  
LABEL = LabelField(batch_first=True, dtype=torch.float)

train, tst = IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))

TEXT.build_vocab(trn)
LABEL.build_vocab(trn)

device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

Вы можете использовать Conv2d с `in_channels=1, kernel_size=(kernel_sizes[0], emb_dim))` или Conv1d c `in_channels=emb_dim, kernel_size=kernel_size[0]`. Но хорошенько подумайте над shape в обоих случаях.

In [None]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        emb_dim,
        out_channels,
        kernel_sizes,
        dropout=0.5,
    ):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.conv_0 = nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[0])
        
        self.conv_1 = nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[1])
        
        self.conv_2 = nn.Conv1d(in_channels=emb_dim, out_channels=out_channels, kernel_size=kernel_sizes[2])
        
        self.fc = nn.Linear(len(kernel_sizes) * out_channels, 1)
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, text): 
        
        embedded = self.embedding(text)
        
        embedded = embedded.transpose(-1, 1)
        
        conved_0 = F.relu(self.conv_0(embedded))  # may be reshape here
        conved_1 = F.relu(self.conv_1(embedded))  # may be reshape here
        conved_2 = F.relu(self.conv_2(embedded))  # may be reshape here
        
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
        
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))
            
        return self.fc(cat).flatten()

In [None]:
kernel_sizes = [3, 4, 5]
vocab_size = len(TEXT.vocab)
out_channels=64
dropout = 0.2
dim = 300

model = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=out_channels,
            kernel_sizes=kernel_sizes, dropout=dropout)

In [None]:
model.to(device)

CNN(
  (embedding): Embedding(201988, 300)
  (conv_0): Conv1d(300, 64, kernel_size=(3,), stride=(1,))
  (conv_1): Conv1d(300, 64, kernel_size=(4,), stride=(1,))
  (conv_2): Conv1d(300, 64, kernel_size=(5,), stride=(1,))
  (fc): Linear(in_features=192, out_features=1, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

In [None]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()

In [None]:
max_epochs = 30
patience = 3

Обучите!

In [None]:
model = fit_model(model, opt, loss_func, train_iter, val_iter, max_epochs, patience)

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.5653, Training Acc: 0.693, Validation Loss: 0.45, Validation Acc: 0.7871


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.4001, Training Acc: 0.8179, Validation Loss: 0.3773, Validation Acc: 0.8331


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.297, Training Acc: 0.8749, Validation Loss: 0.3469, Validation Acc: 0.85


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.2055, Training Acc: 0.9209, Validation Loss: 0.3442, Validation Acc: 0.8511


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.1478, Training Acc: 0.9459, Validation Loss: 0.3348, Validation Acc: 0.8619


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.09931, Training Acc: 0.966, Validation Loss: 0.3548, Validation Acc: 0.8563


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.07298, Training Acc: 0.9741, Validation Loss: 0.3368, Validation Acc: 0.864


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Посчитайте f1-score вашего классификатора.

**Ответ**: 0.8652

In [None]:
f1_score_from_model(model, test_iter)

  0%|          | 0/98 [00:00<?, ?it/s]

F1 score test dataset: 0.8652


## Интерпретируемость

Посмотрим, куда смотрит наша модель. Достаточно запустить код ниже.

In [None]:
!pip install -q captum

[?25l[K     |▎                               | 10 kB 33.3 MB/s eta 0:00:01[K     |▌                               | 20 kB 33.2 MB/s eta 0:00:01[K     |▊                               | 30 kB 40.7 MB/s eta 0:00:01[K     |█                               | 40 kB 35.1 MB/s eta 0:00:01[K     |█▏                              | 51 kB 39.4 MB/s eta 0:00:01[K     |█▍                              | 61 kB 44.0 MB/s eta 0:00:01[K     |█▋                              | 71 kB 29.6 MB/s eta 0:00:01[K     |█▉                              | 81 kB 30.1 MB/s eta 0:00:01[K     |██                              | 92 kB 32.2 MB/s eta 0:00:01[K     |██▎                             | 102 kB 34.1 MB/s eta 0:00:01[K     |██▌                             | 112 kB 34.1 MB/s eta 0:00:01[K     |██▊                             | 122 kB 34.1 MB/s eta 0:00:01[K     |███                             | 133 kB 34.1 MB/s eta 0:00:01[K     |███▏                            | 143 kB 34.1 MB/s eta 0:

In [None]:
from captum.attr import LayerIntegratedGradients, TokenReferenceBase, visualization

PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)

In [None]:
def forward_with_softmax(inp):
    logits = model(inp)
    return torch.softmax(logits, 0)[0][1]

def forward_with_sigmoid(input):
    return torch.sigmoid(model(input))


# accumalate couple samples in this array for visualization purposes
vis_data_records_ig = []

def interpret_sentence(model, sentence, min_len = 7, label = 0):
    model.eval()
    text = [tok for tok in TEXT.tokenize(sentence)]
    if len(text) < min_len:
        text += ['pad'] * (min_len - len(text))
    indexed = [TEXT.vocab.stoi[t] for t in text]

    model.zero_grad()

    input_indices = torch.tensor(indexed, device=device)
    input_indices = input_indices.unsqueeze(0)
    
    # input_indices dim: [sequence_length]
    seq_length = min_len

    # predict
    pred = forward_with_sigmoid(input_indices).item()
    pred_ind = round(pred)

    # generate reference indices for each sample
    reference_indices = token_reference.generate_reference(seq_length, device=device).unsqueeze(0)

    # compute attributions and approximation delta using layer integrated gradients
    attributions_ig, delta = lig.attribute(input_indices, reference_indices, \
                                           n_steps=5000, return_convergence_delta=True)

    print('pred: ', LABEL.vocab.itos[pred_ind], '(', '%.2f'%pred, ')', ', delta: ', abs(delta))

    add_attributions_to_visualizer(attributions_ig, text, pred, pred_ind, label, delta, vis_data_records_ig)
    
def add_attributions_to_visualizer(attributions, text, pred, pred_ind, label, delta, vis_data_records):
    attributions = attributions.sum(dim=2).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    attributions = attributions.cpu().detach().numpy()

    # storing couple samples in an array for visualization purposes
    vis_data_records.append(visualization.VisualizationDataRecord(
                            attributions,
                            pred,
                            LABEL.vocab.itos[pred_ind],
                            LABEL.vocab.itos[label],
                            LABEL.vocab.itos[1],
                            attributions.sum(),       
                            text,
                            delta))

In [None]:
interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)
# мои примеры
interpret_sentence(model, 'stupid movie! wasting time', label=0)
interpret_sentence(model, 'im not sure, perhaps its not bad', label=1)
interpret_sentence(model, 'something strange and inexplicit', label=0)
interpret_sentence(model, 'you shoud see it', label=1)
interpret_sentence(model, 'scary and impossible to break away', label=1)

pred:  pos ( 0.98 ) , delta:  tensor([3.1916e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.80 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  pos ( 1.00 ) , delta:  tensor([4.0493e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([5.7669e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.01 ) , delta:  tensor([0.0003], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.44 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([5.1827e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.05 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.65 ) , delta:  tensor([8.6406e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.94 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.90 ) , delta:  tensor([1.8841e-05], device='cuda:0', dtype=torch.float64)


Попробуйте добавить свои примеры!

In [None]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.98),pos,1.53,It was a fantastic performance ! pad
,,,,
pos,pos (0.80),pos,1.51,Best film ever pad pad pad pad
,,,,
pos,pos (1.00),pos,1.26,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,-0.72,It was a horrible movie pad pad
,,,,
neg,neg (0.01),pos,-0.91,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.98),pos,1.53,It was a fantastic performance ! pad
,,,,
pos,pos (0.80),pos,1.51,Best film ever pad pad pad pad
,,,,
pos,pos (1.00),pos,1.26,Such a great show! pad pad pad
,,,,
neg,neg (0.01),pos,-0.72,It was a horrible movie pad pad
,,,,
neg,neg (0.01),pos,-0.91,I've never watched something as bad pad
,,,,


## Эмбеддинги слов

Вы ведь не забыли, как мы можем применить знания о word2vec и GloVe. Давайте попробуем!

In [None]:
vec = GloVe(name='42B', dim=300)

.vector_cache/glove.42B.300d.zip: 1.88GB [05:54, 5.30MB/s]                            
100%|█████████▉| 1917493/1917494 [02:49<00:00, 11306.01it/s]


In [None]:
TEXT = Field(sequential=True, lower=True, batch_first=True, include_lengths=True)
LABEL = LabelField(batch_first=True, dtype=torch.float)

train, tst = IMDB.splits(TEXT, LABEL)
trn, vld = train.split(random_state=random.seed(SEED))
TEXT.build_vocab(trn, vectors=vec) 
LABEL.build_vocab(trn)

word_embeddings = TEXT.vocab.vectors

kernel_sizes = [3, 4, 5]
vocab_size = len(TEXT.vocab)
dropout = 0.2
dim = 300

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(128, 256, 256),
        sort=False,
        sort_key= lambda x: len(x.src),
        sort_within_batch=False,
        device=device,
        repeat=False,
)

In [None]:
type(model.embedding.weight), type(word_embeddings)

(torch.nn.parameter.Parameter, torch.Tensor)

In [None]:
model = CNN(vocab_size=vocab_size, emb_dim=dim, out_channels=64,
            kernel_sizes=kernel_sizes, dropout=dropout)

# word_embeddings = TEXT.vocab.vectors

prev_shape = model.embedding.weight.shape

model.embedding.weight = nn.parameter.Parameter(data=word_embeddings, requires_grad=True)

assert prev_shape == model.embedding.weight.shape
model.to(device)

opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCEWithLogitsLoss()
max_epochs = 30
patience = 3

Вы знаете, что делать.

In [None]:
# Первые 3 эпохи не будем обучать эмбеддинги 
def freeze_embeddings(model, req_grad=False):
    embeddings = model.embedding
    for c_p in embeddings.parameters():
        c_p.requires_grad = req_grad

In [None]:
freeze_embeddings(model, req_grad=False)
min_loss = np.inf

cur_patience = 0
max_grad_norm = 2
freeze_embeddings(model)
for epoch in range(1, max_epochs + 1):
    if epoch > 3:
      freeze_embeddings(model, req_grad=True)
    train_loss = 0.0
    train_acc = 0.0
    train_num_objs = 0    
    model.train()
    pbar = tqdm(enumerate(train_iter), total=len(train_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}. Training")
    for it, batch in pbar:
        opt.zero_grad()
        texts = batch.text[0].to(device)
        labels = batch.label.to(device)
        preds = model(texts)
        loss = loss_func(preds.flatten(), labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        opt.step()
        preds = torch.sigmoid(preds.flatten()) > 0.5
        cur_acc = (labels == preds).float().sum()
        train_acc += cur_acc
        train_num_objs += len(labels)
        train_loss += loss
        pbar.set_description(f"Epoch {epoch}. Train Loss: {loss:.4}. Train acc: {cur_acc / len(labels):.4}")
    train_loss /= len(train_iter)
    train_acc /= train_num_objs
    val_loss = 0.0
    val_acc = 0.0
    val_num_objs = 0
    model.eval()
    pbar = tqdm(enumerate(val_iter), total=len(val_iter), leave=False)
    pbar.set_description(f"Epoch {epoch}. Validation")
    for it, batch in pbar:
        with torch.no_grad():
          texts = batch.text[0].to(device)
          labels = batch.label.to(device)
          preds = model(texts)
          loss = loss_func(preds.flatten(), labels)
          preds = torch.sigmoid(preds.flatten()) > 0.5
          cur_acc = (labels == preds).float().sum()
          val_acc += cur_acc
          val_num_objs += len(labels)
          val_loss += loss
          pbar.set_description(f"Epoch {epoch}. Val Loss: {loss:.4}. Val acc: {cur_acc / len(labels):.4}")
    val_loss /= len(val_iter)
    val_acc /= val_num_objs
    if val_loss < min_loss:
        min_loss = val_loss
        best_model = model.state_dict()
    else:
        cur_patience += 1
        if cur_patience == patience:
            cur_patience = 0
            break
    
    print('Epoch: {}, Training Loss: {:.4}, Training Acc: {:.4}, Validation Loss: {:.4}, Validation Acc: {:.4}'.format(epoch, train_loss, train_acc, val_loss, val_acc))
model.load_state_dict(best_model)

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 1, Training Loss: 0.4737, Training Acc: 0.7765, Validation Loss: 0.3467, Validation Acc: 0.8515


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 2, Training Loss: 0.3171, Training Acc: 0.8666, Validation Loss: 0.3139, Validation Acc: 0.866


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 3, Training Loss: 0.2555, Training Acc: 0.8975, Validation Loss: 0.2922, Validation Acc: 0.8767


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 4, Training Loss: 0.2048, Training Acc: 0.9223, Validation Loss: 0.2701, Validation Acc: 0.8865


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 5, Training Loss: 0.09815, Training Acc: 0.9742, Validation Loss: 0.2691, Validation Acc: 0.8896


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 6, Training Loss: 0.041, Training Acc: 0.9942, Validation Loss: 0.275, Validation Acc: 0.8892


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Epoch: 7, Training Loss: 0.01687, Training Acc: 0.9991, Validation Loss: 0.2803, Validation Acc: 0.8932


  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

<All keys matched successfully>

Посчитайте f1-score вашего классификатора.

**Ответ**: 0.891

In [None]:
f1_score_from_model(model, test_iter)

  0%|          | 0/98 [00:00<?, ?it/s]

F1 score test dataset: 0.891


Проверим насколько все хорошо!

In [None]:
PAD_IND = TEXT.vocab.stoi['pad']

token_reference = TokenReferenceBase(reference_token_idx=PAD_IND)
lig = LayerIntegratedGradients(model, model.embedding)
vis_data_records_ig = []

interpret_sentence(model, 'It was a fantastic performance !', label=1)
interpret_sentence(model, 'Best film ever', label=1)
interpret_sentence(model, 'Such a great show!', label=1)
interpret_sentence(model, 'It was a horrible movie', label=0)
interpret_sentence(model, 'I\'ve never watched something as bad', label=0)
interpret_sentence(model, 'It is a disgusting movie!', label=0)
# мои примеры
interpret_sentence(model, 'stupid movie! wasting time', label=0)
interpret_sentence(model, 'im not sure, perhaps its not bad', label=1)
interpret_sentence(model, 'something strange and inexplicit', label=0)
interpret_sentence(model, 'you shoud see it', label=1)
interpret_sentence(model, 'scary and impossible to break away', label=1)

pred:  pos ( 0.99 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.55 ) , delta:  tensor([1.9757e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.97 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.23 ) , delta:  tensor([0.0001], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0002], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.00 ) , delta:  tensor([0.0003], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.44 ) , delta:  tensor([5.3315e-06], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.68 ) , delta:  tensor([8.1160e-05], device='cuda:0', dtype=torch.float64)
pred:  pos ( 0.60 ) , delta:  tensor([6.4960e-05], device='cuda:0', dtype=torch.float64)
pred:  neg ( 0.43 ) , delta:  tensor([1.8369e-05], device='cuda:0', dtype=torch.float64)


In [None]:
print('Visualize attributions based on Integrated Gradients')
visualization.visualize_text(vis_data_records_ig)

Visualize attributions based on Integrated Gradients


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,0.98,It was a fantastic performance ! pad
,,,,
pos,pos (0.55),pos,-1.11,Best film ever pad pad pad pad
,,,,
pos,pos (0.97),pos,1.08,Such a great show! pad pad pad
,,,,
neg,neg (0.00),pos,-0.9,It was a horrible movie pad pad
,,,,
neg,neg (0.23),pos,-0.94,I've never watched something as bad pad
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
pos,pos (0.99),pos,0.98,It was a fantastic performance ! pad
,,,,
pos,pos (0.55),pos,-1.11,Best film ever pad pad pad pad
,,,,
pos,pos (0.97),pos,1.08,Such a great show! pad pad pad
,,,,
neg,neg (0.00),pos,-0.9,It was a horrible movie pad pad
,,,,
neg,neg (0.23),pos,-0.94,I've never watched something as bad pad
,,,,


Субъективно, первая CNN-модель лучше. В частности, пример "scary and impossible to break away" первая модель однозначно классифицирует, как позитивный класс, при этом выделяя слова "break away", как позитивные. К сожалению, обе модели не справились с "im not sure, perhaps its not bad".

С другой стороны, использовать на входе предобученные эмбеддинги вместо случайной инициализации кажется хорошей идеей. 