## Attention linking PoC: Part III

This is an overview of the research of utilizing attention in aim for more accurate and transparent Information Retrieval.

### Hyphotesis:

Transformers' attention reflects the internal relations of entities in the text and hence does reflect the importance of the pieces of text.

### Background

In `attention_kw_classification.ipynb`, we've seen that although some heads have significantly lower entropy towards predicting keywords, they are still not good enough at it, evenwhen combined.

Article [Are Sixteen Heads Really Better than One?](https://arxiv.org/abs/1905.10650) shows that "most attention heads can be individually removed after training without any significant downside in terms of test performance (§3.2)" on sequence classification tasks, but not so much on seq2seq tasks (Machine translation). "Remarkably, many attention layers can even be individually reduced to a single attention head without impacting test performance (§3.3)". 

Another [OpenAI blogger](https://aletheap.github.io/posts/2020/07/looking-for-grammar) shows that in GPT2, particular heads are most likely responsible for representing syntactical (PoS) properties of the text, by fine-tuning on them, or removing them. PoS is easier than keywords, but still similar task.

## In this notebook

We'll see, if certain heads can be responsible for handling the keywords. If that was the case, we can use the relevant subset of the attention links to:

1. weight the importance of parts of text
2. construct a graph of the coreferences, e.g. to weight the words as nodes, using Page Rank.

### Experiment Methodology:

We'll train the token classification, deciding upon each token, whether it is a keyword (class=1) or not (class=0).
We'll first evaluate, that Transformer can do that successfully.

If yes, we'll look for particular heads that are responsible for keyword classification:

1. We'll mask each one head and look how much the performance worsens.
2. We'll sort the heads by the impact of 1. and will gradually mask the least impactful to must impactful. We'll evaluate the performance for each selection.
3. We'll prune heads on training as in [this blog](https://towardsdatascience.com/head-pruning-in-transformer-models-ec222ca9ece7): we introduce a weight parameter for each head and will expect the weight of useless heads to converge to zero, as in seq2seq task here:

<div style="width:image width px; font-size:100%; text-align:center;"><img src='https://miro.medium.com/max/380/1*Uu_Fbe6l549HZevc6tVpaQ.gif' alt="alternate text" width="width" height="height" style="width:300px;height:200px;" /> </div>

### Evaluation

We'll measure an accuracy of token classification of the whole system, after the selected customization.

### Outputs

We'll see, if we can use only "purer" subset of the attentions for keyphrases detection, or weighting.

The best-performing methodology will be used for kephrase selection and respective weighting of the pieces of text in IRSystem.

In [1]:
from common import *
import random

In [2]:
ktexts = get_keyphrased_texts()

### Collect data and convert it to tensor features

In [3]:
from utils_ner import InputFeatures, MiniTokenClassificationDataset
import torch

In [4]:
model_id = "bert"
firstn = 100
max_length = 512

features = []

tokenizer, model = get_tokenizer_model(model_id)

for kw_key, ktext in tqdm(tuple(ktexts.items())[:firstn], desc="Converting texts to Token classification features for %s " % model_id):
    wpieces = get_wordpieces(ktext['text'], tokenizer, is_single_kword=False, max_length=max_length, as_tokens=True)
    wpieces_ids = get_wordpieces(ktext['text'], tokenizer, is_single_kword=False, max_length=max_length, 
                                 as_tokens=False, padding='max_length')
    kwds_postproc = [kw for kw in ktext['keywords'] if len(kw)>2]
    kwds_postproc += [" "+kw for kw in kwds_postproc]
    keywd_mask = get_keyphrases_mask(wpieces, [kw for kw in ktext['keywords'] if len(kw)>2], tokenizer).long()
    keywd_mask_padded = torch.cat((keywd_mask, torch.zeros((max_length-keywd_mask.__len__(),), dtype=torch.long)), dim=0) 

    non_padded_length = torch.full((wpieces.__len__(),), 1, dtype=torch.float)  # TODO
    attention_mask = torch.cat((non_padded_length, torch.zeros((max_length-wpieces.__len__(),))), dim=0)
    features.append(InputFeatures(input_ids=wpieces_ids, attention_mask=attention_mask, 
                                  token_type_ids=torch.zeros((max_length,)), label_ids=keywd_mask_padded))
    
train_dataset = MiniTokenClassificationDataset(features[:int(len(features)*0.9)])
eval_dataset = MiniTokenClassificationDataset(features[int(len(features)*0.9):])

Converting texts to Token classification features for bert : 100%|██████████| 100/100 [01:08<00:00,  1.46it/s]


In [12]:
from transformers import Trainer, AutoModelForTokenClassification, AutoConfig

config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
    cache_dir="cache_dir",
    device="cpu"
)

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    config=config,
    cache_dir="cache_dir",
)

trainer = Trainer(model, train_dataset=train_dataset, eval_dataset=eval_dataset, device="cpu")
# model(features)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 1.96 GiB total capacity; 519.76 MiB already allocated; 4.62 MiB free; 562.00 MiB reserved in total by PyTorch)

In [8]:
trainer.train(model_path="bert_kw_trained")
trainer.save_model()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=12.0, style=ProgressStyle(description_wid…

TypeError: forward() got an unexpected keyword argument 'labels'