## Averaging last 4 attention distributions using a GRU 

We average together the last 4 attention distributions of the input. Our classification model is an LSTM that reads in the attention distribution for each word sequentially. In general the max sequence lenght is 80 which means that our model will read in 80 data points which are each 80 dimensional. 


#### Notes
* One remaining question is how can we experiment with different number of attention heads. In general the extract attention scores function seems to have some bugs that need ironing out - such as only being able to pass in a batch size of 1 into the extraction schema.

* We also would like to eventually use a transformer architecture on top of the attention distributions

In [1]:
import sys; sys.path.append("../../../../..")
import torch 
from src.experiment import AttentionExperiment, ClassificationExperiment
from src.dataset import ExperimentDataset
from src.params import Params

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
params = Params.read_params("experiment_params.json")

In [4]:
# Loading in the dataset that we are using in this experiments 
# typically this dataset is the small set of ground-truth labels
dataset = ExperimentDataset.init_dataset(params)

02/19/2020 14:25:15 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
386it [00:00, 4915.57it/s]


In [5]:
attention_dataloader = dataset.return_dataloader() 

Attention Experiment: 
* Is a class that wraps useful methods to extract attention distributions from a given BERT-based model 
* In the config file the user needs to specify a .ckpt file for a trained BERT-based model from which 
     we want to extract attention scores
* The user needs to instantiate the attention experiment with a function that tells the model how to run 
 inference on the given model 

In [6]:
attention_experiment = AttentionExperiment.initialize_attention_experiment(params.intermediary_task, verbose=True)

02/19/2020 14:25:15 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084


The len of our vocabulary is 30523
Cuda is set to true


02/19/2020 14:25:16 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at ./cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
02/19/2020 14:25:16 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file ./cache/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpflrqtlwo
02/19/2020 14:25:19 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

02/19/2020 14:25:27 - INFO - pytor

Succesfully loaded in attention experiment!


In [7]:
dataset

Length: 324 Keys: dict_keys(['pre_ids', 'masks', 'pre_lens', 'post_in_ids', 'post_out_ids', 'pre_tok_label_ids', 'post_tok_label_ids', 'rel_ids', 'pos_ids', 'categories', 'index', 'bias_label'])

extract_attention_scores() works out of the box because the attention experiment has the config file saved, and knows what BERT model to use/load in, which layers to extract the attention scores from, and what the inference function is that should be used on this particular BERT model.

Attention_scores is then a list of dictionaries. The keys in this dictionary are the specific layers of a BERT model and the values are the corresponding attention distributions extracted from that particular layer.

In [8]:
attention_scores = attention_experiment.extract_attention_scores(attention_dataloader)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [9]:
from src.utils.attention_utils import avg_attention_dist

In [10]:
avg_attention = avg_attention_dist(attention_scores)

In [11]:
avg_attention[0].shape

torch.Size([1, 1, 80, 80])

In [12]:
stacked_avg_attention = torch.stack(avg_attention).squeeze()
# squeezes from [324, 1, 1, 80, 80] --> [324, 80, 80]

In [13]:
dataset.add_data(stacked_avg_attention, "attention_dist")

In [14]:
dataset.shuffle_data()

In [15]:
dataset

Length: 324 Keys: dict_keys(['pre_ids', 'masks', 'pre_lens', 'post_in_ids', 'post_out_ids', 'pre_tok_label_ids', 'post_tok_label_ids', 'rel_ids', 'pos_ids', 'categories', 'index', 'bias_label', 'attention_dist'])

### This is where the classification experiment starts

We create a classification experiment that contains useful methods for classifying bias based on the attention distributions. 

In [16]:
classification_experiment = ClassificationExperiment.init_cls_experiment(params.final_task)

In [17]:
from src.utils.classification_utils import run_bootstrapping

In [18]:
train_dataloader, eval_dataloader, _ = dataset.split_train_eval_test(train_split=0.8, eval_split=0.2, batch_size=8)

In [19]:
classification_experiment.train_model(train_dataloader, eval_dataloader, input_key="attention_dist", label_key="bias_label")

HBox(children=(IntProgress(value=0, description='epochs', max=50, style=ProgressStyle(description_width='initi…

Step: 0 ; Loss 0.7184432744979858 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5495619773864746 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5514553189277649 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5372989177703857 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5394500494003296 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5094419121742249 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.43724748492240906 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.46851640939712524 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.4385913908481598 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.43250665068626404 
All labels are of the same type – skipping AUC calculation
Step: 0 ; Loss 0.5869534611701965 
All labels are of the 

([[{'num_examples': 8, 'loss': 0.7184432744979858},
   {'num_examples': 8, 'loss': 0.7017520070075989},
   {'num_examples': 8, 'loss': 0.6846326589584351},
   {'num_examples': 8, 'loss': 0.6936505436897278},
   {'num_examples': 8, 'loss': 0.6837756037712097},
   {'num_examples': 8, 'loss': 0.6809084415435791},
   {'num_examples': 8, 'loss': 0.6608880162239075},
   {'num_examples': 8, 'loss': 0.6785640716552734},
   {'num_examples': 8, 'loss': 0.7253981232643127},
   {'num_examples': 8, 'loss': 0.7260427474975586},
   {'num_examples': 8, 'loss': 0.700170636177063},
   {'num_examples': 8, 'loss': 0.6108407974243164},
   {'num_examples': 8, 'loss': 0.60433030128479},
   {'num_examples': 8, 'loss': 0.6277279853820801},
   {'num_examples': 8, 'loss': 0.667755663394928},
   {'num_examples': 8, 'loss': 0.8380337953567505},
   {'num_examples': 8, 'loss': 0.6628445386886597},
   {'num_examples': 8, 'loss': 0.6714268922805786},
   {'num_examples': 8, 'loss': 0.7514424920082092},
   {'num_example