Contextual BERT embeddings as a weak labeling function
------
Like in the GLOVE->classification case, but now we use a BERT word embedding as the input to a classification labeling function. As before, we convert the biased word into a BERT vector and classifying that embedding. For a given sentence, we again have to first extract what the first-predicted biased word is. We know what this word will be, since our dataset contains the ground-truth labels for which words were edited for bias. We can thus extract the index of the first biased word, and then do some BERT embedding featurization based on that particular word. 

In [1]:
import sys; sys.path.append("../../../../..")
import torch 
from src.experiment import ClassificationExperiment
from src.dataset import ExperimentDataset
from src.params import Params

%load_ext autoreload
%autoreload 2

In [2]:
params = Params.read_params("experiment_params.json")

In [3]:
# Loading in the dataset that we are using in this experiments 
# typically this dataset is the small set of ground-truth labels
dataset = ExperimentDataset.init_dataset(params.dataset)

03/06/2020 14:25:29 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./cache/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
386it [00:00, 4288.18it/s]


In [4]:
from src.utils.weak_labeling_utils import get_bert_features

In [5]:
bert_embeddings = get_bert_features(dataset)

03/06/2020 14:25:30 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /sailhome/rdm/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
03/06/2020 14:25:30 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /sailhome/rdm/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmp1nvkz2d5
03/06/2020 14:25:34 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 

In [6]:
bert_embeddings.shape

torch.Size([324, 768])

In [7]:
dataset.add_data(bert_embeddings, "bert_embeddings")

### This is where the classification experiment starts

In [8]:
classification_experiment = ClassificationExperiment.init_cls_experiment(params.final_task)

In [9]:
from src.utils.classification_utils import run_bootstrapping

In [10]:
statistics = run_bootstrapping(classification_experiment, dataset, params.final_task, num_bootstrap_iters=3, input_key='bert_embeddings', label_key='bias_label', threshold=0.42)

HBox(children=(FloatProgress(value=0.0, description='Cross Validation Iteration', max=3.0, style=ProgressStyle…

HBox(children=(FloatProgress(value=0.0, description='epochs', max=200.0, style=ProgressStyle(description_width…

HBox(children=(FloatProgress(value=0.0, description='epochs', max=200.0, style=ProgressStyle(description_width…

HBox(children=(FloatProgress(value=0.0, description='epochs', max=200.0, style=ProgressStyle(description_width…




In [11]:
statistics

{'auc': [(0.8615131531815541, 0.8830869324115832), 0.8741890551582191],
 'accuracy': [(0.8239583333333335, 0.8467793367346939), 0.8378684807256236]}