<a href="https://colab.research.google.com/github/yuezhang23/nlp/blob/main/bert_probe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probling Language Model Representations

In this notebook, you will explore how much information language models have about linguistic structure even when they have not been explicitly trained to predict it. You will use the encoder language model BERT.

This is a kind of experiment called &ldquo;probing&rdquo;, where we use internal representations from a language model to predict certain information we have but the language model does not. In particular, we will use a named entity recognition (NER) task, `BIO` tags on each word for the classes person, location, organization, and miscellaneous. The base BERT model did not see any of these labels in training—although BERT has often been fine-tuned on token labeling tasks. For more on token classification for named entity recognition, and for some of the code we use here, see [this huggingface tutorial](https://huggingface.co/docs/transformers/en/tasks/token_classification).

Work through the notebook and complete the cells marked TODO to set up and run these experiments.

We start by installing the huggingface `transformers` and related libraries.

In [3]:
!pip install transformers datasets evaluate seqeval torch



In case you want them later, we'll load the sklearn functions you used for training logistic regression in assignment 2.

In [64]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, LeaveOneOut, KFold
import numpy as np

Then, we'll use the huggingface `datasets` library to download the CoNLL (Conference on Natural Language Learning) 2003 data for named-entity recognition.

In [65]:
from datasets import load_dataset
conll2003 = load_dataset("hgissbkh/conll2003-en")

To keep things simple, we'll work with a sample of 1000 sentences.

In [34]:
sample = conll2003['train'].select(range(1000))

Each record contains a list of word tokens and a list of NER labels. For efficiency, the labels have been turned into integers, which makes them hard to interpret.

In [66]:
sample[0]

{'words': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'ner': [4, 0, 8, 0, 0, 0, 8, 0, 0]}

Fortunately, the dataset object also contains information to map these integers back to readable strings. We can see tags such as `B-PER` (the beginning token of a personal name), `I-PER` (the following tokens inside a personal name, if any), and `O` (a token outside any named entities). We create two dictionaries `id2label` and `label2id` to make mapping between integers and labels easier.

In [8]:
labels = sample.features['ner'].feature.names
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}
print(labels)
print(id2label)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


For a language model to interpret our data properly, we need to tokenize it in the same way as its training data. We download the tokenizer for the `bert-base-cased` model from huggingface.

In [67]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Let's see what happens when we run the tokenizer on a single sentence. We tell it that our sentence has already been split into words, in this case by the creators of the CoNLL 2003 NER dataset. BERT, like many language models, used **subword tokenization** to keep the size of its vocabulary manageable. The tokenizer turns $n$ words into $m \ge n$ tokens, represented as a list of integer token identifiers. We use the method `convert_ids_to_tokens` to turn these integers back into a string representation.

In [68]:
example = sample[10]
tokenized_input = tokenizer(example['words'], is_split_into_words=True)
print(tokenized_input.keys)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
tokens

<bound method Mapping.keys of {'input_ids': [101, 2124, 6865, 2110, 23828, 1260, 19585, 1742, 8174, 1125, 2206, 4806, 17355, 9022, 2879, 1120, 1126, 7270, 3922, 9813, 112, 2309, 1104, 3989, 8362, 9380, 2050, 6202, 8819, 1194, 107, 4249, 1704, 5771, 119, 107, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}>


['[CLS]',
 'Spanish',
 'Farm',
 'Minister',
 'Loyola',
 'de',
 'Pa',
 '##la',
 '##cio',
 'had',
 'earlier',
 'accused',
 'Fi',
 '##sch',
 '##ler',
 'at',
 'an',
 'EU',
 'farm',
 'ministers',
 "'",
 'meeting',
 'of',
 'causing',
 'un',
 '##ju',
 '##st',
 '##ified',
 'alarm',
 'through',
 '"',
 'dangerous',
 'general',
 '##isation',
 '.',
 '"',
 '[SEP]']

Notice how the name `Palacio` has been split into three subword tokens: `Pa`, `##la`, and `##cio`. The prepended `##` indicates that this token is _not_ the start of a word. But the NER annotations we have are at the word level. We thus need to do some work to map the sequence of NER labels, linked to words, to the usually longer sequence of subword tokens. This is a common task when you have data that wasn't created for a particular language model's classification. We adapt a function from the huggingface tutorial to map the NER labels onto the subword tokens. We assign the label -100 to tokens not at the beginning of a word, as well as to the sentinel `[CLS]` and `[SEP]` tokens at the beginning and end of the sentence.

In [102]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['words'], truncation=True, is_split_into_words=True)

    labels = []
    cnt = 0
    for i, label in enumerate(examples['ner']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
        cnt += np.sum(np.array(label_ids) != -100)

    tokenized_inputs['labels'] = labels
    print(cnt)
    return tokenized_inputs

In [12]:
# token_inputs = example.map(tokenize_and_align_labels, batched=True)
# token_inputs

We apply this function to the whole dataset.

In [103]:
tokenized_sample = sample.map(tokenize_and_align_labels, batched=True)
tokenized_sample.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

12057


Each record in the tokenized sample now has numeric IDs for each token, an attention mask (always 1 in this encoding task), and token-level labels.

In [14]:
tokenized_sample[0]

{'input_ids': tensor([  101,  7270, 22961,  1528,  1840,  1106, 21423,  1418,  2495, 12913,
           119,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor([-100,    4,    0,    8,    0,    0,    0,    8,    0, -100,    0, -100])}

Now let's load the BERT model itself. We use the version that was trained on data that hadn't been case-folded, since upper-case words might be useful features for NER in English.

In [104]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased", output_hidden_states=True)

We run inference on the first sentence in our sample, passing the model the list of token identifiers (coerced into a tensor with a single batch dimension) and the attention mask, which is all 1s for this simple encoding task.

In [105]:
import torch
with torch.no_grad():
  outputs = model(input_ids=tokenized_sample[0]['input_ids'].unsqueeze(0), attention_mask=tokenized_sample[0]['attention_mask'].unsqueeze(0))
  hidden_states = outputs.hidden_states

The `hidden_states` object we just created is a tuple with 13 items, one for each layer of the BERT model. The initial token embedding is layer 0 and the output is layer 12. Each layer contains embeddings for each token&mdash;here, there are 12&mdash;each of which is a vector of length 768.

In [106]:
print(len(hidden_states))
print(hidden_states[0].shape)
# print(hidden_states[0])

13
torch.Size([1, 12, 768])


We now define a function to take a dataset of tokens, run it through BERT to produce embeddings at all 13 layers, and to produce features for predicting NER labels from token embeddings. This function uses two explicit nested loops, which is not the fastest way to do things in pytorch, but more clearly expresses what is being computed. It takes about a minute to run on colab. (This assignment isn't meant to be a pytorch tutorial, but if you know pytorch, or are learning it, feel free to speed up this code by batching the examples together.)

In [107]:
def compute_layer_representation(data, model, tokenizer):
  rep = []
  lab = []
  for example in data:
    with torch.no_grad():
      outputs = model(input_ids=example['input_ids'].unsqueeze(0), attention_mask=example['attention_mask'].unsqueeze(0))
      tokens = tokenizer.convert_ids_to_tokens(example['input_ids'])
      # for one sequence
      hidden_states = outputs.hidden_states
      for i in range(len(example['labels'])):
        if example['labels'][i] != -100:
          lab.append(int(example['labels'][i]))
          rep.append([hidden_states[layer][0][i].numpy() for layer in range(len(hidden_states))])
          #rep.append(hidden_states[layer][0][i].numpy())
  return [np.array(rep), np.array(lab)]

We compute embeddings for all layers for the full dataset. Note that the first dimension is now _words_ rather then sentences. This means that we can probe the information that each word's embedding has about named entities (or anything else).

In [108]:
X, y = compute_layer_representation(tokenized_sample, model, tokenizer)

We can select information about the bottom (word embedding) layer, which gives as a matrix of words by embedding dimensions.

In [109]:
X[:,:,:].shape

(12057, 13, 768)

**TODO:** Your first task is to probe the information that these emedding layers have about named entities. Train one linear model for each of the 13 layers of BERT to predict the label of each word in `y` using the embeddings in `X`. Print the accuracy of this model for each of the 13 layers of BERT. By accuracy, we simply mean the proportion of words that have been assigned the correct tag. (Although NER is often evaluated at the level of the entity, which may span one or more words, we will keep things simple here.)

You may use the sklearn code for training logistic regression models that you ran in assignment 2. You may also train these classifiers using pytorch. In any case, perform 10-fold cross validation and return the average accuracy over all ten folds.

In [110]:
# TODO: Train linear models to predict the NER labels using embeddings from each layer of BERT.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, LeaveOneOut, KFold
import numpy as np
def train_model(X, y):
  for i in range(X.shape[1]):
    # [num_valid_tokens, num_layers, hidden_dim]
    X_layer = X[:, i, :]
    clf = LogisticRegression(penalty='l2',solver='liblinear')
    clf.fit(X_layer, y)
    result= cross_validate(clf, X_layer, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
    score = result['test_score']
    ave_score = sum(score)/len(score)
    print(ave_score)

In [111]:
train_model(X, y)

0.9251900937910722
0.9532228208886412
0.9606868148882146
0.9600236714078294
0.9668244531147858
0.9707227348733511
0.9729620913413569
0.9736253036339739
0.9719663783434143
0.9723806967926618
0.9743709529806018
0.9728780716060086
0.9708882971036932


**TODO:** How good are these accuracy levels? Since the `O` tag is very common, you can do quite well by always predicting `O`. Compute the baseline accuracy, i.e., the accuracy you would get on the sample data if you always predicted `O`.

In [112]:
# TODO: Compute and print the baseline accuracy of always predicting O.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, LeaveOneOut, KFold
import numpy as np

def get_baseline(y):
  score = 0
  for label in y:
    score += 1 if label == 0 else 0
  ave_score = score/len(y)
  print(ave_score)


In [113]:
get_baseline(y)

0.7650327610516712


**TODO:** Now try another probing experiment for capitalized words, a simple feature that, in English, is correlated with named entities. For each word in the sample data, create a feature that indicates whether that word's first character is a capital letter. Then train logistic regression models for each layer of BERT to see how well they predict capitalization. Perform 10-fold cross-validation as above. Note any differences you see with the NER probes.

In addition, compute the baseline accuracy, i.e., the accuracy of always predicting that a word is not capitalized.

In [96]:
# TODO: Train linear models to predict capitalization.
# Compute and print the accuracy of these models for each layer of BERT.
# Compute and print the baseline accuracy.
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

def tokenize_and_labels(examples):
    tokenized_input = tokenizer(examples['words'], is_split_into_words=True)
    return tokenized_input

# Extract features from Bert model
def compute_capitalized_representation(data, model, tokenizer, batch_size=5):
    # dataloader = DataLoader(data, batch_size=batch_size, shuffle=False)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    dataloader = DataLoader(data, batch_size=batch_size, shuffle=False, collate_fn=data_collator)

    all_reps = []
    all_labs = []

    model.eval()
    for batch in dataloader:
        with torch.no_grad():
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
            )
            hidden_states = outputs.hidden_states

            stacked_hidden = torch.stack(hidden_states)
            mask = batch['attention_mask'].bool()

            for batch_idx in range(mask.size(0)):
                # Extract valid positions for each batch item
                valid_positions = mask[batch_idx]
                valid_reps = stacked_hidden[:, batch_idx, valid_positions, :]

                # Transpose to [num_valid_tokens, num_layers, hidden_dim]
                valid_reps = valid_reps.permute(1, 0, 2).cpu().numpy()

                # Generate labels from valid input_ids
                input_ids = batch['input_ids'][batch_idx][valid_positions]
                tokens = tokenizer.convert_ids_to_tokens(input_ids)
                valid_labels = np.array([1 if token[0].isupper() else 0 for token in tokens])

                all_reps.append(valid_reps)
                all_labs.append(valid_labels)

    return [np.vstack(all_reps), np.concatenate(all_labs)]


In [88]:
tokenized_sample = sample.map(tokenize_and_labels, batched=True)
tokenized_sample.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask'])
X, y = compute_capitalized_representation(tokenized_sample, model, tokenizer, 16)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [92]:
train_model(X, y)

1.0
1.0
1.0
1.0
0.9998966408268734
0.9996898690374074
0.9994831239695479
0.9992764056232948
0.9981911075838589
0.9976741513885884
0.9970539429066161
0.9966405062141096
0.9964337344246437


In [95]:
get_baseline(y)

0.8079805654623455
