<a href="https://colab.research.google.com/github/rishinbussa/CS-6120/blob/main/bert_probe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probling Language Model Representations

In this notebook, you will explore how much information language models have about linguistic structure even when they have not been explicitly trained to predict it. You will use the encoder language model BERT.

This is a kind of experiment called &ldquo;probing&rdquo;, where we use internal representations from a language model to predict certain information we have but the language model does not. In particular, we will use a named entity recognition (NER) task, `BIO` tags on each word for the classes person, location, organization, and miscellaneous. The base BERT model did not see any of these labels in training—although BERT has often been fine-tuned on token labeling tasks. For more on token classification for named entity recognition, and for some of the code we use here, see [this huggingface tutorial](https://huggingface.co/docs/transformers/en/tasks/token_classification).

Work through the notebook and complete the cells marked TODO to set up and run these experiments.

We start by installing the huggingface `transformers` and related libraries.

In [27]:
!pip install transformers datasets evaluate seqeval



In case you want them later, we'll load the sklearn functions you used for training logistic regression in assignment 2.

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, LeaveOneOut, KFold
import numpy as np

Then, we'll use the huggingface `datasets` library to download the CoNLL (Conference on Natural Language Learning) 2003 data for named-entity recognition.

In [29]:
from datasets import load_dataset
conll2003 = load_dataset("hgissbkh/conll2003-en")

To keep things simple, we'll work with a sample of 1000 sentences.

In [30]:
sample = conll2003['train'].select(range(1000))

Each record contains a list of word tokens and a list of NER labels. For efficiency, the labels have been turned into integers, which makes them hard to interpret.

In [31]:
sample[0]

{'words': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'ner': [4, 0, 8, 0, 0, 0, 8, 0, 0]}

Fortunately, the dataset object also contains information to map these integers back to readable strings. We can see tags such as `B-PER` (the beginning token of a personal name), `I-PER` (the following tokens inside a personal name, if any), and `O` (a token outside any named entities). We create two dictionaries `id2label` and `label2id` to make mapping between integers and labels easier.

In [32]:
labels = sample.features['ner'].feature.names
id2label = {i: label for i, label in enumerate(labels)}
label2id = {label: i for i, label in enumerate(labels)}
print(labels)
print(id2label)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


For a language model to interpret our data properly, we need to tokenize it in the same way as its training data. We download the tokenizer for the `bert-base-cased` model from huggingface.

In [33]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Let's see what happens when we run the tokenizer on a single sentence. We tell it that our sentence has already been split into words, in this case by the creators of the CoNLL 2003 NER dataset. BERT, like many language models, used **subword tokenization** to keep the size of its vocabulary manageable. The tokenizer turns $n$ words into $m \ge n$ tokens, represented as a list of integer token identifiers. We use the method `convert_ids_to_tokens` to turn these integers back into a string representation.

In [34]:
example = sample[10]
tokenized_input = tokenizer(example['words'], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
tokens

['[CLS]',
 'Spanish',
 'Farm',
 'Minister',
 'Loyola',
 'de',
 'Pa',
 '##la',
 '##cio',
 'had',
 'earlier',
 'accused',
 'Fi',
 '##sch',
 '##ler',
 'at',
 'an',
 'EU',
 'farm',
 'ministers',
 "'",
 'meeting',
 'of',
 'causing',
 'un',
 '##ju',
 '##st',
 '##ified',
 'alarm',
 'through',
 '"',
 'dangerous',
 'general',
 '##isation',
 '.',
 '"',
 '[SEP]']

Notice how the name `Palacio` has been split into three subword tokens: `Pa`, `##la`, and `##cio`. The prepended `##` indicates that this token is _not_ the start of a word. But the NER annotations we have are at the word level. We thus need to do some work to map the sequence of NER labels, linked to words, to the usually longer sequence of subword tokens. This is a common task when you have data that wasn't created for a particular language model's classification. We adapt a function from the huggingface tutorial to map the NER labels onto the subword tokens. We assign the label -100 to tokens not at the beginning of a word, as well as to the sentinel `[CLS]` and `[SEP]` tokens at the beginning and end of the sentence.

In [35]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['words'], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['ner']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

We apply this function to the whole dataset.

In [37]:
tokenized_sample = sample.map(tokenize_and_align_labels, batched=True)
tokenized_sample.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

Each record in the tokenized sample now has numeric IDs for each token, an attention mask (always 1 in this encoding task), and token-level labels.

In [38]:
tokenized_sample[0]

{'input_ids': tensor([  101,  7270, 22961,  1528,  1840,  1106, 21423,  1418,  2495, 12913,
           119,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor([-100,    4,    0,    8,    0,    0,    0,    8,    0, -100,    0, -100])}

Now let's load the BERT model itself. We use the version that was trained on data that hadn't been case-folded, since upper-case words might be useful features for NER in English.

In [39]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased", output_hidden_states=True)

We run inference on the first sentence in our sample, passing the model the list of token identifiers (coerced into a tensor with a single batch dimension) and the attention mask, which is all 1s for this simple encoding task.

In [40]:
import torch
with torch.no_grad():
  outputs = model(input_ids=tokenized_sample[0]['input_ids'].unsqueeze(0), attention_mask=tokenized_sample[0]['attention_mask'].unsqueeze(0))
  hidden_states = outputs.hidden_states

The `hidden_states` object we just created is a tuple with 13 items, one for each layer of the BERT model. The initial token embedding is layer 0 and the output is layer 12. Each layer contains embeddings for each token&mdash;here, there are 12&mdash;each of which is a vector of length 768.

In [41]:
print(len(hidden_states))
print(hidden_states[0].shape)

13
torch.Size([1, 12, 768])


We now define a function to take a dataset of tokens, run it through BERT to produce embeddings at all 13 layers, and to produce features for predicting NER labels from token embeddings. This function uses two explicit nested loops, which is not the fastest way to do things in pytorch, but more clearly expresses what is being computed. It takes about a minute to run on colab. (This assignment isn't meant to be a pytorch tutorial, but if you know pytorch, or are learning it, feel free to speed up this code by batching the examples together.)

In [42]:
def compute_layer_representation(data, model, tokenizer):
  rep = []
  lab = []
  for example in data:
    with torch.no_grad():
      outputs = model(input_ids=example['input_ids'].unsqueeze(0), attention_mask=example['attention_mask'].unsqueeze(0))
      tokens = tokenizer.convert_ids_to_tokens(example['input_ids'])
      hidden_states = outputs.hidden_states
      for i in range(len(example['labels'])):
        if example['labels'][i] != -100:
          lab.append(int(example['labels'][i]))
          rep.append([hidden_states[layer][0][i].numpy() for layer in range(len(hidden_states))])
          #rep.append(hidden_states[layer][0][i].numpy())
  return [np.array(rep), np.array(lab)]

We compute embeddings for all layers for the full dataset. Note that the first dimension is now _words_ rather then sentences. This means that we can probe the information that each word's embedding has about named entities (or anything else).

In [43]:
X, y = compute_layer_representation(tokenized_sample, model, tokenizer)

We can select information about the bottom (word embedding) layer, which gives as a matrix of words by embedding dimensions.

In [44]:
X[:,0,:].shape

(12057, 768)

**TODO:** Your first task is to probe the information that these emedding layers have about named entities. Train one linear model for each of the 13 layers of BERT to predict the label of each word in `y` using the embeddings in `X`. Print the accuracy of this model for each of the 13 layers of BERT. By accuracy, we simply mean the proportion of words that have been assigned the correct tag. (Although NER is often evaluated at the level of the entity, which may span one or more words, we will keep things simple here.)

You may use the sklearn code for training logistic regression models that you ran in assignment 2. You may also train these classifiers using pytorch. In any case, perform 10-fold cross validation and return the average accuracy over all ten folds.

In [45]:
def train_ner_probes(X, y):

    print("Training NER probes for each BERT layer...")
    accuracies = []
    kf = KFold(n_splits=10, shuffle=True, random_state=42)

    for layer_idx in range(13):
        print(f"Processing Layer {layer_idx}...")
        X_layer = X[:, layer_idx, :]

        # Perform 10-fold cross-validation
        fold_accuracies = []

        for fold_num, (train_idx, test_idx) in enumerate(kf.split(X_layer), 1):
            X_train, X_test = X_layer[train_idx], X_layer[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]

            # Train logistic regression
            clf = LogisticRegression(max_iter=1000, random_state=42)
            clf.fit(X_train, y_train)

            accuracy = clf.score(X_test, y_test)
            fold_accuracies.append(accuracy)

        avg_accuracy = np.mean(fold_accuracies)
        accuracies.append(avg_accuracy)
        print(f"  Layer {layer_idx} - Average Accuracy: {avg_accuracy:.4f}")

    return accuracies

ner_accuracies = train_ner_probes(X, y)

print("\nNER Probe Results Summary:")
print("-" * 40)
for layer_idx, acc in enumerate(ner_accuracies):
    print(f"Layer {layer_idx:2d}: {acc:.4f}")
print("-" * 40)
best_layer = np.argmax(ner_accuracies)
print(f"Best performing layer: Layer {best_layer} with accuracy {ner_accuracies[best_layer]:.4f}")


Training NER probes for each BERT layer...
Processing Layer 0...
  Layer 0 - Average Accuracy: 0.9250
Processing Layer 1...
  Layer 1 - Average Accuracy: 0.9502
Processing Layer 2...
  Layer 2 - Average Accuracy: 0.9573
Processing Layer 3...
  Layer 3 - Average Accuracy: 0.9584
Processing Layer 4...
  Layer 4 - Average Accuracy: 0.9659
Processing Layer 5...
  Layer 5 - Average Accuracy: 0.9690
Processing Layer 6...
  Layer 6 - Average Accuracy: 0.9722
Processing Layer 7...
  Layer 7 - Average Accuracy: 0.9723
Processing Layer 8...
  Layer 8 - Average Accuracy: 0.9719
Processing Layer 9...
  Layer 9 - Average Accuracy: 0.9716
Processing Layer 10...
  Layer 10 - Average Accuracy: 0.9724
Processing Layer 11...
  Layer 11 - Average Accuracy: 0.9725
Processing Layer 12...
  Layer 12 - Average Accuracy: 0.9700

NER Probe Results Summary:
----------------------------------------
Layer  0: 0.9250
Layer  1: 0.9502
Layer  2: 0.9573
Layer  3: 0.9584
Layer  4: 0.9659
Layer  5: 0.9690
Layer  6: 0.9

**TODO:** How good are these accuracy levels? Since the `O` tag is very common, you can do quite well by always predicting `O`. Compute the baseline accuracy, i.e., the accuracy you would get on the sample data if you always predicted `O`.

In [46]:
def compute_baseline_accuracy(y):

    unique, counts = np.unique(y, return_counts=True)

    # The O tag is label 0
    # Find accuracy of always predicting 0 (O tag)
    o_tag_count = np.sum(y == 0)
    total_count = len(y)
    baseline_accuracy = o_tag_count / total_count

    return baseline_accuracy

baseline_acc = compute_baseline_accuracy(y)
print(f"Baseline accuracy (always predicting 'O' tag): {baseline_acc:.4f}")
print(f"This means {baseline_acc*100:.2f}% of words are not part of named entities")


Baseline accuracy (always predicting 'O' tag): 0.7650
This means 76.50% of words are not part of named entities


**TODO:** Now try another probing experiment for capitalized words, a simple feature that, in English, is correlated with named entities. For each word in the sample data, create a feature that indicates whether that word's first character is a capital letter. Then train logistic regression models for each layer of BERT to see how well they predict capitalization. Perform 10-fold cross-validation as above. Note any differences you see with the NER probes.

In addition, compute the baseline accuracy, i.e., the accuracy of always predicting that a word is not capitalized.

In [48]:

def create_capitalization_labels_aligned(tokenized_sample, tokenizer):

    cap_labels = []

    for example in tokenized_sample:
        tokens = tokenizer.convert_ids_to_tokens(example['input_ids'])
        ner_labels = example['labels']

        current_word_capitalized = False

        for i, token in enumerate(tokens):

            if ner_labels[i] == -100:

                if not token.startswith('##') and token not in ['[CLS]', '[SEP]', '[PAD]']:

                    current_word_capitalized = token and token[0].isupper()
                continue

            if token.startswith('##'):
                if current_word_capitalized:
                    cap_labels.append(1)
                else:
                    cap_labels.append(0)
            else:
                if token and token[0].isupper():
                    cap_labels.append(1)
                    current_word_capitalized = True
                else:
                    cap_labels.append(0)
                    current_word_capitalized = False

    return np.array(cap_labels)

def train_capitalization_probes_aligned(X, tokenized_sample, tokenizer):

    print("Creating capitalization labels (aligned with embeddings)...")
    cap_labels = create_capitalization_labels_aligned(tokenized_sample, tokenizer)

    print(f"Label count: {len(cap_labels)}, Embedding count: {X.shape[0]}")

    if len(cap_labels) != X.shape[0]:
        print(f"WARNING: Still mismatched! Labels: {len(cap_labels)}, Embeddings: {X.shape[0]}")

        min_len = min(len(cap_labels), X.shape[0])
        cap_labels = cap_labels[:min_len]
        X_subset = X[:min_len]
    else:
        print("Labels and embeddings are properly aligned!")
        X_subset = X

    print("Training capitalization probes for each BERT layer...")
    accuracies = []
    kf = KFold(n_splits=10, shuffle=True, random_state=42)

    for layer_idx in range(13):
        print(f"Processing Layer {layer_idx}...")
        X_layer = X_subset[:, layer_idx, :]

        fold_accuracies = []

        for fold_num, (train_idx, test_idx) in enumerate(kf.split(X_layer), 1):
            X_train, X_test = X_layer[train_idx], X_layer[test_idx]
            y_train, y_test = cap_labels[train_idx], cap_labels[test_idx]

            clf = LogisticRegression(max_iter=1000, random_state=42)
            clf.fit(X_train, y_train)

            # Compute accuracy
            accuracy = clf.score(X_test, y_test)
            fold_accuracies.append(accuracy)

        # Average accuracy across folds
        avg_accuracy = np.mean(fold_accuracies)
        accuracies.append(avg_accuracy)
        print(f"  Layer {layer_idx} - Average Accuracy: {avg_accuracy:.4f}")

    return accuracies, cap_labels

def compute_cap_baseline(cap_labels):
    not_cap_count = np.sum(cap_labels == 0)
    total_count = len(cap_labels)
    return not_cap_count / total_count

cap_accuracies, cap_labels = train_capitalization_probes_aligned(X, tokenized_sample, tokenizer)

print("\nCapitalization Probe Results Summary:")

for layer_idx, acc in enumerate(cap_accuracies):
    print(f"Layer {layer_idx:2d}: {acc:.4f}")

best_cap_layer = np.argmax(cap_accuracies)
print(f"Best performing layer: Layer {best_cap_layer} with accuracy {cap_accuracies[best_cap_layer]:.4f}")

# Compute baseline
cap_baseline = compute_cap_baseline(cap_labels)
print(f"\nCapitalization baseline accuracy (always predicting not capitalized): {cap_baseline:.4f}")
print(f"This means {(1-cap_baseline)*100:.1f}% of tokens are capitalized")

# Performance check
if max(cap_accuracies) > cap_baseline:
    print(f"SUCCESS: Models ({max(cap_accuracies):.4f}) beat baseline ({cap_baseline:.4f})")
else:
    print(f"ERROR: Models ({max(cap_accuracies):.4f}) worse than baseline ({cap_baseline:.4f})")

print("FINAL COMPARATIVE ANALYSIS")


print("\n1. Task Performance Summary:")

print(f"NER best accuracy: {max(ner_accuracies):.4f} (Layer {np.argmax(ner_accuracies)})")
print(f"NER baseline: {baseline_acc:.4f}")
print(f"NER improvement: {max(ner_accuracies) - baseline_acc:.4f}")
print()
print(f"Cap best accuracy: {max(cap_accuracies):.4f} (Layer {np.argmax(cap_accuracies)})")
print(f"Cap baseline: {cap_baseline:.4f}")
print(f"Cap improvement: {max(cap_accuracies) - cap_baseline:.4f}")

print("\n2. Layer Preferences:")

if np.argmax(cap_accuracies) < np.argmax(ner_accuracies):
    print("EXPECTED PATTERN CONFIRMED:")
    print(f"Capitalization peaks early (Layer {np.argmax(cap_accuracies)})")
    print(f"NER peaks in middle/late layers (Layer {np.argmax(ner_accuracies)})")
else:
    print("! Unexpected pattern detected")

print("\n3. Detailed Layer Comparison:")
print("-" * 40)
print("Layer | NER Acc | Cap Acc | NER>Cap?")
print("-" * 40)
for i in range(13):
    ner_acc = ner_accuracies[i]
    cap_acc = cap_accuracies[i]
    better = "NER" if ner_acc > cap_acc else "CAP"
    marker_ner = "*" if i == np.argmax(ner_accuracies) else " "
    marker_cap = "*" if i == np.argmax(cap_accuracies) else " "
    print(f"{i:5d} | {ner_acc:.4f}{marker_ner} | {cap_acc:.4f}{marker_cap} | {better}")
print("-" * 40)
print("* = best layer for that task")

Creating capitalization labels (aligned with embeddings)...
Label count: 12057, Embedding count: 12057
Labels and embeddings are properly aligned!
Training capitalization probes for each BERT layer...
Processing Layer 0...
  Layer 0 - Average Accuracy: 1.0000
Processing Layer 1...
  Layer 1 - Average Accuracy: 1.0000
Processing Layer 2...
  Layer 2 - Average Accuracy: 1.0000
Processing Layer 3...
  Layer 3 - Average Accuracy: 0.9999
Processing Layer 4...
  Layer 4 - Average Accuracy: 0.9999
Processing Layer 5...
  Layer 5 - Average Accuracy: 0.9996
Processing Layer 6...
  Layer 6 - Average Accuracy: 0.9997
Processing Layer 7...
  Layer 7 - Average Accuracy: 0.9990
Processing Layer 8...
  Layer 8 - Average Accuracy: 0.9990
Processing Layer 9...
  Layer 9 - Average Accuracy: 0.9984
Processing Layer 10...
  Layer 10 - Average Accuracy: 0.9982
Processing Layer 11...
  Layer 11 - Average Accuracy: 0.9977
Processing Layer 12...
  Layer 12 - Average Accuracy: 0.9974

Capitalization Probe Resu