# Prepare Universal Dependencies Treebank Dataset

We are using the UD English EWT Treebank for our dataset. The corpus comprises 254,820 words and 16,622 sentences, taken from five genres of web media: weblogs, newsgroups, emails, reviews, and Yahoo! answers. HuggingFace has these datasets ready to load using the datasets module.

Documentation for the English UD EWT treebank: https://universaldependencies.org/treebanks/en_ewt/index.html

Aside: Taking a look at the tokens list (for a sentence), it's 1-based indexing because of UD indexing. This is because index 0 refers to the root, so index 1 refers to the first token of the sentence.

In [1]:
!pip install datasets
!pip install conllu

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [2]:
from datasets import load_dataset

# Loading dataset parameters
treebank_name = "en_ewt"
split=None  # Load all three splits: train, test, & validation sets.

# Load UD Treebank
dataset = load_dataset("universal_dependencies", treebank_name, split=split)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/191k [00:00<?, ?B/s]

universal_dependencies.py:   0%|          | 0.00/87.8k [00:00<?, ?B/s]

The repository for universal_dependencies contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/universal_dependencies.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/13.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12543 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2002 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2077 [00:00<?, ? examples/s]

In [3]:
import ast

def get_dep_info(dep_str):
    """Convert dependency string representation to tuple"""
    try:
        # literal_eval() for safety and also our information should only be literals.
        dep_list = ast.literal_eval(dep_str)
        if dep_list:
            return dep_list[0]  # dep_list looks like [(dependency, head_index)], so we grab the tuple inside the list.
        return ('N/A', 0)
    except:
        return ('N/A', 0)

def extract_subject_verb_pairs(sentence) -> list[dict]:
    """
    Extract subject-verb pairs from a sentence.
    """

    subject_verb_pairs = []
    for i, dep_str in enumerate(sentence['deps']):
        dep_rel, head = get_dep_info(dep_str)
        if dep_rel == 'nsubj' and head > 0: # Skip when the head points to the root.
            subject_idx = i
            verb_idx = head - 1 # The tokens list uses 1-based indexing, because index 0 is technically the root. That's why we're taking the head index - 1, so we can adjust for the 1-based indexing.

            subject_verb_pairs.append({
                'subject': sentence['tokens'][subject_idx],
                'verb': sentence['tokens'][verb_idx],
                'subject_idx': subject_idx,
                'verb_idx': verb_idx
            })

    return subject_verb_pairs

In [4]:
# Test extract subject-verb pairs for evaluation.
extract = extract_subject_verb_pairs(dataset["train"][0])[0]
subj_idx = extract["subject_idx"]
verb_idx = extract["verb_idx"]

sentence = dataset["train"][0]

# Load model from Hugging Face

One thing to keep in mind. BERT uses the WordPiece tokenization scheme, which is different than the UD tokenization scheme. Therefore, we need to map BERT's tokenization scheme to the UD scheme so we can properly do evaluation of BERT on the ground truth subject-verb pairs from the UD dataset.

In [5]:
from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

TODO: Implement BertViz to visualize attention head scores

The code below takes one sentence from the UD dataset and extracts the subject-verb pair from that sentence. Then, it feeds the sentence into BERT so we can perform analysis. Here, we examine the attention weights from layer 7. Since the BERT tokenization scheme breaks words into subwords, we evaluate the BERT tokens by taking the max attention score across all subwords for a particular word. For example, max(attention score of ['run', '##ning']).

In [None]:
import torch

def get_bert_token_mapping(ud_tokens: list, bert_tokens: list, offset_mapping) -> dict:
    """
    Create mapping between UD tokens and BERT tokens.
    Returns a dict mapping UD token index -> list of BERT token indices.
    """
    # Remove special tokens for mapping calculation
    bert_tokens = bert_tokens[1:-1]  # Remove [CLS] and [SEP]
    offset_mapping = offset_mapping[1:-1]  # Remove special tokens' offsets

    # Join UD tokens with spaces to match original text
    text = ' '.join(ud_tokens)

    # Create mapping: UD index -> BERT token indices
    ud_to_bert = {}
    current_ud_idx = 0
    current_char_idx = 0

    # For each UD token, find all BERT tokens that correspond to it
    for ud_idx, ud_token in enumerate(ud_tokens):
        bert_indices = []
        token_start = text.find(ud_token, current_char_idx)
        token_end = token_start + len(ud_token)

        # Find all BERT tokens that overlap with this UD token
        for bert_idx, (start, end) in enumerate(offset_mapping):
            # Check if this BERT token overlaps with current UD token
            if start >= token_start and end <= token_end:
                # +1 to account for [CLS] token
                bert_indices.append(bert_idx + 1)

        ud_to_bert[ud_idx] = bert_indices
        current_char_idx = token_end

    return ud_to_bert

In [12]:
''' Experiment setup to see how subject-verb pairs within a certain sentence are scored for attention heads at a specified layer. '''

# Get a sentence with subject-verb pairs
sentence = dataset['train'][200]  # Change index to get different sentence samples
text = ' '.join(sentence['tokens'])
pairs = extract_subject_verb_pairs(sentence)

print(f"Sentence: {text}")
print(f"Subject-verb pairs: {pairs}")

# Tokenize with BERT
encoding = tokenizer(text, return_tensors='pt', return_offsets_mapping=True)
input_ids = encoding['input_ids']
offset_mapping = encoding['offset_mapping'][0].numpy()  # Remove batch dimension
bert_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

# Get token mapping
token_mapping = get_bert_token_mapping(sentence['tokens'], bert_tokens, offset_mapping)

print("\nBERT tokenization:", bert_tokens)
print("\nToken mapping:")
for ud_idx, bert_indices in token_mapping.items():
    print(f"UD token '{sentence['tokens'][ud_idx]}' -> BERT tokens: {[bert_tokens[i] for i in bert_indices]}")

# Get attention weights
with torch.no_grad():
    outputs = model(input_ids, output_attentions=True)

# Get attention weights from layer 7
layer_idx = 6  # 0-based indexing
attention = outputs.attentions[layer_idx][0]  # Shape: [num_heads, seq_len, seq_len]
# For each attention head, there is a seq_len x seq_len matrix of attention scores.
# The rows are tokens DOING the attending
# The columns are tokens BEING attended to
# Each cell [i,j] shows how much token i pays attention to token j

# Analyze attention for subject-verb pairs with proper token mapping
print("\nAttention analysis for layer 7:")
for pair in pairs:
    subject_bert_indices = token_mapping[pair['subject_idx']]
    verb_bert_indices = token_mapping[pair['verb_idx']]

    print(f"\nPair: {pair['subject']} -> {pair['verb']}")
    print(f"BERT tokens: {[bert_tokens[i] for i in subject_bert_indices]} -> {[bert_tokens[i] for i in verb_bert_indices]}")

    # For each attention head
    for head in range(attention.size(0)):
        # Get max attention scores across all subword combinations
        scores = []
        for subj_idx in subject_bert_indices:
            for verb_idx in verb_bert_indices:
                scores.append(attention[head, subj_idx, verb_idx].item()) # Asking: In this attention head, how much does the token at subject_idx attend to the token at verb_idx?
        # print(scores)
        max_score = max(scores) # There can be multiple scores in the attention matrix corresponding to one word (either subject or the verb) because of how the tokenizer breaks up those words. So we take the max of the scores.

        print(f"Head {head}: {max_score:.3f}")

Sentence: Indeed , given how Bush has rampaged around the world alienating allies and ignoring vital conflicts with the potential to blow back on the US , one might well argue that Edwards knows more now than Bush does .
Subject-verb pairs: [{'subject': 'Bush', 'verb': 'rampaged', 'subject_idx': 4, 'verb_idx': 6}, {'subject': 'one', 'verb': 'argue', 'subject_idx': 26, 'verb_idx': 29}, {'subject': 'Edwards', 'verb': 'knows', 'subject_idx': 31, 'verb_idx': 32}, {'subject': 'Bush', 'verb': 'does', 'subject_idx': 36, 'verb_idx': 37}]

BERT tokenization: ['[CLS]', 'Indeed', ',', 'given', 'how', 'Bush', 'has', 'ramp', '##aged', 'around', 'the', 'world', 'alien', '##ating', 'allies', 'and', 'ignoring', 'vital', 'conflicts', 'with', 'the', 'potential', 'to', 'blow', 'back', 'on', 'the', 'US', ',', 'one', 'might', 'well', 'argue', 'that', 'Edwards', 'knows', 'more', 'now', 'than', 'Bush', 'does', '.', '[SEP]']

Token mapping:
UD token 'Indeed' -> BERT tokens: ['Indeed']
UD token ',' -> BERT tok