# Lab 4: Probing the capabilities of LLMs

Unlike previous assignments in this course, our primary goal in this lab is not to use NLP tools and techniques model language *per se*, but rather to investigate properties of language models themselves, since large language models (LLMs) are "black boxes" whose inner workings cannot be directly observed.

In particular, you will utilize and interpret the outputs of language models to **probe** features of those models--and in particular, how closely (or not) they resemble human language use/knowledge. We will design and implement probes for masked language modeling in BERT, in order to build on our knowledge from Lab 3, but these techniques are very generally applicable to models of all sorts, including generative models like GPT-3.




Masked Language Modeling is essentially a game of "fill in the blank". The model is given an input text of which a portion is "masked", and trained to predict what the masked element is given the surrounding context (both before and after the mask). Your job is to use these predictions to reason about how the model itself works.

# Rules
* The assignment should submitted to **Blackboard** as `.ipynb`. Only **one submission per group**.

* The **filename** should be the group number, e.g., `01.ipynb` or `31.ipynb`.

* The questions marked **Extra** or **Optional** are an additional challenge for those interested in going the extra mile. There are no points for them.

**Rules for implementation**

* You should **write your answers in this iPython Notebook**. (See http://ipython.org/notebook.html for reference material.) If you have problems, please contact your teaching assistant.

* Use only **one cell for markdown** answers!    

    * You do **not need to submit any code** for this lab, but you are free to leave any code you might run in your submission, so long as it does not interfere with readability of your written responses.
    * For text-based questions, put your solution in the `█████ YOUR ANSWER HERE █████` cell and keep the header.

* Don't change or delete any initially provided cells, either text or code, unless explicitly instructed to do so.
* Don't change the names of provided functions and variables or arguments of the functions.
* Leave the output of your code in the output cells.
* Test your code and **make sure we can run your notebook** in the colab environment.
* Don't forget to fill in the contribution information.

<font color="red">You following these rules helps us to grade the submissions relatively efficiently. If these rules are violated, a submission will be subject to penalty points.</font>  

All exercises were completed with the contribution of all the members by pair coding and discussion of results.

Group Members:

- Andreas Alexandrou

- Nikolas Stavrou

- Sotiris Zenios

# Setup

BERT, which you are already familiar with, is pre-trained on a masked language modeling task. We will use this model to make predictions about what "fills in the blank" in a masked language task. As before, we will need the transformers package to use BERT.

In [None]:
# !pip install transformers
import transformers
print(transformers.__version__) # 4.41.2

4.41.2


Then, we need to instantiate the tokenizer and the masked learning model.

In [None]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM, RobertaTokenizer, RobertaModel, RobertaForMaskedLM

seed = 5
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertForMaskedLM.from_pretrained('bert-base-uncased')

#bert_model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Dealing with masked sentences

Next, we want to define some methods to allow us to see the probability of particular candidate tokens to "fill in the blank" in some text. We will use the token [MASK] to denote the blank.

The class ```MaskedSentence``` takes a sentence with a [MASK] token and uses softmax to turn the weights of all possible predicted values for the mask into probabilities. The class has two additional methods:

* The function ```get_masked_token_probability``` takes a string  ```token``` and prints the likelihood that BERT assigns to [MASK] being replaced in the text with ```token```.
* The function ```predict_masked_sentence``` prints the top *k* (5 by default) predictions for [MASK] with their probabilities.

In [None]:
# adapted from code by Yuchen Liu

class MaskedSentence:

  """
  A tokenized sentence with a masked word
  Note: [MASK] is the default mask token for BERT, other MLMs may have different defaults
  """

  def __init__(self, text, model=bert_model, tokenizer=bert_tokenizer, mask_token="[MASK]"):

    # Tokenize text and obtain predictions for mask

    self.tokenizer = tokenizer
    self.mask = mask_token

    text = "[CLS] %s [SEP]"%text
    tokenized_text = self.tokenizer.tokenize(text)
    masked_index = tokenized_text.index(mask_token)
    indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens_tensor = torch.tensor([indexed_tokens])

    with torch.no_grad():
        outputs = model(tokens_tensor)
        predictions = outputs[0]

    # Turn predictions into a probability distribution using softmax

    self.probs = torch.nn.functional.softmax(predictions[0, masked_index], dim=-1)

  def get_masked_token_probability(self, token):

    # prints probability of mask being replaced by token

    token_id = self.tokenizer.convert_tokens_to_ids(token)
    token_prob = self.probs[token_id]

    print(f"{self.mask}: {token},  | probability:, {float(token_prob)}, \n")

  def predict_masked_sent(self, top_k=5):

    # prints k most probable replacements for token

    top_k_weights, top_k_indices = torch.topk(self.probs, top_k, sorted=True)

    for i, pred_idx in enumerate(top_k_indices):
        predicted_token = self.tokenizer.convert_ids_to_tokens([pred_idx])[0]
        token_weight = top_k_weights[i]
        print(f"{self.mask}: {predicted_token}, | probability:, {float(token_weight)}")


Now, let's test these methods on a string with a mask.

In [None]:
test_sentence = MaskedSentence("All the world’s a [MASK], and all the men and women merely players.")
test_sentence.get_masked_token_probability("stage")
test_sentence.predict_masked_sent(top_k=5)

[MASK]: stage,  | probability:, 0.0009108720696531236, 

[MASK]: game, | probability:, 0.22936248779296875
[MASK]: team, | probability:, 0.2215639054775238
[MASK]: player, | probability:, 0.1818789690732956
[MASK]: champion, | probability:, 0.02110450714826584
[MASK]: winner, | probability:, 0.012764733284711838


We see that the model assigns a probability of about 0.0009 to *stage*, and the highest-probability token is *game* (0.229).

# Testing linguistic knowledge in MLMs

A major outstanding question in the study of large language models in general is how well (or poorly) the models are able to replicate aspects of humans' implicit linguistic knowledge and linguistic reasoning. One way to test this is to see how the model predicts a target word.

For example, consider the ways we could fill in the blank in the sentence *If cats were herbivores, they would probably eat _________.* This sentence is an example of what linguists call a **counterfactual conditional**: a description of what would happen if some hypothetical (but contrary to reality) condition were met. A speaker of English could recognize that this sentence is describing a hypothetical situation in which cats eat only plants, so the most logical continuation would be a word that describes edible plants, like *vegetables* or *carrots*.

Let's see what BERT predicts as the likeliest possible predictions for the mask:

In [None]:
cats_sent = MaskedSentence("If cats were herbivores, they would probably eat [MASK].")
cats_sent.predict_masked_sent(top_k=5)

[MASK]: them, | probability:, 0.2037343680858612
[MASK]: humans, | probability:, 0.19445215165615082
[MASK]: it, | probability:, 0.06679246574640274
[MASK]: animals, | probability:, 0.028084909543395042
[MASK]: meat, | probability:, 0.02714720368385315


We see that BERT's top 5 predictions for the mask are *them, humans, it*, *animals*, and *meat*. On one hand, all of these predictions result in grammatical (syntactically well-formed) sentences, but they are either not very contentful (*them*, *it*) or nonsensical (*humans*, *meat*, *animals*)---none human-like. By contrast, the probabilities of *vegetables* and *carrots* are both relatively low:

In [None]:
cats_sent.get_masked_token_probability("vegetables")
cats_sent.get_masked_token_probability("carrots")

[MASK]: vegetables,  | probability:, 0.0011069991160184145, 

[MASK]: carrots,  | probability:, 2.656224978636601e-06, 



# Ex 1 [1pt] Evaluating counterfactual conditionals in BERT

Provide a possible explanation, in 150-200 words, as to why BERT gives such non-human like results for this counterfactual conditional sentence. Your explanation should address the following questions: Do you think this is an arbitrary feature of this one sentence? Or does it reveal something more general about BERT? How could you go about testing whether your explanation is correct, using the class defined above?


<font color="red">█████ YOUR ANSWER HERE █████</font>

BERT gives such non-human like results for this counterfactual conditional sentence probably because it was trained on a lot amount of text that prioritized the sentences to be syntactically correct rather than follow deep semantic understanding (e.g here where from the fact that if the cat is a herbivore it means it only eats plants hence a carrot for example).

Yes, BERT is trained to predict the masked tokens based on the context but it does not have world knowledge or the ability to reason with different hypothetical or counterfactual scenarios like a human will do. This leads us to the conclusion that BERT struggles a lot with understanding semantic contexts that require world knowledge and logical inference.

We can further test this by using more counterfactual conditional sentences like the ones we see below. BERT fails to understand for example that if water wasn't drinkable we would drink milk for example but rather says 'it' or 'water'.

Also, on 'If I could pick a type of dance, I would pick [MASK].' BERT fails to fill mask with a dance type such as hip-hop but rather fills the word with 'one' or 'it' etc.


In [None]:
test_sent = MaskedSentence("If water was not drinkable, we would drink [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: it, | probability:, 0.6088706254959106
[MASK]: water, | probability:, 0.14153553545475006
[MASK]: nothing, | probability:, 0.03644610941410065
[MASK]: only, | probability:, 0.010709594003856182
[MASK]: less, | probability:, 0.00738625880330801


In [None]:
test_sent = MaskedSentence("If I could pick a type of dance, I would pick [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: one, | probability:, 0.1626421958208084
[MASK]: it, | probability:, 0.1510871797800064
[MASK]: another, | probability:, 0.04559238627552986
[MASK]: them, | probability:, 0.036757346242666245
[MASK]: dance, | probability:, 0.034330643713474274


# Ex2 [4pt] Design your own BERT probe experiment

We can reason about the capabilities of an LLM simply by choosing carefully designed inputs and evaluating the model's corresponding outputs. If we have test many inputs with some shared property, such as a particular syntactic structure, we can start to generalize about BERT's behaviour with text that has that property. For example, we can investigate the question of whether or not BERT predicts continuations of counterfactual conditionals which are consistent with the hypothetical scenario presented in the *if*-clause of the conditional by evaluating what happens when we give it many such conditionals as inputs.

Your primary task for this final lab is to design a small experiment that tests, using the same kinds of techniques as above, the capabilities of BERT in a particular domain of your choice. To give you some ideas, here are a few suggestions of possible general domains that could be worth investigating, although the actual question you investigate should be small enough that it can be tested with a relatively modest selection of sentences. You are also free to come up with your own idea:

* The interpretation of pronouns (can BERT recognize which individual a pronoun like *it* is referring to when there are multiple possible options?)
* Does BERT fall for so-called "semantic illusions", in which it fails to recognize an inaccuracy in text, such as answering the question "How many of each animal did Moses take on the ark?" with "2"? (The Biblical story is about Noah, not Moses.)
* Bias: Does BERT make predictions which are more consistent with gender, racial, or other stereotypes?
* World knowledge: Does BERT make predictions which correspond with the way the world actually is?

Your description of your experiment should have the following parts and be approximately equivalent in length to 1 typed page (roughly 500 words, in addition to your test sentences):



*   **Research Question**: A clear formulation of the question you intend to investigate. It should be small and precise enough that it can reasonably be investigated using the functions defined above.
*   **Hypothesis**: The answer to the research question that you predict to be true, and *why you have that specific expectation*.
*   **At least 10 test sentences**, with a description of which of their properties are relevant. Be very clear about what, specifically, you are testing, and how the results will bear on your hypothesis.
*   **Test** your sentences and see what outputs you get. Do these provide evidence for or against your hypothesis? Why do you think you got the results you did?  

*   **Discuss** whether, given your own linguistic intuitions, the behaviour of BERT approximates that of a human language user with respect to your research question. If it is not human-like, how could the model be improved (in terms of training data, architecture, etc.) to achieve better results?
* **OPTIONAL**: Try investigating your research question in some other models (see https://huggingface.co/models for some options). You will likely need to adapt your probes for other kinds of models--for instance, a probe you test in a dialogue-based interface like ChatGPT will be different than those you designed for BERT. Some models, like [RoBERTa](https://huggingface.co/FacebookAI/roberta-base), are built upon the BERT architecture, so they are compatible with the code above (see example below--note that RoBERTa tokenizes slightly differently from BERT).

In [None]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaForMaskedLM.from_pretrained('roberta-base')

#Note 1: RoBERTa's default mask token is <mask>
#Note 2: Most tokens in RoBERTa begin with the unusual character Ġ.
#This is an artefact of the tokenization process, which includes the space preceding words.
#To get the probability of a token "word",
#we need to give get_masked_token_probability "Ġword"

roberta_test_sentence = MaskedSentence("All the world’s a <mask>, and all the men and women merely players.", model=roberta_model, tokenizer=roberta_tokenizer, mask_token='<mask>')
roberta_test_sentence.get_masked_token_probability("Ġstage")
roberta_test_sentence.predict_masked_sent(top_k=5)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

<mask>: Ġstage,  | probability:, 0.02834651991724968, 

<mask>: Ġgame, | probability:, 0.2418021857738495
<mask>: Ġteam, | probability:, 0.036390986293554306
<mask>: Ġsport, | probability:, 0.028797855600714684
<mask>: Ġstage, | probability:, 0.02834651991724968
<mask>: Ġball, | probability:, 0.026413245126605034


<font color="red">█████ YOUR ANSWER HERE █████</font>

- **Research Question:**

Does BERT output predictions align correctly with the actual world knowledge?

- **Hypothesis:**

We think that BERT will produce predictions that are mostly consistent with actual world knowledge but may occasionally output incorrect information as well.

- **Test Sentences:**

  1. "The capital of France is [MASK]."

  We are trying to see if BERT understands the connection of a country and capital and also the fact that we are looking for a city since capital is a city in this context. Essentially we are testing geographical knowledge.

  Accurate prediction means it has correct georgraphical knowledge.

  2. "The largest mammal in the world is [MASK]."

  Understanding that we are looking of an animal and it has to be the largest mammal in the world. Essentially we are testing biological knowledge in a way.

  Accurate prediction means it has correct biological knowledge.

  3. "The fastest land animal is the [MASK]."

  Similarly, understanding of looking for an animal that is in the land and it is the fastest (in the world).

  Accurate prediction means it has correct biological knowledge.

  4. "Water freezes at [MASK] degrees Celsius.

  Here we are testing a scientific fact, seeing if BERT can predict that the freezing point of water is 0 degrees Celcius.

  5. "The Great wall of China is located in [MASK]."

  Another geographical knowledge task.

  6. "The currency used in The Netherlands is [MASK]."

  Here we are testing if BERT has economic knowledge being able to tell the currency of a country.

  7. "A year has [MASK] months."

  Basic numerical fact that a year has 12 months, showing that BERT can correctly predict basic facts.

  8. "The primary language spoken in The Netherlands is [MASK]."

  Linguistic knowledge of which language is spoken in a country

  9. "The Pyramids of [MASK] are one of the seven wonders of the world."

  This is historical knowledge of being able to identify that pyramids of egypt are one of the seven wonders of the world.

  10. "The currency used in [MASK] is the Yen."

  Lastly, another economic knowledge question.


  Overall we are testing the world knowledge of BERT by touching upon subject that require knowledge on the domains of economy, history, language, basic numerical facts, geography, science and biology.



In [None]:
# Test sentences

test_sent = MaskedSentence("The capital of France is [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: paris, | probability:, 0.4167894423007965
[MASK]: lille, | probability:, 0.07141634821891785
[MASK]: lyon, | probability:, 0.06339266151189804
[MASK]: marseille, | probability:, 0.04444744810461998
[MASK]: tours, | probability:, 0.030297260731458664


In [None]:
test_sent = MaskedSentence("The largest mammal in the world is [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: lion, | probability:, 0.09077168256044388
[MASK]: gibbons, | probability:, 0.05036930367350578
[MASK]: jaguar, | probability:, 0.03229096904397011
[MASK]: turkey, | probability:, 0.03163881227374077
[MASK]: camel, | probability:, 0.031165773048996925


In [None]:
test_sent = MaskedSentence("The fastest land animal is the [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: elephant, | probability:, 0.19859649240970612
[MASK]: horse, | probability:, 0.048719849437475204
[MASK]: lion, | probability:, 0.041427768766880035
[MASK]: tiger, | probability:, 0.031486574560403824
[MASK]: deer, | probability:, 0.030749179422855377


In [None]:
test_sent = MaskedSentence("Water freezes at [MASK] degrees Celsius.")
test_sent.predict_masked_sent(top_k=5)

[MASK]: 100, | probability:, 0.04347000643610954
[MASK]: 60, | probability:, 0.04179767891764641
[MASK]: 50, | probability:, 0.04051332175731659
[MASK]: 30, | probability:, 0.03706340864300728
[MASK]: 90, | probability:, 0.03318754583597183


In [None]:
test_sent = MaskedSentence("The Great wall of China is located in [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: beijing, | probability:, 0.43733811378479004
[MASK]: china, | probability:, 0.15474867820739746
[MASK]: nanjing, | probability:, 0.05595695972442627
[MASK]: shanghai, | probability:, 0.054905593395233154
[MASK]: xinjiang, | probability:, 0.03460432589054108


In [None]:
test_sent = MaskedSentence("The currency used in The Netherlands is [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: euro, | probability:, 0.5109432339668274
[MASK]: euros, | probability:, 0.13625146448612213
[MASK]: silver, | probability:, 0.06058467924594879
[MASK]: gold, | probability:, 0.04999730736017227
[MASK]: paper, | probability:, 0.04012717306613922


In [None]:
test_sent = MaskedSentence("A year has [MASK] months.")
test_sent.predict_masked_sent(top_k=5)

[MASK]: six, | probability:, 0.059606391936540604
[MASK]: 12, | probability:, 0.05833644047379494
[MASK]: four, | probability:, 0.04770107567310333
[MASK]: three, | probability:, 0.046642009168863297
[MASK]: nine, | probability:, 0.041973307728767395


In [None]:
test_sent = MaskedSentence("The primary language spoken in The Netherlands is [MASK].")
test_sent.predict_masked_sent(top_k=5)

[MASK]: dutch, | probability:, 0.5414645671844482
[MASK]: english, | probability:, 0.17027541995048523
[MASK]: german, | probability:, 0.15079237520694733
[MASK]: french, | probability:, 0.06365203857421875
[MASK]: portuguese, | probability:, 0.01894487626850605


In [None]:
test_sent = MaskedSentence("The Pyramids of [MASK] are one of the seven wonders of the world.")
test_sent.predict_masked_sent(top_k=5)

[MASK]: egypt, | probability:, 0.8313038945198059
[MASK]: pharaoh, | probability:, 0.020176183432340622
[MASK]: omar, | probability:, 0.017123518511652946
[MASK]: hercules, | probability:, 0.013578164391219616
[MASK]: cairo, | probability:, 0.006811816710978746


In [None]:
test_sent = MaskedSentence("The currency used in [MASK] is the Yen.")
test_sent.predict_masked_sent(top_k=5)

[MASK]: japan, | probability:, 0.25111308693885803
[MASK]: taiwan, | probability:, 0.06420330703258514
[MASK]: togo, | probability:, 0.02597927488386631
[MASK]: china, | probability:, 0.02440056949853897
[MASK]: ghana, | probability:, 0.022728262469172478


- **Discussion:**

The results provide a mix of correct and incorrect predictions by BERT which more or less adhere close to our hypothesis that most predictions will be correct but there will be some that will be wrong.

BERT accurately predicts several facts, such as the capital of France, the currency of The Netherlands, the primary language in The Netherlands, the location of the Pyramids, the location of the Great Wall of China and the currency of Japan. These accurate predictions support the hypothesis that BERT can produce predictions consistent with actual world knowledge.

However, BERT also fails in several instances, such as predicting the largest mammal, the fastest land animal, the freezing point of water and the number of months in a year, showing that it sometimes struggles with certain types of knowledge, supporting the part of our hypothesis that it may occasionally output incorrect information.

Comparing with a humans own linguistic intuitions, it is obvious that BERT's behavior does not fully approximate that of a person. A human that has a basic education will easily identify most all of the facts in the test sentences as long as he has the knowledge of them or at least give a close answer. BERT's failures highlight its limitations in understanding and reasoning about factual information, especially when the context or the information required involves specific and less common knowledge.

Potential improvements for BERT could be:

  1. Incorporate more diverse and comprehensive datasets that don't focus only on covering syntax but a deep semantic understanding too.

  2. We could also fine-tune the models on datasets that are made specifically to improve accuracy on world knowledge and facts.

  3. Maybe we oculd also integrate external knowledge bases, such as Wikipedia to enhance the model even further.

# Acknowledgments

Concept and lab designed by Tom Roberts. BERT MLM script was heavily based on work by Yuchen Lin. Counterfactual example adapted from [Li, Yu, and Ettinger (2023)](https://aclanthology.org/2023.acl-short.70.pdf).