<a href="https://colab.research.google.com/github/tanmaysurve/Language-Models/blob/main/Minicons_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Please read and follow these instructions before getting started:

I have only shared with you a view-only copy of this assignment. If you run code on it, **it will not save!**

To save this on your google drive, click on File > Save a copy in Drive, and follow instructions.

In principle, it is not required of you to save this environment, but if you want to keep your code for future reference, please save it! 

# Analyzing Contextualized Word Embeddings for knowledge of word-senses

By Kanishka Misra (kmisra@purdue.edu)


The goal of this assignment is to get you to be familiarized in dealing with vectors computed by (roughly) state of the art pre-trained language models. 

Recall from Tuesday's lecture that language modeling is a commonly used method for training neural-network-based sequence models, and allows them to learn vector representations of words *in context.* For instance, every layer of the BERT model represents a word by relying on **all other words** in the sentence context that the word occurs in. 

This may lead us to hypothesize that BERT could have attained a decent competency in representing lexical ambiguity -- a phenomena when the same word has multiple meanings.

---


## A brief primer on lexical ambiguity

Lexical ambiguity manifests in language in two elementary ways:

The first way is when the same word has multiple meanings that are related. In this case, what we have is an instance of **polysemy**:

Consider the many polysemous senses of the word **face** (taken from [WordNet](http://wordnetweb.princeton.edu/perl/webwn?s=face&sub=Search+WordNet&o2=&o0=1&o8=1&o1=1&o7=&o5=&o9=&o6=1&o3=&o4=&h=000000)):
  1. (n) the front of the human head from the forehead to the chin and ear to ear. E.g., *his **face** was injured*
  2. (n) the feelings expressed on a person's face. E.g., *his angry **face**.*
  3. (n) the general outward appearance of something. E.g., *the **face** of the city is changing* (metaphorically related)
  4. ... (including anything else that is related to the three above)

The second way is when two or more distinct, unrelated meanings happen to have the same word form (this usually happens by coincidence). In this case, we have an instance of **homonymy** (when two words share sounds, it is called **homophony**).

Consider the following senses of the word form **bow**:

  1. (n) a slightly curved piece of resilient wood with taut horsehair strands; used in playing certain stringed instruments. *She checked on her **bow** before performing that night.*
  2. (n) bending the head or body or knee as a sign of reverence or submission or shame or greeting. *He dropped into a **bow** before them.*
  3. (n) a weapon for shooting arrows, composed of a curved piece of resilient wood with a taut cord to propel the arrow. *a **bow** and arrow.*

Coming back to our assignment -- our goal here is to analyze a given model's (or more models, upto you) behavior in representing the above lexical phenomena.

---

## Analysing lexical ambiguity in models

While there are several ways in which one can test for lexical ambiguity, we will be using the notion of vector space similarity. 
It would be reasonable to suggest that vectors of words that have the same or related senses should be much closer together as opposed to words that do not. That is, the vectors for the word **bank** in (1.) should be closer to that in (2.), than to that in (3.):

1. *I went to the **bank** to withdraw some cash.*
2. *John had an appointment with the manager of the **bank** yesterday.*
3. *They pulled the canoe up on the **bank**.*

This closeness can be measured by the cosine similarity:

$$
cos(\pmb x, \pmb y) = \frac {\pmb x \cdot \pmb y}{||\pmb x|| \cdot ||\pmb y||}
$$

Therefore a good model will show us the following result: $cos(bank_1, bank_2) > cos(bank_1, bank_3)$. This is exactly what we will be exploring in this homework.

---

## Deliverables

A document (pdf, or any other format) that has a description of your results and discussions. Each question following the demo has its own set of discussion content. **Code is optional, but feel free to include it. We will be mostly paying attention to the discussion and results.**

## Getting started



We will begin by installing my package, `minicons`.
If you are interested, check out its [documentation](https://minicons.kanishka.website) (still in active development). I have also made "getting started examples" for the package. They can be found [here](https://github.com/kanishkamisra/minicons/blob/master/examples/word_representations.md).

I wrote this package to make it easy to extract representations for words from pre-trained LMs. It also contains other very important utilities that we may use in this class at some point.

To install the package, click on the grey cell below, and either click "play" to the left of the cell, or hit `shift + tab` which will run the cell and take you to the next code cell.

In [None]:
!pip install minicons # will show an error towards the end, but that's not an error in the installation, so worry not!

Collecting minicons
  Downloading minicons-0.2.3-py3-none-any.whl (20 kB)
Collecting transformers<5.0.0,>=4.4.1
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 5.6 MB/s 
[?25hCollecting urllib3<2.0.0,>=1.26.7
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 61.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 64.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 15.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Do

The above code may have shown an error saying `ERROR: pip's dependency resolver...`, but that is google's internal problem. The package should be installed regardless.

Next, load some useful libraries that we'll need:

In [None]:
from minicons import cwe
import torch

We will then write code to compute the cosine similarity between two vectors (or tensors, in general; you do not have to worry about this for the purposes of this homework, but feel free to ask us questions separately)

In [None]:
def cosine(a: torch.Tensor, b: torch.Tensor, eps =1e-8) -> torch.Tensor:
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
    b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
    sims = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sims

# Loading pre-trained models

We are now ready to load our first pre-trained model! 

For simplicity, I will show this demo on the `bert-base-uncased` model, the smallest official BERT model released in the original paper.

Every pre-trained model that can be loaded by minicons is an instance of the `cwe.CWE` class. `CWE` stands for 'contextual word embeddings'

BERT, RoBERTa, etc., are all instances of contextual word embeddings, since they emit vectors that take their input context into account.

In theory, any model that is part of the [huggingface hub](https://huggingface.co/models) can be loaded with this class.

To load bert-base, run the following code:

In [None]:
model = cwe.CWE('bert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

# Extracting representations of words (and phrases)

The function primarily used for extracting representations from models is `model.extract_representation()`. It accepts batches of instances represented in either of the following formats:

```
data = [
  (sentence_1, word_1),
  (sentence_2, word_2),
  ....
  (sentence_n, word_n)
]
```
where `word_i` is the word whose vector is to be extracted from its corresponding sentence (`sentence_i`)

or

```
data = [
  (sentence_1, (start_1, end_1)),
  (sentence_2, (start_2, end_2)),
  ....
  (sentence_n, (start_n, end_n))
]
```
where `(start_i, end_i)` are the character span indices for the target word in the ith sentence, i.e., `start_i` is the start index, and `end_i` is the end index.

For example, the entry `["I like reading books.", (15, 20)]` corresponds to the word `"books"`, because 

```python
"I like reading books."[15:20] = "books"
```

To keep things simple, I will be using sentences where a word only occurs once, and use the first method of representing the input.

In [None]:
data = [
        ['The quick brown fox jumped over the lazy dog.', 'fox'],
        ['The slow pink fox ran into the fast cat.', 'fox']
]


The `model.extract_representation` function also takes another parameter, `layer`.

Recall from earlier class that pre-trained LMs usually have multiple layers. To find out how many layers a model has, run:

In [None]:
model.layers

12

Note that the 12 layers here means that there have been 12 total "multi-headed self attention" operations. Apart from this, the model also contains a 0th layer, which consists of representations that are passed to the first self-attention layer. These representations are composed of the static embeddings of the model (one for each word) which are combined with the position and segment embeddings using vector-addition.

By default, minicons uses the model's last layer.


Let us now extract embeddings from the structure we created earlier (the sentences containing the word 'fox')

In [None]:
embedding = model.extract_representation(data)

In [None]:
embedding # embeddings of the word fox in the sentences contained in `data`

tensor([[-0.0571, -0.2493, -0.3621,  ..., -0.3584,  0.2380,  0.3966],
        [-0.1199, -0.7536, -0.3551,  ..., -0.5651,  0.4172,  0.6888]])

The result is interpreted as follows:

```
tensor([[-0.0571, -0.2493, -0.3621,  ..., -0.3584,  0.2380,  0.3966], <- embedding for the first instance
        [-0.1199, -0.7536, -0.3551,  ..., -0.5651,  0.4172,  0.6888]]) <- embedding for the second instance
```

`bert-base` encodes words in a 768 dimensional vector, you can check the dimensions of the above result using:

In [None]:
embedding.shape # 2 vectors of 768 dimensions each

torch.Size([2, 768])

Let us now compute the cosine similarity of the two fox-sentences. While there are a number of different ways of doing this, we will take the similarity of the above result with itself:

In [None]:
pairwise = cosine(embedding, embedding)
pairwise # notice that cosine is symmetric

tensor([[1.0000, 0.9278],
        [0.9278, 1.0000]])

The top right (or bottom left) value is the similarity of the two fox-words, we can access it by:

In [None]:
pairwise[0,1].item() # similarity of fox in the  first sentence with that in second

0.9278203248977661

## An example

Let us now apply our knowledge about computing similarities with contextualized word representations to test how well BERT represents lexical ambiguity.

We will adopt the paradigm of defining a set of query instances (sentence-word pairs) and take each instance's similarity with a set of reference sentence-word pairs.

In the following case, we have (focus word bolded) the following queries:

1. Please just **book** me a place to stay already, will you!
2. I'll **reserve** those tickets shortly.
3. I liked reading that **book**.

Similarly, we have the following references:

1. My children said the will **book** us a trip to Hawaii!
2. Please just buy the **book** already, will you!
3. Lester, can you **book** my entire schedule for all of Monday?

**Exercise for the reader:** What words should be more similar to each other? (Notice the contexts for query 1 and reference 2)

In [None]:
query = [
         ["Please just book me a place to stay already, will you!", "book"],
         ["I'll reserve those tickets shortly.", "reserve"],
         ["I liked reading that book!", "book"]
]

reference = [
             ["My children said they will book us a trip to Hawaii!", "book"],
             ["Please just buy the book already, will you!", "book"],
             ["Lester, can you book my entire schedule for all of Monday?", "book"]
]

In [None]:
# Extract embeddings for each set of instances, for demonstration, let us look at the second last layer (11)
reference_emb = model.extract_representation(reference, layer = 11)
query_emb = model.extract_representation(query, layer = 11)

In [None]:
# Take the cosine of every query with every reference
sims = cosine(query_emb, reference_emb)

# explore the output:
sims

tensor([[0.7540, 0.4687, 0.6881],
        [0.5885, 0.3825, 0.6178],
        [0.5589, 0.8250, 0.5271]])

In [None]:
# To get the similarity between the first query, "My children...", and the reference list:
sims[0]

tensor([0.7540, 0.4687, 0.6881])

We see that the similarity of "book" in *My children said they will **book**...* with:
1. first reference is 0.754
2. second reference is 0.469
3. third reference is 0.688

Which means, the "book" in first question is closest to the "book" in:

"Please just **book** me a place to stay already, will you!"


we can make the process of looking at "the closest" embedding a little easier:

In [None]:
# For the first query, what is the closest usage of the book in the reference set?

closest1 = reference[sims[0].argmax().item()] # argmax finds the index with the greatest value, in this case, the greatest similarity!

print(f"Query: {query[0]}\nClosest Reference: {closest1}")

Query: ['Please just book me a place to stay already, will you!', 'book']
Closest Reference: ['My children said they will book us a trip to Hawaii!', 'book']


In [None]:
# Repeating the same for the second query:
closest2 = reference[sims[1].argmax().item()] # argmax finds the index with the greatest value, in this case, the greatest similarity!

print(f"Query: {query[1]}\nClosest Reference: {closest2}")

Query: ["I'll reserve those tickets shortly.", 'reserve']
Closest Reference: ['Lester, can you book my entire schedule for all of Monday?', 'book']


In [None]:
sims

tensor([[0.7540, 0.4687, 0.6881],
        [0.5885, 0.3825, 0.6178],
        [0.5589, 0.8250, 0.5271]])

In [None]:
# Third query:
closest3 = reference[sims[2].argmax().item()] # argmax finds the index with the greatest value, in this case, the greatest similarity!

print(f"Query: {query[2]}\nClosest Reference: {closest3}")

Query: ['I liked reading that book!', 'book']
Closest Reference: ['Please just buy the book already, will you!', 'book']


We see here that in all cases, BERT-base (layer 11) prefers the correct reference! Although to conclude about this more broadly, we'd need a large dataset of diverse sentences.

Now, it's your turn!

# Assignment objectives

Using the code from above, your objectives are as follows:

**Preliminary:** Select a model and layer of your choice. Here are some suggested options (`Name: <identifier to be used in cwe.CWE()>, <number of layers>`):
```
BERT-base: bert-base-uncased, 12 layers
BERT-large: bert-large-uncased, 24 layers
RoBERTa-base: roberta-base, 12 layers
RoBERTa-large: roberta-large, 24 layers
```

If you want to be a little adventurous, check out other models here: https://huggingface.co/models


## Question 1: Same words, different meanings

Analyze your model (and layer) on a new polysemous/homonymous word (should at least contain 2 different senses of the word) using the same format as above:

```
query = list of instances containing two distinct usages of the word. 

reference = list of instances containing two distinct usages of the word, with each having a similar usage with at least one instance in the query. 

Example:

query = [
  ["i like books", "books"], 
  ["please book me a hotel", "book]
]

reference = [
  ["she read that book", "book"], 
  ["I will book those tickets shortly", "book]
]]
```

The word you select should be different from the ones discussed in this file. Therefore, you cannot use: `face, book, bow, bank`. In all cases, the word being compared should be the same (different tense and number allowed: *books* vs. *book* or *book* vs *booked*)

**In your write-up, write what word you chose, the sentences you chose for the various senses of the word, and what you found.**

In [None]:
# Your code here, add new code cells below if you wish to by:
# 1. using the shortcut: esc -> b
# 2. or using the shortcut: cmd/ctrl + m -> b
# 3. or hovering to the center+bottom of this cell and clicking on "+code"



## Question 2, Different words, (related or same) meaning

For your set of sentences in question 1, come up with new reference instances that include words that are related to only one of the sentences. For e.g., if I was comparing book (novel) vs book (reserving something):

```
query = [
  ["i like books", "books"], 
  ["please book me a hotel", "book]
]

references = [
  ["that was a good novel", "novel"], 
  ["i'd like to make a reservation", "reservation"]
]
```

here, `novel` should be closer to the first query than to the second, similarly, `reservation` should be closer to the second as opposed to the first.

**Same as above, discuss the stimuli you created, and what you found.**

In [None]:
# Your code here:



## Question 3: Your turn!

Ask your own question! It could be about comparing the above results on different models, or different layers of the same model. Feel free to explore! 

**Write about your analysis, what choices you made, and the results you got, and the conclusions you derived.**