BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens #180

asumagic · 2024-03-19T10:28:20Z

During the IDF dict calculation, the weight associated with special tokens is zeroed:

Lines 243 to 246 in dbcf6db

    
           idf_dict = defaultdict(lambda: 1.0) 
        
           # set idf for [SEP] and [CLS] to 0 
        
           idf_dict[tokenizer.sep_token_id] = 0 
        
           idf_dict[tokenizer.cls_token_id] = 0

But, to my understanding of the code, this weight never actually prevents a non-special token embedding from getting matched with a [SEP] or [CLS] token embedding.

I've noticed this because I was obtaining different recall/precision values on certain pairs on a custom implementation. This difference disappears if I stop masking pairs involving a special token in the cosine similarity matrix.

That code looks something like:

ref_mask = self._select_by_tokens(token_masks, ref_tokens)
hyp_mask = self._select_by_tokens(token_masks, hyp_tokens)

# mask rows according to ref_mask and columns according to hyp_mask
# reminder: this is the mask used to mask off special tokens
similarity_matrix[~ref_mask, :] = 0.0
similarity_matrix.transpose(1, 2)[~hyp_mask, :] = 0.0

Testing with no IDF, using google-bert/bert-base-uncased at layer 12 (not a really thought-out choice, it's just for the repro), the following pair of sentences reproduces the issue:

ref: "WE'LL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HAVE A REGULAR HOUSE CLEANING"
hyp: "WILL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HALF A REGULAR HOUSE CLEANING"

With my implementation, greedy selection through the matrix shows a difference over the 2nd (non-special) token:

disabled masking 0.70251393, 0.95448172, 0.45837021, ..., resulting in a recall of 0.82332665 (matches bert-score)
enabled masking 0.70251393, 0.18742326, 0.45837021, ..., resulting in a recall of 0.78071225

Inspecting the cosine similarity matrix indicates that 0.95448172 is the similarity between the 2nd token and the last token ([SEP]).

I don't know if this is intended, but since those special tokens are weighted down to 0 in the IDF dict, I'm assuming the intent is to never actually consider them. I have not tried to check whether that degrades the quality of the metric, so maybe it doesn't matter. In any case, I felt like this was worth documenting as an issue.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens #180

BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens #180

asumagic commented Mar 19, 2024

BERTScore can match contextualized embeddings of [SEP]/[CLS] tokens #180

BERTScore can match contextualized embeddings of [SEP]/[CLS] tokens #180

Comments

asumagic commented Mar 19, 2024

BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens #180

BERTScore can match contextualized embeddings of `[SEP]`/`[CLS]` tokens #180