You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But, to my understanding of the code, this weight never actually prevents a non-special token embedding from getting matched with a [SEP] or [CLS] token embedding.
I've noticed this because I was obtaining different recall/precision values on certain pairs on a custom implementation. This difference disappears if I stop masking pairs involving a special token in the cosine similarity matrix.
That code looks something like:
ref_mask=self._select_by_tokens(token_masks, ref_tokens)
hyp_mask=self._select_by_tokens(token_masks, hyp_tokens)
# mask rows according to ref_mask and columns according to hyp_mask# reminder: this is the mask used to mask off special tokenssimilarity_matrix[~ref_mask, :] =0.0similarity_matrix.transpose(1, 2)[~hyp_mask, :] =0.0
Testing with no IDF, using google-bert/bert-base-uncased at layer 12 (not a really thought-out choice, it's just for the repro), the following pair of sentences reproduces the issue:
ref: "WE'LL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HAVE A REGULAR HOUSE CLEANING"
hyp: "WILL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HALF A REGULAR HOUSE CLEANING"
With my implementation, greedy selection through the matrix shows a difference over the 2nd (non-special) token:
disabled masking 0.70251393, 0.95448172, 0.45837021, ..., resulting in a recall of 0.82332665 (matches bert-score)
enabled masking 0.70251393, 0.18742326, 0.45837021, ..., resulting in a recall of 0.78071225
Inspecting the cosine similarity matrix indicates that 0.95448172 is the similarity between the 2nd token and the last token ([SEP]).
I don't know if this is intended, but since those special tokens are weighted down to 0 in the IDF dict, I'm assuming the intent is to never actually consider them. I have not tried to check whether that degrades the quality of the metric, so maybe it doesn't matter. In any case, I felt like this was worth documenting as an issue.
The text was updated successfully, but these errors were encountered:
During the IDF dict calculation, the weight associated with special tokens is zeroed:
bert_score/bert_score/score.py
Lines 243 to 246 in dbcf6db
But, to my understanding of the code, this weight never actually prevents a non-special token embedding from getting matched with a
[SEP]
or[CLS]
token embedding.I've noticed this because I was obtaining different recall/precision values on certain pairs on a custom implementation. This difference disappears if I stop masking pairs involving a special token in the cosine similarity matrix.
That code looks something like:
Testing with no IDF, using
google-bert/bert-base-uncased
at layer 12 (not a really thought-out choice, it's just for the repro), the following pair of sentences reproduces the issue:"WE'LL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HAVE A REGULAR HOUSE CLEANING"
"WILL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HALF A REGULAR HOUSE CLEANING"
With my implementation, greedy selection through the matrix shows a difference over the 2nd (non-special) token:
0.70251393, 0.95448172, 0.45837021, ...
, resulting in a recall of0.82332665
(matchesbert-score
)0.70251393, 0.18742326, 0.45837021, ...
, resulting in a recall of0.78071225
Inspecting the cosine similarity matrix indicates that
0.95448172
is the similarity between the 2nd token and the last token ([SEP]
).I don't know if this is intended, but since those special tokens are weighted down to 0 in the IDF dict, I'm assuming the intent is to never actually consider them. I have not tried to check whether that degrades the quality of the metric, so maybe it doesn't matter. In any case, I felt like this was worth documenting as an issue.
The text was updated successfully, but these errors were encountered: