In [1]:
pip install transformers



In [4]:
from transformers import RobertaForMaskedLM,RobertaTokenizer

# Initialize RoBERTa for Masked Language Modeling
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
mlm_model = RobertaForMaskedLM.from_pretrained('roberta-large')
mlm_head = mlm_model.lm_head

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We directly utlize the framework of **prompt-based learning** to explain model predictions. Here, we first define the **template** and **verbalizer** for the task of sentiment analysis.

**Please note that this is just a simple implementation of PromptExplainer to help understand our framework more easily. For experimental comparison purposes, please follow our original paper to select the templates and verbalizers.**

In [20]:
# template
template = "It was <mask>. "

# define label words in the verbalizer
positive = ['positive','good', 'great','superb','wonderful','fantastic','enjoy']
negative = ['negative','poor','bad','awful','terrible','horrible','hate']


pos_ids = []
for cc in positive:
  id = tokenizer.encode(' ' + cc.strip())[1]
  pos_ids.append(id)

neg_ids = []
for cc in negative:
  id = tokenizer.encode(' ' + cc.strip())[1]
  neg_ids.append(id)

print('pos_ids:',pos_ids,tokenizer.convert_ids_to_tokens(pos_ids))
print('neg_ids:',neg_ids,tokenizer.convert_ids_to_tokens(neg_ids))

pos_ids: [1313, 205, 372, 11415, 4613, 5500, 2254] ['Ġpositive', 'Ġgood', 'Ġgreat', 'Ġsuperb', 'Ġwonderful', 'Ġfantastic', 'Ġenjoy']
neg_ids: [2430, 2129, 1099, 11522, 6587, 11385, 4157] ['Ġnegative', 'Ġpoor', 'Ġbad', 'Ġawful', 'Ġterrible', 'Ġhorrible', 'Ġhate']


Next, given an input "*I really enjoyed the movie last night*", we will demonstrate how to calculate the **saliency scores**.

In [21]:
import torch
sentence = "I really enjoy this movie."
sentence = template + sentence
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)

inputs_ids = inputs['input_ids'][0]

tokens = []
for i in inputs_ids:
  tokens.append(tokenizer.decode(i))
print('encoded tokens:',len(tokens),tokens)

with torch.no_grad():
    mlm_outputs = mlm_model(**inputs)
    predictions = mlm_outputs.logits[0] # equation 1 feature disentanglement in the paper. The LM head projects every token's hedden state into a disentangled space.
    print('predictions:',predictions.shape)


encoded tokens: 12 ['<s>', 'It', ' was', '<mask>', '.', ' I', ' really', ' enjoy', ' this', ' movie', '.', '</s>']
predictions: torch.Size([12, 50265])


In the last step, we perform **feature disentanglement in section 3.3**. Every token is now represented as the distributions over the vocabulary, whose size is 50265. Each feature in 50265 corrresponds to a unique token in the vocabulary and the logit indicates its corresponding probability (correlation).

Next, we will generate the model's prediction and perform **discriminative feature extraction** in section 3.4.


In [22]:
# generate model's prediction
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
print('masked_index:',masked_index)

mask_logits_pos = predictions[masked_index][pos_ids]
print('mask_logits_pos:',mask_logits_pos)
mask_logits_neg = predictions[masked_index][neg_ids]
print('mask_logits_neg:',mask_logits_neg)
p_logits = [sum(mask_logits_pos)/len(mask_logits_pos),sum(mask_logits_neg)/len(mask_logits_neg)]
print('p_logits:',p_logits)
final_preds = torch.nn.functional.softmax(torch.tensor(p_logits),dim = -1)
print('The prediction (positive v.s. negative) is:',final_preds)

# Now we generate the saliency socres for each input token to explain the model's prediction

# equation 2
def extract_dis(ppp):
  ## this function is for discriminative feature extraction in section 3.4. It is basically a verbalizer to map model predictions to the label space.
  mask_logits_pos = ppp[pos_ids]
  # print('mask_logits_pos:',mask_logits_pos)
  mask_logits_neg = ppp[neg_ids]
  # print('mask_logits_neg:',mask_logits_neg)
  p_logits = [sum(mask_logits_pos)/len(mask_logits_pos),sum(mask_logits_neg)/len(mask_logits_neg)] # equation 3
  # print('p_logits:',p_logits)
  final_preds = torch.nn.functional.softmax(torch.tensor(p_logits),dim = -1) # equation 4
  return final_preds

label_id = 0
saliency_scores = []

feas = predictions # only extract input text, remove template
for i in feas:
  s = extract_dis(i)
  saliency_scores.append(s[label_id])

# print('tokens:',tokens)
# print('saliency_scores:',saliency_scores)

for ii, jj in zip(tokens,saliency_scores):
  print('token:',ii,'saliency:',float(jj))


masked_index: 3
mask_logits_pos: tensor([55.0499, 62.4359, 62.7249, 57.6104, 59.9281, 60.6659, 52.6526])
mask_logits_neg: tensor([51.9433, 53.5667, 55.9558, 55.8370, 56.5814, 56.0261, 47.9566])
p_logits: [tensor(58.7239), tensor(53.9810)]
The prediction (positive v.s. negative) is: tensor([0.9914, 0.0086])
token: <s> saliency: 0.5877633690834045
token: It saliency: 0.9132519364356995
token:  was saliency: 0.9796391725540161
token: <mask> saliency: 0.9913626313209534
token: . saliency: 0.7945637106895447
token:  I saliency: 0.8072583079338074
token:  really saliency: 0.9590702056884766
token:  enjoy saliency: 0.986190915107727
token:  this saliency: 0.9251473546028137
token:  movie saliency: 0.5575780868530273
token: . saliency: 0.7945643067359924
token: </s> saliency: 0.8846296668052673


Please note that this is just a simple implementation of PromptExplainer to help understand our framework more easily. For experimental comparison purposes, please follow our original paper to select the templates and verbalizers.

> Indented block

