# AIPI 590 - XAI | Assignment #10
### XAI in LLMs
### Shaunak Badani

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AIPI-590-XAI/Duke-AI-XAI/blob/dev/templates/template.ipynb)

> This repository contains:
1. XAI in prompting with saliency scores
2. Embedding vectors mapped to 2 dimensions using t-SNE and UMAP.

# 1. XAI in prompting with Saliency methods

- In this notebook I will use the method of vanilla gradients to help generate explanations for a model prompt from a sentiment analysis model

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from IPython.display import display, HTML
import numpy as np

In [2]:
MODEL_NAME = "finiteautomata/bertweet-base-sentiment-analysis"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


- We first pass the sentence through the tokenizer, then get the maximum logit of the output to get which of the three logits was the greatest.
- Then we use the logit to perform a backward pass, which computes the gradients.
- The gradients are within the embedding layer.
- We take the gradients of the 768 dimensions, and sum those values to get the effective weight of a single token.

In [4]:
sample_text = "This review is amazing!"
input_ids = tokenizer(sample_text, return_tensors="pt")
output_ids = model(**input_ids)
predicted_class_index = torch.argmax(output_ids.logits)
max_val = torch.max(output_ids.logits)
print(max_val)
max_val.backward()
# predicted_class_index.backward()
labels = ['negative', 'neutral', 'positive']
predicted_sentiment = labels[predicted_class_index]
print("Predicted Sentiment:", predicted_sentiment)

tensor(3.8919, grad_fn=<MaxBackward1>)
Predicted Sentiment: positive


In [5]:
indices = input_ids['input_ids'][0]
print(indices)

tensor([    0,   126,  2274,    17, 36148,    12,     2])


In [11]:
gradients = model.roberta.embeddings.word_embeddings.weight.grad[indices]
input_weights = torch.abs(gradients.sum(axis = 1))

tokens = np.array(list(map(lambda x: tokenizer.decode(x), input_ids['input_ids'][0])))[1:-1]
weights = input_weights.numpy()

- Write a function that can color code based onthe weights given to it:

In [7]:
def visualize_token_attrs(tokens, attrs):
    """
      Visualize attributions for given set of tokens.
      Args:
      - tokens: An array of tokens
      - attrs: An array of attributions, of same size as 'tokens',
        with attrs[i] being the attribution to tokens[i]

      Returns:
      - visualization: An IPython.core.display.HTML object showing
        tokens color-coded based on strength of their attribution.
    """
    def get_color(attr):
        if attr > 0:
            g = int(128*attr) + 127
            b = 128 - int(64*attr)
            r = 128 - int(64*attr)
        else:
            g = 128 + int(64*attr)
            b = 128 + int(64*attr)
            r = int(-128*attr) + 127
        return r,g,b

    # normalize attributions for visualization.
    bound = max(abs(attrs.max()), abs(attrs.min()))
    attrs = attrs/bound
    html_text = ""
    for i, tok in enumerate(tokens):
        r, g, b = get_color(attrs[i])
        html_text += " <span style='color:rgb(%d,%d,%d)'>%s</span>" % \
                     (r, g, b, tok)
    return HTML(html_text)

In [12]:
display(visualize_token_attrs(tokens, weights))

- Encode all of the above logic in a function.

In [18]:
def visualize_sentence(sentence):
  # zero out gradients before running inference on the next sentence
  for param in model.parameters():
    if param.grad is not None:
        param.grad.zero_()

  input_ids = tokenizer(sentence, return_tensors="pt")
  output_ids = model(**input_ids)
  predicted_class_index = torch.argmax(output_ids.logits)
  max_val = torch.max(output_ids.logits)
  max_val.backward()
  labels = ['negative', 'neutral', 'positive']
  predicted_sentiment = labels[predicted_class_index]
  print("Predicted sentiment: ", predicted_sentiment)
  indices = input_ids['input_ids'][0]
  gradients = model.roberta.embeddings.word_embeddings.weight.grad[indices]
  input_weights = torch.abs(gradients.sum(axis = 1))

  tokens = np.array(list(map(lambda x: tokenizer.decode(x), input_ids['input_ids'][0])))[1:-1]
  weights = input_weights.numpy()

  display(visualize_token_attrs(tokens, weights))


In [19]:
visualize_sentence("I did not like the food!")

Predicted sentiment:  negative


- As can be seen, the word "not" had a huge impact on the sentiment of the above sentence being negative.

In [17]:
# some more examples

sentences = [
    "I absolutely love this product!",
    "The book was okay, not too bad, but not great either.",
    "The customer service was terrible, they didn’t help at all.",
    "I didn’t like the food at all, it was tasteless.",
    "The experience was horrible, I will never visit again.",
    "I regret buying this phone, it doesn’t meet my expectations.",
    "I’m waiting for the bus to arrive.",
    "I went to the store and bought some groceries.",
    "I hate how slow my internet is today.",
    "The movie was fantastic, I highly recommend it.",
    "This phone is amazing, so smooth and fast!",
    "I had a wonderful day at the beach with my friends.",
    "The weather is quite average today, not too hot or cold.",
    "I’m really upset about how my team performed in the game.",
]

In [20]:
for sentence in sentences:
  visualize_sentence(sentence)

Predicted sentiment:  positive


Predicted sentiment:  neutral


Predicted sentiment:  negative


Predicted sentiment:  negative


Predicted sentiment:  negative


Predicted sentiment:  negative


Predicted sentiment:  neutral


Predicted sentiment:  neutral


Predicted sentiment:  negative


Predicted sentiment:  positive


Predicted sentiment:  positive


Predicted sentiment:  positive


Predicted sentiment:  neutral


Predicted sentiment:  negative


# 2. Using U-MAP, t-SNE and PCA to project embedding space into 2 dimensions and visualize the outputs.

# AI Usage

- None of the code in this repository was written using AI.
- AI was used in helping understand some of the concepts, like saliency scores,U-MAP, and t-SNE, in more depth.