# AIPI 590 - XAI | Assignment #08
### XAI in LLMs
### Yabei Zeng

#### Link to Colab: https://colab.research.google.com/github/yabeizeng1121/XAI/blob/main/Assignment8/XAI_in_LLMs.ipynb

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yabeizeng1121/XAI/blob/main/Assignment8/XAI_in_LLMs.ipynb)


In [None]:
# Please use this to connect your GitHub repository to your Google Colab notebook
# Connects to any needed files from GitHub and Google Drive
import os

# Remove Colab default sample_data
!rm -r ./sample_data

# Clone GitHub files to colab workspace
repo_name = "XAI" # Change to your repo name
git_path = 'https://github.com/yabeizeng1121/XAI.git' #Change to your path
!git clone "{git_path}"

# Install dependencies from requirements.txt file
!pip install nlp --quiet
# Change working directory to location of notebook
notebook_dir = 'Assignment8'
path_to_notebook = os.path.join(repo_name,notebook_dir)
%cd "{path_to_notebook}"
%ls

## Model
I used the `distilbert-base-uncased-finetuned-sst-2-english` model, a distilled version of BERT fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset. This model is designed for binary sentiment classification, distinguishing between positive and negative sentiments.

## Objective
The goal is to analyze the key components of a prompt by introducing perturbations and observing their impact on the model’s output. By altering specific parts of the input text and examining the resulting changes in model confidence, we can identify which words or phrases are most influential in the sentiment classification.


In [10]:
# Import the packages
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
import nlp

# Load the tokenizer and model
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model.eval()  # Set the model to evaluation mode

# Check if CUDA is available and move the model to GPU if it is
if torch.cuda.is_available():
    model.cuda()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [11]:
# Define a function to encode text and return logits
def evaluate_text(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    with torch.no_grad():
        logits = model(**inputs).logits
    return logits

# Original text
text = "The food was absolutely wonderful, from preparation to presentation."

# Perturbed texts (removing adjectives and adverbs)
texts = [
    "The food was wonderful, from preparation to presentation.",
    "The food was absolutely, from preparation to presentation.",
    "The food was absolutely wonderful, from to presentation.",
    "The food was absolutely wonderful, from preparation to."
]

# Analyze the impact of perturbations
original_logits = evaluate_text(text)
print("Original:", original_logits)

for perturbed_text in texts:
    perturbed_logits = evaluate_text(perturbed_text)
    print(f"Perturbed: {perturbed_text[:50]}...", perturbed_logits)


Original: tensor([[-4.3844,  4.7253]])
Perturbed: The food was wonderful, from preparation to presen... tensor([[-4.3770,  4.7237]])
Perturbed: The food was absolutely, from preparation to prese... tensor([[-3.8379,  4.0673]])
Perturbed: The food was absolutely wonderful, from to present... tensor([[-4.3855,  4.7130]])
Perturbed: The food was absolutely wonderful, from preparatio... tensor([[-4.3718,  4.7330]])


## Analysis of Perturbations

1. **Original Text**: "The food was wonderful, from preparation to presentation."
   - **Tensor Output**: `tensor([[-4.3844, 4.7253]])`
   - **Analysis**: The high positive score (4.7253) indicates that the model has strong confidence in the positive sentiment of the original prompt. The phrase "wonderful" likely plays a significant role in reinforcing this sentiment.

2. **Perturbation 1**: "The food was absolutely, from preparation to presentation."
   - **Tensor Output**: `tensor([[-3.8379, 4.0673]])`
   - **Analysis**: Removing "wonderful" caused a drop in the positive score (from 4.7253 to 4.0673), suggesting that "wonderful" is a key word for positive sentiment. Without it, the model's confidence in the positive sentiment decreased, showing its importance in the sentiment assessment.

3. **Perturbation 2**: "The food was absolutely wonderful, from to presentation."
   - **Tensor Output**: `tensor([[-4.3855, 4.7130]])`
   - **Analysis**: Omitting the word "preparation" had a minimal effect on the sentiment score, with the tensor values remaining similar to the original. This suggests that "preparation" is not a critical word for the model in determining positive sentiment, while "absolutely wonderful" strongly influences the classification.

4. **Perturbation 3**: "The food was absolutely wonderful, from preparation to."
   - **Tensor Output**: `tensor([[-4.3718, 4.7330]])`
   - **Analysis**: Truncating the phrase "preparation to presentation" caused only a slight increase in the positive score. This indicates that as long as the main sentiment-indicative words ("absolutely wonderful") remain, the model's confidence in positive sentiment is maintained.

## Conclusion
The analysis demonstrates that certain key words, particularly "wonderful" and "absolutely," significantly influence the model's positive sentiment score. Minor truncations or removal of non-critical words like "preparation" or parts of the phrase "preparation to presentation" do not drastically impact the sentiment classification. However, removing key positive words reduces the model's confidence in a positive sentiment. This insight shows that the model places greater emphasis on sentiment-laden words and phrases, in this example, "Wonderful", which are essential for its sentiment assessment.
