# Testing out NLI 

Notebook to get the baseline usage of the model `tals/albert-xlarge-vitaminc` from [HuggingFace](https://huggingface.co/tals/albert-xlarge-vitaminc).

Here are a few text examples for sanity checks

In [None]:
evidence_claim_pairs = [
    {
        "evidence": "The new policy has led to a significant decrease in crime rates.",
        "claim": "The new policy reduces crime.",
        "label": "Supports"
    },
    {
        "evidence": "There are no studies showing a direct link between the policy and crime rates.",
        "claim": "The new policy has a high impact on crime rates.",
        "label": "Not enough info"
    },
    {
        "evidence": "Crime rates have increased since the policy was implemented.",
        "claim": "The new policy reduces crime.",
        "label": "Refutes"
    }
]


## Imports and Model Loading

In [None]:
%%writefile src/entailment_vitc.py 
# This will write the function to a file for use downstream in the pipeline. It can be commented out while experimenting.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("tals/albert-xlarge-vitaminc")
model = AutoModelForSequenceClassification.from_pretrained("tals/albert-xlarge-vitaminc")


## Prediction pipeline

Takes in two strings, evidence and claim. Outputs the tuple of the label and corresponding prediction score

In [None]:
%%writefile -a src/entailment_vitc.py 

def NLI(text_a, text_b):
    """Predicts the relationship between a claim and evidence. 

    Args:
        text_a (str): The evidence statement.
        text_b (str): The claim statement.

    Returns:
        str: The predicted relationship label ("SUPPORTS", "REFUTES", or "NOT ENOUGH INFO").
        float: The confidence score of the prediction.
    """

    # Tokenize the input
    inputs = tokenizer(
        text_a, text_b,
        return_tensors='pt',  # Return PyTorch tensors
        padding=True,         # Pad to the longest sequence in the batch
        truncation=True       # Truncate to the model's max length
    )

    # Make predictions
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)
        logits = outputs.logits
    
    # Convert logits to probabilities
    probabilities = F.softmax(logits, dim=1)
    
    # Get the predicted class and its score
    predicted_class = torch.argmax(probabilities, dim=1).item()
    predicted_score = probabilities[0, predicted_class].item()

    # Label mapping
    label_map = {0: "SUPPORTS", 1: "REFUTES", 2: "NOT ENOUGH INFO"}  # Updated label mapping

    return label_map[predicted_class], predicted_score  # Return label and score


Silly example:

In [None]:
# Example usage
label, score = NLI("The sky is blue.", "The color of the pineapple is orange.")
print(f"Predicted class: {label}, Score: {score}")



Another one

In [None]:
label, score = NLI(
    "The sky is light blue.", # evidence
    "The color of the sky is blue." # claim
    )
print(f"Predicted class: {label}, Score: {score}")


## Testing
Lets try for our original examples

In [None]:
for pair in evidence_claim_pairs:
    print(
        NLI(pair['evidence'],pair['claim']))
    print("ground truth:", pair['label'])

That looks good! Yay! 

Lets move onto working this for value kaleidescope.

# Value Kaleidoscope

## Tangentially, if we want the original VP dataset. Otherwise skip this

In order to download this, you need a HuggingFace access token. You should add this by running `huggingface-cli login` in the command line before running the next cell.

In [None]:
import pandas as pd
df = pd.read_csv("hf://datasets/allenai/ValuePrism/full/full.csv")
df.head()

## Replicating VK experiment

In the paper, they describe the experimental setup as:

Concretely, for an LLM response with $n$ sentences $S = \{s_1, · · · , s_n\}$
and VK’s explanation $e$ of how this value is related to the given situation, we calculate
$$
\max^n_{i=1} \mathbb{1}(NLI(s_i, e) \textit{ is most\_probable})
$$
as whether the value is reflected somewhere in
the LLM’s response, with $\mathbb{1}$ as the indicator
function, NLI produces the entailment
score, and $\textit{most\_probable}$ indicates that
entailment is the most likely in the three-way
classification (contradiction, entailment,
neutral). The scores are then averaged across
all values associated with each situation and
then across situations.

So basically, you loop over each LLM sentence, see whether it entails the value (explanation sentence), and if any of them do, the LLM response gets a score of 1 for that value and 0 otherwise. 

Each VP situation contains several values, so to score an LLM response for a situation, we average the number of values present over the total number of values for that situation. 

Then we can average over all situations to get a final score for this LLM on this dataset.

As per Tobin's request, we will also be storing which sentence(s) correspond to which value(s) so that we can do span metrics later on. 

## LLM Responses for ValuePrism

In [None]:
vp = pd.read_csv('data/questions_and_human_perspectives_with_responses.csv')
vp = vp[vp.source=='valueprism']
vp.head()

## Experiment

### Helper functions

In [None]:
%%writefile -a src/entailment_vitc.py 

import nltk
nltk.download('punkt')

def tokenize_sentences(text):
    # Use NLTK's sent_tokenize to split the text into sentences
    sentences = nltk.sent_tokenize(text)
    return sentences

def find_span_indices(string, substring):
    start_index = string.find(substring)
    if start_index == -1:
        return None  # Return None if the substring is not found
    end_index = start_index + len(substring)
    return (start_index, end_index)


### Experiment loop
Let's just do gpt-4o-mini for now.

Results json structure:

```json
rj = {
    "gpt-4o-mini": {
        "<question1_text>": {
            "model_response": "<model_response>",
            "values": { # dict of all explanations for each value for the given question and the results
                "<explanation1_text>": {
                    "labels": [], # list of predicted labels ('support' 'refute' 'nei') for each sentence in model_response
                    "scores": [], # list of scores of each predicted label for each sentence in model_response
                    "spans": [] # list of tuples of the span of each sentence in model_response 
                },
                ...
                # rest of explanations
            },
            "avg": "<avg score over the values>" # avg num values present over the total num values for this question
        },
        ...
        # rest of questions
    },
    ...
    # (eventually) rest of models
}
```

In [None]:
%%writefile -a src/entailment_vitc.py 

import ast
import json
import numpy as np
# models = ['gpt-4o-mini','gpt-3.5-turbo','gemini-1.5-flash-002','mistral-7b-instruct','gemma-2-2b-it','llama-3.1-8B-it']
# rj = {model: {} for model in models}
results = {}


# just doing gpt-4o-mini for now
for id, row in vp.iterrows():
    question = row.question
    perspectives = ast.literal_eval(row.perspectives) # list of str of the form "Value: <value>\nExplanation: <explanation>"
    # we just want the explanation
    explanations = [p.split("Explanation: ")[-1] for p in perspectives]
    model_response = row['gpt-4o-mini']
    results[question] = {
        "model_response": model_response,
        "values": {},
        "avg": 0
    }
    S = tokenize_sentences(model_response) # sentences of LLM response S = {s_1, ... , s_n}
    presence = [] # list to store the binary indicator function results
    for e in explanations:
        results[question]["values"][e] = {
            "labels": [], 
            "scores": [],
            "spans": []
        }
        for si in S:
            label, score = NLI(si,e)
            span = find_span_indices(model_response,si)
            results[question]["values"][e]["labels"].append(label)
            results[question]["values"][e]["scores"].append(score)
            results[question]["values"][e]["spans"].append(span) 
        presence.append(1 if "SUPPORTS" in results[question]["values"][e]["labels"] else 0)
    # now lets calcualte the average over all the values
    results[question]["avg"] = np.mean(presence)
    
    # save results
    with open('data/results/NLI_VP_results_gpt-4o-mini.json', 'w') as f:
        json.dump(results,f,indent=2)


## Benchmarking on [MNLI](https://huggingface.co/datasets/nyu-mll/multi_nli)

The model in the VitaminC paper was evaluated on MNLI (Williams et al., 2018), using "the hypothesis as the
claim and the premise as the evidence... on the “mismatched” evaluation set." The paper reports an accuracy of  78.89%.

The labels are 0: entails, 1: neutral and 2: contradiction



In [None]:
%%writefile -a src/entailment_vitc.py 

# Download the test split 

splits = {'train': 'data/train-00000-of-00001.parquet', 'validation_matched': 'data/validation_matched-00000-of-00001.parquet', 'validation_mismatched': 'data/validation_mismatched-00000-of-00001.parquet'}
df_mnli = pd.read_parquet("hf://datasets/nyu-mll/multi_nli/" + splits["validation_mismatched"])

# label_mapping = {0: "entails", 1: "neutral", 2: "contradiction"} # true MNLI label mapping
# label mapping to match the vitaminc labels
label_mapping = {0: "SUPPORTS", 1: "NOT ENOUGH INFO", 2: "REFUTES"}

In [None]:
df_mnli.head()

Let's look at one example and see if the model can predict it correctly!

In [None]:
ex = df_mnli.sample(1)
label_mapping[ex['label'].values[0]]

In [None]:
NLI(ex['premise'].values[0],ex['hypothesis'].values[0])

### Run baseline on sample of test set

Lets sample 200 from the test set and predict on that to see whether we match the 78.89% accuracy from the paper.

In [None]:
%%writefile -a src/entailment_vitc.py 

n = 200
seed = 0
sample_df_mnli = df_mnli.sample(n, random_state=seed)

In [None]:
%%writefile -a src/entailment_vitc.py 

import tqdm
results = {}
correct = 0
for index, row in tqdm.tqdm(sample_df_mnli.iterrows(), total=n):
    premise = row['premise']
    hypothesis = row['hypothesis']
    label = label_mapping[row["label"]]
    prediction, score = NLI(premise, hypothesis)
    results[index] = {
        "promptID": row["promptID"],
        "pairID": row["pairID"],
        "premise": premise,
        "hypothesis": hypothesis,
        "prediction": prediction,
        "score": score,
        "label": label
    }
    if prediction == label:
        correct += 1
print(f"Accuracy on MNLI sample size n={n} is {round(correct/n,3)*100}%")

with open(f'data/results/MNLI_predictions_seed-{seed}_n-{n}.json', 'w') as f:
    json.dump(results, f, indent=2)


The above prints "Accuracy on MNLI sample size n=200 is 73.5%" which is great! Yay!


Heres a lil function to calculate the accuracy when loading from a previously saved json file:

In [None]:
%%writefile -a src/entailment_vitc.py 

def process_results_json(file_path: str) -> float:
    with open(file_path, 'r') as f:
        results = json.load(f)
    
    n = len(results)  # Calculate n based on the length of the results json
    correct = sum(1 for result in results.values() if result['prediction'] == result['label'])
    accuracy = round(correct / n, 3) * 100
    return accuracy

# Example usage
process_results_json('data/results/MNLI_predictions_seed-0_n-200.json')