## Logs
- Factuality NLI
    - Without CoT
    - With CoT ( WIN)  
    - WikiQA 
        - generated non factual answer for measuring factuality agreement.
        - Kendall Score = 0.7
    - HotPotQA
        - Accuracy = 0.75 
    - Possible Improvements 
        - improve statement generation

- Relevance scores
    - QGen method
        - models tried : t5-base / gptneo-125M
        - WikiQA
            - Kendall score = 0.65
            - observations : finetune model on prompt/answer pairs to improve performance.
    - Cross-encoder method
        - models tried : distilbert 
        - WikiQA
            - kendall score = 0.63
            

In [1]:
import json
from datasets import load_dataset
import re
import os
import openai
from tqdm import tqdm
import numpy as np
import random
from scipy.stats import kendalltau, spearmanr

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.chdir('/Users/shahules/belar/src/')

In [3]:
OPENAI_KEY = json.load(open("/Users/shahules/openai-key.json"))["jj"]

In [4]:
os.environ["OPENAI_API_KEY"] = OPENAI_KEY

## OpenAI API

In [5]:
openai.api_key = OPENAI_KEY


def llm(prompt, **kwargs):
    response = openai.Completion.create(
        model=kwargs.get("model", "text-davinci-003"),
        prompt=prompt,
        temperature=kwargs.get("temperature", 0),
        top_p=kwargs.get("top_p", 1),
        frequency_penalty=kwargs.get("frequency_penalty", 0.0),
        presence_penalty=kwargs.get("presence_penalty", 0.0),
        max_tokens=kwargs.get("max_tokens", 500),
        logprobs=kwargs.get("logprobs", 1),
        n=kwargs.get("n", 1),
    )
    return response

In [6]:
def json_logger(data, filename="nli_check"):
    output = json.load(open(filename + ".json"))
    output.append(data)
    with open(filename + ".json", "w") as file:
        json.dump(output, file, indent=4)

## Datasets

In [7]:
wikiqa_ragas = load_dataset("explodinggradients/ragas-wikiqa")

Found cached dataset parquet (/Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-5b5116e5cb909aca/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 242.78it/s]


## Correlation

In [8]:
def get_corr(target, prediction):
    return [kendalltau(x, y).correlation for x, y in zip(target, predictions)]

## QA-QG paradigm
- Generate question and answer pair from `generated answer`.
- Given `context`, ask these questions
- Verify answer correctness

In [9]:
Question_generation = """Given a text, extract {} noun phrases and create questions for each based on given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
A: Germany
Q: Where was Albert Einstein born?
A: theory of relativity
Q: What is Albert Einstein best known for?
text: {}
"""

Question_answering = """Given a text and set of questions, answer the questions
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
questions: Where was Albert Einstein born?\n\nWhat is Albert Einstein best known for?
answers:Germany\n\ntheory of relativity
text: {}
questions:{}
answers:"""

Answer_verification = """Given a set of questions, correct answer and student's answer return the number of questions incorrectly answered by student.
Where was Albert Einstein born?\nCorrect answer: Germany\nStudent answer:India\n\n
What is Albert Einstein best known for?\nCorrect answer:  theory of relativity\nStudent answer: theory of relativity\n\n
Number of incorrect answers:1
{}
Number of incorrect answers:"""

In [10]:
def QAQG_fun(question, context, answer):
    """
    returns number of factual inconsistencies.
    """

    def answer_ver(qstn, answer, cand):
        return f"{qstn}\nCorrect answer: {answer}\nStudent answer: {cand}"

    num = len(answer.split(".")) - 1
    prompt = Question_generation.format(num, answer)
    output = llm(prompt)
    qa_pairs = [
        re.sub(r"A:|Q:", "", x).strip()
        for item in output["choices"][0]["text"].strip().split("\n\n")
        for x in item.split("\n")
    ]
    qa_pairs = [tuple(qa_pairs[i : i + 2]) for i in range(0, len(qa_pairs), 2)]
    print(qa_pairs)
    questions = "\n\n".join([qstn for ans, qstn in qa_pairs])
    prompt = Question_answering.format(context, questions)
    answers = llm(prompt)["choices"][0]["text"].split("\n\n")

    prompt = "\n\n".join(
        [answer_ver(qstn, ans, cand) for (ans, qstn), cand in zip(qa_pairs, answers)]
    )
    output = llm(Answer_verification.format(prompt))["choices"][0]["text"].strip()
    return int(output)

In [11]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming."
question = "What was the age of Sue Lyon when she played Lolita?"
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [13]:
QAQG_fun(question, context, answer)

[('Sue Lyon', 'Who played the role of Lolita in the movie?')]


0

## G-Eval
- Define criterions to evaluate model.
- Normalize `score = prob(s) * s`

In [14]:
relevence = """
Evaluation Criteria.\n
Relevance (1-5) - how relevant is the reply to the given question.
1. Read the reply and compare it to the question. Check if the given reply
actually answers the question, and if it presents them in a clear and logical order.
2. The reply should include only required information to answer the question.
3. Penalize replies that contain redundancies and excess information.
4. Assign a score for Relevance on a scale of 1 to 5, where 1 is the lowest and
5 is the highest based on the Evaluation Criteria.

question:{}
reply:{}
score:"""

In [15]:
def g_eval(question, context, answer):
    prompt = relevence.format(question, answer)
    output = llm(prompt)["choices"][0]
    prob = np.exp(sum(output["logprobs"]["token_logprobs"]))
    score = int(output["text"].strip())
    print(score)
    return prob * score

In [16]:
question = "Which year did Lolita release?"
answer = "Lolita film released in 1947."

In [17]:
g_eval(question, context, answer)

5


3.5533440372846865

## Relevancy Score 
- Scores `answers` according to `prompt`


### QGen scoring method

In [13]:
from ragas.metrics.answer_relevance import QGen

In [14]:
t5_qgen = QGen("t5-base", "cpu")

In [15]:
def predict_relevance(examples):
    scores = {}
    questions = examples["question"]
    for col in COLUMNS:
        passage = examples[col]
        inputs = list(zip(questions, passage))
        scores[f"{col}_relevance"] = t5_qgen.predict(inputs, show_progress=False)
    return scores

- We assume `generated_with_rag > correct_answer > incorrect_answer` for relevancy.

In [16]:
COLUMNS = ["generated_with_rag", "correct_answer", "incorrect_answer"]

In [130]:
output = (
    wikiqa_ragas["train"]
    .select(range(0, 10))
    .map(predict_relevance, batched=True, batch_size=4)
)

Loading cached processed dataset at /Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-5b5116e5cb909aca/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-6b548dfc2a4d5a4f.arrow


In [131]:
predictions = [[item[f"{k}_relevance"] for k in COLUMNS] for item in output]
target = [[2, 1, 0] for i in range(len(output))]
get_corr(target, predictions)

0.6518583971423787

### Cross encoder method

In [17]:
from ragas.metrics.context_relevance import context_relavancy

In [24]:
def predict_relevance(examples):
    scores = {}
    questions = examples["question"]
    for col in COLUMNS:
        passage = examples[col]
        inputs = list(zip(questions, passage))
        scores[f"{col}_relevance"] = cross_encoder.predict(inputs, show_progress=False)
    return scores

In [None]:
output = (
    wikiqa_ragas["train"]
    .select(range(0, 10))
    .map(predict_relevance, batched=True, batch_size=4)
)

In [None]:
predictions = [[item[f"{k}_relevance"] for k in COLUMNS] for item in output]
target = [[2, 1, 0] for i in range(len(output))]
get_tau(target, predictions)

## Factuality on HotpotQA


In [134]:
import experimental

In [135]:
from importlib import reload

reload(experimental)

<module 'experimental' (namespace)>

In [136]:
from experimental.nli import NLI

In [137]:
hotpot_qa = load_dataset(
    "hotpot_qa",
    "distractor",
    split="validation",
).select(range(0, 20))

Found cached dataset hotpot_qa (/Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5)


In [138]:
false_answer_prompt = """Given a question and correct answer, generate an incorrect answer
question: Were Scott Derrickson and Ed Wood of the same nationality?
correct answer: yes
answer: no
question: {}
correct answer: {}
answer:"""


def generate_false_answers(question, answer):
    answer = llm(false_answer_prompt.format(question, answer))["choices"][0][
        "text"
    ].strip()
    return {"false_answer": answer}

In [139]:
hotpot_qa = hotpot_qa.map(lambda x: generate_false_answers(x["question"], x["answer"]))

Loading cached processed dataset at /Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5/cache-593e03a966a13563.arrow


In [140]:
def get_context(item):
    titles, ids = item["supporting_facts"].values()
    title_ids = [item["context"]["title"].index(i) for i in titles]
    sentences = [
        item["context"]["sentences"][i][k]
        for i, k in zip(title_ids, item["supporting_facts"]["sent_id"])
    ]
    orig_context = " ".join(sentences)
    return {"answer_context": orig_context}

In [141]:
hotpot_qa = hotpot_qa.map(lambda x: get_context(x), batched=False)

Loading cached processed dataset at /Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5/cache-7badd24e430a747f.arrow


In [142]:
def predict_factuality(examples):
    scores = {}
    questions = examples["question"]
    contexts = examples["answer_context"]
    for col in COLUMNS:
        answers = examples[col]
        while True:
            try:
                scores[f"{col}_factual"] = NLI.score(questions, contexts, answers)
            except Exception as e:
                print(e)
                continue
            break
    return scores

In [143]:
COLUMNS = ["answer", "false_answer"]
hotpot_qa = hotpot_qa.map(predict_factuality, batched=True, batch_size=8)

Loading cached processed dataset at /Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5/cache-d51f81546b2858f1.arrow


In [164]:
predictions = [[item[f"{k}_factual"] for k in COLUMNS] for item in hotpot_qa]
target = [[1, 0] for i in range(len(hotpot_qa))]
incorrect = [
    idx for idx, item in enumerate(predictions) if all(np.argsort(item) != [1.0, 0.0])
]
print("Accuracy", 1 - (len(incorrect) / len(target)))

Accuracy 0.75
