<a href="https://colab.research.google.com/github/stasrodov/self-rag-eval/blob/main/self_rag_eval_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Self-RAG Evaluation
## Advanced Topics in Natural Language Processing, BGU
## Final Project

# Loading the self-rag model:

In [None]:
!git clone https://github.com/AkariAsai/self-rag.git

In [None]:
# Modify requirements.txt to not use fixed versions (except for datasets).
# Code doesn't work without this modification:
import re

with open('self-rag/requirements.txt', 'r') as req_txt:
  req_txt_lines = req_txt.readlines()

req_txt_new = ''
for line in req_txt_lines:
  if 'datasets' in line:
    req_txt_new = req_txt_new + line
  else:
    line_no_version = re.sub(r'(=|>)=[\d.]+', '', line)
    req_txt_new = req_txt_new + line_no_version

with open('self-rag/requirements.txt', 'w') as req_txt:
  req_txt.write(req_txt_new)

# Install ninja for faster installation of flash-attn
# https://github.com/Dao-AILab/flash-attention
!pip install packaging && pip install ninja
!pip show ninja

!pip install -r self-rag/requirements.txt

In [None]:
# Above cell prompts for session-restart, then can run current cell:
from vllm import LLM, SamplingParams

model = LLM("selfrag/selfrag_llama2_13b", download_dir="/model_cache", dtype="half")
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=1024, skip_special_tokens=False)  # original max_tokens=100

def format_prompt(input, paragraph=None):
  prompt = f'### Instruction:\n{input}\n\n### Response:\n'
  if paragraph is not None:
    prompt += f'[Retrieval]<paragraph>{paragraph}</paragraph>'
  return prompt

## Self-rag playground:

In [None]:
def print_predictions(preds):
  for i, pred in enumerate(preds):
    print(f'\n\nPrediction #{i+1}:')
    print(f'\ttext: [{pred.outputs[0].text}]')
    # print(f'\tfinish reason: {pred.outputs[0].finish_reason}')

In [None]:
# Without passage:
query_1 = "Leave odd one out: twitter, instagram, whatsapp."
query_2 = "Can you tell me the difference between llamas and alpacas?"
queries = [query_1, query_2]
prompts = [format_prompt(query) for query in queries]

preds = model.generate(prompts, sampling_params)
print_predictions(preds)

Processed prompts: 100%|██████████| 2/2 [00:02<00:00,  1.04s/it, est. speed input: 26.93 toks/s, output: 102.91 toks/s]



Prediction #1:
	text: [Twitter, Instagram, and WhatsApp are all social media platforms.[No Retrieval]However, WhatsApp is a messaging app, while Twitter and Instagram are both primarily used for sharing photos and videos.[No Retrieval]Therefore, WhatsApp is the odd one out in this group.[Utility:5]]


Prediction #2:
	text: [Sure![Retrieval]<paragraph>

* Alpaca (left) and llama (right) in the Andes of southern Peru.

Alpacas and llamas are both domesticated species of South American camelids.[Continue to Use Evidence]Alpacas are a much smaller than llamas, with a shoulder height of 3 to 4 feet.[Continue to Use Evidence]They are also bred specifically for their fiber, which is used to make all sorts of textiles and clothing.[Continue to Use Evidence]Llamas, on the other hand, are larger and more social, and are often kept as pets or as guard animals.[Continue to Use Evidence]They are also used as pack animals, and can carry up to 20% of their body weight.[Utility:5]]





In [None]:
# With passage:
query = "Can you tell me the difference between llamas and alpacas?"
passage = "The alpaca (Lama pacos) is a species of South American camelid mammal. It is similar to, and often confused with, the llama. Alpacas are considerably smaller than llamas, and unlike llamas, they were not bred to be working animals, but were bred specifically for their fiber."
prompts = [format_prompt(query, passage)]

preds = model.generate(prompts, sampling_params)
print_predictions(preds)
# output from github: ['[Relevant]Alpacas are considerably smaller than llamas, and unlike llamas, they were not bred to be working animals, but were bred specifically for their fiber.[Fully supported][Utility:5]</s>']

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  4.47it/s, est. speed input: 472.47 toks/s, output: 71.99 toks/s]



Prediction #1:
	text: [[Relevant]Alpacas are considerably smaller than llamas.[Fully supported][Utility:5]]





In [None]:
# With bad but "relevant" passages:
query = "Can you tell me the difference between llamas and alpacas?"

passage1 = "alpaca is stupid and llama is smart"
passage2 = "bla bla bla bla bla bla bla alpaca bla bla bla bla bla bla llama bla bla bla"
passage3 ="The quick brown alpaca jumps over the lazy llama"
passage4 ="The quick brown llama jumps over the lazy alpaca"

passages = [passage1, passage2, passage3, passage4]
prompts = [format_prompt(query, p) for p in passages]

preds = model.generate(prompts, sampling_params)
print_predictions(preds)

Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 10.25it/s, est. speed input: 465.77 toks/s, output: 195.56 toks/s]



Prediction #1:
	text: [[Relevant]Alpacas are a much smaller than llamas.[No support / Contradictory][Utility:5]]


Prediction #2:
	text: [[Relevant]Alpacas are a much smaller than llamas.[Partially supported][Utility:5]]


Prediction #3:
	text: [[Relevant]Alpacas are a much smaller than llamas.[No support / Contradictory][Utility:5]]


Prediction #4:
	text: [[Relevant]Llamas and alpacas are both members of the Camelidae family, but they have some differences.[No support / Contradictory][Utility:5]]





In [None]:
# With bad and irrelevant passages:
query = "Can you tell me the difference between llamas and alpacas?"

passage1 = "Aliens built the pyramids and then went to their home planet"
passage2 = "Joe Rogan sure loves his DMT"
passage3 = "The quick brown fox jumps over the lazy dog"
passage4 = "bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla"
# passage5 = "The truth of a thing is in the feel of it, not in the think of it (Stanley Kubrick)" # this one took the model down a spiral...

passages = [passage1, passage2, passage3, passage4]
prompts = [format_prompt(query, p) for p in passages]

preds = model.generate(prompts, sampling_params)
print_predictions(preds)

Processed prompts: 100%|██████████| 4/4 [00:03<00:00,  1.26it/s, est. speed input: 55.42 toks/s, output: 92.89 toks/s]



Prediction #1:
	text: [[Irrelevant]I don't know, but I know that they are both from South America and are related to each other.[Utility:4]]


Prediction #2:
	text: [[Irrelevant]Sure![Continue to Use Evidence]Alpacas are a much smaller than llamas.[Utility:5]]


Prediction #3:
	text: [[Relevant]Sure, I'd be happy to help you with that![No support / Contradictory][No Retrieval]Alpacas and llamas are both members of the Camelidae family, but they are different species.[Continue to Use Evidence]Alpacas are a much smaller than llamas, standing only about 3 feet tall at the shoulder.[Continue to Use Evidence]They are also bred specifically for their fiber, which is used to make all sorts of textiles and clothing.[Continue to Use Evidence]Alpacas are also typically more docile than llamas, and are often kept as pets.[Continue to Use Evidence]Llamas, on the other hand, are larger and more social animals.[Continue to Use Evidence]They are often used as pack animals, and can be trained to her




# Downloading and processing the evaluation result CSV from Attributed-QA:

In [None]:
# Constants:
ATTR_QA_PATH = 'Attributed-QA'
RATINGS_FILENAME = 'ratings'
SELF_RAG_PATH = 'self-rag' # Title:

In [None]:
from zipfile import ZipFile
import pandas as pd

!git clone https://github.com/google-research-datasets/Attributed-QA.git

with ZipFile(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}.zip') as ratings_zip:
  ratings_zip.extract(f'{RATINGS_FILENAME}.csv', ATTR_QA_PATH)

ratings_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}.csv')

# Filter rows without human rating:
ratings_df = ratings_df[ratings_df['human_rating'] != '-']

# Filter unneeded columns:
columns_all = ratings_df.columns.values
columns_needed = ['question', 'passage', 'human_rating', 'attribution'] # 'answer' !!
columns_unneeded = set(columns_all) - set(columns_needed)
ratings_df = ratings_df.drop(columns=columns_unneeded)

# Filter duplicate rows:
ratings_df = ratings_df.drop_duplicates()
# ratings_df # 7121 rows

# Filter rows with different 'human_rating' values while the rest is the same:
columns_no_hr = [c for c in columns_needed if c != 'human_rating']
ratings_df = ratings_df.drop_duplicates(subset=columns_no_hr, keep=False)

ratings_df.to_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_filtered.csv', index=False)
ratings_df # 5785 rows

fatal: destination path 'Attributed-QA' already exists and is not an empty directory.


Unnamed: 0,question,passage,human_rating,attribution
0,who played hyde in league of extraordinary gen...,Title: Jason Flemyng\nSection: Television and ...,Y,http://en.wikipedia.org/wiki/Jason_Flemyng#Jas...
7,who signed the largest on the declaration of i...,Title: United States Declaration of Independen...,Y,http://en.wikipedia.org/wiki/United_States_Dec...
16,when was the last time the carolina hurricanes...,Title: 2009 Stanley Cup playoffs\nSection: Con...,Y,http://en.wikipedia.org/wiki/2009_Stanley_Cup_...
21,where was 2017 beauty and the beast filmed,Title: Beauty and the Beast (2017 film)\n\nA l...,Y,http://en.wikipedia.org/wiki/Beauty_and_the_Be...
23,when does the next warrior book come out,Title: Adonal Foyle\n\nAdonal David Foyle (bor...,N,http://en.wikipedia.org/wiki/Adonal_Foyle#Adon...
...,...,...,...,...
81842,the road that connects the tombs is called,Title: Valley of the Kings (Tibet)\n\nThe Vall...,N,http://en.wikipedia.org/wiki/Valley_of_the_Kin...
81843,when was the miraculous journey of edward tula...,Title: The Miraculous Journey of Edward Tulane...,N,http://en.wikipedia.org/wiki/The_Miraculous_Jo...
81844,where was the remake of dirty dancing filmed,"Title: Asheville, North Carolina\nSection: In ...",N,"http://en.wikipedia.org/wiki/Asheville,_North_..."
81845,how long is a whale shark in meters,Title: List of largest fish\nSection: Cartilag...,N,http://en.wikipedia.org/wiki/List_of_largest_f...


## Attributed-QA CSV playground:

In [None]:
# ratings_df[ratings_df['question'].str.contains('who played hyde in league of extraordinary gen')] # .iloc[0]

# Let's find the largest passage:
passage_len_max = (ratings_df['passage'].str.len()).max()
passage_max_len_df =  ratings_df[ratings_df['passage'].str.len() == passage_len_max]
passage_max = passage_max_len_df.iloc[0].passage
question_passage_max = passage_max_len_df.iloc[0].question

print(f'passage_len_max: {passage_len_max}')
print(f'question_passage_max: [{question_passage_max}]')
print(f'passage_max: [{passage_max}]\n')

# Trying self-rag model on the largest passage:
prompts = [format_prompt(question_passage_max, passage_max)]
preds = model.generate(prompts, sampling_params)
print_predictions(preds)

passage_max_len_df

passage_len_max: 8400
question_passage_max: [what is the second book in the alchemyst series]
passage_max: [Title: Indigo Muldoon
Section: List of books

1. "The Awakening:" Indigo and her life are introduced to the reader through immersion into the story. Indigo's talents are growing stronger, and visions are taking over her everyday life. Problems in her family are presented: conflicts with her parents, and her paternal grandfather is diagnosed with lung cancer. His death serves as a bridge which leads Indigo to the Spirit World and ultimately, to fulfill her destiny as a medium. 2. "Spellbound:" Indigo begins her work as a medium. Spirits, good and bad, are popping up everywhere, and Indigo is becoming more powerful. But can she make the sacrifices necessary to become a medium, and will anyone ever understand why she's so different? 3. "Taking to the Sky:" Indigo is endowed with the wings of a crow, and she begins to learn to fight and slay demons. An attractive new boy appears in h

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.10it/s, est. speed input: 2415.95 toks/s, output: 7.68 toks/s]



Prediction #1:
	text: [[Irrelevant]The Alchemyst[Utility:5]]





Unnamed: 0,question,passage,human_rating,attribution
72512,what is the second book in the alchemyst series,Title: Indigo Muldoon\nSection: List of books\...,N,http://en.wikipedia.org/wiki/Indigo_Muldoon#In...


# Checking 'Relevant' Tags

In [None]:
# Run "Load the self-rag model" and "Download and process..." first

import pandas as pd

ratings_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_filtered.csv')

self_rag_rating = []
preds_ = []
for i, row in  ratings_df.iterrows():
  prompt = format_prompt(row['question'], row['passage'])
  pred = model.generate(prompt, sampling_params)
  preds_.append(pred[0].outputs[0].text)
  if '[Relevant]' in pred[0].outputs[0].text:
    self_rag_rating.append('Y')
  elif '[Irrelevant]' in pred[0].outputs[0].text:
    self_rag_rating.append('N')
  else:
    self_rag_rating.append('-')

ratings_df['self_rag_rating'] = self_rag_rating
ratings_df['preds'] = preds_
ratings_df.to_csv(f'{SELF_RAG_PATH}/relevant_rating.csv', index=False)

In [None]:
# Statistics:
rel_df = pd.read_csv(f'{SELF_RAG_PATH}/relevant_rating.csv')
total = len(rel_df)
yes = len(rel_df[rel_df['self_rag_rating'] == 'Y'])
no =  len(rel_df[rel_df['self_rag_rating'] == 'N'])
none = len(rel_df[rel_df['self_rag_rating'] == '-'])
print(f'Total: {total}, #Yes: {yes}, #No: {no}, #-: {none}')

same = len(rel_df[rel_df['self_rag_rating'] == rel_df['human_rating']])
print(f'Same: {same}, Different: {total - same}, %Same: {round(same / total * 100, 2)}')

Total: 5785, #Yes: 5122, #No: 498, #-: 165
Same: 2567, Different: 3218, %Same: 44.37


In [None]:
def load_csv(file_path):
    with open(file_path, mode='r', newline='', encoding='utf-8') as file:
        reader = csv.reader(file)
        data = [row for row in reader]
    return data

In [None]:
# Check whether some preds contain Relavant AND Irrelevant tags:
import csv

file_path = f'{SELF_RAG_PATH}/relevant_rating.csv'
data = load_csv(file_path)
i = 0
for row in data:
    text = row[5]
    is_relevant = '[Relevant]' in text
    is_irrelevant = '[Irrelevant]' in text
    if is_relevant and is_irrelevant:
        print(text)
        i += 1

print(i) # 0

0


# Checking 'Retrieval' Tags

In [None]:
# Run "Load the self-rag model" and "Download and process..." first

import pandas as pd

ratings_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_filtered.csv')

self_rag_ratings = []
self_rag_preds = []
for i, row in  ratings_df.iterrows():
  prompt = format_prompt(row['question'])
  pred = model.generate(prompt, sampling_params)
  text = pred[0].outputs[0].text
  self_rag_preds.append(text)
  if '[Retrieval]' in text:
    self_rag_ratings.append('Y')
  elif '[No Retrieval]' in text:
    self_rag_ratings.append('N')
  else:
    self_rag_ratings.append('-')

ratings_df['self_rag_rating'] = self_rag_ratings
ratings_df['self_rag_pred'] = self_rag_preds
ratings_df.to_csv(f'{SELF_RAG_PATH}/retrieval_rating.csv', index=False)

In [None]:
# Statistics:
ret_df = pd.read_csv(f'{SELF_RAG_PATH}/retrieval_rating.csv')
total = len(ret_df)
yes = len(ret_df[ret_df['self_rag_rating'] == 'Y'])
no = len(ret_df[ret_df['self_rag_rating'] == 'N'])
none = len(ret_df[ret_df['self_rag_rating'] == '-'])
print(f'Total: {total}, #Yes: {yes}, #No: {no}, #-: {none}')
print(f' %Yes: {round(yes / total * 100, 2)}')

Total: 5785, #Yes: 341, #No: 36, #-: 5408
 %Yes: 5.89


In [None]:
# Run 'load_csv' from above first.
# Check whether some preds contain Retrieval AND No-Retrieval tags:
import csv

file_path_ret = f'{SELF_RAG_PATH}/retrieval_rating.csv'
data_ret = load_csv(file_path_ret)
i = 0
for row in data_ret:
    text = row[5]
    is_ret = '[Retrieval]' in text
    is_no_ret = '[No Retrieval]' in text
    if is_ret and is_no_ret:
        print(f'\n\n{text}')
        i += 1

print(i) # 14



Yes, it is possible for a bowler to take a hat-trick in both innings of a Test match.[Retrieval]<paragraph>A hat-trick is when a bowler takes three wickets in three consecutive balls in an innings.[Retrieval]<paragraph>If a bowler takes three wickets in three consecutive balls in both innings, it would be considered a double hat-trick.[No Retrieval]However, this is a rare feat and has only been achieved by a few bowlers in the history of Test cricket.[Utility:5]


1.[Retrieval]<paragraph>The Cash Cab guy reads the questions by using a computer system that generates the questions and displays them on a screen.
2.[Retrieval]<paragraph>The Cash Cab guy reads the questions by using a computer system that generates the questions and displays them on a screen.
3.[Retrieval]<paragraph>The Cash Cab guy reads the questions by using a computer system that generates the questions and displays them on a screen.
4.[Retrieval]<paragraph>The Cash Cab guy reads the questions by using a computer syst

# Trying Gemma-2 to imitate the behavior of 'Relevant' Tags

## Trying in Colab:

In [None]:
!pip install -U transformers
!pip install accelerate
!pip install bitsandbytes

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

from google.colab import userdata
access_token = userdata.get('huggingface')

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b", token=access_token)

# No quantizatio and no flash attention (simplest):
# On GPU: "Your session crashed after using all available RAM"
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2-27b", token=access_token)

# No quantizatio and no flash attention:
# Works on TPU, but VERY slow. about 15 seconds for 10 tokens.
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-27b",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    token=access_token
)

In [None]:
# Using 8-bit quantization:
# GPU: Model outputs gibberish;  TPU: Doesn't work.

# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2-27b", quantization_config=quantization_config, token=access_token)

In [None]:
# Using 8-bit quantization + flash attention:
# Still gibberish.

# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# !pip install packaging && pip install ninja
# !pip show ninja
# !pip install flash-attn

# model = AutoModelForCausalLM.from_pretrained(
#     "google/gemma-2-27b",
#     quantization_config=quantization_config,
#     attn_implementation="flash_attention_2",
#     token=access_token
# )

In [None]:
# No quantizatio + flash attention:
# GPU: No gibberish, but slow as hell;  TPU: Doesn't work.

# !pip install packaging && pip install ninja
# !pip show ninja
# !pip install flash-attn

# model = AutoModelForCausalLM.from_pretrained(
#     "google/gemma-2-27b",
#     device_map="auto",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     token=access_token
# )# .to(0)

In [None]:
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")# .to("cuda")

from datetime import datetime
start_time = datetime.now()

outputs = model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(outputs[0]))

print(datetime.now() - start_time)

<bos>Write me a poem about Machine Learning.

Machine learning, a field of AI,

0:00:15.049458


## Trying on RunPod machine:

In [None]:
# See 'gemma2_is_relevant_passage.ipynb' in github.

# Calculating final statistics:
ratings_gemma_fs_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_gemma_merged_few-shot.csv')

total = len(ratings_gemma_fs_df)

human_yes_df = ratings_gemma_fs_df[ratings_gemma_fs_df['human_rating'] == 'Y']
human_yes_model_yes_df = human_yes_df[human_yes_df['gemma2_rating'] == 'Y']
human_yes_model_no_df = human_yes_df[human_yes_df['gemma2_rating'] == 'N']

human_no_df = ratings_gemma_fs_df[ratings_gemma_fs_df['human_rating'] == 'N']
human_no_model_yes_df = human_no_df[human_no_df['gemma2_rating'] == 'Y']
human_no_model_no_df = human_no_df[human_no_df['gemma2_rating'] == 'N']

print(f'human_yes_model_yes: {len(human_yes_model_yes_df)}, %: {round(len(human_yes_model_yes_df) / total * 100, 2)}')
print(f'human_yes_model_no: {len(human_yes_model_no_df)}, %: {round(len(human_yes_model_no_df) / total * 100, 2)}')
print(f'human_no_model_yes: {len(human_no_model_yes_df)}, %: {round(len(human_no_model_yes_df) / total * 100, 2)}')
print(f'human_no_model_no: {len(human_no_model_no_df)}, %: {round(len(human_no_model_no_df) / total * 100, 2)}')

human_yes_model_yes: 467, %: 16.87
human_yes_model_no: 676, %: 24.41
human_no_model_yes: 279, %: 10.08
human_no_model_no: 1342, %: 48.47


# Trying Claude

In [None]:
!pip install anthropic

import anthropic
from google.colab import userdata

claude_api_key = userdata.get('claude')
claude_client = anthropic.Anthropic(api_key=claude_api_key)

In [None]:
# Constants:
TAG_YES = '<Answer>Yes</Answer>'
TAG_NO = '<Answer>No</Answer>'

In [None]:
def claude_res_get_text(response):
  return response.content[0].text

def claude_get_response(client, prompt_raw, sys_prompt='', max_tokens=1000, temperature=0):
  prompt = claude_create_user_prompt(prompt_raw)
  return _claude_get_response(client, prompt, sys_prompt, max_tokens, temperature)

def _claude_get_response(client, prompt, sys_prompt='', max_tokens=1000, temperature=0):
  return client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=max_tokens,
    temperature=temperature,
    system=sys_prompt,
    messages=prompt
  )

def claude_create_user_prompt(text):
  return [{"role": "user", "content": [{"type": "text", "text": text}]}]

def create_prompt_is_relevant_passage(question, passage):
  return f"""Decide if the following passage <Passage> is relevant for answering the following question <Question>.
Explain your reasoning process, then write the final answer "Yes" or "No" between the tags <Answer> and </Answer>.
Here is the question <Question>: <Question>{question}</Question>
Here is the passage <Passage>: <Passage>{passage}</Passage>\n"""

def create_prompt_is_retrieve(question):
  return f"""Decide if you would use Wikipedia to answer the following question <Question>.
Write the final answer "Yes" or "No" between the tags <Answer> and </Answer>.
Here is the question <Question>: <Question>{question}</Question>"""

is_retrieve_sys_prompt = "Pretend that you have access to all Wikipedia, so you can use Wikipedia to answer trivia questions"

## Imitating the behavior of 'Relevant' Tags

In [None]:
# Run "Download and process..." first

from tqdm.notebook import tqdm

ratings_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_filtered.csv')

# Finish what's left, as previous session crashed due to token limit:
ratings_df = ratings_df.tail(821)
# -------------------------------------------------------------------

total = len(ratings_df)
claude_preds = []
claude_ratings = []

for i, row in  tqdm(ratings_df.iterrows(), desc="Querying Claude...", total=total):
  prompt_raw = create_prompt_is_relevant_passage(row.question, row.passage)
  response = claude_get_response(claude_client, prompt_raw)
  claude_preds.append(response)
  response_txt = claude_res_get_text(response)
  if TAG_YES in response_txt:
    claude_ratings.append('Y')
  elif TAG_NO in response_txt:
    claude_ratings.append('N')
  else:
    claude_ratings.append('-')

ratings_df['claude_rating'] = claude_ratings
ratings_df['claude_pred'] = claude_preds
ratings_df.to_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_tail.csv', index=False)  # _tail !

Querying Claude...:   0%|          | 0/821 [00:00<?, ?it/s]

In [None]:
# Merge with results from previous run:
ratings_cld_head_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_head.csv')
ratings_cld_tail_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_tail.csv')
ratings_cld_merged_df = pd.concat([ratings_cld_head_df, ratings_cld_tail_df])
ratings_cld_merged_df.to_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_merged.csv', index=False)

In [None]:
# Final merged statistics:
ratings_claude_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_merged.csv')

total = len(ratings_claude_df)
yes = len(ratings_claude_df[ratings_claude_df['claude_rating'] == 'Y'])
no =  len(ratings_claude_df[ratings_claude_df['claude_rating'] == 'N'])
none = len(ratings_claude_df[ratings_claude_df['claude_rating'] == '-'])
print(f'Total: {total}, #Yes: {yes}, #No: {no}, #-: {none}')

same = len(ratings_claude_df[ratings_claude_df['claude_rating'] == ratings_claude_df['human_rating']])
print(f'Same: {same}, Different: {total - same}, %Same: {round(same / total * 100, 2)}')

Total: 5785, #Yes: 3113, #No: 2672, #-: 0
Same: 4418, Different: 1367, %Same: 76.37


### Exploring wrong predictions:

In [None]:
# Let's take 100 questions+passages with Claude's WRONG predictions:
ratings_cld_no_100_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_merged.csv')
ratings_cld_no_100_df = ratings_cld_no_100_df[ratings_cld_no_100_df['claude_rating'] != ratings_cld_no_100_df['human_rating']]
ratings_cld_no_100_df = ratings_cld_no_100_df.head(100)


total = len(ratings_cld_no_100_df)
yes = len(ratings_cld_no_100_df[ratings_cld_no_100_df['claude_rating'] == 'Y'])
no = len(ratings_cld_no_100_df[ratings_cld_no_100_df['claude_rating'] == 'N'])

print(f'Total: {total}, #Yes: {yes}, #No: {no}')

same = len(ratings_cld_no_100_df[ratings_cld_no_100_df['claude_rating'] == ratings_cld_no_100_df['human_rating']])
print(f'Same: {same}, Different: {total - same}')

Total: 100, #Yes: 76, #No: 24
Same: 0, Different: 100


In [None]:
# Let's explore above WRONG predictions:
for i in range(10):
  row = ratings_cld_no_100_df.iloc[i]
  passage = row.passage.replace('\n', '\n\t\t  ')
  passage = passage.replace('.', '.\n\t\t  ')
  print(f'row {i + 1}:')
  print(f'\tclaude_rating: {row.claude_rating}\n\thuman_rating: {row.human_rating}')
  print(f'\tquestion: [{row.question}]\n\tpassage: [{passage}]\n\n')

  # Rows we mostly agree with human rating:  1, 5, 6 , 9
  # Rows we mostly agree with claude: 2, 3, 4, 7, 8, 10

row 1:
	claude_rating: Y
	human_rating: Y
	question: [who played hyde in league of extraordinary gentlemen]
	passage: [Title: Jason Flemyng
		  Section: Television and film work
		  
		  In the early 2000s he featured in two big-budget Hollywood films which were adaptations of Alan Moore comic books; as John Netley in 2001's From Hell, with Johnny Depp, and 2003's The League of Extraordinary Gentlemen, with Sean Connery, in which Flemyng played Dr.
		   Henry Jekyll and Edward Hyde.
		   The latter film was a disappointment, but Flemyng commented that: "It was a bit of a nightmare.
		  .
		  .
		   the film cost a fortune and didn't make back the money it was meant to.
		  .
		  .
		   But I still get a huge kick out of doing films like that and From Hell.
		   Any day you walk onto a set and Sean Connery or Johnny Depp or Brad Pitt is there has to be a good day.
		  "]


row 2:
	claude_rating: Y
	human_rating: Y
	question: [who signed the largest on the declaration of independence]
	p

### More prompting experiments:

In [None]:
# Let's take 100 questions+passages with Claude's relevant/irrelevant predictions:
ratings_cld_100_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_merged.csv')
ratings_cld_100_df = ratings_cld_100_df.head(100)

total = len(ratings_cld_100_df)
yes = len(ratings_cld_100_df[ratings_cld_100_df['claude_rating'] == 'Y'])
no = len(ratings_cld_100_df[ratings_cld_100_df['claude_rating'] == 'N'])

print(f'Total: {total}, #Yes: {yes}, #No: {no}')

same = len(ratings_cld_100_df[ratings_cld_100_df['claude_rating'] == ratings_cld_100_df['human_rating']])
print(f'Same: {same}, Different: {total - same}')

Total: 100, #Yes: 53, #No: 47
Same: 82, Different: 18


In [None]:
# Let's try Claude again on above 100 questions+passages, WITHOUT CoT:
def create_prompt_is_relevant_passage_no_cot(question, passage):
  return f"""Decide if the following passage <Passage> is relevant for answering the following question <Question>.
Write the final answer "Yes" or "No" between the tags <Answer> and </Answer>.
Here is the question <Question>: <Question>{question}</Question>
Here is the passage <Passage>: <Passage>{passage}</Passage>\n"""

total = len(ratings_cld_100_df)
claude_preds = []
claude_ratings = []

for i, row in  tqdm(ratings_cld_100_df.iterrows(), desc="Querying Claude...", total=total):
  prompt_raw = create_prompt_is_relevant_passage_no_cot(row.question, row.passage) # <-- !!
  response = claude_get_response(claude_client, prompt_raw)
  claude_preds.append(response)
  response_txt = claude_res_get_text(response)
  if TAG_YES in response_txt:
    claude_ratings.append('Y')
  elif TAG_NO in response_txt:
    claude_ratings.append('N')
  else:
    claude_ratings.append('-')

ratings_cld_100_df['claude_rating'] = claude_ratings
ratings_cld_100_df['claude_pred'] = claude_preds
ratings_cld_100_df.to_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_100_no_cot.csv', index=False)  # _tail !

Querying Claude...:   0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
# Statistics (WITHOUT CoT):
ratings_cld_100_nocot_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_relevant_claude_100_no_cot.csv')

total = len(ratings_cld_100_nocot_df)
yes = len(ratings_cld_100_nocot_df[ratings_cld_100_nocot_df['claude_rating'] == 'Y'])
no = len(ratings_cld_100_nocot_df[ratings_cld_100_nocot_df['claude_rating'] == 'N'])

print(f'Total: {total}, #Yes: {yes}, #No: {no}')

same = len(ratings_cld_100_nocot_df[ratings_cld_100_nocot_df['claude_rating'] == ratings_cld_100_nocot_df['human_rating']])
print(f'Same: {same}, Different: {total - same}')

Total: 100, #Yes: 55, #No: 45
Same: 80, Different: 20


## Imitating the behavior of 'Retrieve' Tags

In [None]:
# Let's take 100 questions with self-rag's [Retrieval]/[No Retrieval] predictions:
ret_df = pd.read_csv(f'{SELF_RAG_PATH}/retrieval_rating.csv')
ret_df_head = ret_df.head(100)

total = len(ret_df_head)
yes = len(ret_df_head[ret_df_head['self_rag_rating'] == 'Y'])
no = len(ret_df_head[ret_df_head['self_rag_rating'] == 'N'])
none = len(ret_df_head[ret_df_head['self_rag_rating'] == '-'])

print(f'Total: {total}, #Yes: {yes}, #No: {no}, #-: {none}')

Total: 100, #Yes: 5, #No: 0, #-: 95


In [None]:
# Let's try Claude on above 100 questions:
claude_preds = []
claude_ratings = []

for i, row in  tqdm(ret_df_head.iterrows(), desc="Querying Claude...", total=total):
  prompt_raw = create_prompt_is_retrieve(row.question)
  response = claude_get_response(claude_client, prompt_raw, is_retrieve_sys_prompt) # <-- !!
  claude_preds.append(response)
  response_txt = claude_res_get_text(response)
  if TAG_YES in response_txt:
    claude_ratings.append('Y')
  elif TAG_NO in response_txt:
    claude_ratings.append('N')
  else:
    claude_ratings.append('-')

ret_df_head['claude_rating'] = claude_ratings
ret_df_head['claude_pred'] = claude_preds
ret_df_head.to_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_retrieve_claude_100.csv', index=False)


# Statistics:
ret_100_df = pd.read_csv(f'{ATTR_QA_PATH}/{RATINGS_FILENAME}_retrieve_claude_100.csv')

total = len(ret_100_df)
yes = len(ret_100_df[ret_100_df['claude_rating'] == 'Y'])
no = len(ret_100_df[ret_100_df['claude_rating'] == 'N'])
none = len(ret_100_df[ret_100_df['claude_rating'] == '-'])

print(f'Total: {total}, #Yes: {yes}, #No: {no}, #-: {none}')
print(f' %Yes: {round(yes / total * 100, 2)}')

Querying Claude...:   0%|          | 0/100 [00:00<?, ?it/s]

Total: 100, #Yes: 93, #No: 7, #-: 0
 %Yes: 93.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ret_df_head['claude_rating'] = claude_ratings
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ret_df_head['claude_pred'] = claude_preds
