# Finetuning Dataset with Cross-Encoder

## Process
- Download the QASPER Dataset from HuggingFace Hub using Datasets Library (https://huggingface.co/datasets/allenai/qasper)
- From the train and test splits of the dataset extract 800 and 80 samples respectively

- Use the 800 samples collected from train data which have the respective questions framed on a research paper to generate a dataset in the respective format required for CrossEncoder finetuning. Currently the format we use is that a single sample of fine tune data consists of two sentences(question and context) and a score either 0 or 1 where 1 shows that the question and context are relevant to each other and 0 shows they are not relevant to each other.

- Use the 100 samples of test set to extract two kinds of evaluation datasets

- Rag Eval Dataset:
    - One dataset consists of samples where a single sample consists of a research paper content, list of questions on the research paper, answers of the list of questions on the research paper. While forming this dataset we keep only questions which have long answers/ free-form answers for better comparision with RAG generated answers.

- Reranking Eval Dataset:
    - The other datasets consists of samples where a single sample consists of the research paper content, list of questions on the research paper, list of contexts from the research paper contents relevant to each question

- We finetuned the cross-encoder using helper utilities written in llamaindex and push it to HuggingFace Hub using the huggingface cli tokens login which can be found here:- https://huggingface.co/settings/tokens

- We evaluate on both datasets using two metrics and three cases
    - Just BAAI/bge-small-en embeddings without any reranker
    - BAAI/bge-small-en embeddings combined with cross-encoder/ms-marco-MiniLM-L-12-v2 as reranker
    - BAAI/bge-small-en embeddings combined with our fine-tuned cross encoder model as reranker

- Evaluation Criteria for each Eval Dataset
  - F1-score metric in original paper
  - Hits metric:- For evaluating the Reranking Eval Dataset we just simply use the retriever+ post-processor functionalities of LLamaIndex to see in the different cases how many times does the relevant context gets retrieved and call it the hits metric.
  - RAGAS metric:- A third party  

## Load the Dataset

In [33]:
from qasper_data.qasper_dataset import QasperDataset
seed = 42
train_dataset = QasperDataset("train", seed=seed)
test_dataset = QasperDataset("test", seed=seed)
validation_dataset = QasperDataset("validation", seed=seed)

train_samples = train_dataset.random_sample(800)
test_samples = test_dataset.random_sample(80)

In [34]:
doc_qa_dict_list = [{"paper": sample.get_full_text(), "questions": sample.get_questions()} for sample in train_samples]

In [35]:
len(doc_qa_dict_list)

800

## Save Train Data into a csv file

In [36]:
import pandas as pd
import os

data_folder = "data"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)

df_train = pd.read_csv(os.path.join(data_folder, "train.csv"))

In [37]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,paper,questions
0,0,Multi-task learning (MTL) refers to machine le...,['Do they repot results only on English data?'...
1,1,Deep neural networks have been successfully ap...,['At which interval do they extract video and ...
2,2,With ever-increasing amounts of data available...,['By how much do they outperform standard BERT...
3,3,Urban legends are a genre of modern folklore c...,['How accurate is their predictive model?']
4,4,The recent years have seen unprecedented forwa...,['How does morphological analysis differ from ...


In [38]:
eval_doc_qa_answer_list = []

for test_sample in test_samples:
    final_answer_list = []
    full_text = test_sample.get_full_text()
    questions = test_sample.get_questions()
    for qa in test_sample.qas:
        for answer in qa.answers:
            final_answer_list.append(answer.local_answer)
    eval_doc_qa_answer_list.append({
        "paper": full_text, 
        "questions": questions, 
        "answers": final_answer_list
    })

In [39]:
len(eval_doc_qa_answer_list)

80

# Save eval data as a csv

In [40]:
import pandas as pd
import os

data_folder = "data"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)

df_test = pd.read_csv(os.path.join(data_folder, "test.csv"))

In [41]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,paper,questions,answers
0,0,Deep neural models recently have achieved rema...,['How do they perform semi-supervised learning...,"['On each step, a generative network is used t..."
1,1,Evidence-based medicine (EBM) is of primary im...,"['what boosting techniques were used?', 'did t...","['Unacceptable', 'Unacceptable', 'Unacceptable..."
2,2,Twitter is a huge micro-blogging service with ...,['did the top teams experiment with lexicons?'...,"['Unacceptable', 'Unacceptable', 'Unacceptable..."
3,3,Recent progress in Automatic Speech Recognitio...,"['Which datasets do they evaluate on?', 'Do th...","['Unacceptable', 'Unacceptable', 'Unacceptable..."
4,4,The ability to make effective presentations ha...,['What linguistic model does the conventional ...,['Random Forest to perform humor recognition b...


# Generate the Finetuning Dataset

In [42]:
import os
from llama_index import SimpleDirectoryReader
import openai
from llama_index.finetuning.cross_encoders.dataset_gen import (
    generate_ce_fine_tuning_dataset,
    generate_synthetic_queries_over_documents,
)


# os.environ["OPENAI_API_KEY"] = "sk-"
# os.environ["OPENAI_API_BASE"] = "https://ai-yyds.com/v1"
# openai.api_key = os.environ["OPENAI_API_KEY"]
# openai.api_base = os.environ["OPENAI_API_BASE"]

In [43]:
import os
import pandas as pd

data_folder = "data"

doc_qa_dict_list = pd.read_csv(os.path.join(data_folder, "train.csv")).to_dict("records")
eval_doc_qa_answer_list = pd.read_csv(os.path.join(data_folder, "test.csv")).to_dict("records")

In [44]:
from llama_index import Document
from llama_index.llms import OpenAI

llm = OpenAI(api_key=os.environ["OPENAI_API_KEY"],
             api_base=os.environ["OPENAI_API_BASE"],
             model_id="gpt-3.5-turbo",
             temperature=1)

final_finetuning_data_list = []
for paper in doc_qa_dict_list:
    questions_list = paper["questions"]
    documents = [Document(text=paper["paper"])]
    local_finetuning_dataset = generate_ce_fine_tuning_dataset(
        documents=documents,
        llm=llm,
        questions_list=questions_list,
        max_chunk_length=256,
        top_k=5,
    )
    final_finetuning_data_list.extend(local_finetuning_dataset)

  0%|          | 0/202 [00:00<?, ?it/s]

  0%|          | 0/209 [00:00<?, ?it/s]

  0%|          | 0/159 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Total samples in the final fine-tuning dataset
len(final_finetuning_data_list)
import pandas as pd

df_finetuning_dataset = pd.DataFrame(final_finetuning_data_list)
df_finetuning_dataset.to_csv("fine_tuning.csv")

# Load fine-tuning dataset

In [None]:
finetuning_dataset = final_finetuning_data_list


In [None]:
finetuning_dataset[0]

In [1]:
import pandas as pd
import os

data_folder = "data"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)
    
df_finetuning = pd.read_csv(os.path.join(data_folder, "fine_tuning.csv"))

In [2]:
df_finetuning.head()

Unnamed: 0.1,Unnamed: 0,query,context,score
0,0,Do they repot results only on English data?,"addition to precision, recall, and F1 scores f...",0
1,1,Do they repot results only on English data?,"which contains 910 training, 243 dev, and 288 ...",0
2,2,Do they repot results only on English data?,value of taking the number of shared and task-...,0
3,3,Do they repot results only on English data?,types (Adverse-Effect and Drug) and a single r...,0
4,4,Do they repot results only on English data?,the setup of Nguyen and Verspoor's nguyen2019e...,0


In [6]:
import ast
import pandas as pd

df_test = pd.read_csv(os.path.join(data_folder, "test.csv"), index_col=0)

df_test["questions"] = df_test["questions"].apply(ast.literal_eval)
df_test["answers"] = df_test["answers"].apply(ast.literal_eval)
print(f"Number of papers in the test sample:- {len(df_test)}")

Number of papers in the test sample:- 80


In [7]:
from llama_index import Document

final_eval_data_list = []
for index, row in df_test.iterrows():
    documents = [Document(text=row["paper"])]
    query_list = row["questions"]
    local_eval_dataset = generate_ce_fine_tuning_dataset(
        documents=documents,
        questions_list=query_list,
        max_chunk_length=256,
        top_k=5,
    )
    relevant_query_list = []
    relevant_context_list = []

    for item in local_eval_dataset:
        if item.score == 1:
            relevant_query_list.append(item.query)
            relevant_context_list.append(item.context)

    if len(relevant_query_list) > 0:
        final_eval_data_list.append(
            {
                "paper": row["paper"],
                "questions": relevant_query_list,
                "context": relevant_context_list,
            }
        )

NameError: name 'generate_ce_fine_tuning_dataset' is not defined

In [6]:
import pandas as pd
import os

data_folder = "data"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)

df_finetuning_dataset = pd.DataFrame(final_eval_data_list)
df_finetuning_dataset.to_csv("eval_dataset.csv")