# Finetune Dataset with Qasper

In this notebook we will create a finetuning dataset with Qasper References.

## Set Up Dataset

We pick up 50 random papers from test dataset and 200 random papers from train dataset.

The result will be stored in pandas dataframe

In [1]:
import pandas as pd
from qasper_data.qasper_dataset import QasperDataset, PaperIndex
from qasper_data.qasper_evaluator import EvidenceEvaluator


train = QasperDataset("train", 64)
test = QasperDataset("test", 64)

train_papers = train.random_sample(200)
test_papers = test.random_sample(80)


## Define Train Dataframe

In [2]:
df_train = pd.DataFrame()

from llama_index import ServiceContext

service_context =ServiceContext.from_defaults(llm=None, embed_model="local")

for paper in train_papers:
    paper_index = PaperIndex(paper, service_context)
    query_engine = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8)
    evaluator = EvidenceEvaluator.from_defaults(service_context)
    df_paper = evaluator.get_evaluation_dataframe(paper, query_engine)
    if df_paper.empty:
        continue
    if df_train.empty:
        df_train = pd.DataFrame(df_paper)
    else:
        df_train = pd.concat([df_train, df_paper], ignore_index=True)
    del paper_index


LLM is explicitly disabled. Using MockLLM.


In [3]:
df_train.head(30)

Unnamed: 0,question,context,score
0,What language are the conversations in?,The patients are from 31 provincial-level admi...,1.0
1,What language are the conversations in?,Telemedicine refers to the practice of deliver...,0.0
2,How is morphology knowledge implemented in the...,Huck BIBREF18 explored target-side segmentatio...,0.0
3,How is morphology knowledge implemented in the...,We denote this method as SSS. The segmented wo...,1.0
4,How is morphology knowledge implemented in the...,We will elaborate the number settings for our ...,0.0
5,How is morphology knowledge implemented in the...,Neural machine translation (NMT) has achieved ...,0.0
6,How does the word segmentation method work?,We denote this method as SSS. The segmented wo...,1.0
7,How does the word segmentation method work?,We will elaborate the number settings for our ...,0.0
8,How does the word segmentation method work?,Huck BIBREF18 explored target-side segmentatio...,0.0
9,How does the word segmentation method work?,Neural machine translation (NMT) has achieved ...,1.0


In [4]:
data_folder = "data"

df_train.to_csv(f"{data_folder}/fine_tuning_2.csv")

In [5]:
validation = QasperDataset("validation", 64)

df_validation = pd.DataFrame()

for paper in validation.random_sample(20):
    paper_index = PaperIndex(paper, service_context)
    query_engine = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8)
    evaluator = EvidenceEvaluator.from_defaults(service_context)
    df_paper = evaluator.get_evaluation_dataframe(paper, query_engine)
    if df_paper.empty:
        continue
    if df_validation.empty:
        df_validation = pd.DataFrame(df_paper)
    else:
        df_validation = pd.concat([df_validation, df_paper], ignore_index=True)
    del paper_index


In [6]:
df_validation.head(30)

Unnamed: 0,question,context,score
0,by how much did nus outperform abus?,This could indicate that the policy “overfits”...,0.0
1,by how much did nus outperform abus?,The maximum dialogue length was 25 turns. The ...,1.0
2,by how much did nus outperform abus?,Spoken Dialogue Systems (SDS) allow human-comp...,0.0
3,by how much did nus outperform abus?,In that case the value will be kept. The behav...,0.0
4,by how much did nus outperform abus?,"Thus, an SDS that is trained with a natural la...",0.0
5,by how much did nus outperform abus?,If the current dialogue turn is turn INLINEFOR...,0.0
6,by how much did nus outperform abus?,"However, even if the size of the corpus is lar...",0.0
7,what corpus is used to learn behavior?,"However, even if the size of the corpus is lar...",0.0
8,what corpus is used to learn behavior?,"However, even if the size of the corpus is lar...",0.0
9,what corpus is used to learn behavior?,This could indicate that the policy “overfits”...,0.0


In [7]:
df_validation.to_csv(f"{data_folder}/fine_tuning_2_validation.csv")

## Run Finetuning

In [8]:
from typing import List
from llama_index.finetuning.cross_encoders.cross_encoder import (
    CrossEncoderFinetuneEngine,
    CrossEncoderFinetuningDatasetSample
)
import os
import pandas as pd

data_folder = "data"

version = 2

df_finetuning = pd.read_csv(os.path.join(data_folder, f"fine_tuning_{version}.csv"), index_col=0)

finetuning_dataset: List[CrossEncoderFinetuningDatasetSample] = []

for _, row in df_finetuning.iterrows():
    finetuning_dataset.append(
        CrossEncoderFinetuningDatasetSample(
            query=row["question"],
            context=row["context"],
            score=row["score"]
        )
    )
    
validation_dataset: List[CrossEncoderFinetuningDatasetSample] = []

df_validation = pd.read_csv(os.path.join(data_folder, f"fine_tuning_{version}_validation.csv"), index_col=0)

for _, row in df_validation.iterrows():
    validation_dataset.append(
        CrossEncoderFinetuningDatasetSample(
            query=row["question"],
            context=row["context"],
            score=row["score"]
        )
    )

finetuning_engine = CrossEncoderFinetuneEngine(
    dataset=finetuning_dataset, epochs=2, batch_size=16, model_output_path="../models/fine_tuned_cross_encoder", val_dataset=validation_dataset
)

# Finetune the cross-encoder model
finetuning_engine.finetune()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/150 [00:00<?, ?it/s]

Iteration:   0%|          | 0/150 [00:00<?, ?it/s]

## Evaluate Test Dataframe

Test Hit rate

In [9]:


from llama_index import ServiceContext
from llama_index.indices.postprocessor import SentenceTransformerRerank

service_context =ServiceContext.from_defaults(llm=None, embed_model="local")

rerank_base = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-12-v2", top_n=3
)

rerank_fine_tuned = SentenceTransformerRerank(
    model="../models/fine_tuned_cross_encoder", top_n=3
)

df_test_base = pd.DataFrame()
df_test_rerank = pd.DataFrame()
df_test_rerank_fine_tuned = pd.DataFrame()

for paper in test_papers:
    paper_index = PaperIndex(paper, service_context)
    query_base = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=3)
    query_rerank = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8,         node_postprocessors=[rerank_base])
    query_rerank_fine_tuned = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8, node_postprocessors=[rerank_fine_tuned])
    evaluator = EvidenceEvaluator.from_defaults(service_context)
    df_paper_base = evaluator.get_evaluation_dataframe(paper, query_base)
    df_paper_rerank = evaluator.get_evaluation_dataframe(paper, query_rerank)
    df_paper_rerank_fine_tuned = evaluator.get_evaluation_dataframe(paper, query_rerank_fine_tuned)
    if df_test_base.empty:
        df_test_base = pd.DataFrame(df_paper_base)
    else:
        df_test_base = pd.concat([df_test_base, df_paper_base], ignore_index=True)
    if df_test_rerank.empty:
        df_test_rerank = pd.DataFrame(df_paper_rerank)
    else:
        df_test_rerank = pd.concat([df_test_rerank, df_paper_rerank], ignore_index=True)
    if df_test_rerank_fine_tuned.empty:
        df_test_rerank_fine_tuned = pd.DataFrame(df_paper_rerank_fine_tuned)
    else:
        df_test_rerank_fine_tuned = pd.concat([df_test_rerank_fine_tuned, df_paper_rerank_fine_tuned], ignore_index=True)
    del paper_index

LLM is explicitly disabled. Using MockLLM.


  df_test_base = pd.concat([df_test_base, df_paper_base], ignore_index=True)
  df_test_rerank = pd.concat([df_test_rerank, df_paper_rerank], ignore_index=True)
  df_test_rerank_fine_tuned = pd.concat([df_test_rerank_fine_tuned, df_paper_rerank_fine_tuned], ignore_index=True)


In [10]:
base_hit = df_test_base["score"].sum()
print(f"Base Hit: {base_hit}")

Base Hit: 430.0


In [11]:
rerank_hit = df_test_rerank["score"].sum()
print(f"Rerank Hit: {rerank_hit}")

Rerank Hit: 567.0


In [12]:
rerank_fine_tuned_hit = df_test_rerank_fine_tuned["score"].sum()
print(f"Rerank Fine Tuned Hit: {rerank_fine_tuned_hit}")

Rerank Fine Tuned Hit: 567.0


In [13]:
data_folder = "data"

df_test_base.to_csv(f"{data_folder}/test_base.csv")
df_test_rerank.to_csv(f"{data_folder}/test_rerank.csv")
df_test_rerank_fine_tuned.to_csv(f"{data_folder}/test_rerank_fine_tuned.csv")

In [14]:
df_f1s = pd.DataFrame(columns=["baseline", "rerank", "rerank_fine_tuned"])

for paper in test_papers:
    paper_index = PaperIndex(paper, service_context)
    query_base = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=3)
    query_rerank = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8,         node_postprocessors=[rerank_base])
    query_rerank_fine_tuned = paper_index.as_index().as_query_engine(mode="no_text", similarity_top_k=8, node_postprocessors=[rerank_fine_tuned]) 
    evaluator = EvidenceEvaluator.from_defaults(service_context)
    
    f1s_base = pd.Series(evaluator.evaluate(paper, query_base))
    f1s_rerank = pd.Series(evaluator.evaluate(paper, query_rerank))
    f1s_rerank_fine_tuned = pd.Series(evaluator.evaluate(paper, query_rerank_fine_tuned))
    
    df_f1s = pd.concat([df_f1s, pd.DataFrame({"baseline": f1s_base, "rerank": f1s_rerank, "rerank_fine_tuned": f1s_rerank_fine_tuned})], ignore_index=True)

    del paper_index

  df_f1s = pd.concat([df_f1s, pd.DataFrame({"baseline": f1s_base, "rerank": f1s_rerank, "rerank_fine_tuned": f1s_rerank_fine_tuned})], ignore_index=True)


In [15]:
df_f1s.head()

Unnamed: 0,baseline,rerank,rerank_fine_tuned
0,0.014868,0.065095,0.065095
1,0.013738,0.044581,0.044581
2,0.020323,0.065574,0.065574
3,0.017029,0.073781,0.073781
4,0.014601,0.044581,0.044581
