# Results Analysis for Generation

We use 3 evaluation metrics, averaged across all queries for each dataset.
- **BLEU**: lexical similarity between generated and actual answers.
- **BERTScore**: semantic similarity between generated and actual answers.
- **Cosine Similarity (QA)**: similarity based on query and answer embeddings from an embedding model pretrained on QA datasets.
    - BERTScore focuses more on semantically similar sentences instead of the alignment between questions and answers. Hence, it's better to use an embedding model pretrained for the objective of QA alignment to calculate this similarity.

Alternatively, we can prompt LLM to test the correctness of our generated answer, but that will be costly and requires permission.

In [1]:
import json
import pickle

In [8]:
results = pickle.load(open('results/results_generation.pkl', 'rb'))

In [9]:
def log_results(results, metric):
    for dataset, dataset_results in results.items():
        print(dataset)
        best = {}
        for model, model_results in sorted(dataset_results.items(), key=lambda x: x[1][metric], reverse=True):
            chunker = model.split('|')[0]
            best[chunker] = max(best.get(chunker, 0), model_results[metric])
        for best_chunker, best_score in best.items():
            print(best_chunker, best_score)
        print()

In [10]:
log_results(results, 'bleu')

conditionalqa
AbsoluteLangchainChunker 0.025614291872207838
SingleLinkageChunker 0.024923561057551367
PositionalChunker 0.024393325506382017
LangchainChunker 0.02137819499443067
DBSCANChunker 0.0211334618369416
BaseChunker 0.01964963641011951

cuad
PositionalChunker 0.09709458461164729
DBSCANChunker 0.09707640001750076
SingleLinkageChunker 0.09689114850435061
BaseChunker 0.09477256568052747
LangchainChunker 0.09372428288341171
AbsoluteLangchainChunker 0.06617159425354206



In [11]:
log_results(results, 'bertscore')

conditionalqa
LangchainChunker 0.4347497642040253
AbsoluteLangchainChunker 0.43426079243421556
DBSCANChunker 0.4327026492357254
PositionalChunker 0.41655766814947126
SingleLinkageChunker 0.41478910341858866
BaseChunker 0.3912048231065273

cuad
PositionalChunker 0.6153298634290695
DBSCANChunker 0.6126029688119888
SingleLinkageChunker 0.6107689866423607
LangchainChunker 0.6106900158524513
BaseChunker 0.6092399621009826
AbsoluteLangchainChunker 0.5903419572114944



In [12]:
log_results(results, 'qa_cos_sim')

conditionalqa
LangchainChunker 0.35828339397907255
PositionalChunker 0.3570630721747875
DBSCANChunker 0.3567922805249691
SingleLinkageChunker 0.35658444866538047
AbsoluteLangchainChunker 0.3563793709874153
BaseChunker 0.3546047221124172

cuad
LangchainChunker 0.8303691279888153
AbsoluteLangchainChunker 0.8287693744897843
DBSCANChunker 0.8283072090148926
SingleLinkageChunker 0.8221230947971344
PositionalChunker 0.8179567801952362
BaseChunker 0.7740970557928085

