## **Notebook Summary**

### **Purpose**
This notebook is designed to analyze and audit retrieval results from various clinical trial search models (e.g., BGE, SapBERT, etc.) by comparing their top-retrieved trials for a set of queries against gold-standard relevance annotations.

---

### **Key Steps and Logic**

#### **1. Configuration**
- The notebook sets up paths to:
  - Retrieval search result files (per model, e.g., `"bge-large-en-v1.5_hnsw_search_results.json"`).
  - Processed gold-standard files: `test.tsv`, `queries.jsonl`, `corpus.jsonl`.

#### **2. Data Loading**
- **Gold Data:** Loads truth labels (`test.tsv`), queries (`queries.jsonl`), and clinical trial info (`corpus.jsonl`) using pandas.
- **Retrieval Results:** Loads search results per model (typically nested JSON output, where each query contains a results list).

#### **3. Results Flattening**
- Defines a function (`flatten_results`) to transform nested search results into a flat DataFrame:
  - Columns: `query-id`, `corpus-id`, `score` (retrieval score).

#### **4. Merging & Alignment**
- Merges the model's flat results with gold-standard relevance labels, and adds in query/trial metadata for further analysis.

#### **5. Error/Quality Analysis**
- **Missing Pairs:** Identifies gold-standard query-trial relevance pairs not retrieved at all by the model (i.e., relevant trial not among predictions).
- **Extra Pairs:** Identifies predicted pairs that are not in the gold set (possible false positives).

#### **6. Per-Query Statistics**
- For every query:
  - Counts for "score 1" and "score 2" pairs in the gold set (typically 'relevant', 'definitely relevant').
  - Fraction/number of such pairs recovered by the model.
  - Computes the percentage of missing relevant trials (by score, per query).

#### **7. Overall Summary**
- Totals the above statistics across all queries:
  - Total number of "score 1"/"score 2" pairs in the gold set.
  - Total number missed by the model.
  - Percentage missed, for reporting model coverage/gaps.

#### **8. Main Analysis Loop**
- For each specified model:
  - Runs the above analysis.
  - Displays per-query recovery stats.
  - Prints a summary row.
- Final output is a summary table, with one row per model.

---

### **Outputs**
- **Per-Query Table:** For each query, presents coverage of relevant trials by the model.
- **Summary Table:** Easy comparison across models: how many gold-standard relevant pairs each misses, by relevance level.

---

### **Intended Use**
- **Sanity check / QA:** For dataset and retrieval result integration.
- **Model evaluation:** Not full IR metrics (e.g., MAP/MRR), but focuses on recall/gap analysis for critical pairs in the gold set.
- **Model comparison:** Facilitates head-to-head comparison of clinical retrieval models’ ability to surface relevant trials.

---

### **Technical Notes**
- **Modularity:** Can be easily adapted to add more models/files via the `RESULT_FILES` dictionary.
- **Not a full evaluation script:** Does not compute MAP/MRR; focuses on missing-relevant analysis.
- **Extendable:** Additional analysis, plotting etc. can be built on top of output tables.

In [1]:
import os
import json
import pandas as pd
import numpy as np

In [2]:
ls ../data/sigir2016/results/

bge-large-en-v1.5_hnsw_search_results.json
Bio_ClinicalBERT_hnsw_search_results.json
bluebert_hnsw_search_results.json
e5-large-v2_hnsw_search_results.json
SapBERT_hnsw_search_results.json


In [3]:
# --- Config ---
# RESULT_FILES = {
#     "SapBERT": "../data/sigir2016/results/SapBERT_flat_search_results.json",
#     "bge-large-en-v1.5": "../data/sigir2016/results/bge-large-en-v1.5_flat_search_results.json",
#     "e5-large-v2": "../data/sigir2016/results/e5-large-v2_flat_search_results.json",
#     "Bio_ClinicalBERT": "../data/sigir2016/results/Bio_ClinicalBERT_flat_search_results.json",
#     "bluebert": "../data/sigir2016/results/bluebert_flat_search_results.json",
# }

RESULT_FILES = {
    "SapBERT": "../data/sigir2016/results/SapBERT_hnsw_search_results.json",
    "bge-large-en-v1.5": "../data/sigir2016/results/bge-large-en-v1.5_hnsw_search_results.json",
    "e5-large-v2": "../data/sigir2016/results/e5-large-v2_hnsw_search_results.json",
    "Bio_ClinicalBERT": "../data/sigir2016/results/Bio_ClinicalBERT_hnsw_search_results.json",
    "bluebert": "../data/sigir2016/results/bluebert_hnsw_search_results.json",
}

# RESULT_FILES = {"bge-large-en-v1.5": "../data/sigir2016/results/bge-large-en-v1.5_hnsw_search_results.json"}
DATA_DIR = "../data/sigir2016/processed_cut"
TSV_FILE = os.path.join(DATA_DIR, "test.tsv")
QUERIES_FILE = os.path.join(DATA_DIR, "queries.jsonl")
CORPUS_FILE = os.path.join(DATA_DIR, "corpus.jsonl")

In [4]:
# --- Data Loading ---
def load_gold_data(tsv_file, queries_file, corpus_file):
    df_tsv = pd.read_csv(tsv_file, sep='\t')
    df_queries = pd.read_json(queries_file, lines=True)
    df_corpus = pd.read_json(corpus_file, lines=True)
    print(f" df_tsv {len(df_tsv)} df_queries {len(df_queries)} df_corpus {len(df_corpus)}")
    return df_tsv, df_queries, df_corpus

def load_search_results(results_file):
    return pd.read_json(results_file)

In [5]:
# --- Processing ---
def flatten_results(results_df):
    """Flatten the nested results into a long DataFrame."""
    
    dfs = []
    for _, row in results_df.iterrows():
        for result in row['results']:
            dfs.append({
                'query-id': row['query_id'],
                'corpus-id': result['doc_id'],
                'score': result['score']
            })
    print(f" results_df {len(dfs)}")
    return pd.DataFrame(dfs)

In [6]:
def merge_with_gold(df_results_long, df_tsv, df_queries, df_corpus):
    eval_df = df_results_long.merge(
        df_tsv,
        how='left',
        on=['query-id', 'corpus-id'],
        suffixes=('_pred', '_true')
    )
    eval_df = eval_df.merge(df_queries.rename(columns={'_id':'query-id'}), on='query-id', how='left')
    eval_df = eval_df.merge(df_corpus.rename(columns={'_id':'corpus-id'}), on='corpus-id', how='left')
    return eval_df

In [7]:
def compute_missing_pairs(df_tsv, df_results_long):
    merged = df_tsv.merge(
        df_results_long[['query-id', 'corpus-id']],
        on=['query-id', 'corpus-id'],
        how='left',
        indicator=True
    )
    missing_gold = merged[merged['_merge'] == 'left_only']
    missing_gold = missing_gold[missing_gold['score'] != 0].reset_index(drop=True)
    return missing_gold

In [8]:
def compute_extra_pairs(df_tsv, df_results_long):
    extra_preds = df_results_long.merge(
        df_tsv[['query-id', 'corpus-id']],
        on=['query-id', 'corpus-id'],
        how='left',
        indicator=True
    )
    extras = extra_preds[extra_preds['_merge'] == 'left_only']
    return extras

In [9]:
def per_query_stats(df_tsv, df_results_long):
    df_tsv = df_tsv.copy()
    df_tsv['label_1'] = (df_tsv['score'] == 1).astype(int)
    df_tsv['label_2'] = (df_tsv['score'] == 2).astype(int)
    pred_pairs = set(zip(df_results_long['query-id'], df_results_long['corpus-id']))
    df_tsv['found'] = df_tsv.apply(lambda row: (row['query-id'], row['corpus-id']) in pred_pairs, axis=1)
    per_query = (
        df_tsv
        .groupby('query-id', as_index=True)
        .agg(
            total_score_1=('label_1', 'sum'),
            total_score_2=('label_2', 'sum'),
            found_score_1=('found', lambda x: int(((x) & (df_tsv.loc[x.index, 'label_1'] == 1)).sum())),
            found_score_2=('found', lambda x: int(((x) & (df_tsv.loc[x.index, 'label_2'] == 1)).sum()))
        )
    )
    per_query['missing_score_1'] = per_query['total_score_1'] - per_query['found_score_1']
    per_query['missing_score_2'] = per_query['total_score_2'] - per_query['found_score_2']
    per_query['percent_missing_1'] = np.where(
        per_query['total_score_1'] == 0, 0.0,
        100 * per_query['missing_score_1'] / per_query['total_score_1']
    ).round(1)
    per_query['percent_missing_2'] = np.where(
        per_query['total_score_2'] == 0, 0.0,
        100 * per_query['missing_score_2'] / per_query['total_score_2']
    ).round(1)
    return per_query

In [10]:
def overall_stats(per_query):
    total_1 = int(per_query['total_score_1'].sum())
    total_2 = int(per_query['total_score_2'].sum())
    missing_1 = int(per_query['missing_score_1'].sum())
    missing_2 = int(per_query['missing_score_2'].sum())
    percent_missing_1 = (missing_1 / total_1 * 100) if total_1 > 0 else 0
    percent_missing_2 = (missing_2 / total_2 * 100) if total_2 > 0 else 0
    return {
        "Total Score 1": total_1,
        "Total Score 2": total_2,
        "Missing Score 1": missing_1,
        "Missing Score 2": missing_2,
        "Percent Missing 1": percent_missing_1,
        "Percent Missing 2": percent_missing_2,
    }

In [11]:
# --- Main Loop ---
def analyze_all_models(result_files, tsv_file, queries_file, corpus_file):
    df_tsv, df_queries, df_corpus = load_gold_data(tsv_file, queries_file, corpus_file)
    summary = []
    for model_name, results_file in result_files.items():
        print(f"\n--- {model_name} ---")
        results_df = load_search_results(results_file)
        df_results_long = flatten_results(results_df)
        per_query = per_query_stats(df_tsv, df_results_long)
        # Display or export per_query
        # display(per_query)
        stats = overall_stats(per_query)
        print(f"Total Score 1: {stats['Total Score 1']}")
        print(f"Total Score 2: {stats['Total Score 2']}")
        print(f"Missing Score 1: {stats['Missing Score 1']} ({stats['Percent Missing 1']:.1f}%)")
        print(f"Missing Score 2: {stats['Missing Score 2']} ({stats['Percent Missing 2']:.1f}%)")
        summary.append({
            "Model": model_name,
            **stats
        })
    return pd.DataFrame(summary)

In [12]:
# Set pandas display options 
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 600)

In [13]:
# --- Run ---
# if __name__ == "__main__":
summary_df = analyze_all_models(RESULT_FILES, TSV_FILE, QUERIES_FILE, CORPUS_FILE)
print("\n=== Summary Table ===")
print(summary_df.to_string(index=False))

 df_tsv 3870 df_queries 59 df_corpus 3626

--- SapBERT ---
 results_df 7552
Total Score 1: 685
Total Score 2: 421
Missing Score 1: 259 (37.8%)
Missing Score 2: 133 (31.6%)

--- bge-large-en-v1.5 ---
 results_df 7552
Total Score 1: 685
Total Score 2: 421
Missing Score 1: 152 (22.2%)
Missing Score 2: 61 (14.5%)

--- e5-large-v2 ---
 results_df 7552
Total Score 1: 685
Total Score 2: 421
Missing Score 1: 162 (23.6%)
Missing Score 2: 77 (18.3%)

--- Bio_ClinicalBERT ---
 results_df 7552
Total Score 1: 685
Total Score 2: 421
Missing Score 1: 476 (69.5%)
Missing Score 2: 315 (74.8%)

--- bluebert ---
 results_df 7552
Total Score 1: 685
Total Score 2: 421
Missing Score 1: 619 (90.4%)
Missing Score 2: 374 (88.8%)

=== Summary Table ===
            Model  Total Score 1  Total Score 2  Missing Score 1  Missing Score 2  Percent Missing 1  Percent Missing 2
          SapBERT            685            421              259              133          37.810219          31.591449
bge-large-en-v1.5      