# Data Curation for Drosophila Agent Benchmark

This notebook documents the process of curating ground truth data from FlyBase source files for the Drosophila Agent Benchmark.

## Overview

The benchmark evaluates AI agents on three tasks:
- **Task 1: Gene Function** - Extract GO terms from literature
- **Task 2: Expression** - Extract expression data (anatomy, developmental stage)
- **Task 3: Synonyms** - Discover gene synonyms and aliases

## Raw Data Files

Source files from FlyBase (FB2025_04 release):

| File | Description | Task |
|------|-------------|------|
| `gene_association.fb.gz` | GO annotations (GAF 2.2 format) | Task 1 |
| `curated_expression_fb_2025_04.tsv.gz` | Expression data | Task 2 |
| `fb_synonym_fb_2025_04.tsv.gz` | Gene synonyms | Task 3 |
| `fbrf_pmid_pmcid_doi_fb_2025_04.tsv` | Reference ID mapping | Coverage analysis |

In [1]:
import csv
import gzip
import json
import re
from collections import defaultdict
from pathlib import Path

# Paths
BASE_DIR = Path("..").resolve()
RAW_DATA_DIR = BASE_DIR / "raw_data"
OUTPUT_DIR = BASE_DIR / "data"

# Create output directory if needed
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Base directory: {BASE_DIR}")
print(f"Raw data directory: {RAW_DATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")
print("\nRaw data files:")
for f in sorted(RAW_DATA_DIR.glob("*")):
    if f.is_file() and not f.name.startswith("."):
        size_mb = f.stat().st_size / 1024 / 1024
        print(f"  {f.name}: {size_mb:.1f} MB")

Base directory: /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark
Raw data directory: /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/raw_data
Output directory: /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/data

Raw data files:
  README.md: 0.0 MB
  curated_expression_fb_2025_04.tsv.gz: 2.3 MB
  fb_synonym_fb_2025_04.tsv.gz: 11.4 MB
  fbrf_pmid_pmcid_doi_fb_2025_04.tsv: 12.0 MB
  gene_association.fb.gz: 2.3 MB
  gene_to_pmcids.json: 2.6 MB
  genes.csv: 0.8 MB


## 1. Load Benchmark Genes

Load the list of 3,446 benchmark genes with expert-curated summaries.

In [2]:
# Load benchmark genes from genes.csv
genes_file = RAW_DATA_DIR / "genes.csv"

benchmark_genes = {}  # gene_id -> gene_symbol
gene_summaries = {}  # gene_id -> summary

with open(genes_file) as f:
    reader = csv.DictReader(f)
    for row in reader:
        gene_id = row["FBgn_ID"]
        benchmark_genes[gene_id] = row["Gene_Symbol"]
        gene_summaries[gene_id] = row["Summary"]

print(f"Loaded {len(benchmark_genes)} benchmark genes")
print("\nExample genes:")
for i, (gid, symbol) in enumerate(list(benchmark_genes.items())[:5]):
    print(f"  {gid}: {symbol}")

Loaded 3446 benchmark genes

Example genes:
  FBgn0041626: Or19a
  FBgn0052350: Vps11
  FBgn0040372: G9a
  FBgn0029521: Or1a
  FBgn0004034: y


## 2. Load Literature Corpus Mapping

Load the mapping from genes to PMCIDs (papers in the literature corpus).

In [3]:
# Load gene to PMCIDs mapping
with open(RAW_DATA_DIR / "gene_to_pmcids.json") as f:
    gene_to_pmcids = json.load(f)

# Get all unique PMCIDs in corpus
corpus_pmcids = set()
for pmcids_list in gene_to_pmcids.values():
    corpus_pmcids.update(pmcids_list)

print(f"Genes with corpus mapping: {len(gene_to_pmcids)}")
print(f"Unique PMCIDs in corpus: {len(corpus_pmcids)}")

# Distribution of papers per gene
papers_per_gene = [len(v) for v in gene_to_pmcids.values()]
print("\nPapers per gene:")
print(f"  Min: {min(papers_per_gene)}")
print(f"  Max: {max(papers_per_gene)}")
print(f"  Mean: {sum(papers_per_gene) / len(papers_per_gene):.1f}")

Genes with corpus mapping: 3446
Unique PMCIDs in corpus: 16894

Papers per gene:
  Min: 10
  Max: 627
  Mean: 42.2


## 3. Load Reference ID Mapping

Load the mapping between FBrf (FlyBase reference), PMID, and PMCID.
This allows us to check which ground truth annotations have their source papers available in the corpus.

In [4]:
# Load FBrf -> PMID -> PMCID mapping
ref_mapping_file = RAW_DATA_DIR / "fbrf_pmid_pmcid_doi_fb_2025_04.tsv"

fbrf_to_pmcid = {}
pmid_to_pmcid = {}

with open(ref_mapping_file) as f:
    header = f.readline()
    for line in f:
        parts = line.strip().split("\t")
        if len(parts) >= 3:
            fbrf, pmid, pmcid = parts[0], parts[1], parts[2]
            if pmcid:  # Only if PMCID exists
                fbrf_to_pmcid[fbrf] = pmcid
                if pmid:
                    pmid_to_pmcid[pmid] = pmcid

print(f"FBrf entries with PMCID: {len(fbrf_to_pmcid)}")
print(f"PMID entries with PMCID: {len(pmid_to_pmcid)}")

FBrf entries with PMCID: 47828
PMID entries with PMCID: 47828


## 4. Parse GO Annotations (Task 1)

Parse Gene Ontology annotations from the GAF 2.2 file.

In [5]:
# Parse GO annotations from GAF 2.2 file
go_file = RAW_DATA_DIR / "gene_association.fb.gz"

go_annotations = defaultdict(list)

with gzip.open(go_file, "rt") as f:
    for line in f:
        # Skip comment lines
        if line.startswith("!"):
            continue

        parts = line.strip().split("\t")
        if len(parts) < 15:
            continue

        gene_id = parts[1]  # Column 2: DB_Object_ID (FBgn)

        # Skip if not in benchmark genes
        if gene_id not in benchmark_genes:
            continue

        # Parse qualifier and NOT flag
        qualifier_raw = parts[3]  # Column 4: Qualifier
        is_negated = False
        if qualifier_raw.startswith("NOT|"):
            is_negated = True
            qualifier = qualifier_raw[4:]
        elif qualifier_raw == "NOT":
            is_negated = True
            qualifier = ""
        else:
            qualifier = qualifier_raw

        annotation = {
            "go_id": parts[4],  # Column 5: GO ID
            "qualifier": qualifier,
            "is_negated": is_negated,
            "aspect": parts[8],  # Column 9: Aspect (P/F/C)
            "evidence_code": parts[6],  # Column 7: Evidence code
            "reference": parts[5],  # Column 6: DB:Reference
        }

        go_annotations[gene_id].append(annotation)

print(f"Parsed GO annotations for {len(go_annotations)} genes")
total_annotations = sum(len(v) for v in go_annotations.values())
print(f"Total GO annotations: {total_annotations}")

# Show example
example_gene = list(go_annotations.keys())[0]
print(f"\nExample annotations for {example_gene} ({benchmark_genes[example_gene]}):")
for ann in go_annotations[example_gene][:3]:
    print(f"  {ann}")

Parsed GO annotations for 3445 genes
Total GO annotations: 66611

Example annotations for FBgn0020238 (14-3-3epsilon):
  {'go_id': 'GO:0000077', 'qualifier': 'involved_in', 'is_negated': False, 'aspect': 'P', 'evidence_code': 'IMP', 'reference': 'PMID:10733527'}
  {'go_id': 'GO:0005737', 'qualifier': 'located_in', 'is_negated': False, 'aspect': 'C', 'evidence_code': 'IDA', 'reference': 'PMID:11682604'}
  {'go_id': 'GO:0005634', 'qualifier': 'located_in', 'is_negated': False, 'aspect': 'C', 'evidence_code': 'IDA', 'reference': 'PMID:11682604'}


## 5. Parse Expression Data (Task 2)

Parse expression data with FBbt (anatomy) and FBdv (developmental stage) ontology IDs.

In [6]:
def parse_ontology_term(term_str):
    """
    Parse ontology term from format: "term name (FBxx:XXXXXXXX)"
    Returns (term_name, ontology_id) or (None, None) if empty/invalid.
    """
    if not term_str or term_str.strip() == "":
        return None, None

    # Pattern: "term name (ID)"
    match = re.match(r"^(.+?)\s*\(([A-Za-z]+:\d+)\)$", term_str.strip())
    if match:
        return match.group(1).strip(), match.group(2)

    # No ID in parentheses, return term as-is
    return term_str.strip(), None


# Test the parser
test_cases = [
    "embryonic stage 4 (FBdv:00005306)",
    "organism (FBbt:00000001)",
    "",
    "some term without ID",
]
print("Parser test:")
for tc in test_cases:
    term, ont_id = parse_ontology_term(tc)
    print(f"  '{tc}' -> term='{term}', id='{ont_id}'")

Parser test:
  'embryonic stage 4 (FBdv:00005306)' -> term='embryonic stage 4', id='FBdv:00005306'
  'organism (FBbt:00000001)' -> term='organism', id='FBbt:00000001'
  '' -> term='None', id='None'
  'some term without ID' -> term='some term without ID', id='None'


In [7]:
# Parse expression data from TSV file
expression_file = RAW_DATA_DIR / "curated_expression_fb_2025_04.tsv.gz"

expression_data = defaultdict(list)

with gzip.open(expression_file, "rt") as f:
    header = f.readline()  # Skip header

    for line in f:
        parts = line.strip().split("\t")
        if len(parts) < 13:
            continue

        gene_id = parts[0]  # Column 1: feature_id

        # Skip if not in benchmark genes
        if gene_id not in benchmark_genes:
            continue

        # Parse ontology terms
        stage_start_term, stage_start_id = parse_ontology_term(parts[5] if len(parts) > 5 else "")
        stage_end_term, stage_end_id = parse_ontology_term(parts[6] if len(parts) > 6 else "")
        anatomy_term, anatomy_id = parse_ontology_term(parts[9] if len(parts) > 9 else "")
        substructure_term, substructure_id = parse_ontology_term(
            parts[12] if len(parts) > 12 else ""
        )

        record = {
            "expression_type": parts[3] if len(parts) > 3 and parts[3] else None,
            "assay": parts[4] if len(parts) > 4 and parts[4] else None,
            "stage_start_id": stage_start_id,
            "stage_start_term": stage_start_term,
            "stage_end_id": stage_end_id,
            "stage_end_term": stage_end_term,
            "anatomy_id": anatomy_id,
            "anatomy_term": anatomy_term,
            "substructure_id": substructure_id,
            "substructure_term": substructure_term,
            "reference": parts[2] if len(parts) > 2 and parts[2] else None,
        }

        expression_data[gene_id].append(record)

print(f"Parsed expression data for {len(expression_data)} genes")
total_expression = sum(len(v) for v in expression_data.values())
print(f"Total expression records: {total_expression}")

# Show example
example_gene = list(expression_data.keys())[0]
print(f"\nExample expression for {example_gene} ({benchmark_genes[example_gene]}):")
for rec in expression_data[example_gene][:2]:
    print(f"  {rec}")

Parsed expression data for 2829 genes
Total expression records: 13067

Example expression for FBgn0000166 (bcd):
  {'expression_type': 'polypeptide', 'assay': 'immunolocalization', 'stage_start_id': 'FBdv:00005306', 'stage_start_term': 'embryonic stage 4', 'stage_end_id': None, 'stage_end_term': None, 'anatomy_id': 'FBbt:00000001', 'anatomy_term': 'organism', 'substructure_id': None, 'substructure_term': None, 'reference': 'FBrf0051587'}
  {'expression_type': 'polypeptide', 'assay': 'western blot', 'stage_start_id': 'FBdv:00005306', 'stage_start_term': 'embryonic stage 4', 'stage_end_id': 'FBdv:00005318', 'stage_end_term': 'embryonic stage 6', 'anatomy_id': None, 'anatomy_term': None, 'substructure_id': None, 'substructure_term': None, 'reference': 'FBrf0050616'}


## 6. Parse Synonyms (Task 3)

Parse gene synonyms from the dedicated synonyms file.

In [8]:
# Parse synonyms from TSV file
synonym_file = RAW_DATA_DIR / "fb_synonym_fb_2025_04.tsv.gz"

synonyms = {}

with gzip.open(synonym_file, "rt") as f:
    header = f.readline()  # Skip header

    for line in f:
        parts = line.strip().split("\t")
        if len(parts) < 6:
            continue

        gene_id = parts[0]  # Column 1: primary_FBid
        organism = parts[1]  # Column 2: organism_abbreviation

        # Filter for Dmel and benchmark genes
        if organism != "Dmel" or gene_id not in benchmark_genes:
            continue

        # Parse synonyms (pipe-separated)
        fullname = parts[3] if len(parts) > 3 and parts[3] else None
        fullname_syns = (
            [s.strip() for s in parts[4].split("|") if s.strip()]
            if len(parts) > 4 and parts[4]
            else []
        )
        symbol_syns = (
            [s.strip() for s in parts[5].split("|") if s.strip()]
            if len(parts) > 5 and parts[5]
            else []
        )

        synonyms[gene_id] = {
            "current_fullname": fullname,
            "fullname_synonyms": fullname_syns,
            "symbol_synonyms": symbol_syns,
        }

print(f"Parsed synonyms for {len(synonyms)} genes")

# Show example
example_gene = list(synonyms.keys())[0]
print(f"\nExample synonyms for {example_gene} ({benchmark_genes[example_gene]}):")
print(f"  {synonyms[example_gene]}")

Parsed synonyms for 3440 genes

Example synonyms for FBgn0000014 (abd-A):
  {'current_fullname': 'abdominal A', 'fullname_synonyms': ['Abdominal A', 'Abdominal-A', 'AbdominalA', 'Contrabithoraxoid', 'Front-ultraabdominal', 'Hyperabdominal', 'Ultraabdominal', 'abdominal-A', 'abdominal-a', 'abdominalA', 'infra-abdominal 2', 'infra-abdominal 3', 'infra-abdominal 4', 'infraabdominal'], 'symbol_synonyms': ['ABD-A', 'Abd A', 'Abd-A', 'AbdA', 'Abda', 'BX-C', 'CG10325', 'Cbxd', 'DmabdA', 'Dmabda', 'Fab-3', 'Fab-4', 'Fub', 'Hab', 'Uab', 'abd A', 'abd-a', 'abdA', 'abd_a', 'abda', 'iab', 'iab-2', 'iab-3', 'iab-4', 'iab-5', 'iab2', 'iab3', 'iab4', 'l(3)89Ec']}


## 7. Analyze Reference Coverage

Check how many ground truth annotations have their source references available in the literature corpus.

**Key insight:** Ground truth uses PMIDs (Task 1) and FBrf IDs (Task 2), while the corpus uses PMCIDs. We need to map between these ID types.

In [9]:
def ref_in_corpus(ref, ref_type="pmid"):
    """
    Check if a reference is available in the corpus.

    Args:
        ref: Reference ID (PMID or FBrf)
        ref_type: "pmid" or "fbrf"

    Returns:
        True if reference maps to a PMCID in the corpus
    """
    if ref_type == "pmid":
        pmcid = pmid_to_pmcid.get(ref)
    else:  # fbrf
        pmcid = fbrf_to_pmcid.get(ref)

    return pmcid in corpus_pmcids if pmcid else False


# Calculate per-gene coverage
gene_coverage = []

for gene_id in benchmark_genes:
    # Task 1: Count GO annotations with corpus references
    task1_total = len(go_annotations.get(gene_id, []))
    task1_in_corpus = 0
    for ann in go_annotations.get(gene_id, []):
        ref = ann["reference"]
        if ref.startswith("PMID:"):
            pmid = ref.replace("PMID:", "")
            if ref_in_corpus(pmid, "pmid"):
                task1_in_corpus += 1

    # Task 2: Count expression records with corpus references
    task2_total = len(expression_data.get(gene_id, []))
    task2_in_corpus = 0
    for rec in expression_data.get(gene_id, []):
        ref = rec["reference"]
        if ref and ref_in_corpus(ref, "fbrf"):
            task2_in_corpus += 1

    # Task 3: Check if synonyms exist
    task3_data = synonyms.get(gene_id, {})
    task3_has_data = bool(task3_data.get("fullname_synonyms") or task3_data.get("symbol_synonyms"))

    gene_coverage.append(
        {
            "gene_id": gene_id,
            "gene_symbol": benchmark_genes[gene_id],
            "task1_total": task1_total,
            "task1_in_corpus": task1_in_corpus,
            "task2_total": task2_total,
            "task2_in_corpus": task2_in_corpus,
            "task3_has_data": task3_has_data,
            "total_in_corpus": task1_in_corpus + task2_in_corpus,
        }
    )

print("Coverage analysis complete!")

Coverage analysis complete!


In [10]:
# Overall coverage statistics
total_task1 = sum(g["task1_in_corpus"] for g in gene_coverage)
total_task1_all = sum(g["task1_total"] for g in gene_coverage)
total_task2 = sum(g["task2_in_corpus"] for g in gene_coverage)
total_task2_all = sum(g["task2_total"] for g in gene_coverage)

print("=== Overall Reference Coverage ===")
print("\nTask 1 (GO annotations):")
print(f"  Total annotations: {total_task1_all}")
print(f"  With corpus reference: {total_task1} ({100 * total_task1 / total_task1_all:.1f}%)")

print("\nTask 2 (Expression):")
print(f"  Total records: {total_task2_all}")
print(f"  With corpus reference: {total_task2} ({100 * total_task2 / total_task2_all:.1f}%)")

print(f"\nTask1:Task2 corpus ratio: {total_task1 / total_task2:.1f}:1")

=== Overall Reference Coverage ===

Task 1 (GO annotations):
  Total annotations: 66611
  With corpus reference: 12536 (18.8%)

Task 2 (Expression):
  Total records: 13067
  With corpus reference: 866 (6.6%)

Task1:Task2 corpus ratio: 14.5:1


## 8. Select Top 100 Genes (Balanced)

Select genes with good coverage for **both** tasks to avoid bias.

**Selection criteria:**
1. Require at least 1 annotation in corpus for both Task 1 AND Task 2
2. Sort by `min(task1_in_corpus, task2_in_corpus)` to ensure both tasks are strong
3. Use total as tiebreaker
4. Select top 100

In [11]:
# Filter genes with coverage for both tasks
qualified_genes = [
    g for g in gene_coverage if g["task1_in_corpus"] >= 1 and g["task2_in_corpus"] >= 1
]

print(f"Genes with both Task1 >= 1 AND Task2 >= 1: {len(qualified_genes)}")

# Add min_in_corpus for sorting
for g in qualified_genes:
    g["min_in_corpus"] = min(g["task1_in_corpus"], g["task2_in_corpus"])

# Sort by min(task1, task2) first, then by total as tiebreaker
qualified_genes_sorted = sorted(
    qualified_genes, key=lambda x: (x["min_in_corpus"], x["total_in_corpus"]), reverse=True
)

# Select top 100
top_100 = qualified_genes_sorted[:100]
top_100_ids = set(g["gene_id"] for g in top_100)

print("\nSelected top 100 genes by min(Task1, Task2) in corpus")

Genes with both Task1 >= 1 AND Task2 >= 1: 314

Selected top 100 genes by min(Task1, Task2) in corpus


In [12]:
# Display top 100 genes
print(f"{'Rank':<5} {'Gene':<15} {'Task1':<12} {'Task2':<12} {'Min':<6} {'Total':<8}")
print("-" * 60)

for i, g in enumerate(top_100, 1):
    t1 = f"{g['task1_in_corpus']}/{g['task1_total']}"
    t2 = f"{g['task2_in_corpus']}/{g['task2_total']}"
    print(
        f"{i:<5} {g['gene_symbol']:<15} {t1:<12} {t2:<12} {g['min_in_corpus']:<6} {g['total_in_corpus']:<8}"
    )

# Summary
total_t1 = sum(g["task1_in_corpus"] for g in top_100)
total_t2 = sum(g["task2_in_corpus"] for g in top_100)
print("\n" + "=" * 60)
print(f"Total Task 1 annotations in corpus: {total_t1}")
print(f"Total Task 2 annotations in corpus: {total_t2}")
print(f"Combined total: {total_t1 + total_t2}")
print(f"Task1:Task2 ratio: {total_t1 / total_t2:.2f}:1")

Rank  Gene            Task1        Task2        Min    Total   
------------------------------------------------------------
1     Apc2            22/77        16/31        16     38      
2     aPKC            31/82        14/21        14     45      
3     sqh             27/59        14/16        14     41      
4     Ir25a           9/23         14/15        9      23      
5     Ten-m           32/40        8/20         8      40      
6     ena             23/57        8/14         8      31      
7     btz             8/17         12/13        8      20      
8     gish            11/35        8/9          8      19      
9     fru             7/47         45/74        7      52      
10    Mad             7/115        9/26         7      16      
11    p120ctn         7/19         9/10         7      16      
12    Ten-a           26/30        6/10         6      32      
13    TBPH            17/38        6/14         6      23      
14    fra             6/42         14/32   

## 9. Generate Output Files

Create the final benchmark files:
- `ground_truth_top100.jsonl` - Ground truth annotations for top 100 genes
- `genes_top100.csv` - Gene list with summaries
- `gene_to_pmcids_top100.json` - Gene to corpus mapping

**Note on coverage:** The ground truth includes *all* annotations from FlyBase, not just those with references in the corpus. This means agents cannot achieve 100% recallâ€”max achievable is ~29% for Task 1 and ~37% for Task 2. This reflects a realistic scenario where curated databases contain knowledge from sources beyond any single literature corpus.

In [13]:
# Generate ground_truth_top100.jsonl
output_file = OUTPUT_DIR / "ground_truth_top100.jsonl"

with open(output_file, "w") as f:
    for gene_id in sorted(top_100_ids):
        record = {
            "gene_id": gene_id,
            "gene_symbol": benchmark_genes[gene_id],
            "task1_function": go_annotations.get(gene_id, []),
            "task2_expression": expression_data.get(gene_id, []),
            "task3_synonyms": synonyms.get(
                gene_id,
                {
                    "current_fullname": None,
                    "fullname_synonyms": [],
                    "symbol_synonyms": [],
                },
            ),
        }
        f.write(json.dumps(record) + "\n")

print(f"Created {output_file}")
print(f"  Genes: {len(top_100_ids)}")

Created /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/data/ground_truth_top100.jsonl
  Genes: 100


In [14]:
# Generate genes_top100.csv
genes_output = OUTPUT_DIR / "genes_top100.csv"

with open(genes_output, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["FBgn_ID", "Gene_Symbol", "Summary"])

    for gene_id in sorted(top_100_ids):
        writer.writerow([gene_id, benchmark_genes[gene_id], gene_summaries.get(gene_id, "")])

print(f"Created {genes_output}")

Created /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/data/genes_top100.csv


In [15]:
# Generate gene_to_pmcids_top100.json
pmcids_output = OUTPUT_DIR / "gene_to_pmcids_top100.json"

filtered_mapping = {k: v for k, v in gene_to_pmcids.items() if k in top_100_ids}

with open(pmcids_output, "w") as f:
    json.dump(filtered_mapping, f, indent=2)

print(f"Created {pmcids_output}")
print(f"  Genes: {len(filtered_mapping)}")
print(f"  Total PMCIDs: {sum(len(v) for v in filtered_mapping.values())}")

Created /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/data/gene_to_pmcids_top100.json
  Genes: 100
  Total PMCIDs: 9991


In [16]:
# Save coverage statistics for reference
stats_output = OUTPUT_DIR / "top100_coverage_stats.json"

with open(stats_output, "w") as f:
    json.dump(top_100, f, indent=2)

print(f"Created {stats_output}")

Created /Users/jimmy/Projects/Research/drosophila_dr_benchmark/drosophila-agent-benchmark/data/top100_coverage_stats.json


## 10. Verification

Verify the generated files are correct.

In [17]:
# Verify ground_truth_top100.jsonl
with open(OUTPUT_DIR / "ground_truth_top100.jsonl") as f:
    lines = f.readlines()

print(f"ground_truth_top100.jsonl: {len(lines)} genes")

# Parse and show first entry
first_record = json.loads(lines[0])
print("\nFirst record:")
print(f"  Gene: {first_record['gene_id']} ({first_record['gene_symbol']})")
print(f"  Task 1 annotations: {len(first_record['task1_function'])}")
print(f"  Task 2 records: {len(first_record['task2_expression'])}")
print(f"  Task 3 symbol synonyms: {len(first_record['task3_synonyms']['symbol_synonyms'])}")

ground_truth_top100.jsonl: 100 genes

First record:
  Gene: FBgn0000014 (abd-A)
  Task 1 annotations: 59
  Task 2 records: 67
  Task 3 symbol synonyms: 29


In [18]:
# Final summary
print("=" * 60)
print("DATA CURATION COMPLETE")
print("=" * 60)
print("\nOutput files:")
print("  - ground_truth_top100.jsonl: 100 genes with ground truth")
print("  - genes_top100.csv: Gene list with summaries")
print("  - gene_to_pmcids_top100.json: Gene to corpus mapping")
print("  - top100_coverage_stats.json: Coverage statistics")
print("\nBenchmark statistics:")
print("  - Total genes: 100")
print(f"  - Task 1 (GO) annotations in corpus: {total_t1}")
print(f"  - Task 2 (Expression) records in corpus: {total_t2}")
print(f"  - Task1:Task2 ratio: {total_t1 / total_t2:.1f}:1")
print("=" * 60)

DATA CURATION COMPLETE

Output files:
  - ground_truth_top100.jsonl: 100 genes with ground truth
  - genes_top100.csv: Gene list with summaries
  - gene_to_pmcids_top100.json: Gene to corpus mapping
  - top100_coverage_stats.json: Coverage statistics

Benchmark statistics:
  - Total genes: 100
  - Task 1 (GO) annotations in corpus: 1047
  - Task 2 (Expression) records in corpus: 492
  - Task1:Task2 ratio: 2.1:1
