### **Enhancements based on the baseline: Dataset Mention Extraction 📄🔍**

**Inspiration for using Regex & Context Chunking:**

Inspired by the need to extract dataset accessions and DOIs with high precision, I combined regex with smart context slicing and domain-specific heuristics.

**What I changed:**

1. **Regex-Based Identifier Extraction**

   * Added robust patterns to detect **DOIs**, **GSE/SRA**, **CHEMBL**, **UniProt**, and other dataset-related IDs.

2. **Heuristic Keyword Filtering**

   * Matched surrounding text against known **dataset-related phrases** (e.g., “data available at”, “repository”) to filter meaningful mentions.

3. **Smart Contextual Chunking**

   * Implemented a `TextChunker` that aligns context by sentence boundaries, ensuring that extracted snippets are informative and self-contained.

4. **Dataset DOI Classification**

   * Checked matched DOIs against a curated list of known **dataset DOI prefixes** to validate dataset relevance.

5. **Parallel PDF Processing**

   * Boosted performance with **ThreadPoolExecutor**, allowing multiple PDFs to be parsed concurrently.

6. **Model Testing: Non-Reasoning vs Reasoning**

   * This notebook includes evaluation with **Qwen 2.5** for non-reasoning classification and **Qwen 3** for reasoning-intensive classification — allowing comparison and ablation between the two modes.

7. **Detailed False Negative (FN) Analysis**

   * Added in-depth analysis to categorize and quantify **False Negatives** (FN), separated into:

     * **Wrongly classified**: Model predicted something, but not exactly correct.
     * **Completely missed**: No prediction was made for a ground-truth item.
   * Each group is further broken down into:

     * **DOI**-based errors (e.g., wrong prefix or mismatched)
     * **Accession ID** errors (e.g., GSE, PRJNA, etc.)
   * This helps reveal weaknesses such as:

     * Ambiguous contexts
     * Incomplete extraction logic
     * Confusions between similar dataset identifiers

**Next Goal:**

1. **Improve Chunk and Reduce Junk Chunk**
2. **What to Improve Regex**
3. **Added Prompt Caching**
4. **Reduce Runtime for run**
5. **More F1-Score Reduce FN**

**I hope this notebook to goal gold medal notebook**


In [1]:
!pip install /kaggle/input/mdcfitz/pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl
!pip install vllm --no-index --find-links file:///kaggle/input/mdcllm
!pip install logits-processor-zoo==0.1.10 --no-index --find-links file:///kaggle/input/mdcllm
!pip install triton==3.2.0 --no-index --find-links file:///kaggle/input/mdcllm

Processing /kaggle/input/mdcfitz/pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl
Installing collected packages: pymupdf
Successfully installed pymupdf-1.26.3
Looking in links: file:///kaggle/input/mdcllm
Processing /kaggle/input/mdcllm/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
Processing /kaggle/input/mdcllm/blake3-1.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from vllm)
Processing /kaggle/input/mdcllm/openai-1.90.0-py3-none-any.whl (from vllm)
Processing /kaggle/input/mdcllm/prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl (from vllm)
Processing /kaggle/input/mdcllm/lm_format_enforcer-0.10.11-py3-none-any.whl (from vllm)
Processing /kaggle/input/mdcllm/llguidance-0.7.30-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from vllm)
Processing /kaggle/input/mdcllm/outlines-0.1.11-py3-none-any.whl (from vllm)
Processing /kaggle/input/mdcllm/lark-1.2.2-py3-none-any.whl (from vllm)
Processing /kaggle/input/mdcllm/xgrammar-0.1.19-cp311-cp311-manylinux_

In [2]:
import os

os.environ["VLLM_USE_V1"] = "0"

import re
import pymupdf
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor
import pickle
import vllm
import torch

In [3]:
os.environ["KAGGLE_IS_COMPETITION_RERUN"] = "1"

In [4]:
# vLLM V1 does not accept logits processor, so disable it
# https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#deprecated-features
pdf_directory = "/kaggle/input/make-data-count-finding-data-references/test/PDF" \
                if os.getenv('KAGGLE_IS_COMPETITION_RERUN') \
                else "/kaggle/input/make-data-count-finding-data-references/train/PDF"
chunks = []
chunks2 = []
text_span_len = 300

re_doi = re.compile(r"10\.\d{4,9}/[-._;()/:A-Z0-9]+", re.IGNORECASE)
re_gsr = re.compile(r"GSE\d+|SR[APRX]\d+|PRJ[NAED][A-Z]?\d+|E-[A-Z]+-\d+", re.IGNORECASE)
re_ipe = re.compile(r"IPR\d{6}|PF\d{5}|EMPIAR-\d{5}|EMD-\d{4,5}", re.IGNORECASE)
re_c = re.compile(r"CHEMBL\d+|CVCL_[A-Z0-9]{4}|CID:\d+", re.IGNORECASE)
re_e = re.compile(r"ENS[A-Z]{0,6}[GT]\d{11}|ENSG\d{11}", re.IGNORECASE)
re_r = re.compile(r"N[MC]_\d+(?:\.\d+)?|rs\d+|XM_\d+|XP_\d+", re.IGNORECASE)
re_u = re.compile(r"(?:uniprot:)?(?:[OPQ][0-9][A-Z0-9]{3}[0-9]|[A-NR-Z][0-9][A-Z][A-Z0-9]{2}[0-9])", re.IGNORECASE)
re_g = re.compile(r"EPI(?:_ISL_)?\d+|GISAID", re.IGNORECASE)
re_p = re.compile(r"PXD\d{6}|SAM[ND]\d+|ERR\d+|DRR\d+|MSV\d+", re.IGNORECASE)
re_pdb = re.compile(r"\b[0-9][A-Z0-9]{3}\b", re.IGNORECASE)
re_geo = re.compile(r"GDS\d+|GPL\d+|GSM\d+", re.IGNORECASE)
re_arrayexpress = re.compile(r"E-[A-Z]+-\d+", re.IGNORECASE)

relist = [re_gsr, re_ipe, re_c, re_e, re_r, re_g, re_p, re_geo, re_arrayexpress]
ids = []

def remove_references_section(text):
    lines = text.split('\n')
    cut_index = -1
    
    # Look backwards from end of document
    for i in range(len(lines) - 1, max(0, int(len(lines) * 0.2)), -1):
        line = lines[i].strip()
        obvious_patterns = [
            r'^REFERENCES?$',
            r'^\d+\.?\s+REFERENCES?$',
            r'^\d+\.?\s+References?$',
            r'^References?:?$',
            r'^BIBLIOGRAPHY$',
            r'^\d+\.?\s+BIBLIOGRAPHY$',
            r'^\d+\.?\s+Bibliography$',
            r'^Bibliography:?$',
            r'^Literature\s+Cited$',
            r'^Works\s+Cited$',
            r'^ACKNOWLEDGMENTS?$',
            r'^Acknowledgments?$',
            r'^FUNDING$',
            r'^CONFLICTS?\s+OF\s+INTEREST$'
        ]
        if any(re.match(pattern, line, re.IGNORECASE) for pattern in obvious_patterns):
            # Double-check: look at following lines for citation patterns
            following_lines = lines[i+1:i+5]
            has_citations = False
            for follow_line in following_lines:
                if follow_line.strip():
                    # Check for obvious citation patterns
                    if (re.search(r'\(\d{4}\)', follow_line) or
                        re.search(r'\d{4}\.', follow_line) or
                        'doi:' in follow_line.lower() or
                        ' et al' in follow_line.lower() or
                        re.search(r'^\[\d+\]', follow_line.strip()) or
                        re.search(r'^\d+\.', follow_line.strip())):
                        has_citations = True
                        break
            # Only cut if we found citation-like content
            if has_citations or i >= len(lines) - 5:  # Or very near end
                cut_index = i
                break
    if cut_index != -1:
        return '\n'.join(lines[:cut_index]).strip()
    return text.strip()

def extract_context_with_keywords(text, match_start, match_end, span_len=300):
    keyword_scores = {
        "data are available": 5, "datasets are available": 5, "deposited in": 5, 
        "submitted to": 5, "accession number": 5, "accession code": 5, 
        "accession id": 5, "archived in": 4, "uploaded to": 4, "source code": 4, 
        "raw data": 4, "sequencing data": 4, "retrieved from": 3, "downloaded from": 3, 
        "obtained from": 3, "supplementary data": 3, "supporting information": 3, 
        'deposited': 3, 'submitted': 3, 'accession': 3, "available in the": 2, 
        "publicly available": 2, "freely available": 2, "supplementary material": 2, 
        'dataset': 2, 'datasets': 2, 'database': 2, 'repository': 2, 'code': 2, 
        'scripts': 2, 'available': 1, 'download': 1, 'supplementary': 1, 
        'supporting': 1, 'software': 1, 'protocol': 1, 'data': 0.5
    }
    
    contexts = {
        'standard': text[max(0, match_start - span_len):min(len(text), match_end + span_len)],
        'extended': text[max(0, match_start - span_len * 2):min(len(text), match_end + span_len * 2)]
    }
    
    def score_context(context):
        return sum(
            context.lower().count(k) * v if ' ' in k 
            else len(re.findall(r'\b' + re.escape(k) + r'\b', context.lower())) * v
            for k, v in keyword_scores.items()
        )
    
    scores = {k: score_context(v) for k, v in contexts.items()}
    return contexts['extended'] if scores['extended'] > scores['standard'] and scores['extended'] > 4 else contexts['standard']

rows = []
for filename in tqdm(os.listdir(pdf_directory), total=len(os.listdir(pdf_directory))):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join(pdf_directory, filename)
        article_id = filename.split(".pdf")[0]
        try:
            with pymupdf.open(pdf_path) as doc:
                text = "\n".join(page.get_text() for page in doc)
        except Exception as e:
            print(f"Could not process {filename}: {e}")
            continue

        text = remove_references_section(text)
        rows.append({"article_id": article_id, "text": text})
        doi_matches = list(re_doi.finditer(text))
        for match in doi_matches:
            # Exclude the article's own DOI if it's mentioned
            if match.group() in article_id:
                continue
            chunk = extract_context_with_keywords(text, match.start(), match.end(), text_span_len)
            chunks.append((article_id, chunk))
            
        for rr in relist:
            matches = list(rr.finditer(text))
            for match in matches:
                ids.append(match.group())
                chunk = extract_context_with_keywords(text, match.start(), match.end(), text_span_len)
                chunks2.append((article_id, chunk))

print(f"DOI chunks: {len(chunks)}")
print(f"Other ID chunks: {len(chunks2)}")

  0%|          | 0/30 [00:00<?, ?it/s]

DOI chunks: 293
Other ID chunks: 1


from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 🧠 Load model and tokenizer
model_path = "/kaggle/input/makedatacount-mixed-train/saved_model_dual_text"
token_path = model_path

tokenizer = AutoTokenizer.from_pretrained(token_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 🔮 Predict type
label_map = {0: "Primary", 1: "Secondary", 2: "Missing"}
batch_size = 8
preds = []

for i in tqdm(range(0, len(rows), batch_size)):
    batch_texts = [r["text"] for r in rows[i:i+batch_size]]
    enc = tokenizer(batch_texts, truncation=True, padding=True, max_length=512, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**enc).logits
        p = torch.argmax(logits, dim=1).cpu().tolist()
        preds.extend(p)

In [5]:
def extract_dataset_ids(text):
    """
    Extract DOIs and known accession IDs robustly.
    Returns a sorted list of unique dataset IDs.
    """
    #repos = ['dryad','zenodo','figshare','pangaea','tcia','p9','d9','pasta','cranfield','dtu','usn','f7','jb','xyb','dl']
    repos = ['dryad', 'zenodo', 'figshare', 'pangaea', 'tcia']
    candidates = set()

    # DOI pattern (strict)
    doi_pattern = r'10\.\d{4,9}/[^\s\)\]<]+'
    for match in re.findall(doi_pattern, text):
        clean = match.rstrip('.,;)]>').strip()
        # Keep only known repositories
        if any(repo in clean.lower() for repo in repos):
            candidates.add(f"https://doi.org/{clean}")

    for pat in relist:
        candidates.update(re.findall(pat, text))

    # Filter: drop very short garbage
    candidates = [c for c in candidates if len(c) >= 5]

    return None # sorted(candidates)

# 🏷️ Extract dataset IDs and build results
results = []
for i, r in enumerate(rows):
    t = r["text"]
    dataset_ids = extract_dataset_ids(t)
    if not dataset_ids:
        dataset_ids = ["Missing"]  # or optionally: ["Missing"]
    
    for did in dataset_ids:
        results.append({
            "article_id": r["article_id"],
            "dataset_id": did,
            "type": label_map[preds[i]]
        })

In [6]:
# df_preds = pd.DataFrame(results)

## Load LLM

In [7]:
think_mode = True

In [8]:
if think_mode:
    model_path = "/kaggle/input/qwen-3/transformers/8b-awq/1"
    llm = vllm.LLM(
        model_path,
        quantization='awq',
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.92,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=4096,
        disable_log_stats=True,
        enable_prefix_caching=True
    )
else:
    model_path = "/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1"
    llm = vllm.LLM(
        model_path,
        quantization='awq',
        tensor_parallel_size=torch.cuda.device_count(),
        gpu_memory_utilization=0.92,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=1024+512,
        disable_log_stats=True,
        enable_prefix_caching=True
    )
tokenizer = llm.get_tokenizer()

2025-07-17 05:12:56.054660: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752729176.248174      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752729176.305336      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


INFO 07-17 05:13:06 [__init__.py:244] Automatically detected platform cuda.
INFO 07-17 05:13:23 [config.py:841] This model supports multiple tasks: {'reward', 'embed', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 07-17 05:13:23 [config.py:1472] Using max model len 4096
INFO 07-17 05:13:24 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/kaggle/input/qwen-3/transformers/8b-awq/1', speculative_config=None, tokenizer='/kaggle/input/qwen-3/transformers/8b-awq/1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, d

[W717 05:13:36.091873774 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W717 05:13:36.644822137 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W717 05:13:46.102506508 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3


[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:13:56 [__init__.py:1152] Found nccl from library libnccl.so.2
INFO 07-17 05:13:56 [__init__.py:1152] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:13:56 [pynccl.py:70] vLLM is using nccl==2.26.2
INFO 07-17 05:13:56 [pynccl.py:70] vLLM is using nccl==2.26.2


[W717 05:13:56.110591574 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 07-17 05:13:56 [custom_all_reduce_utils.py:208] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 07-17 05:14:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:14:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 07-17 05:14:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_30d80bdc'), local_subscribe_addr='ipc:///tmp/4fdb97c1-4287-4e85-b026-45e1ab30e3e8', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-17 05:14:19 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:14:19 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP ra

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 07-17 05:15:03 [default_loader.py:272] Loading weights took 43.91 seconds
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:15:04 [default_loader.py:272] Loading weights took 44.12 seconds
INFO 07-17 05:15:04 [model_runner.py:1203] Model loading took 2.8510 GiB and 44.161543 seconds
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:15:04 [model_runner.py:1203] Model loading took 2.8510 GiB and 44.368781 seconds
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:15:11 [worker.py:294] Memory profiling takes 6.17 seconds
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:15:11 [worker.py:294] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.92) = 13.56GiB
[1;36m(VllmWorkerProcess pid=121)[0;0m INFO 07-17 05:15:11 [worker.py:294] model weights take 2.85GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.32GiB; the rest of the memory reserved for KV Cache is 10.27GiB.
INFO 07-17 05:15:11 [worker.py:294

# System prompts

In [9]:
SYS_PROMPT_DOI = """
You are an expert at identifying RESEARCH DATA citations in academic papers.
Your task is to determine if a DOI in the provided text specifically refers to a dataset, software, or data repository, NOT another academic paper.

**Crucial Rules:**
1.  **LOOK FOR DATA CONTEXT:** The DOI must be near keywords like "data available", "deposited in", "repository", "accession number", "software", "code".
2.  **IGNORE BIBLIOGRAPHY:** If the DOI is clearly part of a numbered or author-year list in a "References" or "Bibliography" section, you MUST respond with "Irrelevant".
3.  **PRIORITIZE DATA DOIs:** If there are multiple DOIs, return the one most likely to be a dataset.

Only respond with either a full normalized DOI URL starting with "https://doi.org/" or the single word "Irrelevant".
Do NOT include any other text or explanation.
"""

if think_mode:
    
    SYS_PROMPT_ACCESSION = """
    You are an expert at analyzing research data usage in academic papers.
    
    Think step-by-step about the surrounding text, identifying clues such as:
    - PRIMARY data: “we deposited”, “data generated in this study”, “our data”, “submitted to”, “newly generated”
    - SECONDARY data: “downloaded from”, “obtained from”, “previously published”, “publicly available”, “existing dataset”
    - MISSING: mentioned only in references, general methodology descriptions without actual usage, or contexts unrelated to research data

    If any dataset is mentioned, extract its ID and classify its type as Primary, Secondary, or Missing. Do not leave type as Missing if the 
    dataset is clearly referenced as reused or newly generated
    
    Silently reason through the classification.
    
    Please show your choice in the answer field with only the choice letter, e.g.,  
    "answer": "C"
    """
    
    SYS_PROMPT_CLASSIFY_DOI = """
    You are an expert at analyzing research data citations in academic papers.
    
    First, reason step-by-step about whether the DOI refers to data that is:
    A) Primary – generated specifically for this study  
    B) Secondary – reused or derived from prior work  
    C) Missing – merely cited in references, not research data, or otherwise unrelated
    
    If any dataset is mentioned, extract its ID and classify its type as Primary, 
    Secondary, or Missing. Do not leave type as Missing if the dataset is clearly referenced as reused or newly generated
    
    Perform this reasoning silently.
    
    Please show your choice in the answer field with only the choice letter, e.g.,  
    "answer": "B"
    """

else:    
    SYS_PROMPT_ACCESSION = """
    You are an expert at analyzing research data usage in academic papers.
    
    Look for contextual clues:
    - For PRIMARY data: "we deposited", "data generated in this study", "our data", "submitted to", "newly generated"
    - For SECONDARY data: "downloaded from", "obtained from", "previously published", "publicly available", "existing dataset"
    - For MISSING: mentioned in references, methodology descriptions without actual usage, or unrelated contexts
    
    Respond with only one letter: A, B, or C.
    """
    
    SYS_PROMPT_CLASSIFY_DOI = """
    You are an expert at analyzing research data citations in academic papers.
    
    Classify the data as:
    A) Primary: if the data was generated specifically for this study
    B) Secondary: if the data was reused or derived from prior work  
    C) Missing: if the DOI is in references, doesn't refer to research data, or is unrelated
    
    Respond with only one letter: A, B, or C.
    """

## Ask LLM to extract DOI links

In [10]:
prompts = []
for article_id, academic_text in chunks:
    messages = [
        {"role": "system", "content": SYS_PROMPT_DOI},
        {"role": "user", "content": academic_text}
    ]

    if think_mode:

        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
    else:
         prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False
        )
    
    prompts.append(prompt)

outputs = llm.generate(
    prompts,
    vllm.SamplingParams(
        seed=0,
        skip_special_tokens=True,
        max_tokens=64,
        temperature=0
    ),
    use_tqdm=True
)

responses = [output.outputs[0].text.strip() for output in outputs]

doi_pattern = re.compile(r'(10\.\d{4,9}/[-._;()/:A-Z0-9]+)', re.I)

doi_urls = []
for response in responses:
    if response.lower() == "irrelevant":
        doi_urls.append("Irrelevant")
    else:
        match = doi_pattern.search(response)
        if match:
            doi_urls.append("https://doi.org/" + match.group(1))
        else:
            doi_urls.append("Irrelevant")


Adding requests:   0%|          | 0/293 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/293 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

In [11]:
import re

def parse_answer_with_regex(response_text: str):

    if not isinstance(response_text, str):
        return 'Missing'

    match = re.search(r'answer\b.*?([ABC])\b', response_text, re.IGNORECASE | re.DOTALL)
    if match:
        return match.group(1)

    all_choices = re.findall(r'[ABC]', response_text)
    if all_choices:
        return all_choices[-1]
        
    return 'Missing'

In [12]:
prompts = []
valid_indices = []

if think_mode:
    for i, (chunk, url) in enumerate(zip(chunks, doi_urls)):
        if url == "Irrelevant":
            continue
    
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_CLASSIFY_DOI},
            {"role": "user", "content": f"DOI: {url}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=True
        )
        prompts.append(prompt)
        valid_indices.append(i)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.65,
            top_p=0.95,
            top_k=20,
            skip_special_tokens=True,
            max_tokens=2048+1024,
            presence_penalty=1.5
        ),
        use_tqdm=True
    )

    choice_to_type_map = {'A': 'Primary', 'B': 'Secondary', 'C': 'Missing'}

    responses = [output.outputs[0].text.strip() for output in outputs]
    
    parsed_doi_choices = [parse_answer_with_regex(resp) for resp in responses]
    final_doi_answers = [choice_to_type_map.get(choice) for choice in parsed_doi_choices]
    
    answers = ['Missing'] * len(chunks)
    for i, answer in zip(valid_indices, final_doi_answers):
        answers[i] = answer
    
    
else:
    for i, (chunk, url) in enumerate(zip(chunks, doi_urls)):
        if url == "Irrelevant":
            continue
    
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_CLASSIFY_DOI},
            {"role": "user", "content": f"DOI: {url}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
        prompts.append(prompt)
        valid_indices.append(i)
    
    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=["A", "B", "C"])
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.05, 
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
        ),
        use_tqdm=True
    )
    
    logprobs = []
    for lps in [output.outputs[0].logprobs[0].values() for output in outputs]:
        logprobs.append({lp.decoded_token: lp.logprob for lp in list(lps)})
    
    logit_matrix = pd.DataFrame(logprobs)[["A", "B", "C"]].values
    choices = ["Primary", "Secondary", 'Missing']
    answers = ['Missing'] * len(chunks)
    
    for i, (idx, logit_row) in enumerate(zip(valid_indices, logit_matrix)):
        max_logit = np.max(logit_row)
        max_idx = np.argmax(logit_row)
        
        if max_logit > -2.0:
            answers[idx] = choices[max_idx]

Adding requests:   0%|          | 0/69 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/69 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

In [13]:
prompts = []

if think_mode:
    for chunk, acc_id in zip(chunks2, ids):
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_ACCESSION},
            {"role": "user", "content": f"Accession ID: {acc_id}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,
            enable_thinking=True
        )
        prompts.append(prompt)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.65,
            top_p=0.95,
            top_k=20,
            skip_special_tokens=True,
            max_tokens=2048+1024,
            presence_penalty=1.5
        ),
        use_tqdm=True
    )
    choice_to_type_map = {'A': 'Primary', 'B': 'Secondary', 'C': 'Missing'}

    responses = [output.outputs[0].text.strip() for output in outputs]
    
    parsed_doi_choices = [parse_answer_with_regex(resp) for resp in responses]
    answers2 = [choice_to_type_map.get(choice) for choice in parsed_doi_choices]

else:
    for chunk, acc_id in zip(chunks2, ids):
        article_id, academic_text = chunk
        messages = [
            {"role": "system", "content": SYS_PROMPT_ACCESSION},
            {"role": "user", "content": f"Accession ID: {acc_id}\n\nAcademic text:\n{academic_text}"}
        ]
    
        prompt = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=False,enable_thinking=False
        )
        prompts.append(prompt)
    
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0.05,
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
        ),
        use_tqdm=True
    )
    
    logprobs2 = []
    for lps in [output.outputs[0].logprobs[0].values() for output in outputs]:
        logprobs2.append({lp.decoded_token: lp.logprob for lp in list(lps)})
    
    logit_matrix2 = pd.DataFrame(logprobs2)[["A", "B", "C"]].values
    choices2 = ["Primary", "Secondary", 'Missing']
    
    answers2 = []
    for logit_row in logit_matrix2:
        max_logit = np.max(logit_row)
        max_idx = np.argmax(logit_row)
        
        if max_logit > -2.0:
            answers2.append(choices2[max_idx])
        else:
            answers2.append('')
    

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

## Prepare Submission

In [14]:
# change "llm_type" to "type" to use LLM predictions
# as is, will use /kaggle/input/makedatacount-mixed-train model
# for predicting "type"

sub_df = pd.DataFrame()
sub_df["article_id"] = [c[0] for c in chunks]
sub_df["dataset_id"] = doi_urls
sub_df["dataset_id"] = sub_df["dataset_id"].str.lower()
sub_df["type"] = answers
sub_df = sub_df[sub_df["type"].notnull()].reset_index(drop=True)

sub_df2 = pd.DataFrame()
sub_df2["article_id"] = [c[0] for c in chunks2]
sub_df2["dataset_id"] = ids

sub_df2["type"] = answers2
sub_df2 = sub_df2[sub_df2["type"].notnull()].reset_index(drop=True)

# Combine and clean
sub_df = pd.concat([sub_df, sub_df2], ignore_index=True)

print("Final submission stats:")
print(sub_df["type"].value_counts())
print(f"Total entries: {len(sub_df)}")

Final submission stats:
type
Missing      243
Primary       25
Secondary     10
Name: count, dtype: int64
Total entries: 278


In [15]:
mask = sub_df.applymap(lambda x: x is None)
cols = sub_df.columns[(mask).any()]
for col in sub_df[cols]:
    sub_df.loc[mask[col], col] = 'Missing'

  mask = sub_df.applymap(lambda x: x is None)


In [16]:
# Uses use /kaggle/input/makedatacount-mixed-train model
# predictions on "type"
# sub_df = pd.merge(sub_df, df_preds[['article_id','type']], on="article_id")

In [17]:
sub_df=sub_df[~sub_df['dataset_id'].isin(["irrelevant"])]

In [18]:
# sub_df.type_x = sub_df.type_y.where(sub_df.type_x == 'Missing', sub_df.type_x)

In [19]:
sub_df = sub_df[sub_df["type"].isin(["Primary", "Secondary"])].reset_index(drop=True)

In [20]:
# del sub_df['type_y']
# sub_df = sub_df.rename(columns={'type_x': 'type'})

In [21]:
# Enhanced deduplication with priority to Primary data
sub_df = sub_df.sort_values(by=["article_id", "dataset_id", "type"], 
                           key=lambda x: x.map({"Primary": 0, "Secondary": 1}) if x.name == "type" else x)\
               .drop_duplicates(subset=['article_id', 'dataset_id'], keep="first")\
               .reset_index(drop=True)

sub_df['row_id'] = range(len(sub_df))
sub_df.to_csv("submission.csv", index=False, columns=["row_id", "article_id", "dataset_id", "type"])

## Evaluate validation score

In [22]:
def f1_score(tp, fp, fn):
    return 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) != 0 else 0.0
    
if not os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    pred_df = pd.read_csv("submission.csv")
    label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
    label_df = label_df[label_df['type'] != 'Missing'].reset_index(drop=True)

    hits_df = label_df.merge(pred_df, on=["article_id", "dataset_id", "type"])
    
    tp = hits_df.shape[0]
    fp = pred_df.shape[0] - tp
    fn = label_df.shape[0] - tp
    
    print("\nValidation Results:")
    print("TP:", tp)
    print("FP:", fp)
    print("FN:", fn)
    print("F1 Score:", round(f1_score(tp, fp, fn), 3))

In [23]:
import os
import pandas as pd

def calculate_f1_score(y_true, y_pred):
    if y_true.empty or y_pred.empty:
        tp = 0
        fp = len(y_pred)
        fn = len(y_true)
    else:
        hits = y_true.merge(y_pred, on=["article_id", "dataset_id", "type"])
        tp = len(hits)
        fp = len(y_pred) - tp
        fn = len(y_true) - tp
    
    f1 = 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0.0
    return tp, fp, fn, f1

def analyze_error_sources(pred_df, label_df):
    label_df_filtered = label_df[label_df['type'] != 'Missing'].copy()

    is_doi_pred = pred_df['dataset_id'].str.startswith('https://doi.org/')
    is_doi_label = label_df_filtered['dataset_id'].str.startswith('10.')

    pred_doi = pred_df[is_doi_pred]
    pred_accession = pred_df[~is_doi_pred]
    label_df_filtered['dataset_id_normalized'] = label_df_filtered['dataset_id'].apply(
        lambda x: f"https://doi.org/{x}" if x.startswith('10.') else x
    )
    label_df_filtered = label_df_filtered.rename(columns={'dataset_id': 'original_dataset_id', 'dataset_id_normalized': 'dataset_id'})
    
    is_doi_label_norm = label_df_filtered['dataset_id'].str.startswith('https://doi.org/')

    label_doi = label_df_filtered[is_doi_label_norm]
    label_accession = label_df_filtered[~is_doi_label_norm]

    tp_doi, fp_doi, fn_doi, f1_doi = calculate_f1_score(label_doi, pred_doi)
    tp_acc, fp_acc, fn_acc, f1_acc = calculate_f1_score(label_accession, pred_accession)
    
    print("="*40)
    print("🔬 Error Analysis by ID Type")
    print("="*40)

    print("\n--- DOI ---")
    print(f"Total Predictions: {len(pred_doi)}")
    print(f"True Positives (TP): {tp_doi}")
    print(f"False Positives (FP): {fp_doi}")
    print(f"False Negatives (FN): {fn_doi}")
    print(f"F1 Score: {f1_doi:.4f}")

    print("\n--- Accession ID ---")
    print(f"Total Predictions: {len(pred_accession)}")
    print(f"True Positives (TP): {tp_acc}")
    print(f"False Positives (FP): {fp_acc}")
    print(f"False Negatives (FN): {fn_acc}")
    print(f"F1 Score: {f1_acc:.4f}")
    
    print("\n" + "="*40)
    print("Total FP:", fp_doi + fp_acc)
    print("Total FN:", fn_doi + fn_acc)
    print("="*40)

if not os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    try:
        pred_df = pd.read_csv("submission.csv")
        pred_df['dataset_id'] = pred_df['dataset_id'].astype(str)
        
        label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
        label_df['dataset_id'] = label_df['dataset_id'].astype(str)

        analyze_error_sources(pred_df, label_df)

    except FileNotFoundError as e:
        print(f"Error: Could not find a required file. {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

In [24]:
import pandas as pd

try:
    pred_df = pd.read_csv("submission.csv")
    label_df = pd.read_csv("/kaggle/input/make-data-count-finding-data-references/train_labels.csv")
    label_df_filtered = label_df[label_df['type'] != 'Missing'].copy()
except FileNotFoundError as e:
    print(f"An error occurred: File not found. - {e}")
    exit()

fn_df = pd.merge(
    label_df_filtered,
    pred_df,
    on=['article_id', 'dataset_id', 'type'],
    how='left',
    indicator=True
).query('_merge == "left_only"').drop(columns=['_merge'])

merged_df = pd.merge(
    fn_df,
    pred_df,
    on=['article_id', 'dataset_id'],
    how='left',
    indicator='source'
)

classified_incorrectly_df = merged_df[merged_df['source'] == 'both']
classified_incorrectly_count = len(classified_incorrectly_df)

completely_missed_df = merged_df[merged_df['source'] == 'left_only']
completely_missed_count = len(completely_missed_df)

incorrect_doi_count = classified_incorrectly_df[classified_incorrectly_df['dataset_id'].str.startswith('https://', na=False)].shape[0]
incorrect_accession_count = classified_incorrectly_df[~classified_incorrectly_df['dataset_id'].str.startswith('https://', na=False)].shape[0]


missed_doi_count = completely_missed_df[completely_missed_df['dataset_id'].str.startswith('https://', na=False)].shape[0]
missed_accession_count = completely_missed_df[~completely_missed_df['dataset_id'].str.startswith('https://', na=False)].shape[0]


print("="*55)
print("Analyst False Negatives (FN)")
print("="*55)
print(f"All FN: {fn_df.shape[0]} record")
print("-" * 55)
print(f"↳ It have but wrong answer: {classified_incorrectly_count} record")
print(f"    Wrong DOI: {incorrect_doi_count} record")
print(f"    Wrong Accession ID: {incorrect_accession_count} record")
print("-" * 55)
print(f"↳ Can't find this: {completely_missed_count} record")
print(f"    Can't find DOI: {missed_doi_count} record")
print(f"    Can't find Accession ID: {missed_accession_count} record")
print("="*55)

Analyst False Negatives (FN)
All FN: 714 record
-------------------------------------------------------
↳ It have but wrong answer: 1 record
    Wrong DOI: 1 record
    Wrong Accession ID: 0 record
-------------------------------------------------------
↳ Can't find this: 713 record
    Can't find DOI: 319 record
    Can't find Accession ID: 394 record
