# 🧑‍🔬 Make Data Count: Competition-Leading Pipeline

This notebook implements a modular pipeline for extracting and classifying research data citations from PDFs (and, optionally, XMLs), following the unique demands and quirks of the Make Data Count competition.

---

## Competition Context

- **Goal**: Identify data mentions in research articles and classify the *relationship* to the paper as **Primary** (data generated in this study) or **Secondary** (reused data).
- **Challenge**: Labels are *noisy and incomplete*, with major updates and forum debates (see: train labels, “Missing” type, new F1 scoring).
- **Key insight**: Regex + LLM hybrid approaches, with layered filtering, currently outperform end-to-end NER or deep learning models due to label quality.

---

## Pipeline Steps

### 1. Environment Setup
- Cleans up conflicting packages (e.g. TensorFlow) and pins versions for parsing, LLM inference, and regex support.
- Defensive: Handles CUDA, logging, and IO quirks on Kaggle.

### 2. Helpers & Logging
- Configures paths for Kaggle/notebook runs, sets up robust logging, and normalizes all text (including Unicode, stray whitespace, and weird publisher characters).

### 3. PDF (and XML) Parsing
- Converts each PDF to raw text using PyMuPDF.
- *Forum wisdom*: Both PDF and XML are needed, as neither alone is complete or fully labeled.

### 4. Extraction Engine
- **Regex Patterns**: Finds DOIs, accessions, and other dataset IDs using hand-tuned, aggressively expanded patterns.
- **Section Splitting**: Identifies and splits out the References section to reduce noise and false positives.
- **Deduplication & Filters**: Removes non-dataset IDs, known bad prefixes, and article IDs.

### 5. Context Window Construction
- For each candidate ID, builds a snippet (window) of surrounding text. This is crucial for LLM classification.

### 6. LLM-Based Classification
- Uses a strongly-structured prompt and vLLM (Qwen 32B) to decide if an ID is "Data" or "Literature" (A/B), leveraging explicit prefix rules and a few-shot template.
- Constrains output to single-character “A” or “B” for reliability.

### 7. Post-Filtering
- Drops IDs with common literature/publisher prefixes unless “data context” words are present in the window.
- Aggressive final deduplication.

### 8. Scoring & Output
- Outputs submission as required (article_id, dataset_id, type).
- Local F1 scoring mimics the updated official metric: "Missing" labels only penalize explicit mistakes; others use classic F1.

---

## Why This Works

- **Sane Extraction**: Regex + layered filtering catches nearly all valid IDs, especially with hand-crafted postprocessing.
- **LLM for Context**: Prompt engineering beats naive NER for deciding “data” vs “literature,” especially with incomplete labels.
- **Flexible**: Each stage is isolated for quick patching—key in a noisy, changing competition.

---

*If you have a regex improvement, please fork and share. However, I think the real gold will lie in creating a dataset that maps these regexes to other "aliases" of datasets. 

May the F1 be ever in your favor!*


In [1]:
! uv pip uninstall --system 'tensorflow'
! uv pip install --system --no-index --find-links='/kaggle/input/latest-mdc-whls/whls' 'pymupdf' 'vllm' 'triton' 'logits-processor-zoo' 'numpy<2'
! mkdir -p /tmp/src

[2mUsing Python 3.11.13 environment at: /usr[0m
[2mUninstalled [1m1 package[0m [2min 5.07s[0m[0m
 [31m-[39m [1mtensorflow[0m[2m==2.18.0[0m
[2mUsing Python 3.11.13 environment at: /usr[0m
[2K[2mResolved [1m157 packages[0m [2min 509ms[0m[0m
[2K[2mPrepared [1m52 packages[0m [2min 15.29s[0m[0m
[2mUninstalled [1m14 packages[0m [2min 231ms[0m[0m
[2K[2mInstalled [1m52 packages[0m [2min 93ms[0m[0m
 [32m+[39m [1mairportsdata[0m[2m==20250622[0m
 [32m+[39m [1mastor[0m[2m==0.8.1[0m
 [32m+[39m [1mblake3[0m[2m==1.0.5[0m
 [32m+[39m [1mcompressed-tensors[0m[2m==0.9.3[0m
 [32m+[39m [1mdepyf[0m[2m==0.18.0[0m
 [32m+[39m [1mdiskcache[0m[2m==5.6.3[0m
 [32m+[39m [1mfastapi-cli[0m[2m==0.0.7[0m
 [32m+[39m [1mgguf[0m[2m==0.17.1[0m
 [32m+[39m [1mhttptools[0m[2m==0.6.4[0m
 [31m-[39m [1mimportlib-metadata[0m[2m==8.7.0[0m
 [32m+[39m [1mimportlib-metadata[0m[2m==8.0.0[0m
 [32m+[39m [1mi

In [2]:
%%writefile /tmp/src/helpers.py
import logging, os, kagglehub, inspect
from pathlib import Path
import polars as pl

IS_KAGGLE_ENV = sum(['KAGGLE' in k for k in os.environ]) > 0
IS_KAGGLE_SUBMISSION = bool(os.getenv("KAGGLE_IS_COMPETITION_RERUN"))
COMP_DIR = Path(('/kaggle/input/make-data-count-finding-data-references' if IS_KAGGLE_SUBMISSION else kagglehub.competition_download('make-data-count-finding-data-references')))
PDF_DIR = COMP_DIR / ('test' if IS_KAGGLE_SUBMISSION else 'train') / 'PDF'
WORKING_DIR = Path(('/kaggle/working/' if IS_KAGGLE_ENV else '.working/'))

DOI_LINK = 'https://doi.org/'

DEFAULT_LOG_LEVEL = os.getenv("LOG_LEVEL", "DEBUG").upper() if not IS_KAGGLE_SUBMISSION else "WARNING"
LOG_FILE_PATH = os.getenv("LOG_FILE", "logs/project.log")
LOG_DIR = Path(LOG_FILE_PATH).parent

LOG_DIR.mkdir(parents=True, exist_ok=True)

LOG_FORMAT = "%(levelname)s %(asctime)s  [%(filename)s:%(lineno)d - %(funcName)s()] %(message)s"
LOG_DATEFMT = "%Y-%m-%d %H:%M:%S"

def get_logger(name=None):
    if name is None:
        frame = inspect.currentframe()
        if frame is None or frame.f_back is None:
            name = "__main__"
        else:
            name = frame.f_back.f_globals.get("__name__", "__main__")

    logger = logging.getLogger(name)

    if not logger.handlers:
        logger.setLevel(DEFAULT_LOG_LEVEL)
        formatter = logging.Formatter(fmt=LOG_FORMAT, datefmt=LOG_DATEFMT)
        ch = logging.StreamHandler()
        ch.setLevel(DEFAULT_LOG_LEVEL)
        ch.setFormatter(formatter)
        fh = logging.FileHandler(LOG_FILE_PATH)
        fh.setLevel(DEFAULT_LOG_LEVEL)
        fh.setFormatter(formatter)
        logger.addHandler(ch)
        logger.addHandler(fh)
        logger.propagate = False
    return logger

def is_doi_link(name: str) -> pl.Expr:
    return pl.col(name).str.starts_with(DOI_LINK)

def string_normalization(name: str) -> pl.Expr:
    return pl.col(name).str.normalize("NFKC").str.replace_all(r"[^\p{Ascii}]", '').str.replace_all(r"https?://zenodo\.org/record/(\d+)", r" 10.5281/zenodo.$1 ")

def get_df(parse_dir: str):
    records = []
    txt_files = list(Path(parse_dir).glob('*.txt'))
    for txt_file in txt_files:
        id_ = txt_file.stem
        with open(txt_file, 'r') as f:
            text = f.read()
        records.append({'article_id': id_, 'text': text})
    return pl.DataFrame(records).with_columns(string_normalization('text').alias('text'))

def assume_type(df: pl.DataFrame) -> pl.DataFrame:
    return (
        df.with_columns(pl.when(is_doi_link('dataset_id').or_(pl.col('dataset_id').str.starts_with('SAMN'))).then(pl.lit('Primary')).otherwise(pl.lit('Secondary')).alias('type'))
    )

def score(df, gt, on, tag='all'):
    hits = gt.join(df, on=on)
    tp = hits.height
    fp = df.height - tp
    fn = gt.height - tp
    f1 = 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) != 0 else 0.0
    return f"{tag} - f1: {f1:.4f} [{tp}/{fp}/{fn}]"

def evaluate(df, on=['article_id', 'dataset_id']):
    gt = pl.read_csv(COMP_DIR/'train_labels.csv').filter(pl.col('type')!='Missing')
    return (
        score(df, gt, on),
        score(df.filter(is_doi_link('dataset_id')), gt.filter(is_doi_link('dataset_id')), on, 'doi'),
        score(df.filter(~is_doi_link('dataset_id')), gt.filter(~is_doi_link('dataset_id')), on, 'acc'),
    )

Writing /tmp/src/helpers.py


In [3]:
%%writefile /tmp/src/parse.py
"""
parser.py
=========
• Detects competition-submission mode
• Parses both PDF and XML into plain text
• Cleans text with the same heuristics
• Writes /kaggle/working/output_dir/<article_id>.txt
"""

from __future__ import annotations
from pathlib import Path
import multiprocessing as mp
import os, re, unicodedata, fitz               # PyMuPDF
from tqdm.auto import tqdm
import pymupdf
import os, re, pathlib
import polars as pl
from lxml import etree
import pymupdf
from typing import Tuple

# ----------------------- Environment / paths ---------------------------------
IS_KAGGLE_SUBMISSION = bool(os.getenv("KAGGLE_IS_COMPETITION_RERUN"))
SPLIT = "test" if IS_KAGGLE_SUBMISSION else "train"

BASE = Path("/kaggle/input/make-data-count-finding-data-references") / SPLIT
PDF_DIR = BASE / "PDF"
XML_DIR = BASE / "XML"

OUT_DIR = Path("/tmp/train_parse")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# XML & PDF Parsing

def xml_kind(path: pathlib.Path) -> str:
    head = path.open('rb').read(2048).decode('utf8', 'ignore')
    if 'www.tei-c.org/ns' in head:
        return 'tei'
    if re.search(r'(NLM|TaxonX)//DTD', head):
        return 'jats'
    if 'www.wiley.com/namespaces' in head:
        return 'wiley'
    if 'BioC.dtd' in head:
        return 'bioc'
    return 'unknown'

def xml2text(path: pathlib.Path) -> str:
    kind = xml_kind(path)
    root = etree.parse(str(path)).getroot()
    if kind in ('tei', 'bioc', 'unknown'):
        txt = ' '.join(root.itertext())
    elif kind == 'jats':
        elems = root.xpath('//body//sec|//ref-list')
        txt = ' '.join(' '.join(e.itertext()) for e in elems)
    elif kind == 'wiley':
        elems = root.xpath('//*[local-name()="body"]|//*[local-name()="refList"]')
        txt = ' '.join(' '.join(e.itertext()) for e in elems)
    else:
        txt = ' '.join(root.itertext())
    txt = re.sub(r'10\.\d{4,9}/\s+', '10.', txt)
    return txt

def pdf2text(path: pathlib.Path, out_dir: pathlib.Path) -> None:
    doc = pymupdf.open(str(path))
    out = out_dir / f"{path.stem}.txt"
    with open(out, "wb") as f:
        for page in doc:
            f.write(page.get_text().encode("utf8"))
            f.write(b"\n")

# Parse All PDFs & XMLs to TXT
from tqdm.auto import tqdm

def parse_all_pdfs_xmls(pdf_dir, xml_dir, parsed_dir):
    pdf_files = list(pdf_dir.glob('*.pdf'))
    if not pdf_files and not xml_dir.exists():
        raise ValueError("No PDF or XML files found.")

    parsed_dir.mkdir(parents=True, exist_ok=True)

    # PDF → TXT
    for pdf in tqdm(pdf_files, desc="PDF→TXT"):
        try:
            pdf2text(pdf, parsed_dir)
        except Exception as e:
            print(f"PDF error {pdf.stem}: {e}")

    # XML → TXT (append mode)
    if xml_dir.exists():
        for xml in tqdm(xml_dir.glob('*.xml'), desc="XML→TXT"):
            try:
                txt = xml2text(xml).encode("utf8")
                out = parsed_dir / f"{xml.stem}.txt"
                with open(out, "ab") as f:  # 'ab' = append binary
                    f.write(txt)
                    f.write(b"\n")
            except Exception as e:
                print(f"XML error {xml.stem}: {e}")
    print("Done parsing to text.")

parse_all_pdfs_xmls(PDF_DIR, XML_DIR, OUT_DIR)

Writing /tmp/src/parse.py


In [4]:
%%writefile /tmp/src/check_parse.py
import polars as pl
from pathlib import Path
from helpers import *

l=get_logger()

def gt_dataset_id_normalization(name:str) -> pl.Expr:
    return (
        pl.when(is_doi_link(name))
        .then(pl.col(name).str.split(DOI_LINK).list.last())
        .otherwise(name)
        .str.to_lowercase()
    )

def main():
    if IS_KAGGLE_SUBMISSION:
        l.debug('skipping check_parse for submission')
        return
    df = (
        get_df('/tmp/train_parse')
        .with_columns(pl.col('text').str.replace_all('\s+', '').str.to_lowercase().alias('text'))
    )

    gt = (
        pl.read_csv(COMP_DIR/'train_labels.csv')
        .filter(pl.col('article_id').is_in(df['article_id']))
        .filter(pl.col('type')!='Missing')
        .with_columns(gt_dataset_id_normalization('dataset_id').alias('norm_id'))
    )

    l.info(f"pymupdf misses: {gt.join(df, on='article_id').with_columns(hit=pl.col('text').str.contains(pl.col('norm_id'), literal=True)).filter(~pl.col('hit')).height} dataset_ids")

if __name__=='__main__': main()

Writing /tmp/src/check_parse.py


In [5]:
%%writefile /tmp/src/getid.py
import re
import polars as pl
from typing import Optional, Tuple

from helpers import *

import re

COMPILED_PATTERNS = {
    # 参考文献見出しの多様化と空白文字の柔軟対応
    'ref_header_patterns': [
        re.compile(r'\b(R\s*E\s*F\s*E\s*R\s*E\s*N\s*C\s*E\s*S|BIBLIOGRAPHY|LITERATURE CITED|WORKS CITED|CITED WORKS)\b[:\s]*', re.IGNORECASE),
        re.compile(r'\b(REFERENCES|BIBLIOGRAPHY|LITERATURE\s*CITED|WORKS\s*CITED|CITED\s*WORKS)\b', re.IGNORECASE),
    ],

    # 引用パターンの拡張（番号リスト、範囲、著者名＋年、DOIなど）
    'citation_pattern': re.compile(
        r'^\s*('
        r'\[\d+(-\d+)?(,\s*\d+(-\d+)?)*\]'                  # [1], [12-15], [1,3,5-7]
        r'|\([A-Z][a-z]+ et al\., \d{4}\)'                  # (Smith et al., 2020)
        r'|doi:\s*10\.\d{4,9}/[-._;()/:A-Z0-9]+'            # doi:10.xxxx/xxxxx
        r'|https?://doi\.org/10\.\d{4,9}/[-._;()/:A-Z0-9]+' # https://doi.org/10.xxxx/xxxxx
        r'|\d+\.|\d+\)|\(\d+\)|\d+(?=\s|$)'                  # 1. 1) (1) 1（単純番号）
        r')\s*', re.IGNORECASE),

    # 最初の引用パターンも必要に応じて拡張可能
    'first_citation_patterns': [
        re.compile(r'^\s*\[1\]\s*'),
        re.compile(r'^\s*\(1\)\s*'),
        re.compile(r'^\s*1\.\s*'),
        re.compile(r'^\s*1\)\s*'),
        re.compile(r'^\s*1(?=\s|$)'),
        # 例: 著者名＋年の最初の引用パターン追加も可能
        re.compile(r'^\s*\([A-Z][a-z]+ et al\., 20\d{2}\)'),
    ],
}


l = get_logger()

def find_last_reference_header(text: str, header_patterns: list[re.Pattern]) -> Optional[int]:
    last_match_idx = None
    for pattern in header_patterns:
        matches = list(pattern.finditer(text))
        if matches:
            last_match_idx = matches[-1].start()
    return last_match_idx

def find_last_first_citation(text: str) -> Optional[int]:
    lines = text.splitlines()
    last_match_line = None
    for line_num, line in enumerate(lines):
        line = line.strip()
        for pattern in COMPILED_PATTERNS['first_citation_patterns']:
            if pattern.match(line):
                next_lines = lines[line_num:line_num+3]
                if any(COMPILED_PATTERNS['citation_pattern'].match(l.strip()) for l in next_lines[1:]):
                    last_match_line = line_num
                break
    return last_match_line

from typing import Optional

# 補助関数：誤検出行（ノイズ行）を判定する関数
def is_noise_line(line: str) -> bool:
    stripped = line.strip()
    if not stripped:
        return True  # 空行はノイズとみなす

    # 区切り線（---, ===, ***など）
    if stripped in ['---', '===', '***']:
        return True

    # ページ番号のような行例: "Page 12"
    if stripped.lower().startswith('page ') and stripped[5:].strip().isdigit():
        return True

    # FigureやTableのラベル行を除外（例: Figure 1, Table 2）
    if stripped.lower().startswith('figure ') or stripped.lower().startswith('table '):
        return True

    return False

def find_reference_start(text: str) -> Optional[int]:
    lines = text.splitlines()

    # ① 見出し検出：COMPILED_PATTERNS['ref_header_patterns']のどれかにマッチすれば検出
    for i, line in enumerate(lines):
        for pattern in COMPILED_PATTERNS['ref_header_patterns']:
            if pattern.search(line):
                # 見出しの次数行で空行・ノイズ行をスキップして最初の有効行を返す
                for offset in range(1, 6):
                    idx = i + offset
                    if idx >= len(lines):
                        break
                    candidate_line = lines[idx].strip()
                    if candidate_line and not is_noise_line(candidate_line):
                        return idx
                # 見出し直後に該当行がなければ見出し行+1を返す
                return i + 1

    # ② 既存の最後の初出引用位置を試す
    last_first_citation = find_last_first_citation(text)
    if last_first_citation is not None:
        return last_first_citation

    # ③ 後半から引用パターンの連続行を探す
    start_search_idx = int(len(lines) * 0.5)
    for i in range(start_search_idx, len(lines)):
        line = lines[i].strip()
        if is_noise_line(line) or len(line) < 5:
            continue
        if COMPILED_PATTERNS['citation_pattern'].match(line):
            next_lines = lines[i:i + 5]
            count = sum(1 for l in next_lines if COMPILED_PATTERNS['citation_pattern'].match(l.strip()))
            if count >= 3:
                # 直前の引用ではない行を探す（最大15行前まで）
                for j in range(i, max(-1, i - 15), -1):
                    prev_line = lines[j].strip()
                    if prev_line and not COMPILED_PATTERNS['citation_pattern'].match(prev_line) and not is_noise_line(prev_line):
                        return j + 1
                return max(0, i - 15)

    return None



def split_text_and_references(text: str) -> Tuple[str, str]:
    header_idx = find_last_reference_header(text, COMPILED_PATTERNS['ref_header_patterns'])
    prev_idx = None
    while header_idx is not None and header_idx != prev_idx:
        prev_idx = header_idx
        header_idx = find_last_reference_header(text[:header_idx].strip(), COMPILED_PATTERNS['ref_header_patterns'])
    if prev_idx is not None:
        return text[:prev_idx].strip(), text[prev_idx:].strip()

    ref_start_line = find_reference_start(text)
    if ref_start_line is not None:
        lines = text.splitlines()
        body = '\n'.join(lines[:ref_start_line])
        refs = '\n'.join(lines[ref_start_line:])
        return body.strip(), refs.strip()

    return text.strip(), ''

def get_splits(df: pl.DataFrame) -> pl.DataFrame:
    bodies, refs = [], []
    for raw_text in df['text']:
        main, ref = split_text_and_references(raw_text)
        bodies.append(main)
        refs.append(ref)
    return df.with_columns(pl.Series('body', bodies), pl.Series('ref', refs))

def tidy_extraction(df) -> pl.DataFrame:
    bad_ids = [f'{DOI_LINK}{e}' for e in ['10.5061/dryad', '10.5281/zenodo', '10.6073/pasta']]

    doi_df = (
        df.with_columns(pl.col('body').str.extract_all(r'10\s*\.\s*\d{4,9}\s*/\s*\S+').alias('match'))
          .explode('match')
          .drop_nulls('match')
          .with_columns(
              pl.col('match').str.replace_all(r'\s+', '')
                             .str.replace(r'[^A-Za-z0-9]+$', '')
                             .str.to_lowercase()
                             .alias('dataset_id')
          )
          .group_by('article_id', 'dataset_id')
          .agg('match')
          .with_columns((DOI_LINK + pl.col('dataset_id')).alias('dataset_id'))
    )

    REGEX_IDS = (
    r"(?i)\b(?:"
    r"CHEMBL\s*\d+|"
    r"E-GEOD-\s*\d+|E-PROT-\s*\d+|E-MTAB-\s*\d+|E-MEXP-\s*\d+|EMPIAR-\s*\d+|"
    r"ENSBTAG\s*\d+|ENSOARG\s*\d+|"
    r"EPI\s*_?\s*ISL\s*_?\s*\d{5,}|EPI\s*\d{6,7}|"
    r"HPA\s*\d+|CP\s*\d{6}|IPR\s*\d{6}|PF\s*\d{5}|BX\s*\d{6}|KX\s*\d{6}|K0\s*\d{4}|CAB\s*\d{6}|"
    r"NC\s*_\s*\d{6}\.\d{1}|NM\s*_\s*\d{9}|"
    r"PRJNA\s*\d+|PRJDB\s*\d+|PRJEB\s*\d+|PXD\s*\d+|SAMN\s*\d+|"
    r"GSE\s*\d+|GSM\s*\d+|"
    r"CVCL\s*_\s*[A-Z0-9]{4}|"
    r"PDB\s*[1-9][A-Z0-9]{3}|HMDB\s*\d+|"
    r"dryad\.\s*[^\s\"<>]+|pasta\/\s*[^\s\"<>]+|"
    r"(?:SR[RPAX]|STH|ERR|DRR|DRX|DRP|ERP|ERX)\d+|"
    r"phs\d{6}(?:\.v\d{1,2}\.p\d{1,2})?|"   # dbGaP accession
    r"MTBLS\d+|"                            # MetaboLights
    r"E-[A-Z]{4}-\d+|"                      # ArrayExpress (general)
    r"ds\d{6}|"                             # OpenNeuro
    r"[1-5]\s*\.(?:10|20|30|40|50|60|70|80|90)\s*\.\d{2,4}\s*\.\d{2,4}"  # numeric DOI-like
    r")"
)




    
    acc_df = (
        df.with_columns(
            pl.col('text').str.extract_all(REGEX_IDS).alias('match')
        )
        .explode('match')
        .drop_nulls('match')
        .with_columns(
            pl.col('match').str.replace_all(r'\s+', '')
                           .str.replace(r'[^A-Za-z0-9]+$', '')
                           .str.replace(r'(?i)^PDB', '')
                           .alias('dataset_id')
        )
        .group_by('article_id', 'dataset_id')
        .agg('match')
        .with_columns(
            pl.when(pl.col('dataset_id').str.starts_with('dryad.'))
              .then(f'{DOI_LINK}10.5061/' + pl.col('dataset_id'))
              .otherwise('dataset_id')
              .alias('dataset_id')
        )
        .with_columns(
            pl.when(pl.col('dataset_id').str.starts_with('pasta/'))
              .then(f'{DOI_LINK}10.6073/' + pl.col('dataset_id'))
              .otherwise('dataset_id')
              .alias('dataset_id')
        )
    )

    df = pl.concat([doi_df, acc_df])

    df = (
        df.unique(['article_id', 'dataset_id'])  # CHANGED
          .filter(~pl.col('article_id').str.replace('_','/').str.contains(pl.col('dataset_id').str.split(DOI_LINK).list.last().str.escape_regex()))
          .filter(~pl.col('dataset_id').str.contains(pl.col('article_id').str.replace('_','/').str.escape_regex()))
          .filter(~pl.col('dataset_id').str.contains('figshare', literal=True))
          .filter(~pl.col('dataset_id').is_in(bad_ids))
          .filter(
              pl.when(is_doi_link('dataset_id') &
                      (pl.col('dataset_id').str.split('/').list.last().str.len_chars() < 5))
               .then(False)
               .otherwise(True)
          )
          .with_columns(pl.col('match').list.unique())
    )
    return df

def get_context_window(text: str, substring: str, window: int = 100) -> str:
    idx = text.find(substring)
    if idx == -1:
        raise ValueError
    start = max(idx - window, 0)
    end = min(idx + len(substring) + window, len(text))
    return text[start:end]

def get_window_df(text_df, ids_df):
    df = ids_df.join(text_df, on='article_id')
    windows = []
    for text, match_ids in df.select('text', 'match').rows():
        windows.append(get_context_window(text, match_ids[0]))
    return df.with_columns(pl.Series('window', windows)).select('article_id', 'dataset_id', 'window')

def main():
    text_df = get_df('/tmp/train_parse')
    df = get_splits(text_df)
    df = tidy_extraction(df)
    df = get_window_df(text_df, df)
    df.write_parquet('/tmp/extracted.parquet')
    df = assume_type(df)
    df.select(['article_id', 'dataset_id', 'type']).with_row_index(name='row_id').write_csv('/kaggle/working/submission.csv')
    if not IS_KAGGLE_SUBMISSION:
        results = evaluate(df)
        for r in results: l.info(r)
        results = evaluate(df, on=['article_id', 'dataset_id', 'type'])
        for r in results: l.info(r)

if __name__=='__main__': main()

Writing /tmp/src/getid.py


In [6]:
%%writefile /tmp/src/llm_validate.py
import polars as pl
import os

from helpers import *

l = get_logger()

# ===============================
# Few-shot 強化済みプロンプト + 文脈パターン明示 + アクセッション例追加
# ===============================
SYS_PROMPT_CLASSIFY_DOI = """
1. Priority Rules (highest → lowest)

1.1 Always classify as A (Data) if:
DOI prefix matches a known data repository:

Dryad: 10.5061
Zenodo: 10.5281
Figshare: 10.6084
Mendeley Data: 10.24433/, 10.17632
Dataverse: 10.7910/DVN
OpenNeuro: 10.18112/openneuro.
PANGAEA: 10.1594/PANGAEA.
Neotoma Paleoecology: 10.21233
ICPSR: 10.3886
NOAA NCEI: 10.7289
UK Data Service: 10.5255
EMPIAR: 10.6019

Non-DOI dataset accession prefixes:
NCBI SRA / ENA: SRP, SRA, ERP, ERX
BioProject: PRJNA, PRJEB, PRJDB, SAMN
ProteomeXchange / PRIDE: PXD
ArrayExpress / EMBL-EBI: E-MTAB, E-
MetaboLights: MTBLS
GEO Series: GSE
GenBank: MN, NC_, CP, MT (context needed)
EMDB: EMD-
EMPIAR: EMPIAR-

1.2 Context keywords trigger A (Data)
If the context contains any of the following patterns → classify as A:
- hosted on
- deposited at
- available via
- uploaded to
- stored on
- accessible via
- provided by
- deposited in
- archived at
- supplementary dataset / supporting dataset
- experimental data / raw data

2. Classify as B (Literature) if:
DOI prefix belongs to a publisher (e.g., 10.1038, 10.1007, etc.).
Context patterns → classify as B:
- published in
- as described in
- reported in
- method details published in
- supplementary material / supplementary information only

3. Ambiguous cases
No repository prefix and no clear context → default to B.
Rare accession formats → rely on context keywords.

4. Output
Only output:
A → data repository / dataset
B → literature / non-data resource

5. Few-shot examples (アクセッション例追加, 文脈パターン明示, 25 件以上)
“Raw images are hosted on Figshare (DOI 10.6084/m9.figshare.1234567).” → A
“Sequence reads deposited at BioProject accession PRJNA765432.” → A
“Method details published in J. Proteome Res. DOI: 10.1021/acs.jproteome.0c00845.” → B
“As described in Nature Methods (DOI 10.1038/s41592-020-0793-2).” → B
“See Supplementary Data available via Zenodo (10.5281/zenodo.987654).” → A
“Data uploaded to Dryad (10.5061/dryad.x1y2z3).” → A
“Referenced paper reported in bioRxiv DOI 10.1101/2020.01.01.123456.” → B
“Metabolomics data in MetaboLights MTBLS1234.” → A
“The MRI scans are deposited at OpenNeuro (DOI 10.18112/openneuro.ds000001.v1.0.0).” → A
“Protein structure described in Science (DOI 10.1126/science.abc1234).” → B
“Microbiome raw data hosted on ArrayExpress E-MTAB-9876.” → A
“RNA sequencing reads available via SRA SRP098765.” → A
“Supplementary tables published in Nature Genetics (DOI 10.1038/ng.123456).” → B
“Data from neuroimaging study uploaded to OpenNeuro.” → A
“Proteomics dataset deposited at PRIDE PXD012345.” → A
“Sequencing metadata hosted on Zenodo.” → A
“Results presented in conference proceedings (DOI 10.1109/ICML.2020.12345).” → B
“GenBank accession MN1234567 contains raw sequences.” → A
“Cryo-EM map deposited at EMDB EMD-9876.” → A
“Clinical trial dataset uploaded to Dryad.” → A
“Experimental raw data in MetaboLights MTBLS5678.” → A
“Supplementary figures published in Science (DOI 10.1126/science.abc12345).” → B
“Raw imaging files stored on Figshare.” → A
“Referenced preprint DOI 10.1101/2021.01.01.123456.” → B
“Data portal provides access to large-scale RNA-seq datasets.” → A
“Study reported in Nature Communications (DOI 10.1038/s41467-020-12345).” → B
“SAMN123456 metadata contains raw sequences from microbiome study.” → A
""".strip()

# ===============================
# データフレーム構築
# ===============================
def build_df():
    df = pl.read_parquet('/tmp/extracted.parquet')
    df.filter(~is_doi_link('dataset_id')).select('article_id', 'dataset_id').write_csv('/tmp/accid_sub.csv')
    return df.filter(is_doi_link('dataset_id'))

# ===============================
# プロンプト作成
# ===============================
def build_prompt(tokenizer, df):
    prompts = []
    for doi, text in df.select('dataset_id', 'window').rows():
        messages = [{'role':'system','content': SYS_PROMPT_CLASSIFY_DOI},
                    {'role':'user', 'content': text}]
        prompts.append(tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False))
    return df.with_columns(pl.Series('prompt', prompts))

# ===============================
# メイン処理
# ===============================
if __name__=='__main__':
    os.environ["VLLM_USE_V1"] = "0"
    import vllm
    from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor

    model_path = "/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1"
    llm = vllm.LLM(
        model_path,
        quantization='awq',
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=2048,
        disable_log_stats=True,
        disable_custom_all_reduce=True,
        enable_prefix_caching=True,
        task='generate'
    )

    tokenizer = llm.get_tokenizer()
    df = build_df()
    df = build_prompt(tokenizer, df)
    prompts = df['prompt'].to_list()

    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=["A", "B"])
    outputs = llm.generate(
        prompts,
        vllm.SamplingParams(
            seed=777,
            temperature=0,
            skip_special_tokens=True,
            max_tokens=1,
            logits_processors=[mclp],
            logprobs=len(mclp.choices)
        ),
        use_tqdm=True
    )

    logprobs = [
        {lp.decoded_token: lp.logprob for lp in list(lps)}
        for lps in [output.outputs[0].logprobs[0].values() for output in outputs]
    ]
    choices = [max(d, key=d.get) for d in logprobs]
    types = {'A': True, 'B': False}
    choices = [types[c] for c in choices]

    df = df.with_columns(pl.Series('type', choices))
    df.filter(pl.col('type')).select('article_id', 'dataset_id').write_csv('/tmp/doi_sub.csv')
    df = pl.concat([pl.read_csv('/tmp/doi_sub.csv'), pl.read_csv('/tmp/accid_sub.csv')])
    df = assume_type(df)
    df.select(['article_id', 'dataset_id', 'type']).with_row_index(name='row_id').write_csv('/kaggle/working/submission.csv')

    if not IS_KAGGLE_SUBMISSION:
        results = evaluate(df)
        for r in results:
            l.info(r)
        results = evaluate(df, on=['article_id', 'dataset_id', 'type'])
        for r in results:
            l.info(r)


Writing /tmp/src/llm_validate.py


In [7]:
%%writefile /tmp/src/post_filter.py
import polars as pl
from helpers import *

"""
Fourth essence: Post-filter to cut FP DOIs that look like literature.
- Read /kaggle/working/submission.csv (output of llm_validate.py)
- Join with /tmp/extracted.parquet to get context window
- Drop DOI rows that (1) start with typical publisher prefixes AND (2) have no data-ish words nearby
- Keep accessions untouched
"""

l = get_logger()

PAPER_PREFIXES = [
    "10.5061","10.5281","10.17632","10.1594","10.15468","10.17882","10.7937","10.7910","10.6073",
    "10.3886","10.3334","10.4121","10.5066","10.5067","10.18150","10.25377","10.25387","10.23642","10.24381","10.22033"
]

CONTEXT_RE = r"(?i)\b(data(?:set)?|repository|archive|deposited|available|supplementary|raw(?:\s+data)?|uploaded|hosted|stored|accession)\b"

def remove_extra_digit(df: pl.DataFrame, column: str) -> pl.DataFrame:
    """
    Remove rows where the value in `column` is just the same DOI with one extra digit at the end.
    Keeps all other columns.
    """
    items_set = set(df[column].to_list())

    def keep_row(value):
        if (value[-1].isdigit() and value[:-1] in items_set) or \
           (len(value) > 2 and value[-2:].isdigit() and value[:-2] in items_set):
            return False
        return True

    return df.filter(pl.col(column).map_elements(keep_row, return_dtype=pl.Boolean))
def is_paper_prefix(col: str = "dataset_id") -> pl.Expr:
    expr = pl.lit(False)
    for p in PAPER_PREFIXES:
        expr = expr | pl.col(col).str.starts_with(f"{DOI_LINK}{p}")
    return expr

def main():
    sub = pl.read_csv("/kaggle/working/submission.csv")

    # Normalize columns: drop row_id if present so concat widths match
    if "row_id" in sub.columns:
        sub = sub.drop("row_id")

    # Context windows
    win = pl.read_parquet("/tmp/extracted.parquet").select("article_id", "dataset_id", "window")

    # DOI & ACC split
    doi_rows = sub.filter(is_doi_link("dataset_id")).join(win, on=["article_id", "dataset_id"], how="left")
    acc_rows = sub.filter(~is_doi_link("dataset_id"))

    keep_mask = (
        (~is_paper_prefix("dataset_id"))  # not a known paper prefix
        | doi_rows["window"].fill_null("").str.contains(CONTEXT_RE)
    )

    kept_doi = doi_rows.filter(keep_mask).select("article_id", "dataset_id", "type")
    doi_df = remove_extra_digit(kept_doi, "dataset_id")
    final = pl.concat([doi_df, acc_rows.select("article_id", "dataset_id", "type")])

    # Re-eval & save
    if not IS_KAGGLE_SUBMISSION:
        for r in evaluate(final): l.info(r)
        for r in evaluate(final, on=["article_id", "dataset_id", "type"]): l.info(r)

    final.with_row_index("row_id").write_csv("/kaggle/working/submission.csv")

if __name__ == "__main__":
    main()

Writing /tmp/src/post_filter.py


In [8]:
%%writefile /tmp/src/post_validate.py

from helpers import *
import polars as pl
import os


l = get_logger()


PROMPT_CLASSIFY_CITATION_TYPE = '''
# Role & Task
You are an expert data citation analyst. Your task is to classify a given citation from a scientific paper into one of two categories: **A** (Data) or **B** (Not Data). Base your decision strictly on the provided abstract and the context of the citation.

## Instructions
1.  **Read the provided abstract** to understand the research context.
2.  **Analyze the citation context** for key linguistic cues.
3.  **Classify the citation** as either **A** or **B** based on the definitions below.
4.  **Output only a single letter: A or B.** Do not output any other text, explanation, or formatting.

## Category Definitions

### **Category A: DATA**
The citation points to a dataset. This includes:
*   **Primary Data:** Raw or processed data that the current study's authors collected, generated, or created.
*   **Secondary Data:** Data that was originally produced by other researchers but is being *used as a dataset* in the current study.
*   **Key Phrases:** "data are available at", "we collected", "we measured", "data were obtained from", "dataset", "downloaded from", "deposited in", repository names (e.g., GenBank, Zenodo, Figshare, TCIA).

### **Category B: NOT DATA**
The citation points to a traditional scholarly publication or other non-data resource. This includes:
*   Journal articles, books, conference proceedings, preprints, protocols, methods papers.
*   **Key Phrases:** "as described in", "according to", "previous study", "et al.", "paper", "article", "methodology", "was used for analysis" (without indicating data access).
*   Citations that provide background context or methodological description but do not serve as the source of the data used in the analysis.

## Input Format
You will be provided with the following three pieces of information:
Paper Abstract: {abstract}
Citation: {dataset_id}
Citation Context: {context}

## Critical Thinking Guidelines
*   A DOI or URL can point to either data (A) or a paper (B). The context determines the classification.
*   If the citation is used to describe the *source* of the data for the current study's analysis, it is likely **A**.
*   If the citation is used to provide background, justify a method, or compare results, it is likely **B** (a reference to another paper).
*   When in doubt, rely on the linguistic cues in the "Citation Context".

## Examples for Pattern Recognition

**Example 1 (Classify as A):**
*   Context: "Three out of four cohorts used in this study can be found on The Cancer Imaging Archive (TCIA)24: Canadian benchmark dataset23: https://doi.org/10.7937/K9/TCIA.2017.8oje5q00."
*   **Reasoning:** The text states cohorts are "used in this study" and provides direct repository links. This is a clear case of citing external data for use.
*   **Output:** A

**Example 2 (Classify as B):**
*   Context: "data presented here are available at the SEANOE dataportal: https://doi.org/10.17882/94052 (ZooScan dataset Grandremy et al. 2023c)"
*   **Reasoning:** The phrase "data presented here" indicates this is the authors' own data being deposited, not a citation to an external source they are using. The "(Author et al. Year)" format is a classic literature citation style.
*   **Output:** B

**Example 3 (Classify as A):**
*   Context: "GBIF occurrence data: Vulpes vulpes: https://doi.org/10.15468/dl.wgtneb (28 May 2021)."
*   **Reasoning:** Explicitly names the data source (GBIF) and provides a direct access link/DOI for the specific dataset used.
*   **Output:** A

**Example 4 (Classify as B):**
*   Context: "North American soil NCBI SRA SRP035367 Smith & Peay [36] ITS2-Soil"
*   **Reasoning:** While it mentions a data repository ID (SRP035367), it couples it with a standard literature citation "[36]". The context suggests it is referencing the *paper* by Smith & Peay that describes the data, not directly citing the dataset itself for use.
*   **Output:** B

## Ready for Input
Begin your analysis. Remember: Output only **A** or **B**.
'''

def get_context_window(text: str, substring: str, window: int = 600) -> str:
    idx = text.find(substring)
    if idx == -1:
        return "no context", "no abstraction"
    start = max(idx - window, 0)
    end = min(idx + len(substring) + window, len(text))
    return text[start:end] , text[:1000]




def find_context_win(tokenizer,df):
    text_df = pl.read_parquet('/tmp/context_data.parquet')
    # print(text_df)
    df = df.join(text_df, on=["article_id","dataset_id"], how="inner")
    df = df.drop("type")
    print(df)

    prompts = []
    
    for article_id,dataset_id,text,match in df.select(["article_id","dataset_id","text",'match']).rows():

        context, abstract = get_context_window(text,match)
        user_content = f"""
        Paper Abstract: {abstract}
        
        Citation: {dataset_id}

        
        Citation Context: {context}
        """
        messages = [
            {"role": "system", "content": PROMPT_CLASSIFY_CITATION_TYPE},
            {"role": "user", "content": user_content.strip()}
        ]
        prompts.append(
            tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
        )
        
    return df.with_columns(pl.Series("prompt", prompts))

    

if __name__=="__main__":
    os.environ["VLLM_USE_V1"] = "0"
    MODEL_PATH = "/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1"
    import vllm
    from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor

    llm = vllm.LLM(
        MODEL_PATH,
        quantization='awq',
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=16384,
        disable_log_stats=True, 
        disable_custom_all_reduce=True,
        enable_prefix_caching=True,
        task='generate')

    tokenizer = llm.get_tokenizer()

    df=pl.read_csv("/kaggle/working/submission.csv")
    
    if "row_id" in df.columns:
        df = df.drop("row_id")

    # print(df)

    doi_df = df.filter(is_doi_link("dataset_id"))
    acc_df = df.filter(~is_doi_link("dataset_id"))

    # print(doi_df)

    df = find_context_win(tokenizer,doi_df)

    
    
    prompts = df['prompt'].to_list()
    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=["A", "B","C"])
    outputs = llm.generate(prompts, vllm.SamplingParams(seed=777, temperature=0.7, skip_special_tokens=True, max_tokens=1, logits_processors=[mclp], logprobs=len(mclp.choices)), use_tqdm=True)
    logprobs = [{lp.decoded_token: lp.logprob for lp in list(lps)} for lps in [output.outputs[0].logprobs[0].values() for output in outputs]]
    choices = [max(d, key=d.get) for d in logprobs]
    types = {'A': True, 'B': False}
    choices = [types[c] for c in choices]
    df = df.with_columns(pl.Series('type', choices))
    df.filter(pl.col('type')).select('article_id', 'dataset_id').write_csv('/tmp/doi_sub.csv')
    df = pl.concat([pl.read_csv('/tmp/doi_sub.csv'), pl.read_csv('/tmp/accid_sub.csv')])
    df = assume_type(df)
    df.select(['article_id', 'dataset_id', 'type']).with_row_index(name='row_id').write_csv('/kaggle/working/submission.csv')
    # print(df)
    if not IS_KAGGLE_SUBMISSION:
        results = evaluate(df)
        for r in results: l.info(r) 
        results = evaluate(df, on=['article_id', 'dataset_id', 'type'])
        for r in results: l.info(r)
    
    
    try:
        del llm, tokenizer
    except:
        pass
    
    import gc, torch
    gc.collect()
    torch.cuda.empty_cache()

Writing /tmp/src/post_validate.py


In [9]:
%%writefile /tmp/src/predict.py

from helpers import *
import polars as pl
import os


l = get_logger()


PROMPT_CLASSIFY_CITATION_TYPE = '''
# Role & Task
You are an expert data citation analyst. Your task is to classify a given citation from a scientific paper into one of two categories based on the context: **A (Primary Data)** or **B (Secondary Data)**.

## Instructions
1.  **Read the provided abstract** to understand the research context.
2.  **Analyze the citation context** for key linguistic cues.
3.  **Classify the citation** as either **A** or **B** based on the definitions below.
4.  **Output only a single letter: A or B.** Do not output any other text, explanation, or formatting.

## Category Definitions

### **Category A: PRIMARY DATA**
The data was generated, collected, or created by the **authors of the current study**. This is *their* data.
*   **Key Phrases:** "we collected", "we generated", "our data", "data are available at [URL/DOI]", "data have been deposited", "this study presents", "supplementary data".

### **Category B: SECONDARY DATA**
The data was produced by **other researchers** or external sources and is being reused or analyzed by the current study's authors.
*   **Key Phrases:** "data were obtained from", "publicly available data", "previously published data", "retrieved from", "downloaded from", "[Dataset Name] dataset", "database", citing a specific external source.

## Input Format
You will be provided with the following three pieces of information:
Paper Abstract: {abstract}
Citation: {dataset_id}
Citation Context: {context}


## Decision Framework
Answer these questions based on the **Citation Context**:

1.  **Who is the source of the data?**
    *   If the context implies the **authors themselves** are the source (e.g., "we," "our"), classify as **A**.
    *   If the context names an **external source** (e.g., a repository, another study, a database), classify as **B**.

2.  **What is the action being described?**
    *   **A (Primary)** actions: *depositing, making available, presenting* their own data.
    *   **B (Secondary)** actions: *using, obtaining, accessing, downloading, analyzing* existing data from elsewhere.

## Examples for Pattern Recognition

**Example 1 (Classify as B):**
*   Context: "Three out of four cohorts **used in this study** can be found on The Cancer Imaging Archive (TCIA)24: Canadian benchmark dataset23: https://doi.org/10.7937/K9/TCIA.2017.8oje5q00."
*   **Reasoning:** The authors are describing external datasets they **used** (a Secondary action). The source is TCIA, not themselves.
*   **Output:** B

**Example 2 (Classify as A):**
*   Context: "Additional research data **supporting this publication are available** at 10.25377/sussex.21184705."
*   **Reasoning:** The authors are stating the availability of data that **supports their own publication**. The source is implied to be themselves.
*   **Output:** A

**Example 3 (Classify as B):**
*   Context: "GBIF occurrence data: Vulpes vulpes: https://doi.org/10.15468/dl.wgtneb (28 May 2021)."
*   **Reasoning:** The data is explicitly sourced from an external repository (GBIF). The authors are referring to data they reused.
*   **Output:** B

**Example 4 (Classify as A):**
*   Context: "Data referring to Barbieux et al. (2017; https://doi.org/10.17882/49388) are freely available on SEANOE."
*   **Reasoning:** This is a tricky case. The citation format "(Author et al. Year)" suggests a literature reference. However, the phrase "Data referring to" and the direct data DOI indicate the authors are citing **their own previously published dataset** (from a 2017 paper) that is now available. This is their Primary data.
*   **Output:** A

## Ready for Input
Begin your analysis. Remember: Output only **A** or **B**.

'''

def get_context_window(text: str, substring: str, window: int = 600) -> str:
    idx = text.find(substring)
    if idx == -1:
        return "no context", "no abstraction"
    start = max(idx - window, 0)
    end = min(idx + len(substring) + window, len(text))
    return text[start:end] , text[:1000]




def find_context_win(tokenizer,df):
    text_df = pl.read_parquet('/tmp/context_data.parquet')
    # print(text_df)
    df = df.join(text_df, on=["article_id","dataset_id"], how="inner")
    df = df.drop("type")
    print(df)

    prompts = []
    
    for article_id,dataset_id,text,match in df.select(["article_id","dataset_id","text",'match']).rows():

        context, abstract = get_context_window(text,match)
        user_content = f"""
        Paper Abstract: {abstract}
        
        Citation: {dataset_id}

        
        Citation Context: {context}
        """
        messages = [
            {"role": "system", "content": PROMPT_CLASSIFY_CITATION_TYPE},
            {"role": "user", "content": user_content.strip()}
        ]
        prompts.append(
            tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
        )
        
    return df.with_columns(pl.Series("prompt", prompts))

    

if __name__=="__main__":
    os.environ["VLLM_USE_V1"] = "0"
    MODEL_PATH = "/kaggle/input/qwen2.5/transformers/32b-instruct-awq/1"
    import vllm
    from logits_processor_zoo.vllm import MultipleChoiceLogitsProcessor

    llm = vllm.LLM(
        MODEL_PATH,
        quantization='awq',
        tensor_parallel_size=2,
        gpu_memory_utilization=0.9,
        trust_remote_code=True,
        dtype="half",
        enforce_eager=True,
        max_model_len=16384,
        disable_log_stats=True, 
        disable_custom_all_reduce=True,
        enable_prefix_caching=True,
        task='generate')

    tokenizer = llm.get_tokenizer()

    df=pl.read_csv("/kaggle/working/submission.csv")
    
    if "row_id" in df.columns:
        df = df.drop("row_id")


    doi_df = df.filter(is_doi_link("dataset_id"))
    acc_df = df.filter(~is_doi_link("dataset_id"))



    df = find_context_win(tokenizer,doi_df)

    
    
    prompts = df['prompt'].to_list()
    mclp = MultipleChoiceLogitsProcessor(tokenizer, choices=["A", "B"])
    outputs = llm.generate(prompts, vllm.SamplingParams(seed=777, temperature=0.8, skip_special_tokens=True, max_tokens=1, logits_processors=[mclp], logprobs=len(mclp.choices)), use_tqdm=True)
    logprobs = [{lp.decoded_token: lp.logprob for lp in list(lps)} for lps in [output.outputs[0].logprobs[0].values() for output in outputs]]
    choices = [max(d, key=d.get) for d in logprobs]
    types = {'A':'Primary', 'B':'Secondary'}
    choices = [types[c] for c in choices]


    
    df = df.with_columns(pl.Series('type', choices))
    df.select('article_id', 'dataset_id','type').write_csv('/tmp/doi_sub.csv')

    acc_df = assume_type(acc_df)
    acc_df.select('article_id','dataset_id','type').write_csv("/tmp/accid_sub.csv")
    df = pl.concat([pl.read_csv('/tmp/doi_sub.csv'), pl.read_csv('/tmp/accid_sub.csv')])
    
    df.select(['article_id', 'dataset_id', 'type']).with_row_index(name='row_id').write_csv('/kaggle/working/submission.csv')
    # print(df)
    if not IS_KAGGLE_SUBMISSION:
        results = evaluate(df)
        for r in results: l.info(r) 
        results = evaluate(df, on=['article_id', 'dataset_id', 'type'])
        for r in results: l.info(r)
    
    
    try:
        del llm, tokenizer
    except:
        pass
    
    import gc, torch
    gc.collect()
    torch.cuda.empty_cache()

Writing /tmp/src/predict.py


In [10]:
%cd /tmp
!LOG_LEVEL=INFO python src/parse.py /tmp/train_parse
! python src/check_parse.py
! python src/getid.py
! python src/llm_validate.py
! python src/post_filter.py
! python src/post_validate.py
! python src/predict.py
! grep "f1:" /tmp/logs/project.log

/tmp
PDF→TXT:  13%|████▏                            | 67/524 [00:11<01:55,  3.96it/s]MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annotations

MuPDF error: unsupported error: cannot create appearance stream for  annota