conda env create -f environment.ymlThe four datasets used in the experiments are available at BIOC-GS, GSC-2024, ID-68, and NCBI. These datasets should be formatted according to NCBI and processed by the function src.evaluate.GSCplus_corpus_gold, following PhenoTagger's pipeline. The processed datasets are provided in data/corpus.zip.
The two ontologies used are available at HPO and MEDIC, and also provided in ontology.zip.
Please unzip the files if you need to use them.
unzip data/corpus.zip -d data/
unzip ontology.zipconda activate autopcr
cd srcParameters:
- --input, -i, help='input the ontology .obo file', default='../ontology/hp_20240208.obo'
- --output, -o, help='the output path of dictionary', default='../dict/HPO'
- --rootnode, -r, help='input the root node of the ontology', nargs='+', default=['HP:0000118']
Examples:
python build_dict.py -i ../ontology/hp_20240208.obo -o ../dict/HPO -r HP:0000118
python build_dict.py -i ../ontology/CTD_diseases.obo -o ../dict/MEDIC -r MESH:CParameters:
- --ontology_dict, help='folder of the ontology dictionary', type=str, required=True
- --ccr, help='candidate concept retrieval model', type=str, default='cambridgeltl/SapBERT-from-PubMedBERT-fulltext'
Examples:
python build_index.py --ontology_dict ../dict/HPO
python build_index.py --ontology_dict ../dict/MEDICParameters:
-
--ontology_dict, help='folder of the ontology dictionary', type=str, required=True
-
--corpus, -c, help='input corpus dataset', type=str, required=True
-
--output, -o, help='output prediction file', type=str, required=True
-
--doc_id, help='specific docs to test', nargs='*', type=str
-
--ee, help='entity extraction method: (1) "rule": PhenoTagger-style rule-based, (2) "neural": BioNER, (3) "neural+": BioNER + benepar, (4) "neural++": BioNER + benepar + coordination decomposition', choices=['rule', 'neural', 'neural+', 'neural++'], type=str, default='neural++'
-
--abbr_recog, help='whether to identify abbreviations', action='store_true'
-
--ccr, help='candidate concept retrieval model', type=str, default='cambridgeltl/SapBERT-from-PubMedBERT-fulltext'
-
--tau_1, help='high-confidence threshold for candidate concept retrieval', type=float, default=0.95
-
--tau_2, help='low-confidence threshold for candidate concept retrieval', type=float, default=0.85
-
--k, help='number of retrieved candidate concepts', type=int, default=5
-
--el, help='entity linking llm backend; "none" to disable', type=str, default='Qwen/Qwen3-Next-80B-A3B-Instruct'
-
--api_provider, help='api provider for llm inference', choices=['openai', 'groq', 'together', 'vllm'], type=str, default='together'
-
--api_key, help='api key for api provider, api url if api provider is vllm, or loaded from .env (e.g., TOGETHER_API_KEY) if not provided', type=str
-
--use_finetuning_prompt, help='simplified prompt for finetuned llm', action='store_true'
-
--seed, help='seed for llm', type=int, default=0
-
--use_cache, help='whether to cache results', action='store_true'
-
--only_longest, help='whether the output only keeps the longest nested entity', action='store_true'
Examples:
python HPO_evaluation.py --ontology_dict ../dict/HPO -c BIOC-GS -o ../results/bioc-gs.tsv --only_longest
python HPO_evaluation.py --ontology_dict ../dict/HPO -c GSC-2024 -o ../results/gsc-2024.tsv
python HPO_evaluation.py --ontology_dict ../dict/HPO -c ID-68 -o ../results/id-68.tsv --only_longest
python HPO_evaluation.py --ontology_dict ../dict/MEDIC -c NCBI -o ../results/ncbi.tsv --only_longest --abbr_recogFirst, build the training data:
python build_training_dataset.pyThen, install unsloth in a new environment and use it to fine-tune the entity linker:
python unsloth_finetune_30b.pyTo use the fine-tuned model, you may want to serve it using vllm and additionally set "--api_provider vllm --api_key <Your_API_URL> --use_finetuning_prompt" in the last step. A fine-tuned model is also provided here.
We provide the pre-computed results of AutoPCR in the results/submission/ folder and those of baselines in the Google Drive.
This project is adapted in part from the following repositories:
We thank the authors for open-sourcing their code.