AutoPCR: Automated Phenotype Concept Recognition by Prompting

🚀 Installation (Linux)

conda env create -f environment.yml

📁 Data Preparation

The four datasets used in the experiments are available at BIOC-GS, GSC-2024, ID-68, and NCBI. These datasets should be formatted according to NCBI and processed by the function src.evaluate.GSCplus_corpus_gold, following PhenoTagger's pipeline. The processed datasets are provided in data/corpus.zip.

The two ontologies used are available at HPO and MEDIC, and also provided in ontology.zip.

Please unzip the files if you need to use them.

unzip data/corpus.zip -d data/
unzip ontology.zip

🔁 Usage

Activate environment

conda activate autopcr
cd src

Build the ontology dictionary

Parameters:

--input, -i, help='input the ontology .obo file', default='../ontology/hp_20240208.obo'
--output, -o, help='the output path of dictionary', default='../dict/HPO'
--rootnode, -r, help='input the root node of the ontology', nargs='+', default=['HP:0000118']

Examples:

python build_dict.py -i ../ontology/hp_20240208.obo  -o ../dict/HPO   -r HP:0000118
python build_dict.py -i ../ontology/CTD_diseases.obo -o ../dict/MEDIC -r MESH:C

Build the ontology index using SapBERT

Parameters:

--ontology_dict, help='folder of the ontology dictionary', type=str, required=True
--ccr, help='candidate concept retrieval model', type=str, default='cambridgeltl/SapBERT-from-PubMedBERT-fulltext'

Examples:

python build_index.py --ontology_dict ../dict/HPO
python build_index.py --ontology_dict ../dict/MEDIC

Run experiments

Parameters:

--ontology_dict, help='folder of the ontology dictionary', type=str, required=True
--corpus, -c, help='input corpus dataset', type=str, required=True
--output, -o, help='output prediction file', type=str, required=True
--doc_id, help='specific docs to test', nargs='*', type=str
--ee, help='entity extraction method: (1) "rule": PhenoTagger-style rule-based, (2) "neural": BioNER, (3) "neural+": BioNER + benepar, (4) "neural++": BioNER + benepar + coordination decomposition', choices=['rule', 'neural', 'neural+', 'neural++'], type=str, default='neural++'
--abbr_recog, help='whether to identify abbreviations', action='store_true'
--ccr, help='candidate concept retrieval model', type=str, default='cambridgeltl/SapBERT-from-PubMedBERT-fulltext'
--tau_1, help='high-confidence threshold for candidate concept retrieval', type=float, default=0.95
--tau_2, help='low-confidence threshold for candidate concept retrieval', type=float, default=0.85
--k, help='number of retrieved candidate concepts', type=int, default=5
--el, help='entity linking llm backend; "none" to disable', type=str, default='Qwen/Qwen3-Next-80B-A3B-Instruct'
--api_provider, help='api provider for llm inference', choices=['openai', 'groq', 'together', 'vllm'], type=str, default='together'
--api_key, help='api key for api provider, api url if api provider is vllm, or loaded from .env (e.g., TOGETHER_API_KEY) if not provided', type=str
--use_finetuning_prompt, help='simplified prompt for finetuned llm', action='store_true'
--seed, help='seed for llm', type=int, default=0
--use_cache, help='whether to cache results', action='store_true'
--only_longest, help='whether the output only keeps the longest nested entity', action='store_true'

Examples:

python HPO_evaluation.py --ontology_dict ../dict/HPO   -c BIOC-GS  -o ../results/bioc-gs.tsv  --only_longest
python HPO_evaluation.py --ontology_dict ../dict/HPO   -c GSC-2024 -o ../results/gsc-2024.tsv
python HPO_evaluation.py --ontology_dict ../dict/HPO   -c ID-68    -o ../results/id-68.tsv    --only_longest
python HPO_evaluation.py --ontology_dict ../dict/MEDIC -c NCBI     -o ../results/ncbi.tsv     --only_longest --abbr_recog

(Optional) Fine-tune the entity linker

First, build the training data:

python build_training_dataset.py

Then, install unsloth in a new environment and use it to fine-tune the entity linker:

python unsloth_finetune_30b.py

To use the fine-tuned model, you may want to serve it using vllm and additionally set "--api_provider vllm --api_key <Your_API_URL> --use_finetuning_prompt" in the last step. A fine-tuned model is also provided here.

Pre-computed results

We provide the pre-computed results of AutoPCR in the results/submission/ folder and those of baselines in the Google Drive.

📚 Acknowledgements

This project is adapted in part from the following repositories:

We thank the authors for open-sourcing their code.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
results/submission		results/submission
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
ontology.zip		ontology.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoPCR: Automated Phenotype Concept Recognition by Prompting

🚀 Installation (Linux)

📁 Data Preparation

🔁 Usage

Activate environment

Build the ontology dictionary

Build the ontology index using SapBERT

Run experiments

(Optional) Fine-tune the entity linker

Pre-computed results

📚 Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoPCR: Automated Phenotype Concept Recognition by Prompting

🚀 Installation (Linux)

📁 Data Preparation

🔁 Usage

Activate environment

Build the ontology dictionary

Build the ontology index using SapBERT

Run experiments

(Optional) Fine-tune the entity linker

Pre-computed results

📚 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages