GERBERA

We present Gerbera (Transfer Learning for General-to-Biomedical Entity Recognition Augmentation), a multi-task learning method that utilizes knowledge from general-domain NER datasets to improve performance on BioNER datasets, specially on limited-sized dataset. Please refer to our paper Augmenting biomedical named entity recognition with general-domain resources for more details.

Install GERBERA Environment

To set up the GERBERA environment, please follow these steps. Ensure you have the correct version of conda version.

# Install torch
conda create -n GERBERA python=3.7
conda activate GERBERA
conda install pytorch==1.9.0 cudatoolkit=10.2 -c pytorch

# Install GERBERA
git clone https://github.com/qingyu-qc/bioner_gerbera.git
cd bioner_gerbera
pip install -r requirements.txt

Dataset

Please download the necessary BioNER and general-domain NER datasets from here. Ensure the datasets are placed in the correct directory structure as expected by the training scripts.

Models

You can download our GERBERA model from here for BioNER tasks, including disease, Gene, Chemical, Species, DNA, RNA, Cell type and Cell line.

Download the baseline model

Download the pre-trained baseline model for initialization.

wget https://dl.fbaipublicfiles.com/biolm/RoBERTa-large-PM-M3-Voc-hf.tar.gz
tar -zxvf RoBERTa-large-PM-M3-Voc-hf.tar.gz

Training

Multi-task training:

GERBERA training with the BioNER dataset and the general-domain NER dataset.

python run_ner.py 
--model_name_or_path ./RoBERTa-large-PM-M3-Voc-hf 
--data_dir NERdata/ 
--labels NERdata/NCBI-disease/labels.txt 
--output_dir ./gerbera_model 
--data_list NCBI-disease+CoNLL2003 
--eval_data_list NCBI-disease 
--num_train_epochs 20 
--max_seq_length 128 
--warmup_steps 0 
--learning_rate 3e-5 
--per_device_train_batch_size 16 
--per_device_eval_batch_size 16 
--seed 1 
--logging_steps 5000 
--evaluate_during_training 
--save_steps 10000 
--do_train 
--do_eval 
--do_predict 
--overwrite_output_dir

Biomedical finetuning

After intial multi-task training, further finetuning the saved model with specific BioNER dataset.

python run_ner.py 
--model_name_or_path ./gerberal_model/RoBERTa-ncbi # or "Euanyu/GERBERA-NCBI"
--data_dir NERdata/ 
--labels NERdata/NCBI-disease/labels.txt 
--output_dir ./gerbera_model 
--data_list NCBI-disease
--eval_data_list NCBI-disease 
--num_train_epochs 10 
--max_seq_length 128 
--warmup_steps 0 
--learning_rate 3e-5 
--per_device_train_batch_size 16 
--per_device_eval_batch_size 16 
--seed 1 
--logging_steps 5000 
--evaluate_during_training 
--save_steps 10000 
--do_train 
--do_eval 
--do_predict 
--overwrite_output_dir

Evaluation

Evaluate the fine-tuned model on various BioNER datasets to measure its performance.

python run_eval.py 
--model_name_or_path ./gerberal_model/RoBERTa-ncbi
--data_dir NERdata/ 
--labels NERdata/NCBI-disease/labels.txt 
--output_dir ./gerbera_model 
--eval_data_type linnaeus 
--eval_data_list linnaeus 
--max_seq_length 128 
--per_device_eval_batch_size 32 
--seed 1 
--do_eval 
--do_predict 
--overwrite_output_dir

Colab example

This Colab tutorial guides you through setting up the GERBERA environment, running model training scripts, and performing evaluations. Additionally, it includes instructions for downloading our pre-trained model from Hugging Face and demonstrates how to conduct evaluations using this model.

License

This project is licensed under the MIT License - see the LICENSE file for details

Contact Information

For help or issues using GERBERA, please submit a GitHub issue. Please contact with Yu Yin(yinyu201906 (at) gmail (dot) com) for communication related to GERBERA.

Citation

Yin, Y., Kim, H., Xiao, X., Wei, C.H., Kang, J., Lu, Z., Xu, H., Fang, M. and Chen, Q., 2024. Augmenting Biomedical Named Entity Recognition with General-domain Resources. Journal of Biomedical Informatics.

@article{YIN2024104731,
title = {Augmenting biomedical named entity recognition with general-domain resources},
author = {Yu Yin and Hyunjae Kim and Xiao Xiao and Chih Hsuan Wei and Jaewoo Kang and Zhiyong Lu and Hua Xu and Meng Fang and Qingyu Chen},
journal = {Journal of Biomedical Informatics},
volume = {159},
pages = {104731},
year = {2024},
issn = {1532-0464},
doi = {https://doi.org/10.1016/j.jbi.2024.104731}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GERBERA

Install GERBERA Environment

Dataset

Models

Download the baseline model

Training

Multi-task training:

Biomedical finetuning

Evaluation

Colab example

License

Contact Information

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE		LICENSE
README.md		README.md
modeling.py		modeling.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
run_ner.py		run_ner.py
utils_ner.py		utils_ner.py

License

qingyu-qc/bioner_gerbera

Folders and files

Latest commit

History

Repository files navigation

GERBERA

Install GERBERA Environment

Dataset

Models

Download the baseline model

Training

Multi-task training:

Biomedical finetuning

Evaluation

Colab example

License

Contact Information

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages