Skip to content

rttl-ai/BIOptimus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition.

This repo provides our paper's model: BIOptimus: Pre-training an Optimal Biomedical Language Model with Curriculum Learning for Named Entity Recognition (BioNLP workshop @ ACL 2023). HuggingFace BIOptimus is a new pre-trained biomedical language model pre-trained using contextualized weight distillation and Curriculum Learning.

alt text

Contextualization of tokens' embeddings leverages the ability of the BERT model to create domain-specific word embeddings that align with the corpus where they were found. Contextualization is performed with our resource-efficient and performant qik-find tool written in the Rust programming language. Qik-find is a purpose-built tool that finds tokens of interest and extracts corresponding sentences from a large corpus exploiting the native capability of the Rust programming language for efficient multiprocessing. (link to Qik-find repository).

Curriculum Learning is an easy-to-hard strategy that helps to guide the model's learning process more smoothly. We formulate our curriculum strategy from the perspective of the complexity of the prediction task.

v. Phases Masking strategy Masking rate Corruption strategy
0.1 phase 1 tokens 0.15 with corruption
0.2 phase 2 WWM 0.15 with corruption
0.3 phase 3 WWM 0.2 with corruption
0.4 phase 4 WWM 0.2 no corruption

Different training techniques like masking rate and masking strategies help to broaden the model's experience, gain more diversified knowledge of textual input, speed up pre-training, and enhance performance on downstream tasks like NER.

alt text

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published