BIOS_EntityClassification

Overview

A Semantic Type Annotator(STA) to predict the semantic type of a medical term using the term itself as well as its surrounding text as input. Training samples were multi-word terms from the UMLS, as they generally do not have ambiguous daily meaning. The semantic types of these terms were used as the label, and in case a term has multiple semantic types in the UMLS, a random one is used, leveraging the fact that a large sample size can overcome moderate noise in the data. The classification model was trained on PubMedBERT, a BERT model pretrained on PubMed abstracts.

Repo Contents

pretrain: scripts for downloading pretrained models
example: small dataset to demo the code
train: train codes
predict: predict codes
utils: utils codes

System Requirements

Hardware Requirements

For optimal performance, we recommend a computer with following specs:

RAM: 16+ GB
CPU: 4+ cores, 3.3+ GHz/core
GPU Memory: 40 GB

Software requirements

The package is tested on Linux 20.04 operating system.

Install requirements

!pip install -r requirements.txt

Installation Guide

Download cleanterms5.txt (password: d9ol) and put it under ./example/cleanterms/.
Download pretrained models: cd pretrain && sh download_pubmedbert.sh
which will take about a few minutes to complete the download.

Demo

Train

make sure cleanterms5.txt has been downloaded, and then:

1. cd train
2. bash train.sh

after a few minutes, you can find your fine-tuned model and eval result under ./output_xxx by default.

Predict

configure your fine-tuned model path in the predict.sh, and then:

1. cd predict
2. bash predict.sh

after a few minutes, you can find the results under ./output_xxx

Instructions for use

generate your cleanterms.txt from UMLS by your rules. and python utils/label_util.py to generate the entity_type.json and entity_group.json.
prepare your training texts and the FMM results.
use your datasets to train and predict!

Citation

@misc{https://doi.org/10.48550/arxiv.2203.09975,
  doi = {10.48550/ARXIV.2203.09975},
  url = {https://arxiv.org/abs/2203.09975},
  author = {Yu, Sheng and Yuan, Zheng and Xia, Jun and Luo, Shengxuan and Ying, Huaiyuan and Zeng, Sihang and Ren, Jingyi and Yuan, Hongyi and Zhao, Zhengyun and Lin, Yucong and Lu, Keming and Wang, Jing and Xie, Yutao and Shum, Heung-Yeung},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {BIOS: An Algorithmically Generated Biomedical Knowledge Graph},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
example		example
predict		predict
pretrain		pretrain
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIOS_EntityClassification

Contents

Overview

Repo Contents

System Requirements

Hardware Requirements

Software requirements

Install requirements

Installation Guide

Demo

Train

Predict

Instructions for use

Citation

About

Releases

Packages

Languages

License

xiaj1011/bios_entity_classification

Folders and files

Latest commit

History

Repository files navigation

BIOS_EntityClassification

Contents

Overview

Repo Contents

System Requirements

Hardware Requirements

Software requirements

Install requirements

Installation Guide

Demo

Train

Predict

Instructions for use

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages