BELB: Biomedical Entity Linking Benchmark

the Biomedical Entity Linking Benchmark (BELB) is a collection of datasets and knowledge bases to train and evaluate biomedical entity linking models.

Citing
Data
- Knowledge Bases
- Corpora
Setup
API
Roadmap

Citing

If you use BELB in your work, please cite:

@article{10.1093/bioinformatics/btad698,
    author = {Garda, Samuele and Weber-Genzel, Leon and Martin, Robert and Leser, Ulf},
    title = {{BELB}: a {B}iomedical {E}ntity {L}inking {B}enchmark},
    journal = {Bioinformatics},
    pages = {btad698},
    year = {2023},
    month = {11},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btad698},
    url = {https://doi.org/10.1093/bioinformatics/btad698},
    eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad698/53483107/btad698.pdf},
}

Data

Knowledge Bases

Corpus	Entity	Public	Versioned	Website	Download
NCBI Gene	Gene	✅	❌	homepage	kb, history
NCBI Taxonomy	Species	✅	❌	homepage	kb, history
CTD Diseases (MEDIC)	Disease	✅	❌	homepage	kb
CTD Chemicals	Chemical	✅	❌	homepage	kb
dbSNP	Variant	✅	✅	homepage	kb,history
Cellosaurus	Cell line	✅	❌	homepage	kb, history
UMLS	General	❌	✅	homepage	-

Corpora

Corpus	Entity	Public	Website	Download
GNormPLus (improved BC2)	Gene	✅	homepage	link
NLM-Gene	Gene	✅	homepage	link
NCBI-Disease	Disease	✅	homepage	link
BC5CDR	Disease, Chemical	✅	homepage	link
NLM-Chem	Chemical	✅	homepage	link
Linnaeus	Species	✅	homepage	link
S800	Species	✅	homepage	link
BioID	Cell, Species, Gene	✅	homepage	link
Osiris	Gene, Variant	✅	homepage	link
Thomas2011	Variant	✅	homepage	link
tmVar (v3)	Gene, Species, Variant	✅	homepage	link
MedMentions	UMLS	✅	homepage	link

Setup

We assume that all data will be stored in a single directory.

This reduces flexibility, but due to the inter-connection of all data (corpora and KB) this is a trade-off to ease accessibility.

PubTator database

Download PubTator raw data (compressed:~19GB) and PMCID->PMID mapping (compressed: ~155MB). This is needed to add annotations to certain corpora and add the text to those which provide only annotations.

mkdir -p <PUBTATOR>
cd <PUBTATOR>
wget https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz 
wget https://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz

python -m scripts.build_pubtator 
       --pubtator <PUBTATOR>/bioconcepts2pubtatorcentral.offset.gz 
       --pmicid_pmid <PUBTATOR>/PMC-ids.csv.gz
       --output pubtator.db 
       --overwrite

Knowledge Bases

All knowledge bases will be automatically downloaded for you, with two exceptions: dbSNP and UMLS.

dbSNP

As dbSNP is a large resource (>100GB) it is best to launch a separate process to fetch it.

Essentially it boils down to:

mkdir -p <DBSNP> 
cd <DBSNP>

echo "Fetch dbSNP latest release..."
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-chr*.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-unsupported.json.bz2"
wget --continue "ftp://ftp.ncbi.nlm.nih.gov/snp/redesign/latest_release/JSON/refsnp-withdrawn.json.bz2"

echo "Identify corrupted files: please delete and re-initiate download for all corrupted files..."
find . -name *.bz2 -exec bunzip2 --test {} \;

See here for more details.

UMLS

See here for more details on how to request a license.

You need to download the 2017AA full version as this is the one used by the corpus MedMentions.

In principle the parser should work with later versions too, it expects as input a folder (usually cold META) where containing the files MRCONSO.RFF and MRCUI.RFF.

The 2017AA is the last one that does not provide direct access to the UMLS raw data ("Metathesaurus Files"). To access the data w/o setting up a mysql database you can to the following:

unzip umls-2017AA-full.zip
cd 2017AA-full
# poorly disguised zip files...
unzip 2017aa-1-meta.nlm 2017aa-2-meta.nlm
cd 2017AA/META
gunzip MRCONSO.RRF.aa.gz MRCONSO.RRF.ab.gz MRCUI.RRF.gz
cat MRCONSO.RRF.aa MRCONSO.RRF.ab > MRCONSO.RFF

Once you have downloaded these two resources you can launch the script:

python -m belb.scripts.build_kbs --dir <BELB> --cores 20 --umls <path/to/umls/META> --dbsnp <path/to/dbsnp>

This will fetch all the other kbs data and convert them to a unified schema and store them as TSV files.

Each kb can be processed individually with its corresponding module, e.g.:

python -m belb.kbs.umls
       --dir  /belb/directory
       --data_dir /path/to/umls/data
       --db ./db.yaml

By default all kbs are stored as sqlite databases. The db.yaml can be edited to your liking if you wish to store the data into a database. This feature is only paritally tested and it supports only postgres.

Corpora

Once all kbs are ready you can create all benchmark corpora via:

python -m belb.scripts.build_corpora --dir <BELB> --pubtator <BELB>/pubtator/pubtator.db

Similarly to kbs, you can also create a single corpus:

python -m belb.corpora.ncbi_disease --dir  /belb/directory --sentences

This will fetch the ncbi disease corpus, preprocess it, split text into sentences (--sentences) and store it into the belb directory.

API

Every resource (corpus, kb) is represented by a module which acts as a standalone script as well. This means you can programmatically access a resource:

from belb.kbs.kb import BelbKb
from belb.kbs.ncbi_gene import NcbiGeneKbConfig
from belb.corpora.nlm_gene import NlmGeneCorpusParser

For ease of access we provide a two classes to instantiate corpora and kbs respectively simply by providing an identifying name (a poor reproduction of what you see in the Auto* classes in the transformers library).

from belb import AutoBelbCorpus, AutoBelbKb
from belb.resources import Corpora, Kbs

corpus = AutoBelbCorpus.from_name(directory="path_to_belb", name=Corpora.NCBI_DISEASE.name)
kb = AutoBelbKb.from_name(directory="path_to_belb", name=Kbs.CTD_DISEASES.name)

Roadmap

Datasets:

BioRED - data
CRAFT (v4.0)
BC5-CHEMDNER-patents-GPRO
AskAPatient - data
TwADR-L - data
COMETA - data: "COMETA is available by contacting the last author via e-mail or following the instructions on https://www.siphs.org/."
ShARe
2019 n2c2/UMass Lowell shared task
TAC2017ADR - data

Knowledge Bases:

Snapshot

Create snapshot regularly for ease of reproducibility. This would require contacting resources providers and verify that it is doable, i.e. redistribution issues may arise.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
belb		belb
resources/docs		resources/docs
.gitignore		.gitignore
.mypy.ini		.mypy.ini
LICENSE		LICENSE
README.md		README.md
db.yaml		db.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

belb

belb

resources/docs

resources/docs

.gitignore

.gitignore

.mypy.ini

.mypy.ini

LICENSE

LICENSE

README.md

README.md

db.yaml

db.yaml

pyproject.toml

pyproject.toml

Repository files navigation

BELB: Biomedical Entity Linking Benchmark

Citing

Data

Knowledge Bases

Corpora

Setup

PubTator database

Knowledge Bases

dbSNP

UMLS

Corpora

API

Roadmap

Datasets:

Knowledge Bases:

Snapshot

About

Releases

Packages

Languages

License

sg-wbi/belb

Folders and files

Latest commit

History

Repository files navigation

BELB: Biomedical Entity Linking Benchmark

Citing

Data

Knowledge Bases

Corpora

Setup

PubTator database

Knowledge Bases

dbSNP

UMLS

Corpora

API

Roadmap

Datasets:

Knowledge Bases:

Snapshot

About

Resources

License

Stars

Watchers

Forks

Languages