Readme

This repo is the source code for paper Biomedical Concept Normalization by Leveraging Hypernyms.

Preprocessing

Dictionary Preprocessing

Download the CTD vocabulary files from CTD, we used the xml file CTD_diseases.xml.gz (v2021.02.01).
Preprocess the CTD vocabulary files and format to JSON.

python ./datasets/data_preprocess/preprocess_CDT.py \
    --disease_xml ./datasets/raw_data/CTD_diseases_MEDIC_2021.02.01.xml \
    --output_dir ./datasets/NCBI_Disease

Raw Dataset Preprocessing

Download the dataset files from NCBI Disease Corpus (Complete Train / Development / Test set), then unzip the files and get the NCBItrainset_corpus.txt, NCBIdevelopset_corpus.txt, NCBItestset_corpus.txt.
Preprocess the dataset files to JSON files.

python ./datasets/data_preprocess/preprocess_NCBI.py \
    --train_txt ./datasets/raw_data/NCBItrainset_corpus.txt \
    --dev_txt ./datasets/raw_data/NCBIdevelopset_corpus.txt \
    --test_txt ./datasets/raw_data/NCBItestset_corpus.txt \
    --output_dir ./datasets/raw_data

Preprocess JSON files with lowercasing, abbreviation etc.

python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data/NCBItrainset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_train \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data/NCBIdevelopset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_dev \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data//NCBItestset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_test \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true

Preprocess the extended dictionary.

# Note that the only difference between the dictionaries is that test_dictionary includes train mentions to increase the coverage.

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --output_dictionary_path ./datasets/NCBI_Disease/train_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --additional_data_dir ./datasets/NCBI_Disease/processed_train \
    --output_dictionary_path ./datasets/NCBI_Disease/dev_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --additional_data_dir ./datasets/NCBI_Disease/processed_train \
                          ./datasets/NCBI_Disease/processed_dev \
    --output_dictionary_path ./datasets/NCBI_Disease/test_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

# mkdir
mkdir ./datasets/NCBI_Disease/processed_train_dev
cp ./datasets/NCBI_Disease/processed_train/* ./datasets/NCBI_Disease/processed_train_dev
cp ./datasets/NCBI_Disease/processed_dev/* ./datasets/NCBI_Disease/processed_train_dev

Train Model

Use the following command to train the model.

CUDA_VISIBLE_DEVICES=0 python train.py \
    --bert_dir ./pretrained/pt_biobert1.1/ \
    --model_dir exp/BCNH \
    --train_dictionary_path ./datasets/NCBI_Disease/train_dictionary.txt \
    --train_dir ./datasets/NCBI_Disease/processed_train_dev \
    --dev_dictionary_path ./datasets/NCBI_Disease/dev_dictionary.txt \
    --dev_dir ./datasets/NCBI_Disease/processed_dev \
    --test_dictionary_path ./datasets/NCBI_Disease/test_dictionary.txt \
    --test_dir ./datasets/NCBI_Disease/processed_test \
    --epoch 10 \
    --hyper_num 10 \
    --hyper_norm_scale 1 \
    --taxonomy ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Ab3P		Ab3P
bert.py		bert.py
dataset.py		dataset.py
eval.py		eval.py
model.py		model.py
path_Ab3P		path_Ab3P
ranknet.py		ranknet.py
readme.md		readme.md
tokenizer.py		tokenizer.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

Preprocessing

Dictionary Preprocessing

Raw Dataset Preprocessing

Train Model

About

Releases

Packages

Languages

yan-cheng/BCNH

Folders and files

Latest commit

History

Repository files navigation

Readme

Preprocessing

Dictionary Preprocessing

Raw Dataset Preprocessing

Train Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages