Skip to content

yan-cheng/BCNH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme

This repo is the source code for paper Biomedical Concept Normalization by Leveraging Hypernyms.

Preprocessing

Dictionary Preprocessing

  • Download the CTD vocabulary files from CTD, we used the xml file CTD_diseases.xml.gz (v2021.02.01).
  • Preprocess the CTD vocabulary files and format to JSON.
python ./datasets/data_preprocess/preprocess_CDT.py \
    --disease_xml ./datasets/raw_data/CTD_diseases_MEDIC_2021.02.01.xml \
    --output_dir ./datasets/NCBI_Disease

Raw Dataset Preprocessing

python ./datasets/data_preprocess/preprocess_NCBI.py \
    --train_txt ./datasets/raw_data/NCBItrainset_corpus.txt \
    --dev_txt ./datasets/raw_data/NCBIdevelopset_corpus.txt \
    --test_txt ./datasets/raw_data/NCBItestset_corpus.txt \
    --output_dir ./datasets/raw_data
  • Preprocess JSON files with lowercasing, abbreviation etc.
python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data/NCBItrainset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_train \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data/NCBIdevelopset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_dev \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dataset.py \
    --input_file ./datasets/raw_data//NCBItestset_corpus.json \
    --output_dir ./datasets/NCBI_Disease/processed_test \
    --ab3p_path ./Ab3P/identify_abbr \
    --dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --lowercase true \
    --remove_punctuation true
  • Preprocess the extended dictionary.
# Note that the only difference between the dictionaries is that test_dictionary includes train mentions to increase the coverage.

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --output_dictionary_path ./datasets/NCBI_Disease/train_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --additional_data_dir ./datasets/NCBI_Disease/processed_train \
    --output_dictionary_path ./datasets/NCBI_Disease/dev_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

python ./datasets/data_preprocess/preprocess_dictionary.py \
    --input_dictionary_path ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json \
    --additional_data_dir ./datasets/NCBI_Disease/processed_train \
                          ./datasets/NCBI_Disease/processed_dev \
    --output_dictionary_path ./datasets/NCBI_Disease/test_dictionary.txt \
    --lowercase true \
    --remove_punctuation true

# mkdir
mkdir ./datasets/NCBI_Disease/processed_train_dev
cp ./datasets/NCBI_Disease/processed_train/* ./datasets/NCBI_Disease/processed_train_dev
cp ./datasets/NCBI_Disease/processed_dev/* ./datasets/NCBI_Disease/processed_train_dev

Train Model

Use the following command to train the model.

CUDA_VISIBLE_DEVICES=0 python train.py \
    --bert_dir ./pretrained/pt_biobert1.1/ \
    --model_dir exp/BCNH \
    --train_dictionary_path ./datasets/NCBI_Disease/train_dictionary.txt \
    --train_dir ./datasets/NCBI_Disease/processed_train_dev \
    --dev_dictionary_path ./datasets/NCBI_Disease/dev_dictionary.txt \
    --dev_dir ./datasets/NCBI_Disease/processed_dev \
    --test_dictionary_path ./datasets/NCBI_Disease/test_dictionary.txt \
    --test_dir ./datasets/NCBI_Disease/processed_test \
    --epoch 10 \
    --hyper_num 10 \
    --hyper_norm_scale 1 \
    --taxonomy ./datasets/NCBI_Disease/CTD_diseases_MEDIC_2021.02.01.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages