Skip to content

shashank140195/Raredis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Raredis

This repository contains code for our paper: Comparison of pipeline, sequence-to-sequence, and generative language models for end-to-end relation extraction: experiments with the rare disease use-case.

Dataset

The full modified dataset is available at this link.

Seq2rel

All the Experiments are done in Google Colab Pro+ using A100 GPU.

1. Preparing the environment

Please follow the original seq2rel repo for installation and environment preparation guidelines here.
Alternatively, run:

pip install git+https://github.com/JohnGiorgi/seq2rel.git

2. Prepare data

We follow the same linearization schema as provided by the authors.
Datasets are tab-separated files where each example is contained on its own line. The first column contains the text, and the second column contains the relations. Relations themselves must be serialized to strings.

SCAN1 has been identified in a single Saudi Arabian family. It has not been identified in other ataxic individuals. The diagnosis of SCAN1 is made on history and clinical signs as listed above. DNA testing for mutations in TDP1 is only available on a research basis.	SCAN1 @RAREDISEASE@ It @ANAPHOR@ @Anaphora@ 

Seq2rel/data_prep_REL.py will generate files in the desired format for seq2rel. The pre processed input files are present in Seq2rel/preprocees_data folder.

3. Model Training

We trained our model on Google Colab Pro+ using A100 GPU.
Git clone the John Giorgi's seq2rel github repo in the desired location in your drive Seq2rel repo

Training

To train the model, use the allennlp train command with one of our configs (or write your own!)

For example, to train a model on the Raredis, first, preprocess the data mentioned in the previous step or directly use the already pre-processed data from Seq2rel/preprocees_data folder.

Then, call allennlp train with the Raredis config we have provided

train_data_path="path/to/preprocessed/raredis/train.txt" \
valid_data_path="path/to/preprocessed/raredis/valid.txt" \
dataset_size=600 \
allennlp train "training_config/raredis.jsonnet" \
    --serialization-dir "output" \
    --include-package "seq2rel" 

The best model checkpoint (measured by micro-F1 score on the validation set), vocabulary, configuration, and log files will be saved to --serialization-dir. This can be changed to any directory you like. You can also follow Our model train Google collab file here link

4. Evaluation

For overall and per relation score, run Seq2rel/eval_rel_type.py. Make sure you change the path to the trained model and gold test file.

BioGPT

All the Experiments are done in Google Colab Pro+ using A100 GPU.

1. Data Prep

  1. First run the BioGPT/scripts/data_preparation/rawToJSON.py to convert the original files in the JSON format. This script adds/removes the instruction to the input sequence and adds/removes entity type for the target sequence.
  2. Run BioGPT/scripts/data_preparation/rel_is_preprocess.py to pre-process the JSON data in rel-is input format. This will output .pmid, .x, and .y files for each split.
    split.pmid: It contains the document name
    split.x: It contains the input string
    split.y: it contains the target string

For example. Original text

the incidence and prevalence of tarsal tunnel syndrome is unknown. the disorder is believed to affect males and females in equal numbers.

Using rel_is_preprocess.py with enabling copy instruct and enabling ent type for the entities will generate

split.pmid that contains

Tarsal-Tunnel-Syndrome

split.x that contains

consider the abstract: $ the incidence and prevalence of tarsal tunnel syndrome is unknown. the disorder is believed to affect males and females in equal numbers. $ from the given abstract, find all the entities and relations among them. do not generate any token outside the abstract.

split.y that contains

the relationship between raredisease tarsal tunnel syndrome and anaphor "the disorder" is antecedent.

A sample pre-processed data can be found here

Training

1. Requirements and Installation

Git clone the BioGPT repo

!git clone https://github.com/microsoft/BioGPT.git

and then follow the original GitHub repo to install the necessary libraries to work with BioGPT here or run the following cells.

!git clone https://github.com/pytorch/fairseq  
import os
os.chdir("/content/fairseq")
!git checkout v0.12.0
!pip install .
!python setup.py build_ext --inplace

Moses

os.chdir("/content/BioGPT")
!git clone https://github.com/moses-smt/mosesdecoder.git
!export MOSES=${PWD}/mosesdecoder

FastBPE

!git clone https://github.com/glample/fastBPE.git
!export FASTBPE=${PWD}/fastBPE
os.chdir("fastBPE")
!g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast

Sacremoses

!pip install sacremoses
!pip install tensorboardX

You can also follow our Google Colab working directory to follow the code for installation steps here.

2. Model Download

  1. The link to the pre-trained BioGPT and BioGPT large is provided on the original GitHub repo here. We observed that sometimes the URL doesn't work so alternatively you can use this link to download BioGPT medium(4GB) or this link to download BioGPT large(18GB) from our google drive and save in your local/gdrive.
os.chdir("/content/BioGPT/")
os.mkdir("checkpoints")
os.chdir("checkpoints")
!wget https://msramllasc.blob.core.windows.net/modelrelease/BioGPT/checkpoints/Pre-trained-BioGPT.tgz
!tar -zxvf Pre-trained-BioGPT.tgz

if the above URL doesn't work, (sometimes public access error), try running the below code to copy the BioGPT model from your Google Drive to Google Collab

os.chdir("/content/BioGPT/")
os.mkdir("checkpoints")
os.chdir("checkpoints")
os.mkdir("Pre-trained-BioGPT")

# copy the model checkpoint from google drive
%cp -av "/content/drive/MyDrive/BioGPT/pre_trained_model_med/checkpoint.pt" "/content/BioGPT/checkpoints/Pre-trained-BioGPT"

The model path hierarchy should look like this:
Screenshot 2023-08-24 at 12 31 10 PM

  1. Create a folder named "Raredis" under the data subfolder in the BioGPT path and paste the raw folder BioGPT/data/raw folder inside it or alternatively you can choose different raw files from pre processed directory.
os.chdir("/content/BioGPT/data")
os.mkdir("Raredis")

# command to copy files created from rel_is_preprocess.py (.pmid, .x and.y files)
%cp -av "content/drive/Mydrive/raw" "/content/BioGPT/data/Raredis/"

The file tree should look like this:

Screenshot 2023-08-24 at 12 41 21 PM
  1. Copy the Re-Raredis under the subfolder "examples" in the BioGPT path. This folder contains the bash files to pre-process, train, and infer.
%cp -av "content/drive/mydrive/RE-Raredis" "/content/BioGPT/examples/"

The file structure should look like this:

Screenshot 2023-08-24 at 1 00 36 PM
  1. Run preprocess.sh
os.chdir("/content/BioGPT/examples/RE-Raredis")
!bash preprocess.sh

The above command will create 1 more folder named "relis-bin" under the same folder as the raw path as shown below:

Screenshot 2023-08-24 at 1 04 10 PM
  1. Run train.sh to begin training the model. this will create a folder "RE-Raredis-BioGPT" under checkpoint folder. you can change configs in train.sh bash file.
!bash train.sh
  1. After training run infer.sh. This script runs inference on the test.txt and generates a .detok file
!bash infer.sh
Screenshot 2023-08-24 at 1 32 30 PM
  1. Post-processing
    After inference, run the BioGPT/scripts/postprocess to fetch the inference in the desired JSON format.

  2. Evaluation Run BioGPT/scripts/eval/eval_per_rel_type.py to get the overall and individual relation type scores.

BioMedLM (Former PubMedGPT)

We use Lambda Labs to train BioMedLM by Stanford on 1 H100 80GB GPU.

We follow the same guidelines to prepare data and model training provided at BioMedLM's author's github for NLG (Seq2seq) task.

1. Data Prep

We use the same JSON files we created earlier using BioGPT/scripts/data_preparation/rawToJSON.py to build the data required for BioMedLM input.

Run BioMedLM/scripts/databuilder to build the files required to train BioMedLM. Notice, this python file is similar to BioGPT/scripts/data_preparation and generates same files but with different extensions that were created in BioGPT files. This script will generate split.pmid, split.source, and split.target for train, dev and test repectively as mentioned in the original github repo.

2. Configuration & Model Training

Git clone the repo

!git clone https://github.com/stanford-crfm/BioMedLM.git

After cloning the BioMedLM repo, copy the train_contol file and put it under gpt2 folder.

Make sure the task dataset is in ./textgen/data. The dataset folder should have .source and .target files. The .source file should contain the original text in a one example per line format and the .target file should contain the desired output in a one example per line format. See example here.

Go to ./textgen/gpt2. To finetune, run:

python finetune_for_summarization.py --output_dir /home/ubuntu/BioMedLM/output_dir \
  --model_name_or_path stanford-crfm/BioMedLM \
  --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1 \
  --save_strategy steps \
  --do_eval \
  --train_data_file /home/ubuntu/BioMedLM/finetune/textgen/data/train.source \
  --eval_data_file /home/ubuntu/BioMedLM/finetune/textgen/data//valid.source \
  --max_source_length 510 \
  --train_max_target_length 500 \
  --save_total_limit 25 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 16 \
  --learning_rate 1e-5 \
  --warmup_ratio 0.1 \
  --weight_decay 0.01 \
  --seed 7 \
  --evaluation_strategy steps \
  --eval_steps 50 \
  --num_train_epochs 30 \
  --logging_steps 50 \
  --save_steps 50 \
  --logging_first_step \
  --load_best_model_at_end True \
  --metric_for_best_model eval_loss \
  --greater_is_better True \
  --adam_beta2 0.98

3. Prediction

Make sure you add the correct validation.source path in the run_generation_batch.py file.

After finetuning, run generation on validation set for each checkpoint saved by running:

python -u run_generation_batch.py --max_source_length -1 --length 510 --model_name_or_path=finetune_checkpoint_path --num_return_sequences 1 --stop_token [SEP] --tokenizer_name=finetune_checkpoint_path --task_mode=raredis --control_mode=no --tuning_mode finetune --gen_dir user/output_dir --batch_size 1 --temperature 1.0

Prediction on validation set (Best checkpoint)

We save the each training checkpoint and selected the best checkpoint based on validation dataset F1 score using the eval script here. We select the best checkpoint when validation F1 score did not improve after 5 checkpoints (Patience = 5)

Prediction on test set

Once you select the best checkpoint based on validation F1 score, change the validation.source path in the run_generation_batch.py file to your test.source path and again run below command with best checkpoint saved:

python -u run_generation_batch.py --max_source_length -1 --length 510 --model_name_or_path=best_validation_ checkpoint_path --num_return_sequences 1 --stop_token [SEP] --tokenizer_name=best_validation_ checkpoint_path --task_mode=raredis --control_mode=no --tuning_mode finetune --gen_dir user/output_dir --batch_size 1 --temperature 1.0

4. Test Evaluation

Run BioMedLM/scripts/eval/test_eval/ to run evaluation on predicted sequence.

Pipeline

For pipeline, we use the truncated documents (upto 512 tokens, BERT input limit) from link

T5/Flan-T5

Training/valid/test data

Data for the copy/No copy instruction for natural language/rel-is template can be found here: link

Training

The training script for both formats can be found here: link

Command:

python PATH_TO_TRAINING_PYTHON_FILE \
  --output_dir PATH_TO_OUTPUT_DIR \
  --model_name_or_path t5-3b \
  --tokenizer_name t5-3b \
  --per_device_train_batch_size 1 \
  --per_device_eval_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --save_strategy steps \
  --do_eval \
  --overwrite_output_dir \
  --learning_rate 3e-4 \
  --warmup_ratio 0.1 \
  --weight_decay 1e-5 \
  --seed 7 \
  --evaluation_strategy steps \
  --eval_steps 200 \
  --num_train_epochs 100 \
  --logging_steps 200 \
  --save_steps 200 \
  --logging_first_step \
  --load_best_model_at_end True \
  --metric_for_best_model eval_f1 \
  --save_total_limit=1 \
  --greater_is_better True \
  --adam_beta2 0.98 \
  --predict_with_generate True \
  --generation_num_beams 4 \
  --prediction_loss_only False \
  --generation_max_length 1024

Inference

After training, you can run inference scripts from link depending on the template you chose.

Evaluation

After inference, you can run evaluation scripts from link depending on the template you chose.

About

Project for end to end relation extraction on Rare Disease

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published