Phenotype Extraction and Differential Diagnosis Pipeline in Diagnosing Rare Diseases Using EHRs
-
codes/core
:Core package, including data processing, phenotype-based rare disease prediction models -
codes/bert_syn_project
:Phenotype extraction models -
codes/requirement.txt
: List of environment packages for running the code -
Docker
: docker image to running the code -
PhenoBrain_Web_API
: Phenobrain website API documentation
These are the operating system and software versions used by the authors of the paper. Note that, except for the Python version, other versions are not strictly required and are provided for reference.
Operating System: Ubuntu 22.04.3 LTS
Java: openjdk version 1.8.0_402, required for running the differential diagnosis method BOQA
Python: 3.6.12
# Create a New Conda Environment,Note that Python version 3.6.12 needs to be
# installed to avoid potential conflicts with other environment packages.
conda create --name <xxxx> python=3.6.12
# install basic packages based on requirement.txt
pip install -r requirements.txt
Download the trained model and parameters from the following address: https://drive.google.com/drive/folders/1cVApHHw5yLLoLRYZht9Qx52AienJlgWN?usp=sharing.
Once downloaded, place the model file of the differential diagnosis module in the path '/codes/core/'.
The storage model parameters are considerably large, with a total size of approximately 14GB. Our four methods of a size of around 4GB. If you do not care about reproducing the results of 12 baseline methods. In that case, you are recommended to download our models solely from the following five file folders: ICTODQAcrossModel, HPOICCalculator, HPOProbMNBNModel, LRNeuronModel, CNBModel.
First, create a new terminal window, and set the default environment by following the commands.
# codes in your own address
CODE_PATH= your address+ "timgroup_disease_diagnosis"
CORE_PATH="${CODE_PATH}/core"
export PYTHONPATH=$CORE_PATH:$PYTHONPATH
cd $CORE_PATH
Second, put the releated patient data in this folder "codes/core/data/preprocess/patient/".
The main test datasets have been publicly released at Zenodo.
To reproduce all the results discussed in the supplementary file of this study, run the "core/script/test/test_optimal_model.py" file.
# Running an Example
python core/script/test/test_optimal_model.py
The main file that contains all the settings for running the model and generating results is "core/script/test/test_optimal_model.py".
In "core/helper/data/data_helper.py", you can find all the addresses of the datasets used in this study. You can replace them with your own if you want to use a different address.
The "core/predict/" folder comprehensively describes rare disease diagnosis models. In addition to the 17 methods (12 state-of-the-art baselines "core/predict/sim_model/" and our 4 developed or used methods, ICTO,PPO,CNB,MLP), After obtaining the results from the 4 models, a new disease prediction ranking can be generated by combining the prediction results of the four models using the ensemble method based on order statistics.
The Ensemble model. We observed that a single assumption or model would not completely capture the characteristics of the diverse datasets. Hence, we developed the Ensemble model by combining predictions from multiple methods using order statistics, and it achieved better results. The Ensemble model calculates one overall prediction by integrating the rankings of the previous four methods using order statistics. Suppose the number of methods is N . First, the Ensemble method normalizes the ranking of diseases within each method to obtain ranking ratios. It then calculates a Z statistic, which measures the likelihood that the observed ranking ratios are solely due to chance factors. It calculates the probability of obtaining ranking ratios through random factors that are smaller than the currently observed ranking ratios. Under the null hypothesis, the position of each disease in the overall ranking is random. In other words, for two diseases, the one with a smaller statistic is more likely to have a top rank. The joint cumulative distribution of an N-dimensional order statistic is used to calculate the Z statistics:
where ri is the rank ratio by the i-th method, and r0=0 . Due to its high complexity, we implemented a faster recursive formula to compute the above integral as previously done:
where V0=1, and ri is the rank ratio by the i-th method.
For more information, please refer to the reference [1] .
Moreover, it contains more deep learning models, GCN, and Bayesian network. Available for user to try.
After running the test_optimal_model.py file, the program will generate seven folders, namely CaseResult, csv, delete, DisCategoryResult, Metric-test, RawResults and table.
The CaseResult folder encompasses a comprehensive list of predicted diseases for all patients, out of a total of 9260 diseases, for each of the methods employed. Sample results for PUMCH-ADM and validation of RAMEDIS are included below.
In the table folder, a comparison of multiple statistical metrics is available for each method applied to every dataset. Sample results for PUMCH-ADM and validation of RAMEDIS are included below. These metrics provide a detailed insight into the performance of each method applied to the dataset analyzed.
The RawResults folder comprises complete ranked lists of predicted diseases for each method implemented on every patient of each dataset, covering a total of 9260 diseases. The raw predictions saved in this folder can range from a few MB to several GB in size. The user has the option to choose whether or not to store these raw predictions by altering the settings in the test_optimal_model.py file.
We upload the trained model code to PhenoBrain, which contains two modules, Phenotype Extraction and Differential Diagnosis/Disease Prediction. Users can select phenotypes in three ways: by phenotype tree for precise phenotype selection, by the phenotype search function, and by the phenotype extraction function. The phenotype extraction module is in the input interface. Users enter clinical text into the phenotype extraction box, select the method of phenotype extraction (HPO/CHPO, CHPO-UMLS, CText2Hpo), and press the "Extract" button to extract the phenotypes. The interface's right side presents the extracted phenotype's specific information.
Specifically, we prepared three examples in PhenoBrain to verify the effect of the phenotype extraction function. The input is Chinese clinical text, English clinical text, and the HPO code list. Then press the "Extract" key to demonstrate the effect of extracting phenotype.
Example of Phenotype Extraction in English Text
Example of Phenotype Extraction in Chinese Text
After selecting the phenotype:
- Go to the Diagnose interface.
- Select the diagnostic method and the number of predicted results to show.
- Finally, press the “Predict” button to get the prediction results for each method within a few seconds.
Example of Disease Diagnosis
You need to download the base models and place them in the following file locations. The three base models used in this document are as follows:
# base models
your address + "bert_syn_project/model/bert"
your address + "bert_syn_project/model/albert_google
your address + "bert_syn_project/model/albert_brightmart
-
When using different types of base models, you need to switch the corresponding packages in
bert_syn/bert_pkg/__init__.py
. The corresponding relationships are as follows:
bert
: BERT modelalbert
: ALBERT by Googlealbert_zh (brightmart)
: ALBERT by brightmart
Set the default environment by following the commands
# codes in your address
RAREDIS_PATH= your address + "timgroup_disease_diagnosis"
CORE_PATH="${RAREDIS_PATH}/core"
BERT_SYN_PRJ_PATH="${RAREDIS_PATH}/bert_syn_project"
export PYTHONPATH=$CORE_PATH:$BERT_SYN_PRJ_PATH:$PYTHONPATH
cd $BERT_SYN_PRJ_PATH
Update source data (updated on 2022-11-01): If you need to update the source data for synonyms (used to generate training data), such as changing the version of chpo, you can follow the steps below:
-
First, update the path of the chpo file: in
core/core/reader/hpo_reader
. -
Delete the old version of
chpo_dict.json
: This file is directly generated from the chpo source file in step 1, so when replacing the source file, you need to generate a newchpo_dict.json
. Runcore/core/reader/hpo_reader.py
to generate a newchpo_dict.json
file. -
Delete related files: As in step 2, we need to delete a series of files generated from the chpo source file. Specifically, see which files are used by
SynDictReader
inbert_syn/core/data_helper.py
. -
Regenerate the series of files: Run
core/core/text_handler/syn_generator.py
to generate the files deleted in step 3.
Run bert_syn/core/data_helper.py
to regenerate the corresponding datasets. For specific usage details, see bert_syn/core/data_helper.py
.
File: bert_syn/script/run_bert_ddml_sim.py
Example command: python bert_syn/script/run_bert_ddml_sim.py --model_name xxx --gpu 0 --epoch xxx --lr xxx
Set the address of training set:
# in bert_syn/core/bert_ddml_sim.py
# in BertDDMLConfig
self.train_data_path = os.path.join(dataset_path, 'train.csv') # set to None if no predict
Setting other parameters: You can set them directly in BertDDMLConfig
in bert_syn/core/bert_ddml_sim.py
, or pass them through the command line.
Result files (using the example above):
-
Stored model:
model/xxx
bert_sim_config.json
: Model parametersloss.png
: Loss changes during trainingpred_median_rank.png
: PUMC-S test set, using spans divided by doctors in electronic medical records to see where the doctor's annotated HPO ranks, and then calculate the median rankpred_recall_k.png
: PUMC-S test set, using spans divided by doctors in electronic medical records to see if the doctor's annotated HPO is recalled in the top k, and then calculate the top k accuracy
-
Detailed matching results during training (PUMC-S test set): 'result'
- File:
bert_syn/core/span_generator.py
- Example command:
python bert_syn/core/span_generator.py
- Current configuration:
- Model: xxx
- Matching algorithm:
CText2Hpo (S)
,CText2Hpo (S)
is the name of the algorithm used before PBTagger - Data: your address+
/core/data/raw/PUMC_PK/pumc_pk
- Result folder mark: "model name"
- You need to modify the
input_folder
- To switch to the model trained in the previous step, update
model_name, global_step = ...
- Result folder:
data/preprocess/pumc_pk/dict_bert-albertTinyDDMLSim-tune
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L-C, De Moor B, Marynen P, Hassan B. Gene prioritization through genomic data fusion. Nature biotechnology. 2006;24.