# Run the Ranker
hp.obo version: 2024.04

In [1]:
from pyhpo import Ontology
import pickle
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F
import numpy as np
import torch
from PhenoDP_Preprocess import *
from PhenoDP import *

  from .autonotebook import tqdm as notebook_tqdm


## Initializing Ontology from pyhpo

Firstly, initialize the ontology from pyhpo. 

Users can replace `hp.obo` via this path: `~/anaconda3/envs/PhenoDP/lib/python3.7/site-packages/pyhpo/data`.


In [2]:
%%time 
Ontology()

CPU times: user 28.7 s, sys: 644 ms, total: 29.3 s
Wall time: 29.3 s


<pyhpo.ontology.OntologyClass at 0x7f79ed110ef0>

## Next Steps: Read Necessary Preprocessing Files

Next, read the necessary preprocessing files. The download link can be found on the GitHub PhenoDP project homepage.

- `JC_sim_dict.pkl` is a dictionary that contains the HPO term-disease similarity matrix calculated using the JC method.
- `node_embedding_dict_T5_gcn.pkl` is an embedding vector for each node in the HPO DAG, generated by PSD-HPOEncoder.


In [3]:
%%time 
with open('../JC_sim_dict.pkl', 'rb') as f:
    hp2d_sim_dict = pickle.load(f)
    
with open('../node_embedding_dict_T5_gcn.pkl', 'rb') as f:
    node_embedding = pickle.load(f)
    


CPU times: user 14.7 s, sys: 4.56 s, total: 19.3 s
Wall time: 19.1 s


In [4]:
%%time 
pre_model = PhenoDP_Initial(Ontology)
phenodp = PhenoDP(pre_model=pre_model, hp2d_sim_dict=hp2d_sim_dict, node_embedding=node_embedding)

generate disease dict...
related hpo num: 8950
generate disease ic dict... 
calculating hp weights
PCL_HPOEncoder is None
CPU times: user 3.77 s, sys: 144 ms, total: 3.92 s
Wall time: 3.91 s


## Running Example

In [5]:
%%time
test_set = ['HP:0003521',
 'HP:0000470',
 'HP:0001249',
 'HP:0003422',
 'HP:0003418',
 'HP:0002751']

test_tar = 277300

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11 µs


In [6]:
%%time
df = phenodp.run_Ranker(test_set)

Find Candidate Diseases: 100%|██████████| 2238/2238 [00:00<00:00, 30660.16it/s]
Calculating Phi Scores: 100%|██████████| 200/200 [00:00<00:00, 511.41it/s]
Calculating Embedding Similarity: 100%|██████████| 200/200 [00:01<00:00, 167.63it/s]

CPU times: user 1.76 s, sys: 92 ms, total: 1.85 s
Wall time: 1.71 s





In [7]:
df

Unnamed: 0,Disease,Total_Similarity
0,277300,0.820066
1,271630,0.759071
2,271530,0.736881
3,613330,0.731544
4,122600,0.730328
...,...,...
195,106300,0.506284
196,300514,0.503490
197,300915,0.494481
198,600384,0.460572


# Run the Recommender

## load weights

`transformer_encoder_infonce_norm.pth` is our pre-trained PCL_HPOEncoder. You can generate it by following our `.ipynb` file, or download it from the GitHub PhenoDP project homepage.


In [8]:
%%time
from PCL_HPOEncoder import PCL_HPOEncoder
input_dim = 256
num_heads = 8
num_layers = 3
hidden_dim = 512
output_dim = 1
max_seq_length = 128
PCL_HPOEncoder = PCL_HPOEncoder(input_dim, num_heads, num_layers, hidden_dim, output_dim, max_seq_length)
PCL_HPOEncoder.load_state_dict(torch.load('../20240723/res_baseline/transformer_encoder_infonce_norm.pth'))

CPU times: user 2.15 s, sys: 4 ms, total: 2.16 s
Wall time: 294 ms


<All keys matched successfully>

In [9]:
%%time
phenodp = PhenoDP(pre_model=pre_model, hp2d_sim_dict=hp2d_sim_dict, node_embedding=node_embedding, PCL_HPOEncoder=PCL_HPOEncoder)

PCL_HPOEncoder is a pre-trained model
CPU times: user 360 ms, sys: 8 ms, total: 368 ms
Wall time: 360 ms


In [10]:
%%time
print(df.head()['Disease'].values[:3])
candidates_d = df.head()['Disease'].values[:3]

[277300 271630 271530]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.05 ms


In [11]:
%%time
phenodp.run_Recommender(test_set, target_disease=277300, candidate_diseases=candidates_d)

using default setting...


Calculating NCE Loss: 100%|██████████| 13/13 [00:00<00:00, 106.75it/s]

CPU times: user 4.52 s, sys: 20 ms, total: 4.54 s
Wall time: 490 ms





Unnamed: 0,hp,importance
0,HP:0003510,6.947567
1,HP:0004322,2.315607
2,HP:0002948,0.817554
3,HP:0003310,0.81753
4,HP:0001522,0.737438
5,HP:0002937,0.565266
6,HP:0011461,0.401018
7,HP:0000476,0.357949
8,HP:0003305,0.324991
9,HP:0001538,0.28419


# Run the Summarizer

## Transformers Version 4.30.2

Users can download the `flan-t5-base` model from Hugging Face: [https://huggingface.co/google/flan-t5-base](https://huggingface.co/google/flan-t5-base), and then load it locally.

`flan-model.pth` is our pre-trained model weights, which can be downloaded from the GitHub PhenoDP project homepage.


In [12]:
%%time
from transformers import AutoTokenizer, T5ForConditionalGeneration
import torch
import pandas as pd
import numpy as np
model = T5ForConditionalGeneration.from_pretrained("/root/flanT5/")
tokenizer = AutoTokenizer.from_pretrained("/root/flanT5/")
state_dict = torch.load('../flan-model.pth')
if torch.cuda.is_available():
    state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}

model.load_state_dict(state_dict)

CPU times: user 7.86 s, sys: 3.26 s, total: 11.1 s
Wall time: 8.75 s


<All keys matched successfully>

In [13]:
%%time
def get_output_txt_HPO2SUM(txt, p_model, p_tokenizer, device):
    with torch.no_grad():
        input_text = 'Help me Summarize Text:' + txt
        input_ids = p_tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True).input_ids
        p_model.to(device)
        output_ids = p_model.generate(input_ids=input_ids.to(device), 
                                       max_length=1024,
                                       min_length=128,
                                       early_stopping=False,
                                       do_sample=True, 
                                       no_repeat_ngram_size=5,
                                       top_k=50, 
                                       top_p=0.95
                                      )
        output_text = p_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 663 µs


## Input Description

The input consists of the `name` and `definition` of each HPO term. It's worth mentioning that some definitions are followed by `[PMID]`. Therefore, we use the `[` delimiter to filter out non-definition text.

For example:
```markdown
HP:0000670
'def: 
"Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries." [https://orcid.org/0000-0002-0736-9199]'


In [14]:
%%time
text = [Ontology.get_hpo_object(t).name + ' ' + Ontology.get_hpo_object(t).definition.split('[')[0] for t in test_set]
text = ' '.join(text)
print(text)

Disproportionate short-trunk short stature "A type of disproportionate short stature characterized by a short trunk but a average-sized limbs."  Short neck "Diminished length of the neck."  Intellectual disability "Intellectual disability, previously referred to as mental retardation, is characterized by subnormal intellectual functioning that occurs during the developmental period. It is defined by an IQ score below 70."  Vertebral segmentation defect "An abnormality related to a defect of vertebral separation during development."  Back pain "An unpleasant sensation characterized by physical discomfort (such as pricking, throbbing, or aching) localized to the back."  Kyphoscoliosis "An abnormal curvature of the spine in both a coronal (lateral) and sagittal (back-to-front) plane." 
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 443 µs


In [15]:
%%time
get_output_txt_HPO2SUM(text, p_model=model, p_tokenizer=tokenizer, device='cuda:1')

CPU times: user 3 s, sys: 172 ms, total: 3.18 s
Wall time: 3.17 s


'A rare, genetic, skeletal dysplasia characterized by disproportionate short-trunk short stature, short neck, vertebral defect and intellectual disability. Additional clinical features may include ophthalmologic abnormalities (such as amblyopia and cataract), seizures, hypotonia, global development delay, intellectual deficit and behavioral abnormalities, such as attention deficit hyper- or hyperactivity disorder. Facial dysmorphism may include upslanted palpebral fissures, hypertelorism, abnormal palpebral ptosis, arched eyebrows, long nose with upturned tip, anteverted nares, proptosis, depressed nasal bridge and micrognathia. Brain imaging may show high signal intensities, cerebral atrophy, or periventricular white matter abnormalities.'