# Run Ranker
hp.obo version: 2024.04

In [1]:
from pyhpo import Ontology
import pickle
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F
import numpy as np
import torch
from PhenoDP_Preprocess import *
from PhenoDP import *

  from .autonotebook import tqdm as notebook_tqdm


## Initializing Ontology from pyhpo

Firstly, initialize the ontology from pyhpo. 

Users can replace `hp.obo` via this path: `~/anaconda3/envs/PhenoDP/lib/python3.7/site-packages/pyhpo/data`.


In [2]:
%%time 
Ontology()

CPU times: user 25.5 s, sys: 736 ms, total: 26.2 s
Wall time: 26.2 s


<pyhpo.ontology.OntologyClass at 0x7ff7a534ef60>

## Next Steps: Read Necessary Preprocessing Files

Next, read the necessary preprocessing files. The download link can be found on the GitHub PhenoDP project homepage.

- `JC_sim_dict.pkl` is a dictionary that contains the HPO term-disease similarity matrix calculated using the JC method.
- `node_embedding_dict_T5_gcn.pkl` is an embedding vector for each node in the HPO DAG, generated by PSD-HPOEncoder.


In [3]:
%%time 
with open('../JC_sim_dict.pkl', 'rb') as f:
    hp2d_sim_dict = pickle.load(f)
    
with open('../node_embedding_dict_T5_gcn.pkl', 'rb') as f:
    node_embedding = pickle.load(f)
    


CPU times: user 13.8 s, sys: 4.43 s, total: 18.2 s
Wall time: 18.1 s


In [4]:
%%time 
pre_model = PhenoDP_Initial(Ontology)
phenodp = PhenoDP(pre_model=pre_model, hp2d_sim_dict=hp2d_sim_dict, node_embedding=node_embedding)

generate disease dict...
related hpo num: 8950
generate disease ic dict... 
calculating hp weights
PCL_HPOEncoder is None
CPU times: user 2.88 s, sys: 72 ms, total: 2.95 s
Wall time: 2.94 s


## Running Example

In [5]:
%%time
test_set = ['HP:0000670', 'HP:0004322', 'HP:0000992', 'HP:0001290', 'HP:0000407', 'HP:0000252', 'HP:0000490']
test_tar = 216400

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.3 µs


In [6]:
%%time
df = phenodp.run_Ranker(test_set)

Find Candidate Diseases: 100%|███████████████████████████████████████████████████| 2744/2744 [00:00<00:00, 23454.69it/s]
Calculating Phi Scores: 100%|████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 471.23it/s]
Calculating Embedding Similarity: 100%|██████████████████████████████████████████████| 200/200 [00:01<00:00, 140.00it/s]

CPU times: user 2 s, sys: 224 ms, total: 2.22 s
Wall time: 2.03 s





In [7]:
df

Unnamed: 0,Disease,Total_Similarity
0,216400,0.729222
1,278760,0.709141
2,133540,0.707141
3,618342,0.705748
4,615919,0.687560
...,...,...
195,600430,0.558744
196,256550,0.557122
197,268400,0.554204
198,620494,0.552885


# Run Recommender

## load weights

`transformer_encoder_infonce_norm.pth` is our pre-trained PCL_HPOEncoder. You can generate it by following our `.ipynb` file, or download it from the GitHub PhenoDP project homepage.


In [8]:
%%time
from PCL_HPOEncoder import PCL_HPOEncoder
input_dim = 256
num_heads = 8
num_layers = 3
hidden_dim = 512
output_dim = 1
max_seq_length = 128
PCL_HPOEncoder = PCL_HPOEncoder(input_dim, num_heads, num_layers, hidden_dim, output_dim, max_seq_length)
PCL_HPOEncoder.load_state_dict(torch.load('../20240723/res_baseline/transformer_encoder_infonce_norm.pth'))

CPU times: user 576 ms, sys: 12 ms, total: 588 ms
Wall time: 64.1 ms


<All keys matched successfully>

In [9]:
%%time
phenodp = PhenoDP(pre_model=pre_model, hp2d_sim_dict=hp2d_sim_dict, node_embedding=node_embedding, PCL_HPOEncoder=PCL_HPOEncoder)

PCL_HPOEncoder is a pre-trained model
CPU times: user 304 ms, sys: 0 ns, total: 304 ms
Wall time: 303 ms


In [10]:
%%time
print(df.head()['Disease'].values[:3])
candidates_d = df.head()['Disease'].values[:3]

[216400 278760 133540]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.44 ms


In [11]:
%%time
phenodp.run_Recommender(test_set, target_disease=216400, candidate_diseases=candidates_d)

using default setting...


Calculating NCE Loss: 100%|████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 130.66it/s]

CPU times: user 3.26 s, sys: 60 ms, total: 3.32 s
Wall time: 170 ms





Unnamed: 0,hp,importance
0,HP:0001260,0.989002
1,HP:0011461,0.858314
2,HP:0000448,0.712112
3,HP:0000135,0.710988
4,HP:0003357,0.697992
5,HP:0006297,0.566507
6,HP:0007814,0.511067
7,HP:0005301,0.434549
8,HP:0001288,0.432601
9,HP:0001105,0.432421


# Run Summarizer

## Transformers Version 4.30.2

Users can download the `flan-t5-base` model from Hugging Face: [https://huggingface.co/google/flan-t5-base](https://huggingface.co/google/flan-t5-base), and then load it locally.

`flan-model.pth` is our pre-trained model weights, which can be downloaded from the GitHub PhenoDP project homepage.


In [12]:
%%time
from transformers import AutoTokenizer, T5ForConditionalGeneration
import torch
import pandas as pd
import numpy as np
model = T5ForConditionalGeneration.from_pretrained("/root/flanT5/")
tokenizer = AutoTokenizer.from_pretrained("/root/flanT5/")
state_dict = torch.load('../flan-model.pth')
if torch.cuda.is_available():
    state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}

model.load_state_dict(state_dict)

CPU times: user 8.08 s, sys: 3.08 s, total: 11.2 s
Wall time: 8.54 s


<All keys matched successfully>

In [13]:
%%time
def get_output_txt_HPO2SUM(txt, p_model, p_tokenizer, device):
    with torch.no_grad():
        input_text = txt
        input_ids = p_tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True).input_ids
        p_model.to(device)
        output_ids = p_model.generate(input_ids=input_ids.to(device), 
                                       max_length=1024,
                                       min_length=128,
                                       early_stopping=False,
                                       do_sample=True, 
                                       no_repeat_ngram_size=5,
                                       top_k=50, 
                                       top_p=0.95
                                      )
        output_text = p_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.7 µs


## Input Description

The input consists of the `name` and `definition` of each HPO term. It's worth mentioning that some definitions are followed by `[PMID]`. Therefore, we use the `[` delimiter to filter out non-definition text.

For example:
```markdown
HP:0000670
'def: 
"Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries." [https://orcid.org/0000-0002-0736-9199]'


In [14]:
%%time
text = [Ontology.get_hpo_object(t).name + ' ' + Ontology.get_hpo_object(t).definition.split('[')[0] for t in test_set]
text = ' '.join(text)
print(text)

Carious teeth "Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries."  Short stature "A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \"short stature\" as height more than 2 standard deviations below the mean for age and gender (or below the 3rd percentile for age and gender dependent norms)."  Cutaneous photosensitivity "An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Photosensitivity can be diagnosed by phototests in which light is shone on small areas of skin."  Generalized hypotonia "Generalized muscular hypotonia (abnormally low muscle tone)."  Sensorineural hearing impairment "A type of hearing impairment in one or both ears related to an abnormal functionalit

In [15]:
%%time
get_output_txt_HPO2SUM(text, p_model=model, p_tokenizer=tokenizer, device='cuda:1')

CPU times: user 2.4 s, sys: 124 ms, total: 2.52 s
Wall time: 2.51 s


'A rare disorder of sex development due to a defect in the ovaries of the ovaries, characterized by tall stature and a very mild sex chromosome Y anomaly. Clinically, a mixture of testicular and ovarian tissue is present in individuals with normal testis development. External genitalia and ambiguous genitalia are reported in a 46,XY individual with normal testis. Cryptorchidism, hypergonadotropic hypogonadism and hypogonadima have been reported in some cases.'