
# üí° Compatibility Notice for Ranker and Recommender Systems
## üîß Current Configuration
The Ranker and Recommender systems were developed and tested under the following environment:
- **Torch**: 1.9.0+cu111
- **Torchaudio**: 0.9.0
- **Torchvision**: 0.10.0+cu111

## ‚ö†Ô∏è Potential Issues with Upgraded Torch Versions
To leverage more advanced and powerful large language models, it is inevitable that the version of Torch will need to be updated. However, it is important to note that the Ranker and Recommender systems were specifically designed and optimized for Torch 1.9.0. Upgrading to a higher version of Torch may introduce compatibility issues and could potentially result in errors.

Therefore, we use PyTorch 1.9.0 in the following experimental environment. When users employ the Summarizer, we recommend that they use an additional Python environment with a more advanced version of PyTorch.

In [1]:
# Standard Libraries
import re
import pickle
from tqdm import tqdm

# Data Handling
import pandas as pd
import numpy as np

# PyTorch
import torch
import torch.nn.functional as F

# HPO Ontology
from pyhpo import Ontology

# Custom Modules
from PhenoDP_Preprocess import *
from PhenoDP import *
from PCLHPOEncoder import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:

def calculate_coefficient_of_variation(df, top_n=3):
    """
    Computes the mean, standard deviation, and coefficient of variation (CV) for the 'Total_Similarity' column 
    of the top n rows in a given DataFrame.

    Parameters:
    df (pd.DataFrame): A DataFrame containing the 'Total_Similarity' column.
    top_n (int): The number of top rows to consider for the calculation. Default is 3.

    Returns:
    mean (float): The mean value of the 'Total_Similarity' column for the top n rows.
    std (float): The standard deviation of the 'Total_Similarity' column for the top n rows.
    cv (float): The coefficient of variation (CV) expressed as a percentage.
    """
    # Extract the top n rows of the 'Total_Similarity' column
    data = df.head(top_n)['Total_Similarity']
    
    # Calculate mean, standard deviation, and coefficient of variation
    mean = data.mean()
    std = data.std()
    cv = (std / mean) * 100
    
    return cv

def Get_Definition(hpo_list):
    definition_list = []
    for t in hpo_list:
        definition = Ontology.get_hpo_object(t).definition
        match = re.search(r'"(.*?)"', definition)
        if match:
            definition_list.append(match.group(1))
    return ' '.join(definition_list)

In [3]:
Ontology('./HPO_2025_3_3/')

with open('./JC_sim_dict_test.pkl', 'rb') as f:
    hp2d_sim_dict = pickle.load(f)
    
with open('./node_embedding_dict_test.plk', 'rb') as f:
    node_embedding = pickle.load(f)

input_dim = 256
num_heads = 8
num_layers = 3
hidden_dim = 512
output_dim = 1
max_seq_length = 128
recommender = PCL_HPOEncoder(input_dim, num_heads, num_layers, hidden_dim, output_dim, max_seq_length)
recommender.load_state_dict(torch.load('./transformer_encoder_infoNCE_test.pth'))


<All keys matched successfully>

In [4]:
Patient_hps = ['HP:0000670', 'HP:0004322', 'HP:0000992', 'HP:0001290', 'HP:0000407', 'HP:0000252', 'HP:0000490']

In [5]:
Input_text = Get_Definition(Patient_hps)

In [6]:
pre_model = PhenoDP_Initial(Ontology)
phenodp = PhenoDP(pre_model=pre_model, hp2d_sim_dict=hp2d_sim_dict, node_embedding=node_embedding, PCL_HPOEncoder=recommender)

generate disease dict...
related hpo num: 9211
generate disease ic dict... 
calculating hp weights
PCL_HPOEncoder is a pre-trained model


In [7]:
df = phenodp.run_Ranker(Patient_hps)
calculate_coefficient_of_variation(df, top_n=3)

Find Candidate Diseases: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2789/2789 [00:00<00:00, 24668.42it/s]
Calculating Phi Scores: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:00<00:00, 512.98it/s]
Calculating Embedding Similarity: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 200/200 [00:01<00:00, 125.20it/s]


3.202152889291407

In [8]:
df

Unnamed: 0,Disease,Total_Similarity
0,216400,0.753126
1,133540,0.720296
2,278760,0.708106
3,618342,0.705713
4,268850,0.698422
...,...,...
195,194050,0.567014
196,300990,0.565087
197,268400,0.551475
198,256550,0.545349


In [9]:
phenodp.run_Recommender(Patient_hps, target_disease=216400 , candidate_diseases=[216400, 278760, 133540])

using default setting...


Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:00<00:00, 130.03it/s]


Unnamed: 0,hp,importance
0,HP:0002119,1.599746
1,HP:0001105,1.291994
2,HP:0000518,1.171731
3,HP:0005301,1.149699
4,HP:0000135,1.033585
5,HP:0003273,1.016881
6,HP:0000448,0.977473
7,HP:0001260,0.883704
8,HP:0000858,0.804751
9,HP:0007814,0.778095


In [10]:
def generate_diagnosis_prompt(Patient_hps, df, Top_n=3, Top_Recom=2):
    """
    Generate a prompt for explaining potential symptoms to differentiate between candidate diseases.

    Args:
        Patient_hps (list): List of patient's observed HPO terms.
        df (pd.DataFrame): DataFrame containing candidate diseases and their details.
        Top_n (int): Number of top candidate diseases to consider. Default is 3.
        Top_Recom (int): Number of top recommended symptoms for each disease. Default is 2.

    Returns:
        str: A formatted prompt for disease differentiation.
    """
    # Get observed symptoms
    observered_syn = Get_Definition(Patient_hps)
    
    # Get top candidate diseases
    Condidate_diseases = df.head(Top_n)['Disease'].values
    
    # Initialize lists for diseases and recommendations
    diseases_list = []
    Recom_list = []
    txt_inputs = []
    
    # Process each candidate disease
    for index, t in enumerate(Condidate_diseases):
        # Get recommended symptoms for the disease
        recom = phenodp.run_Recommender(Patient_hps, target_disease=t, candidate_diseases=Condidate_diseases)
        Recom_list.append([Ontology.get_hpo_object(t).name for t in recom.head(Top_Recom).hp.values])
        
        # Get disease details and append to diseases_list
        diseases_list.extend([str(index + 1) + '. [OMIM:' + str(t) + '] ' + j.name for j in Ontology.omim_diseases if j.id == t])
        
        # Format disease and symptoms for txt_inputs
        txt_inputs.append(diseases_list[-1] + ' : ' + ', '.join(Recom_list[-1]))
    
    # Format diseases_list and txt_inputs as strings
    diseases_list_str = "\n".join(diseases_list)
    txt_inputs_str = "\n".join(txt_inputs)
    
    # Generate the prompt
    prompt = f"""
Assume you are an experienced clinical physician. Below is a patient‚Äôs symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient‚Äôs Symptom Description**:  
{observered_syn}  

**Three Most Likely Disease Diagnoses**:  
{diseases_list_str}  

**Potential Symptoms for Further Differentiation**:  
{txt_inputs_str}  

**Instructions**:  
1. **Explain Potential Symptoms**: Provide a clear and concise rationale for why the listed potential symptoms are critical for distinguishing between the three diseases. Focus on how these symptoms are specific to or more prevalent in one disease compared to the others.  
2. **Do Not Diagnose**: Do not make any new diagnoses or suggest additional diseases. Your response should focus solely on explaining the potential symptoms for differentiation.  
3. **Length and Style**: The report should be approximately 200‚Äì300 words in length, written in a professional and authentic tone that mimics a human expert.  
4. **No References**: Do not include any references in the report.   
"""
    return prompt


In [11]:
# ÂÅáËÆæ Patient_hps Âíå df Â∑≤ÁªèÂÆö‰πâ
prompt = generate_diagnosis_prompt(Patient_hps, df, Top_n=5, Top_Recom=3)
print(prompt)

# ‰øùÂ≠ò‰∏∫txtÊñá‰ª∂
with open('./Case_Report_Prompt.txt', 'w', encoding='utf-8') as f:
    f.write(prompt)

using default setting...


Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20/20 [00:00<00:00, 128.05it/s]

using default setting...



Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 16/16 [00:00<00:00, 133.53it/s]


using default setting...


Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21/21 [00:00<00:00, 141.12it/s]


using default setting...


Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 38/38 [00:00<00:00, 147.06it/s]


using default setting...


Calculating NCE Loss: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 23/23 [00:00<00:00, 146.28it/s]



Assume you are an experienced clinical physician. Below is a patient‚Äôs symptom description using HPO (Human Phenotype Ontology) terms, along with three candidate diagnoses. To further differentiate between these diagnoses, the physician has provided potential symptoms that the patient does not currently exhibit but could help clarify or confirm the diagnosis. Your task is to explain why these potential symptoms are critical for distinguishing between the three diseases.  

**Patient‚Äôs Symptom Description**:  
Caries is a multifactorial bacterial infection affecting the structure of the tooth. This term has been used to describe the presence of more than expected dental caries. A height below that which is expected according to age and gender norms. Although there is no universally accepted definition of short stature, many refer to \ An increased sensitivity of the skin to light. Photosensitivity may result in a rash upon exposure to the sun (which is known as photodermatosis). Ph


# **Notebook Demonstration**
In this <kbd>Jupyter Notebook</kbd>, we've demonstrated how to use the <mark>rank</mark> and <mark>recommender</mark>. We combined the results of two components to generate a prompt (with <ins>prompt</ins>). This prompt can be input into the <u>summarizer</u> to obtain the case report.