# Target Ranking

To reproduce the ranking results reported in the paper for the 5 indications, the following code can be used.

It follows the following steps:

1. Load the disease dataset and the TargetSeek scoring results.
2. Train and evaluate the model on the 5 indications
3. Obtain the ranking results for the 5 indications
4. Calculate the Recall@20, Precision@20, and ROC-AUC for the 5 indications



## 1. **Load Disease Dataset and TargetSeek Scoring Results**

In [1]:
from Target_Ranking_and_Evaluation.utils import *
from Report_Generation_AgentAnalyst.run_section import directory_set_up
import os

In [2]:
retrieve_path = directory_set_up() + '/Target_Ranking_and_Evaluation'

In [3]:
os.chdir(retrieve_path)

load the disease dataset and the TargetSeek scoring results

In [4]:
disease_path, geneseek_path = get_paths('atherosclerosis')
disease_dataset, geneseek = load_data(disease_path, geneseek_path)

In [5]:
all_data = process_diseases()

Processing atherosclerosis...
Processing ibd...
Processing type2_diabetes...
Processing rheumatoid_arthritis...
Processing non_small_cell_lung_cancer...
Processing metabolic_dysfunction_associated_steatohepatitis_mash...


In [6]:
all_data = get_T_max(all_data)

In [7]:
all_data

Unnamed: 0,CI_Genetic Association,CI_Differential expression,CI_Mechanism of Action,CI_In vitro_in vivo experiment,T_Small molecules,T_Antibody,T_siRNA,CP_Competitiveness_Small_Molecules,CP_Competitiveness_Antibody_or_siRNA,CP_Unmet Needs,DO_experimental_model_availability,DO_biomarkers,DO_Safety,NAN,DiseaseSpecific_ClinicLabel,Disease,T_max
npc1l1,1.0,0.5,1.0,0.5,1.0,-0.5,0.5,-1.0,1.0,0.5,1.0,0.0,0.5,,4.0,atherosclerosis,1.0
p2ry12,0.5,1.0,1.0,1.0,1.0,0.0,-1.0,0.5,1.0,0.0,1.0,0.5,0.5,,4.0,atherosclerosis,1.0
adrb1,0.0,-0.5,0.0,0.5,0.5,0.5,0.5,1.0,1.0,1.0,1.0,0.5,-0.5,,4.0,atherosclerosis,0.5
f10,0.5,-0.5,1.0,0.5,1.0,0.5,0.5,0.5,0.0,0.0,1.0,0.5,-0.5,,4.0,atherosclerosis,1.0
pcsk9,1.0,0.5,1.0,1.0,1.0,1.0,0.5,0.0,-1.0,0.5,1.0,1.0,1.0,,4.0,atherosclerosis,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eva1a,0.0,-0.5,-0.5,0.5,0.5,0.5,0.5,0.5,0.0,1.0,-1.0,-0.5,0.0,,,metabolic_dysfunction_associated_steatohepatit...,0.5
cobll1,0.5,-0.5,0.0,0.5,-0.5,-1.0,0.5,1.0,0.0,0.0,1.0,0.5,0.0,,,metabolic_dysfunction_associated_steatohepatit...,0.5
sugp1,0.0,-0.5,0.5,0.5,0.5,-1.0,0.5,0.5,1.0,0.5,1.0,0.5,0.0,,,metabolic_dysfunction_associated_steatohepatit...,0.5
macf1,0.0,-0.5,0.5,0.5,0.5,-0.5,0.5,1.0,1.0,0.0,0.5,0.5,-0.5,,,metabolic_dysfunction_associated_steatohepatit...,0.5


## 2. **Training and Evaluation on 5 Indications**

##### Model Type Selection: logistic_regression, hist_gradient_boosting, random_forest

In [8]:
n_iterations = 5

diseases = ['ibd', 'atherosclerosis', 'type2_diabetes', 'rheumatoid_arthritis', 'non_small_cell_lung_cancer']

# model selection
model_type = 'random_forest'

# Dictionary to store results for each disease
results = {disease: [] for disease in diseases}
positive_samples_ranking_dict = {}

for _ in range(n_iterations):
    for test_disease in diseases:
        # Separate the test disease data
        test_data = all_data[all_data['Disease'] == test_disease]
        
        # Combine the other diseases datasets for training
        train_data = all_data[all_data['Disease'] != test_disease]
        
        # Train and evaluate with show_rank=True to display predictions
        accuracy, _, recall, f1, roc_auc, positive_samples_ranked, all_samples_ranked, avg_ranking, feature_importance, recall_at_10, precision_at_10 = train_and_evaluate(train_data, test_data, test_disease, model_type=model_type, show_rank=False, save_ranked_results=False)

        positive_samples_ranking_dict[test_disease] = positive_samples_ranked
        
        # Store results
        results[test_disease].append({
            'roc_auc': roc_auc,
            'avg_ranking': avg_ranking
        })

# Calculate and print average results with standard deviation
for disease in diseases:
    print(f"\nAverage results for {disease} over {n_iterations} iterations:")
    metrics = ['roc_auc', 'avg_ranking']
    for metric in metrics:
        values = [result[metric] for result in results[disease]]
        mean = np.mean(values)
        std = np.std(values)
        print(f"{metric.capitalize()}: {mean:.4f} ± {std:.4f}")


Average results for ibd over 5 iterations:
Roc_auc: 0.9154 ± 0.0044
Avg_ranking: 17.0800 ± 0.4045

Average results for atherosclerosis over 5 iterations:
Roc_auc: 0.8518 ± 0.0056
Avg_ranking: 20.5800 ± 0.2015

Average results for type2_diabetes over 5 iterations:
Roc_auc: 0.9453 ± 0.0068
Avg_ranking: 14.2600 ± 0.3121

Average results for rheumatoid_arthritis over 5 iterations:
Roc_auc: 0.9571 ± 0.0034
Avg_ranking: 13.6500 ± 0.2608

Average results for non_small_cell_lung_cancer over 5 iterations:
Roc_auc: 0.9716 ± 0.0018
Avg_ranking: 12.3900 ± 0.0735


## 3. Obtain Recall@20, Precision@20, and ROC-AUC for each disease

In [9]:
disease_ranking = {}
for disease in diseases:
    disease_ranking[disease] = ([int(i) for i in positive_samples_ranking_dict[disease]['rank'].tolist()])

    recall_20, _, _ = calculate_recall_precision(disease_ranking[disease])
    print(f"Disease: {disease}")
    print(f"Recall@20: {recall_20:.4f}")
    print("\n" + "="*50)

Disease: ibd
Recall@20: 0.6000

Disease: atherosclerosis
Recall@20: 0.6000

Disease: type2_diabetes
Recall@20: 0.8000

Disease: rheumatoid_arthritis
Recall@20: 0.7000

Disease: non_small_cell_lung_cancer
Recall@20: 0.8000



## 4. **Infer Ranking Results for Disease without labels**

Train a model on all diseases with labels and predict on the disease without labels.

Here, we consider the five indications as the diseases with labels and we use the metabolic_dysfunction_associated_steatohepatitis_mash as the disease without labels.



In [10]:
# List of training diseases
diseases = ['ibd', 'type2_diabetes', 'rheumatoid_arthritis', 'non_small_cell_lung_cancer', 'atherosclerosis']

# Get MASH data
mash_data = all_data[all_data['Disease'] == 'metabolic_dysfunction_associated_steatohepatitis_mash']

# Train model on all other diseases
train_data = all_data[all_data['Disease'].isin(diseases)]
model, scaler, imputer, feature_columns = train_model_only(train_data, model='random_forest')

# Get predictions for MASH
ranked_mash = predict_new_disease(model, scaler, imputer, feature_columns, mash_data, save_samples=True)

# Display ranked results
print("\nRanked predictions for MASH:")
print(ranked_mash[['predicted_proba', 'rank']].round(4))


Ranked predictions for MASH:
            predicted_proba  rank
mtarc1                 0.89   1.0
pparg                  0.88   2.0
tm6sf2                 0.75   3.0
cideb                  0.52   4.0
mttp                   0.49   5.0
gpam                   0.46   6.0
pnpla3                 0.41   7.0
serpina1               0.30   8.0
hfe                    0.28   9.0
rtel1                  0.20  10.0
abcc2                  0.16  11.0
dcaf12                 0.15  12.0
ptpn23                 0.15  12.0
cd19                   0.14  14.0
pam                    0.08  15.0
frat2                  0.08  15.0
gckr                   0.07  17.0
npc1                   0.06  18.0
sugp1                  0.06  18.0
dnaja3                 0.06  18.0
ccdc92                 0.06  18.0
ac022431.2             0.05  22.0
macf1                  0.05  22.0
lpl                    0.05  22.0
vasn                   0.04  25.0
dnah10                 0.03  26.0
map3k1                 0.03  26.0
cobll1            

In [11]:
ranked_mash

Unnamed: 0,CI_Genetic Association,CI_Differential expression,CI_Mechanism of Action,CI_In vitro_in vivo experiment,T_Small molecules,T_Antibody,T_siRNA,CP_Competitiveness_Small_Molecules,CP_Competitiveness_Antibody_or_siRNA,CP_Unmet Needs,DO_experimental_model_availability,DO_biomarkers,DO_Safety,NAN,DiseaseSpecific_ClinicLabel,Disease,T_max,predicted_proba,rank
mtarc1,0.5,0.5,1.0,1.0,0.0,-1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.89,1.0
pparg,0.0,1.0,1.0,1.0,1.0,-0.5,0.0,-1.0,0.5,0.5,1.0,-0.5,0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.88,2.0
tm6sf2,1.0,1.0,1.0,0.5,-1.0,0.5,1.0,0.5,0.0,0.5,1.0,0.5,-0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.75,3.0
cideb,1.0,1.0,0.0,0.5,0.0,-1.0,1.0,0.5,0.5,1.0,1.0,0.5,0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.52,4.0
mttp,0.5,0.5,1.0,0.5,1.0,-1.0,1.0,0.0,0.5,0.5,1.0,0.0,-0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.49,5.0
gpam,1.0,-0.5,1.0,0.5,0.5,-1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.46,6.0
pnpla3,1.0,1.0,1.0,0.5,0.5,0.0,1.0,0.5,0.0,0.5,1.0,0.0,0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.41,7.0
serpina1,1.0,0.5,1.0,0.5,-0.5,0.0,1.0,0.5,0.5,0.5,0.5,1.0,0.0,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.3,8.0
hfe,0.5,-0.5,0.5,0.5,0.0,0.5,1.0,0.5,0.5,-0.5,1.0,0.5,-0.5,,,metabolic_dysfunction_associated_steatohepatit...,1.0,0.28,9.0
rtel1,0.0,-0.5,0.5,0.5,0.0,-1.0,0.5,1.0,1.0,1.0,1.0,0.5,-0.5,,,metabolic_dysfunction_associated_steatohepatit...,0.5,0.2,10.0
