We propose the following confidence metrics for use in such a Disease--Gene score:

nDrug : Drug count for disease-target association.
nStudy_Weighted : Study count weighted by newness of study (newer better, completed better).
nPub_Weighted : Study publications count, weighted by type (results type better).
nDiseaseMention : Disease mention count for disease-target association.
nDrugMention : Drug mention count for disease-target association.
nAssay_Weighted : Assay count for drug-target association, weighted by pChembl.

In [1]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('sddt_links.tsv', sep='\t')

# Convert the 'drug_name' and 'disease_term' columns to lowercase
df['drug_name'] = df['drug_name'].str.lower()
df['disease_term'] = df['disease_term'].str.lower()

How many drugs, diseases, and genes do we have in this data-set?

In [2]:
# Count unique drugs, diseases, and genes
unique_drugs = df['drug_name'].nunique()
unique_diseases = df['disease_term'].nunique()
unique_genes = df['gene_symbol'].nunique()

print(f"Number of unique drugs: {unique_drugs}")
print(f"Number of unique diseases: {unique_diseases}")
print(f"Number of unique genes: {unique_genes}")

Number of unique drugs: 224
Number of unique diseases: 757
Number of unique genes: 790


To count how many different drugs have been developed or studied for each disease and gene combination

In [3]:
# Group by disease_term and gene_symbol, then count unique drug_name
df2 = df.groupby(['disease_term', 'gene_symbol'])['drug_name'].nunique().reset_index()
# Rename the count column for clarity
df2 = df2.rename(columns={'drug_name': 'unique_drugs_count'})
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count
0,abdominal aortic aneurysms,ABCB11,1
1,abdominal aortic aneurysms,ABCC2,1
2,abdominal aortic aneurysms,ABCC3,1
3,abdominal aortic aneurysms,ABCC4,1
4,abdominal aortic aneurysms,ABL1,1
...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1
92415,wolff-parkinson-white syndrome,UGT1A3,1
92416,wolff-parkinson-white syndrome,USP1,1
92417,wolff-parkinson-white syndrome,USP2,1


create a new column 'disease-target' by combining the values from the 'disease_term' and 'gene_symbol' columns with a hyphen in between

In [4]:
df2['disease-target'] = df2['disease_term'] + '-' + df2['gene_symbol']
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1
...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2


create a new column 'nStudy' in df2 and populate the column by counting the total number of studies (nct_id' column) in df related to the disease-target association

In [5]:
#have to have the disease-target in df data frame
df['disease-target'] = df['disease_term'] + '-' + df['gene_symbol']

In [6]:
# Assuming you have a DataFrame df2 with a 'disease-target' column
association_counts = df['disease-target'].value_counts()

# Get the number of unique associations
unique_associations = len(association_counts)

# Get the association with the highest frequency
highest_frequency_association = association_counts.idxmax()
highest_frequency = association_counts.max()

print(f"Number of unique associations: {unique_associations}")
print(f"Association with the highest frequency: {highest_frequency_association} (Frequency: {highest_frequency})")

Number of unique associations: 92419
Association with the highest frequency: hypertension-CYP2D6 (Frequency: 440)


In [7]:
# Assuming you have a DataFrame df2 with a 'disease-target' column
association_counts = df['disease-target'].value_counts()

# Get the total number of associations
total_associations = len(df)

# Get the number of unique associations
unique_associations = len(association_counts)

# Get the association with the highest frequency
highest_frequency_association = association_counts.idxmax()
highest_frequency = association_counts.max()

print(f"Total number of associations: {total_associations}")
print(f"Number of unique associations: {unique_associations}")
print(f"Association with the highest frequency: {highest_frequency_association} (Frequency: {highest_frequency})")

Total number of associations: 422552
Number of unique associations: 92419
Association with the highest frequency: hypertension-CYP2D6 (Frequency: 440)


handle the missing or non-finite values in the 'nStudy' column before converting it to integers. You can replace missing values with a specific value (e.g., 0) 

In [8]:
df2['nStudy'] = df.groupby('disease-target')['nct_id'].transform('nunique').fillna(0).astype(int)

To find out what drugs are associated with "hypertension-CYP2D6" 

In [9]:
# Assuming df2 is your DataFrame
association = "hypertension-CYP2D6"

# Filter the DataFrame to get drugs associated with the specified association
associated_drugs = df[df['disease-target'] == association]['drug_name'].unique()

# Print the associated drugs
print(f"Drugs associated with {association}:")
for drug in associated_drugs:
    print(drug)

Drugs associated with hypertension-CYP2D6:
candesartan
hydrochlorothiazide
chlorthalidone
azilsartan medoxomil
angiotensin ii
amiloride
amiloride hydrochloride
valsartan
chlortalidone
potassium chloride
progesterone
carbidopa
levodopa
simvastatin
acetyl-l-carnitine
ezetimibe
enalapril
benazepril
fluvastatin
pitavastatin
sacubitril
nebivolol
telmisartan
ramipril
bisoprolol
amlodipine
indapamide
diclofenac sodium


Create a new column 'nStudy_Weighted' and convert 'nStudy' to 'nStudy_Weighted' using the formula provided, which is 

nStudy_weighted = ∑(i=1 to N_study) 2e^(-a_i/k)
Here's an explanation of the variables in the formula:
N_study: the total number of studies related to the disease-target association.
a_i: Age in years of the i-th study.
k: Half-life age (typically 5 years)

In the code I provided, 'a' is defined as a variable within the lambda function. It represents the study number within the summation loop. Here's how it works:

    I use the .apply() method on the 'n_study' Series (it contains the total number of studies for each row in the DataFrame).

    For each row, the lambda function is applied to the 'nStudy' value.

    Within the lambda function, i use a list comprehension to create a list of values based on the formula 2**(-a/half_life_age) where a ranges from 0 to x - 1, where x is the 'n_study' value for that row. This list comprehension calculates the contribution of each study to the 'nStudy_Weighted' value.

    The sum() function is used to calculate the sum of these contributions, resulting in the 'nStudy_Weighted' value for that row.

So, in summary, 'a' is a variable that represents the study number, and it iterates from 0 to x - 1, where x is the total number of studies for that row, as specified by the 'nStudy' value.

In [10]:
# Define the constants
half_life_age = 5  # Half-life age in years
n_study = df2['nStudy']  # Assuming 'nStudy' is the column with the total number of studies
# Calculate nStudy_Weighted using the formula
df2['nStudy_Weighted'] = n_study.apply(lambda x: sum([2**(-a/half_life_age) for a in range(x)]))
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768
...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382


For every id 'nct_id' in df data frame, look up the 'nct_id' in the 'aact_study_refs.tsv' file and obtain the reference type from the 'reference_type' column and the 'pmid', 

In [11]:
# Load the 'aact_study_refs.tsv' file (replace with the actual file path)
aact_study_refs = pd.read_csv('aact_study_refs.tsv', sep='\t')

# Merge the two DataFrames on 'nct_id'
merged_df = df.merge(aact_study_refs[['nct_id', 'reference_type', 'pmid']], on='nct_id', how='left')
merged_df

Unnamed: 0,uniprot,CID,nct_id,doid,disease_term,itv_id,drug_name,target_chembl_id,molecule_chembl_id,gene_symbol,idgTDL,disease-target,reference_type,pmid
0,B2RXH2,1050,NCT00157716,DOID:6713,stroke,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,stroke-KDM4E,,
1,B2RXH2,1050,NCT00157716,DOID:3393,coronary artery disease,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,coronary artery disease-KDM4E,,
2,B2RXH2,1050,NCT00157716,DOID:326,ischemia,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,ischemia-KDM4E,,
3,B2RXH2,1050,NCT00157716,DOID:8805,unstable angina,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,unstable angina-KDM4E,,
4,B2RXH2,1050,NCT00157716,DOID:5844,myocardial infarction,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,myocardial infarction-KDM4E,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1769782,Q9Y6R4,151194,NCT00615160,DOID:1909,malignant melanoma,31378458,zk 222584,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,malignant melanoma-MAP3K4,,
1769783,Q9Y6R4,151194,NCT00655655,DOID:1909,melanoma,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,melanoma-MAP3K4,,
1769784,Q9Y6R4,151194,NCT00655655,DOID:263,kidney cancer,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,kidney cancer-MAP3K4,,
1769785,Q9Y6R4,151194,NCT00655655,DOID:10763,hypertension,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,hypertension-MAP3K4,,


In [12]:
# Filter rows where both 'reference_type' and 'pmid' are not null
filtered_df = merged_df.dropna(subset=['reference_type', 'pmid'])

# Count the number of rows in the filtered DataFrame
count = len(filtered_df)

# Print the count
print(f'Number of disease-target associations with reference type and publication ID: {count}')

Number of disease-target associations with reference type and publication ID: 1537309


Create a new column 'nPub' in df2 and populate it with the count of unique PMIDs connected to each disease-target association from merged_df

In [13]:
# Assuming you have a DataFrame df2 with a 'disease-target' column and merged_df with a 'disease-target' and 'pmid' columns
df2['nPub'] = df2['disease-target'].map(merged_df.groupby('disease-target')['pmid'].nunique())

# If there are NaN values in the 'nPub' column, replace them with 0
df2['nPub'].fillna(0, inplace=True)
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4
...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0


Create a new column 'publication_type' in df2 by determining the type of publication (from the 'reference_type' column) connected to each disease-target association in merged_df.


This code uses the groupby method to group merged_df by the 'disease-target' column and then retrieves the unique publication types ('reference_type') associated with each disease-target. It then maps these unique publication types to the corresponding associations in df2. Finally, it fills any NaN values in the 'publication_type' column with an empty string for associations with no associated publication types.

In [14]:
# Assuming you have a DataFrame df2 with a 'disease-target' column and merged_df with 'disease-target' and 'reference_type' columns
df2['publication_type'] = df2['disease-target'].map(merged_df.groupby('disease-target')['reference_type'].unique().str.join(', '))

# If there are NaN values in the 'publication_type' column, replace them with an empty string
df2['publication_type'].fillna('', inplace=True)
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived"
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived"
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived"
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived"
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived"
...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,


In [15]:
df2['publication_type'].unique()

array(['background, derived', '', 'background', 'result', 'derived',
       'background, result, derived', 'background, result',
       'result, derived', 'derived, result',
       'background, derived, result', 'derived, result, background',
       'result, background', 'derived, background',
       'result, derived, background', 'derived, background, result',
       'result, background, derived'], dtype=object)

In [16]:
df2['publication_type'].nunique()

16

This code defines a mapping of publication types to their corresponding values and then defines a function calculate_t_sum that calculates the sum of values based on the publication types found in the 'publication_type' column. Finally, it applies this function to create the 't_sum' column in df2. The code ignores empty strings (' ') in the 'publication_type' column while calculating 't_sum',. Also, the code splits the 'publication_type' string into a list, filtering out empty strings using if type.strip(), and then calculates the sum based on non-empty publication types.

In [17]:
# Define a mapping of publication types to their corresponding values
publication_type_values = {
    'result': 0,
    'background': 1,
    'derived': 2
}

# Create a function to calculate the sum based on the mapping
def calculate_t_sum(publication_types):
    # Split the publication types string and filter out empty strings
    types = [type.strip() for type in publication_types.split(',') if type.strip()]
    
    # Calculate the sum of values based on the publication types
    return sum(publication_type_values.get(type.lower(), 0) for type in types)

# Apply the function to the 'publication_type' column and create the 't_sum' column
df2['t_sum'] = df2['publication_type'].apply(calculate_t_sum)
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3
...,...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,,0
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,,0
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,,0
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,,0


In [18]:
# Find the row with the maximum t_sum value
max_t_sum_row = df2[df2['t_sum'] == df2['t_sum'].max()]
max_t_sum_row

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3
...,...,...,...,...,...,...,...,...,...
91686,viral infections,THRB,1,viral infections-THRB,1,1.000000,26,"background, derived",3
91687,viral infections,TSHR,1,viral infections-TSHR,10,5.793768,26,"background, derived",3
91688,viral infections,UGT2B17,1,viral infections-UGT2B17,32,7.633548,26,"background, derived",3
91689,viral infections,VDR,1,viral infections-VDR,32,7.633548,26,"background, derived",3


Create a new column 'nPub_Weighted' in DataFrame df2 and calculate it using the formula nPub_weighted = ∑i=1Nstudy e^2ti where t is replaced by 't_sum' from the 't_sum' column. 

In [19]:
import numpy as np
# Calculate nPub_Weighted
df2['nPub_Weighted'] = np.exp(2 * df2['t_sum'])
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3,403.428793
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3,403.428793
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3,403.428793
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3,403.428793
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3,403.428793
...,...,...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,,0,1.000000
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,,0,1.000000
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,,0,1.000000
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,,0,1.000000


Create a new column 'nDiseaseMention' in DataFrame df2 by counting the disease mentions based on disease-target associations

In [20]:
# Calculate nDiseaseMention
df2['nDiseaseMention'] = df2.groupby('disease-target')['disease_term'].transform('count')
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3,403.428793,1
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3,403.428793,1
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3,403.428793,1
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3,403.428793,1
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3,403.428793,1
...,...,...,...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,,0,1.000000,1
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,,0,1.000000,1
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,,0,1.000000,1
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,,0,1.000000,1


In [21]:
# Find the row with the maximum disease mention value
max_nDiseaseMention = df2[df2['nDiseaseMention'] == df2['nDiseaseMention'].max()]
max_nDiseaseMention

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3,403.428793,1
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3,403.428793,1
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3,403.428793,1
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3,403.428793,1
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3,403.428793,1
...,...,...,...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,,0,1.000000,1
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,,0,1.000000,1
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,,0,1.000000,1
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,,0,1.000000,1


In [22]:
# Find the row with the maximum disease mention value
max_unique_drugs_count = df2[df2['unique_drugs_count'] == df2['unique_drugs_count'].max()]
max_unique_drugs_count

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
44232,hypertension,CYP3A4,33,hypertension-CYP3A4,96,7.725011,680,,0,1.0,1
44646,hypertension,SLCO1B1,33,hypertension-SLCO1B1,10,5.793768,680,,0,1.0,1
44647,hypertension,SLCO1B3,33,hypertension-SLCO1B3,10,5.793768,680,,0,1.0,1


Create a new column 'nDrugMention' in DataFrame df2 by making a copy of the existing column'unique_drugs_count'

In [23]:
df2['nDrugMention'] = df2['unique_drugs_count'].copy()
df2

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention,nDrugMention
0,abdominal aortic aneurysms,ABCB11,1,abdominal aortic aneurysms-ABCB11,22,7.359119,4,"background, derived",3,403.428793,1,1
1,abdominal aortic aneurysms,ABCC2,1,abdominal aortic aneurysms-ABCC2,7,4.797787,4,"background, derived",3,403.428793,1,1
2,abdominal aortic aneurysms,ABCC3,1,abdominal aortic aneurysms-ABCC3,7,4.797787,4,"background, derived",3,403.428793,1,1
3,abdominal aortic aneurysms,ABCC4,1,abdominal aortic aneurysms-ABCC4,4,3.288163,4,"background, derived",3,403.428793,1,1
4,abdominal aortic aneurysms,ABL1,1,abdominal aortic aneurysms-ABL1,10,5.793768,4,"background, derived",3,403.428793,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
92414,wolff-parkinson-white syndrome,TRIM24,1,wolff-parkinson-white syndrome-TRIM24,3,2.628409,0,,0,1.000000,1,1
92415,wolff-parkinson-white syndrome,UGT1A3,1,wolff-parkinson-white syndrome-UGT1A3,3,2.628409,0,,0,1.000000,1,1
92416,wolff-parkinson-white syndrome,USP1,1,wolff-parkinson-white syndrome-USP1,3,2.628409,0,,0,1.000000,1,1
92417,wolff-parkinson-white syndrome,USP2,1,wolff-parkinson-white syndrome-USP2,61,7.723382,0,,0,1.000000,1,1


Read the "aact_drugs_chembl_activity.tsv" file with only the specified columns ('pchembl', 'target_chembl_id', and 'molecule_chembl_id'), 

In [24]:
# Define the columns you want to read
columns_to_read = ['pchembl_value', 'target_chembl_id', 'molecule_chembl_id']

# Read the CSV file with selected columns
pchembl_df = pd.read_csv("aact_drugs_chembl_activity.tsv", sep="\t", usecols=columns_to_read)
pchembl_df

Unnamed: 0,molecule_chembl_id,pchembl_value,target_chembl_id
0,CHEMBL100,,CHEMBL376
1,CHEMBL100,,CHEMBL376
2,CHEMBL100,6.79,CHEMBL376
3,CHEMBL100,,CHEMBL376
4,CHEMBL100,,CHEMBL376
...,...,...,...
32710,CHEMBL106939,,CHEMBL238
32711,CHEMBL106939,,CHEMBL222
32712,CHEMBL106939,8.95,CHEMBL228
32713,CHEMBL106939,,CHEMBL238


To rank each disease-target based on the values in the four columns (nStudy_Weighted, nPub_Weighted, nDiseaseMention, and nDrugMention), you can use the rank method provided by the Pandas library in Python. 

In this code:

We first create a new column 'Rank' in the DataFrame df2 by applying the rank method to the specified columns. We use ascending=False to rank in descending order.

    The sum(axis=1) part calculates the sum of ranks across the specified columns for each row.

    Finally, we sort the DataFrame based on the 'Rank' column to see the highest-ranked disease-targets.

This will give you a DataFrame sorted by the rank of disease-targets based on the specified columns.

In [25]:
df2['Rank'] = df2[['nStudy_Weighted', 'nPub_Weighted', 'nDiseaseMention', 'nDrugMention']].rank(ascending=False).sum(axis=1)
sorted_df = df2.sort_values(by='Rank')
sorted_df

Unnamed: 0,disease_term,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention,nDrugMention,Rank
61737,myocardial infarction,CA4,7,myocardial infarction-CA4,123,7.725024,101,"background, derived, result",3,403.428793,1,7,53128.5
26813,dementia,MMP1,7,dementia-MMP1,113,7.725023,26,"derived, background",3,403.428793,1,7,53343.5
26832,dementia,PDE5A,7,dementia-PDE5A,113,7.725023,26,"derived, background",3,403.428793,1,7,53343.5
26876,dementia,TACR2,7,dementia-TACR2,113,7.725023,26,"derived, background",3,403.428793,1,7,53343.5
26848,dementia,PTPRC,7,dementia-PTPRC,113,7.725023,26,"derived, background",3,403.428793,1,7,53343.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75630,prostate adenocarcinoma,CYP2A6,1,prostate adenocarcinoma-CYP2A6,1,1.000000,0,,0,1.000000,1,1,253986.0
75629,prostate adenocarcinoma,CYP1A2,1,prostate adenocarcinoma-CYP1A2,1,1.000000,0,,0,1.000000,1,1,253986.0
75628,prostate adenocarcinoma,CXCR2,1,prostate adenocarcinoma-CXCR2,1,1.000000,0,,0,1.000000,1,1,253986.0
75633,prostate adenocarcinoma,CYP2D6,1,prostate adenocarcinoma-CYP2D6,1,1.000000,0,,0,1.000000,1,1,253986.0


In [28]:
import numpy as np

# Assuming you have the DataFrame df2 with columns nStudy_Weighted, nPub_Weighted, nDiseaseMention, nDrugMention
# Calculate ranks for each column using rankdata function from scipy
from scipy.stats import rankdata

df2['rank_nStudy_Weighted'] = rankdata(df2['nStudy_Weighted'])
df2['rank_nPub_Weighted'] = rankdata(df2['nPub_Weighted'])
df2['rank_nDiseaseMention'] = rankdata(df2['nDiseaseMention'])
df2['rank_nDrugMention'] = rankdata(df2['nDrugMention'])

# Calculate the mean rank
df2['mean_rank'] = np.mean(df2[['rank_nStudy_Weighted', 'rank_nPub_Weighted', 'rank_nDiseaseMention', 'rank_nDrugMention']], axis=1)

# Calculate the mean rank score using the percentile formula
N = 4  # Number of variables considered
percentile = (1 / N) * df2['mean_rank'] * 100  # Calculate the percentile

# Calculate the mean rank score
df2['mean_rank_score'] = 100 - percentile

df2[['disease-target', 'mean_rank_score']]

Unnamed: 0,disease-target,mean_rank_score
0,abdominal aortic aneurysms-ABCB11,-1470978.125
1,abdominal aortic aneurysms-ABCC2,-1307137.500
2,abdominal aortic aneurysms-ABCC3,-1307137.500
3,abdominal aortic aneurysms-ABCC4,-1246228.125
4,abdominal aortic aneurysms-ABL1,-1350528.125
...,...,...
92414,wolff-parkinson-white syndrome-TRIM24,-857600.000
92415,wolff-parkinson-white syndrome-UGT1A3,-857600.000
92416,wolff-parkinson-white syndrome-USP1,-857600.000
92417,wolff-parkinson-white syndrome-USP2,-1215587.500


In [29]:
sorted_df = df2[['disease-target', 'mean_rank_score']].sort_values(by='mean_rank_score', ascending=False)
sorted_df

Unnamed: 0,disease-target,mean_rank_score
34683,esophageal varices-ADRB2,-722987.500
18911,chronic myelomonocytic leukemia-CSNK1G3,-722987.500
18901,chronic myelomonocytic leukemia-COQ8A,-722987.500
18902,chronic myelomonocytic leukemia-COQ8B,-722987.500
18903,chronic myelomonocytic leukemia-CSF1R,-722987.500
...,...,...
26832,dementia-PDE5A,-1977003.125
26811,dementia-MC4R,-1977003.125
26808,dementia-MAPK3,-1977003.125
26848,dementia-PTPRC,-1977003.125
