We propose the following confidence metrics for use in such a Disease--Gene score:

nDrug : Drug count for disease-target association.
nStudy_Weighted : Study count weighted by newness of study (newer better, completed better).
nPub_Weighted : Study publications count, weighted by type (results type better).
nDiseaseMention : Disease mention count for disease-target association.
nDrugMention : Drug mention count for disease-target association.
nAssay_Weighted : Assay count for drug-target association, weighted by pChembl.

NEW assignment

However, some requests, I hope requiring modest effort:

    Generate an output file with all disease-to-gene associations with mean rank and mean rank scores. The file should include a separate column for disease and gene IDs. There should also be columns for all the measures used to compute the mean rank and mean rank score.
    The diseases should be referenced by IDs, not just disease name. Tagger output includes a code, which can be dereferenced to Disease Ontology ID (DOID). See jensenlab_diseases_entities.tsv.
    For genes, the gene symbol is satisfactory. We may consider including other gene IDs in future.

The comprehensive output file will be useful to share for review by others and define our first version. Eventually, such a file or files can be integrated by IDG for display by Pharos. Such a file can also be included as supplementary material for a publication.

In [1]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('sddt_links.tsv', sep='\t')

# Convert the 'drug_name' and 'disease_term' columns to lowercase
df['drug_name'] = df['drug_name'].str.lower()
df['doid'] = df['doid'].str.lower()

How many drugs, diseases, and genes do we have in this data-set?

In [2]:
# Count unique drugs, diseases, and genes
unique_drugs = df['drug_name'].nunique()
unique_diseases = df['doid'].nunique()
unique_genes = df['gene_symbol'].nunique()

print(f"Number of unique drugs: {unique_drugs}")
print(f"Number of unique doid: {unique_diseases}")
print(f"Number of unique genes: {unique_genes}")

Number of unique drugs: 224
Number of unique doid: 585
Number of unique genes: 790


To count how many different drugs have been developed or studied for each disease and gene combination

In [3]:
# Group by disease_term and gene_symbol, then count unique drug_name
df2 = df.groupby(['doid', 'gene_symbol'])['drug_name'].nunique().reset_index()
# Rename the count column for clarity
df2 = df2.rename(columns={'drug_name': 'unique_drugs_count'})
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count
0,doid:0001816,ABCB11,1
1,doid:0001816,ABCC2,1
2,doid:0001816,ABCC3,1
3,doid:0001816,ABCC4,1
4,doid:0001816,ACHE,1
...,...,...,...
75816,doid:9993,USP1,3
75817,doid:9993,USP2,1
75818,doid:9993,VCP,1
75819,doid:9993,VDR,1


create a new column 'disease-target' by combining the values from the 'disease_term' and 'gene_symbol' columns with a hyphen in between

In [4]:
df2['disease-target'] = df2['doid'] + '-' + df2['gene_symbol']
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4
4,doid:0001816,ACHE,1,doid:0001816-ACHE
...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1
75817,doid:9993,USP2,1,doid:9993-USP2
75818,doid:9993,VCP,1,doid:9993-VCP
75819,doid:9993,VDR,1,doid:9993-VDR


create a new column 'nStudy' in df2 and populate the column by counting the total number of studies (nct_id' column) in df related to the disease-target association

In [5]:
#have to have the disease-target in df data frame
df['disease-target'] = df['doid'] + '-' + df['gene_symbol']

In [6]:
# Assuming you have a DataFrame df2 with a 'disease-target' column
association_counts = df['disease-target'].value_counts()

# Get the number of unique associations
unique_associations = len(association_counts)

# Get the association with the highest frequency
highest_frequency_association = association_counts.idxmax()
highest_frequency = association_counts.max()

print(f"Number of unique associations: {unique_associations}")
print(f"Association with the highest frequency: {highest_frequency_association} (Frequency: {highest_frequency})")

Number of unique associations: 75821
Association with the highest frequency: doid:10763-CYP2D6 (Frequency: 476)


In [7]:
# Assuming you have a DataFrame df2 with a 'disease-target' column
association_counts = df['disease-target'].value_counts()

# Get the total number of associations
total_associations = len(df)

# Get the number of unique associations
unique_associations = len(association_counts)

# Get the association with the highest frequency
highest_frequency_association = association_counts.idxmax()
highest_frequency = association_counts.max()

print(f"Total number of associations: {total_associations}")
print(f"Number of unique associations: {unique_associations}")
print(f"Association with the highest frequency: {highest_frequency_association} (Frequency: {highest_frequency})")

Total number of associations: 422552
Number of unique associations: 75821
Association with the highest frequency: doid:10763-CYP2D6 (Frequency: 476)


handle the missing or non-finite values in the 'nStudy' column before converting it to integers. You can replace missing values with a specific value (e.g., 0) 

In [8]:
df2['nStudy'] = df.groupby('disease-target')['nct_id'].transform('nunique').fillna(0).astype(int)

To find out what drugs are associated with "hypertension-CYP2D6" 

In [9]:
# Assuming df2 is your DataFrame
association = "doid:10763-CYP2D6"

# Filter the DataFrame to get drugs associated with the specified association
associated_drugs = df[df['disease-target'] == association]['drug_name'].unique()

# Print the associated drugs
print(f"Drugs associated with {association}:")
for drug in associated_drugs:
    print(drug)

Drugs associated with doid:10763-CYP2D6:
candesartan
hydrochlorothiazide
chlorthalidone
azilsartan medoxomil
angiotensin ii
amiloride
amiloride hydrochloride
valsartan
chlortalidone
potassium chloride
progesterone
carbidopa
levodopa
fenofibrate
simvastatin
ezetimibe
acetyl-l-carnitine
enalapril
benazepril
fluvastatin
pitavastatin
sacubitril
nebivolol
telmisartan
ramipril
hctz
bisoprolol
amlodipine
indapamide
diclofenac sodium


Create a new column 'nStudy_Weighted' and convert 'nStudy' to 'nStudy_Weighted' using the formula provided, which is 

nStudy_weighted = ∑(i=1 to N_study) 2e^(-a_i/k)
Here's an explanation of the variables in the formula:
N_study: the total number of studies related to the disease-target association.
a_i: Age in years of the i-th study.
k: Half-life age (typically 5 years)

In the code I provided, 'a' is defined as a variable within the lambda function. It represents the study number within the summation loop. Here's how it works:

    I use the .apply() method on the 'n_study' Series (it contains the total number of studies for each row in the DataFrame).

    For each row, the lambda function is applied to the 'nStudy' value.

    Within the lambda function, i use a list comprehension to create a list of values based on the formula 2**(-a/half_life_age) where a ranges from 0 to x - 1, where x is the 'n_study' value for that row. This list comprehension calculates the contribution of each study to the 'nStudy_Weighted' value.

    The sum() function is used to calculate the sum of these contributions, resulting in the 'nStudy_Weighted' value for that row.

So, in summary, 'a' is a variable that represents the study number, and it iterates from 0 to x - 1, where x is the total number of studies for that row, as specified by the 'nStudy' value.

In [10]:
# Define the constants
half_life_age = 5  # Half-life age in years
n_study = df2['nStudy']  # Assuming 'nStudy' is the column with the total number of studies
# Calculate nStudy_Weighted using the formula
df2['nStudy_Weighted'] = n_study.apply(lambda x: sum([2**(-a/half_life_age) for a in range(x)]))
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768
...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768


For every id 'nct_id' in df data frame, look up the 'nct_id' in the 'aact_study_refs.tsv' file and obtain the reference type from the 'reference_type' column and the 'pmid', 

In [11]:
# Load the 'aact_study_refs.tsv' file (replace with the actual file path)
aact_study_refs = pd.read_csv('aact_study_refs.tsv', sep='\t')

# Merge the two DataFrames on 'nct_id'
merged_df = df.merge(aact_study_refs[['nct_id', 'reference_type', 'pmid']], on='nct_id', how='left')
merged_df

Unnamed: 0,uniprot,CID,nct_id,doid,disease_term,itv_id,drug_name,target_chembl_id,molecule_chembl_id,gene_symbol,idgTDL,disease-target,reference_type,pmid
0,B2RXH2,1050,NCT00157716,doid:6713,stroke,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,doid:6713-KDM4E,,
1,B2RXH2,1050,NCT00157716,doid:3393,coronary artery disease,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,doid:3393-KDM4E,,
2,B2RXH2,1050,NCT00157716,doid:326,ischemia,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,doid:326-KDM4E,,
3,B2RXH2,1050,NCT00157716,doid:8805,unstable angina,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,doid:8805-KDM4E,,
4,B2RXH2,1050,NCT00157716,doid:5844,myocardial infarction,31291349,pyridoxal,CHEMBL1293226,CHEMBL102970,KDM4E,Tchem,doid:5844-KDM4E,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1769782,Q9Y6R4,151194,NCT00615160,doid:1909,malignant melanoma,31378458,zk 222584,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,doid:1909-MAP3K4,,
1769783,Q9Y6R4,151194,NCT00655655,doid:1909,melanoma,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,doid:1909-MAP3K4,,
1769784,Q9Y6R4,151194,NCT00655655,doid:263,kidney cancer,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,doid:263-MAP3K4,,
1769785,Q9Y6R4,151194,NCT00655655,doid:10763,hypertension,31442035,vatalanib,CHEMBL4853,CHEMBL101253,MAP3K4,Tbio,doid:10763-MAP3K4,,


In [12]:
# Filter rows where both 'reference_type' and 'pmid' are not null
filtered_df = merged_df.dropna(subset=['reference_type', 'pmid'])

# Count the number of rows in the filtered DataFrame
count = len(filtered_df)

# Print the count
print(f'Number of disease-target associations with reference type and publication ID: {count}')

Number of disease-target associations with reference type and publication ID: 1537309


Create a new column 'nPub' in df2 and populate it with the count of unique PMIDs connected to each disease-target association from merged_df

In [13]:
# Assuming you have a DataFrame df2 with a 'disease-target' column and merged_df with a 'disease-target' and 'pmid' columns
df2['nPub'] = df2['disease-target'].map(merged_df.groupby('disease-target')['pmid'].nunique())

# If there are NaN values in the 'nPub' column, replace them with 0
df2['nPub'].fillna(0, inplace=True)
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2
...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0


Create a new column 'publication_type' in df2 by determining the type of publication (from the 'reference_type' column) connected to each disease-target association in merged_df.


This code uses the groupby method to group merged_df by the 'disease-target' column and then retrieves the unique publication types ('reference_type') associated with each disease-target. It then maps these unique publication types to the corresponding associations in df2. Finally, it fills any NaN values in the 'publication_type' column with an empty string for associations with no associated publication types.

In [14]:
# Assuming you have a DataFrame df2 with a 'disease-target' column and merged_df with 'disease-target' and 'reference_type' columns
df2['publication_type'] = df2['disease-target'].map(merged_df.groupby('disease-target')['reference_type'].unique().str.join(', '))

# If there are NaN values in the 'publication_type' column, replace them with an empty string
df2['publication_type'].fillna('', inplace=True)
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived
...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,


In [15]:
df2['publication_type'].unique()

array(['derived', '', 'background', 'derived, background', 'result',
       'background, derived', 'result, background, derived',
       'background, result, derived', 'background, result',
       'result, derived, background', 'result, derived',
       'derived, background, result', 'result, background',
       'derived, result', 'background, derived, result',
       'derived, result, background'], dtype=object)

In [16]:
df2['publication_type'].nunique()

16

This code defines a mapping of publication types to their corresponding values and then defines a function calculate_t_sum that calculates the sum of values based on the publication types found in the 'publication_type' column. Finally, it applies this function to create the 't_sum' column in df2. The code ignores empty strings (' ') in the 'publication_type' column while calculating 't_sum',. Also, the code splits the 'publication_type' string into a list, filtering out empty strings using if type.strip(), and then calculates the sum based on non-empty publication types.

In [17]:
# Define a mapping of publication types to their corresponding values
publication_type_values = {
    'result': 0,
    'background': 1,
    'derived': 2
}

# Create a function to calculate the sum based on the mapping
def calculate_t_sum(publication_types):
    # Split the publication types string and filter out empty strings
    types = [type.strip() for type in publication_types.split(',') if type.strip()]
    
    # Calculate the sum of values based on the publication types
    return sum(publication_type_values.get(type.lower(), 0) for type in types)

# Apply the function to the 'publication_type' column and create the 't_sum' column
df2['t_sum'] = df2['publication_type'].apply(calculate_t_sum)
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2
...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0


In [18]:
# Find the row with the maximum t_sum value
max_t_sum_row = df2[df2['t_sum'] == df2['t_sum'].max()]
max_t_sum_row

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum
400,doid:0014667,TST,1,doid:0014667-TST,22,7.359119,42,"derived, background",3
548,doid:0050117,APOBEC3G,1,doid:0050117-APOBEC3G,44,7.707692,21,"background, derived",3
552,doid:0050117,BAZ2B,1,doid:0050117-BAZ2B,2,1.870551,21,"background, derived",3
611,doid:0050117,ICMT,1,doid:0050117-ICMT,11,6.043768,21,"background, derived",3
612,doid:0050117,IDH1,1,doid:0050117-IDH1,11,6.043768,21,"background, derived",3
...,...,...,...,...,...,...,...,...,...
74535,doid:9743,USP1,1,doid:9743-USP1,6,4.362512,2,"background, derived",3
74536,doid:9743,USP2,1,doid:9743-USP2,6,4.362512,2,"background, derived",3
74537,doid:9743,VDR,3,doid:9743-VDR,32,7.633548,4,"background, derived",3
74538,doid:9743,VIPR1,2,doid:9743-VIPR1,32,7.633548,2,"background, derived",3


Create a new column 'nPub_Weighted' in DataFrame df2 and calculate it using the formula nPub_weighted = ∑i=1Nstudy e^2ti where t is replaced by 't_sum' from the 't_sum' column. 

In [19]:
import numpy as np
# Calculate nPub_Weighted
df2['nPub_Weighted'] = np.exp(2 * df2['t_sum'])
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2,54.59815
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2,54.59815
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2,54.59815
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2,54.59815
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2,54.59815
...,...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0,1.00000
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0,1.00000
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0,1.00000
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0,1.00000


Create a new column 'nDiseaseMention' in DataFrame df2 by counting the disease mentions based on disease-target associations

In [20]:
# Calculate nDiseaseMention
df2['nDiseaseMention'] = df2.groupby('disease-target')['doid'].transform('count')
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2,54.59815,1
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2,54.59815,1
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2,54.59815,1
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2,54.59815,1
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2,54.59815,1
...,...,...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0,1.00000,1
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0,1.00000,1
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0,1.00000,1
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0,1.00000,1


In [21]:
# Find the row with the maximum disease mention value
max_nDiseaseMention = df2[df2['nDiseaseMention'] == df2['nDiseaseMention'].max()]
max_nDiseaseMention

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2,54.59815,1
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2,54.59815,1
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2,54.59815,1
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2,54.59815,1
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2,54.59815,1
...,...,...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0,1.00000,1
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0,1.00000,1
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0,1.00000,1
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0,1.00000,1


In [22]:
# Find the row with the maximum disease mention value
max_unique_drugs_count = df2[df2['unique_drugs_count'] == df2['unique_drugs_count'].max()]
max_unique_drugs_count

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention
13888,doid:10763,CYP3A4,36,doid:10763-CYP3A4,26,7.514867,810,,0,1.0,1
14302,doid:10763,SLCO1B1,36,doid:10763-SLCO1B1,1,1.0,810,,0,1.0,1
14303,doid:10763,SLCO1B3,36,doid:10763-SLCO1B3,1,1.0,810,,0,1.0,1


Create a new column 'nDrugMention' in DataFrame df2 by making a copy of the existing column'unique_drugs_count'

In [23]:
df2['nDrugMention'] = df2['unique_drugs_count'].copy()
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention,nDrugMention
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2,54.59815,1,1
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2,54.59815,1,1
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2,54.59815,1,1
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2,54.59815,1,1
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2,54.59815,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0,1.00000,1,3
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0,1.00000,1,1
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0,1.00000,1,1
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0,1.00000,1,1


Read the "aact_drugs_chembl_activity.tsv" file with only the specified columns ('pchembl', 'target_chembl_id', and 'molecule_chembl_id'), 

In [24]:
# Define the columns you want to read
columns_to_read = ['pchembl_value', 'target_chembl_id', 'molecule_chembl_id']

# Read the CSV file with selected columns
pchembl_df = pd.read_csv("aact_drugs_chembl_activity.tsv", sep="\t", usecols=columns_to_read)
pchembl_df

Unnamed: 0,molecule_chembl_id,pchembl_value,target_chembl_id
0,CHEMBL100,,CHEMBL376
1,CHEMBL100,,CHEMBL376
2,CHEMBL100,6.79,CHEMBL376
3,CHEMBL100,,CHEMBL376
4,CHEMBL100,,CHEMBL376
...,...,...,...
32710,CHEMBL106939,,CHEMBL238
32711,CHEMBL106939,,CHEMBL222
32712,CHEMBL106939,8.95,CHEMBL228
32713,CHEMBL106939,,CHEMBL238


To rank each disease-target based on the values in the four columns (nStudy_Weighted, nPub_Weighted, nDiseaseMention, and nDrugMention), you can use the rank method provided by the Pandas library in Python. 

In this code:

We first create a new column 'Rank' in the DataFrame df2 by applying the rank method to the specified columns. We use ascending=False to rank in descending order.

    The sum(axis=1) part calculates the sum of ranks across the specified columns for each row.

    Finally, we sort the DataFrame based on the 'Rank' column to see the highest-ranked disease-targets.

This will give you a DataFrame sorted by the rank of disease-targets based on the specified columns.

In [27]:
df2['Rank'] = df2[['nStudy_Weighted', 'nPub_Weighted', 'nDiseaseMention', 'nDrugMention']].rank(ascending=False).sum(axis=1)
sorted_df = df2.sort_values(by='Rank')
sorted_df

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention,nDrugMention,Rank
59194,doid:5603,HTR2A,13,doid:5603-HTR2A,80,7.724906,36,"background, result, derived",3,403.428793,1,13,45792.0
59193,doid:5603,HRH2,13,doid:5603-HRH2,80,7.724906,36,"background, result, derived",3,403.428793,1,13,45792.0
23925,doid:1307,OPRK1,7,doid:1307-OPRK1,87,7.724979,26,"derived, background",3,403.428793,1,7,47706.5
23974,doid:1307,TACR2,7,doid:1307-TACR2,87,7.724979,26,"derived, background",3,403.428793,1,7,47706.5
23924,doid:1307,OPRD1,7,doid:1307-OPRD1,87,7.724979,26,"derived, background",3,403.428793,1,7,47706.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
56272,doid:5082,CYP1A2,1,doid:5082-CYP1A2,1,1.000000,9,,0,1.000000,1,1,210959.5
56273,doid:5082,CYP2A6,1,doid:5082-CYP2A6,1,1.000000,9,,0,1.000000,1,1,210959.5
20357,doid:1205,IDO1,1,doid:1205-IDO1,1,1.000000,0,,0,1.000000,1,1,210959.5
37453,doid:234,CYP2E1,1,doid:234-CYP2E1,1,1.000000,0,,0,1.000000,1,1,210959.5


The issue of getting negative values for mean_rank_score likely arises from the calculation of the percentile. The formula you've used: percentile = (1 / N) * df2['mean_rank'] * 100

This formula calculates the percentile based on the mean_rank, but it may not guarantee that the mean_rank_score is always positive. The problem lies in the calculation of the percentile when mean_rank is high.

To ensure that mean_rank_score is always positive, you can modify the formula like this:

This formula calculates the percentile based on the relative position of each mean_rank value within the range of mean_rank values in your dataset. It ensures that the mean_rank_score is always between 0 and 100, and thus positive.

percentile = (1 - (df2['mean_rank'] - df2['mean_rank'].min()) / (df2['mean_rank'].max() - df2['mean_rank'].min())) * 100

In [29]:
import numpy as np

# Assuming you have the DataFrame df2 with columns nStudy_Weighted, nPub_Weighted, nDiseaseMention, nDrugMention
# Calculate ranks for each column using rankdata function from scipy
from scipy.stats import rankdata

df2['rank_nStudy_Weighted'] = rankdata(df2['nStudy_Weighted'])
df2['rank_nPub_Weighted'] = rankdata(df2['nPub_Weighted'])
df2['rank_nDiseaseMention'] = rankdata(df2['nDiseaseMention'])
df2['rank_nDrugMention'] = rankdata(df2['nDrugMention'])

# Calculate the mean rank
df2['mean_rank'] = np.mean(df2[['rank_nStudy_Weighted', 'rank_nPub_Weighted', 'rank_nDiseaseMention', 'rank_nDrugMention']], axis=1)

# Calculate the mean rank score using the percentile formula
percentile = (1 - (df2['mean_rank'] - df2['mean_rank'].min()) / (df2['mean_rank'].max() - df2['mean_rank'].min())) * 100

# Calculate the mean rank score
df2['mean_rank_score'] = percentile

# Display the relevant columns
df2[['disease-target', 'mean_rank_score']]

Unnamed: 0,disease-target,mean_rank_score
0,doid:0001816-ABCB11,47.067976
1,doid:0001816-ABCC2,54.697504
2,doid:0001816-ABCC3,59.199298
3,doid:0001816-ABCC4,63.809466
4,doid:0001816-ACHE,54.697504
...,...,...
75816,doid:9993-USP1,77.013335
75817,doid:9993-USP2,75.868739
75818,doid:9993-VCP,100.000000
75819,doid:9993-VDR,80.158021


This section is to prepare requirement.txt to enable us create a virtual environment and run this analysis exactly this way.

Export Package List:
First, export the list of installed packages and their versions from your Jupyter Notebook environment to a text file. You can do this by running the following command in your Jupyter Notebook:

$!pip freeze > requirements.txt

This command will create a requirements.txt file containing the package names and versions.

$Copy requirements.txt:

Copy the requirements.txt file to the target machine where you want to create the isolated environment. You can use methods like SCP, FTP, or even a USB drive to transfer the file.

Create Isolated Environment:

On the target machine, ensure you have Python and virtualenv or conda (if you're using Anaconda) installed. Then, follow one of the two methods below, depending on whether you are using virtualenv or conda:

$Using virtualenv:
# Create a new virtual environment
virtualenv myenv

# Activate the virtual environment
source myenv/bin/activate  # On Windows, use "myenv\Scripts\activate"

# Install packages from requirements.txt
pip install -r requirements.txt

If you are using conda
# Create a new conda environment
$conda create --name myenv python=3.x  # Replace '3.x' with the desired Python version

#Activate the conda environment
conda activate myenv

#Install packages from requirements.txt
pip install -r requirements.txt

Verify the Environment:

After installation, you can verify that the packages have been successfully installed in your isolated environment by running:
$pip list

In [30]:
%pip list

Package                       Version
----------------------------- ---------------
aiobotocore                   2.5.0
aiofiles                      22.1.0
aiohttp                       3.8.3
aioitertools                  0.7.1
aiosignal                     1.2.0
aiosqlite                     0.18.0
alabaster                     0.7.12
anaconda-catalogs             0.2.0
anaconda-client               1.11.3
anaconda-navigator            2.4.2
anaconda-project              0.11.1
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.3
astroid                       2.14.2
astropy                       5.1
asttokens                     2.0.5
async-timeout                 4.0.2
atomicwrites                  1.4.0
attrs                         22.1.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.11.0
backcal

statsmodels                   0.13.5
sympy                         1.11.1
tables                        3.8.0
tabulate                      0.8.10
TBB                           0.2
tblib                         1.7.0
tenacity                      8.2.2
terminado                     0.17.1
text-unidecode                1.3
textdistance                  4.2.1
threadpoolctl                 2.2.0
three-merge                   0.1.1
tifffile                      2021.7.2
tinycss2                      1.2.1
tldextract                    3.2.0
tokenizers                    0.13.2
toml                          0.10.2
tomli                         2.0.1
tomlkit                       0.11.1
toolz                         0.12.0
torch                         2.0.1
tornado                       6.2
tqdm                          4.65.0
traitlets                     5.7.1
transformers                  4.29.2
Twisted                       22.10.0
typing_extensions             4.6.3
uc-micro-py        

In [31]:
!pip freeze > requirements.txt

In [32]:
df2

Unnamed: 0,doid,gene_symbol,unique_drugs_count,disease-target,nStudy,nStudy_Weighted,nPub,publication_type,t_sum,nPub_Weighted,nDiseaseMention,nDrugMention,Rank,rank_nStudy_Weighted,rank_nPub_Weighted,rank_nDiseaseMention,rank_nDrugMention,mean_rank,mean_rank_score
0,doid:0001816,ABCB11,1,doid:0001816-ABCB11,22,7.359119,2,derived,2,54.59815,1,1,123533.0,51980.5,68282.5,37911.0,21581.0,44938.750,47.067976
1,doid:0001816,ABCC2,1,doid:0001816-ABCC2,11,6.043768,2,derived,2,54.59815,1,1,136134.5,39379.0,68282.5,37911.0,21581.0,41788.375,54.697504
2,doid:0001816,ABCC3,1,doid:0001816-ABCC3,7,4.797787,2,derived,2,54.59815,1,1,143570.0,31943.5,68282.5,37911.0,21581.0,39929.500,59.199298
3,doid:0001816,ABCC4,1,doid:0001816-ABCC4,4,3.288163,2,derived,2,54.59815,1,1,151184.5,24329.0,68282.5,37911.0,21581.0,38025.875,63.809466
4,doid:0001816,ACHE,1,doid:0001816-ACHE,11,6.043768,2,derived,2,54.59815,1,1,136134.5,39379.0,68282.5,37911.0,21581.0,41788.375,54.697504
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75816,doid:9993,USP1,3,doid:9993-USP1,1,1.000000,0,,0,1.00000,1,3,172993.0,4838.5,27998.0,37911.0,59547.5,32573.750,77.013335
75817,doid:9993,USP2,1,doid:9993-USP2,14,6.615809,0,,0,1.00000,1,1,171102.5,44695.5,27998.0,37911.0,21581.0,33046.375,75.868739
75818,doid:9993,VCP,1,doid:9993-VCP,1,1.000000,0,,0,1.00000,1,1,210959.5,4838.5,27998.0,37911.0,21581.0,23082.125,100.000000
75819,doid:9993,VDR,1,doid:9993-VDR,10,5.793768,0,,0,1.00000,1,1,178187.0,37611.0,27998.0,37911.0,21581.0,31275.250,80.158021


In [33]:
df2.to_csv('tictac_genes_disease_associations.csv', index = False)