TICTAC: Target illumination clinical trials analytics with cheminformatics

Disease-Target links (via study, drug) => sddt_links.tsv

Columns - six columns
    uniprot	
    CID	
    nct_id	
    doid	
    disease_term	
    itv_id	
    drug_name	
    target_chembl_id	
    molecule_chembl_id	
    gene_symbol	
    idgTDL

In [1]:
import pandas as pd
df = pd.read_csv('example.tsv', sep = '\t')
df.head()

Unnamed: 0,nct_id,disease_term,drug_name,gene_symbol
0,NCT00157716,stroke,Pyridoxal,KDM4E
1,NCT00157716,coronary artery disease,Pyridoxal,KDM4E
2,NCT00157716,ischemia,Pyridoxal,KDM4E
3,NCT00157716,unstable angina,Pyridoxal,KDM4E
4,NCT00157716,myocardial infarction,Pyridoxal,KDM4E


Approach

1. Define a 'triplet term'

        triple term = disease_term, drug_name and gene_symbol combined together.

2. Define a function, ddg_triplet_pair 
        
        as disease_term+drug_name+gene_symbol then make a count of triplet pair

The ddg_triplet_pair function 

    concatenates the values from the disease_term, drug_name, and gene_symbol columns for each row. 

The apply method 
        
        apply this function to each row of the DataFrame, resulting in a new column triplet_pair. 

The value_counts method 

        counts the occurrences of each unique triplet pair in the DataFrame.

You can access the count for a specific triplet pair using the triplet_pair_counts variable, for example, triplet_pair_counts['strokePyridoxalKDM4E'] will give you the count of the triplet pair 'strokePyridoxalKDM4E'.

In [4]:
import pandas as pd

def ddg_triplet_pair(row):
    disease = str(row['disease_term']) if pd.notnull(row['disease_term']) else ''
    drug = str(row['drug_name']) if pd.notnull(row['drug_name']) else ''
    gene = str(row['gene_symbol']) if pd.notnull(row['gene_symbol']) else ''
    return disease + drug + gene

df['triplet_pair'] = df.apply(ddg_triplet_pair, axis=1)

triplet_pair_counts = df['triplet_pair'].value_counts()

def count_nct_id(row):
    triplet_pair = str(row['disease_term']) + str(row['drug_name']) + str(row['gene_symbol'])
    return df.loc[df['triplet_pair'] == triplet_pair, 'nct_id'].nunique()

df['nct_id_count'] = df.apply(count_nct_id, axis=1)

triplet_nct_counts = df.groupby('triplet_pair')['nct_id'].nunique().reset_index()

In [5]:
triplet_nct_counts_sorted = triplet_nct_counts.sort_values('nct_id', ascending=False)
triplet_nct_counts_sorted.head()

Unnamed: 0,triplet_pair,nct_id
16,hypertensionCandesartan cilexetilKDM4E,3
19,ischemiaPyridoxalKDM4E,2
34,strokePyridoxalKDM4E,2
23,myocardial infarctionPyridoxalKDM4E,2
38,unstable anginaPyridoxalKDM4E,2


In [6]:
#Next is to consider the other variables - work backwards using nct_id
df['nct_id'].unique()

array(['NCT00157716', 'NCT00157729', 'NCT00402506', 'NCT00917293',
       'NCT00081731', 'NCT00120003', 'NCT00125463', 'NCT00184587',
       'NCT01012479', 'NCT01052272', 'NCT01135212', 'NCT01289132'],
      dtype=object)

Evidence variables

nStudy - Study count for association

nStudyNewness - Study count weighted by newness of study (newer better)

nStudyPhase - Study count weighted by phase of study (completed better)

nPub - Study publications

nPubTypes - Study publications (results better)


drug-target association
    
    nAssay - Assay count 
    
    nAssayPchembl - Assay count weighted by pChembl

disease-target association
    
    nDiseaseMention_aact - Disease mention count (in AACT)
    nDiseaseMention_pubmed - Disease mention count (in papers)
    nDrugMention_aact - Drug mention count (in AACT)
    nDrugMention_pubmed - Drug mention count (in papers)
    nDrug - Drug count