# Computational prediction of drug-tager interactions

## Exploratory Data Analysis

Original article: Computational prediction of drug–target interactions using chemogenomic approaches: an empirical survey, A. Ezzat, others.
<br>Data link: http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/

<p>

Data supplement. Organic molecules (Qm9 file): https://deepchemdata.s3-us-west-.amazonaws.com/datasets/molnet_publish/qm9.zip


## 1 - Pre-setup

### 1.1 - Imports (dependencies)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np
import os
import json
import requests
from tqdm import tqdm
import time


from io import StringIO #retrive information for mlflow
import sys #retrive information for mlflow


#Chemistry Libraries
from rdkit import Chem
from rdkit.Chem import AllChem



# Pubchem DB API https://pubchem.ncbi.nlm.nih.gov/compound/5388962
import pubchempy as pcp # to retrive features and SMILES

C:\Users\riskf\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
C:\Users\riskf\anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


## 2 - Data imports

### 2.1 - Predicted drug-target interaction networks

Cinq types de données.
    <ul>Predicted compound-protein interacion pairs</ul>
    <ul>Binary relation list of the gold standard drug-target interaction data</ul>
    <ul>Adjacency matrix of the gold standard drug-target interaction data</ul>
    <ul>Compound structure similarity matrix</ul>
    <ul>Protein sequence similarity matrix</ul>

In [2]:
#relative paths. # Set directory paths for later use.
# Get the directory of the script file
base_dir = os.getcwd()
base_dir

ligants_type=['nuclear_receptor','GPCR','ion_channel','nuclear_receptor']

### 2.1.1 - Nuclear Receptor

In [3]:
#Nuclear Receptor

ltype=ligants_type[2]

#set tables
files_matrix_temp={'df_adjacency_matrix_ion_channel_Y':'ic_admat_dgc.txt',
       'df_similarity_matrix_ion_channel_compound_St':'ic_simmat_dc.txt',
       'df_similarity_matrix_ion_channel_protein_Sd':'ic_simmat_dg.txt',
       }

df_temp_matrix={}


for df_name, file_name in files_matrix_temp.items():
    # Construct the file path using base_dir
    file_path = os.path.join(base_dir,'data','split',ltype, file_name)

    try:
        # Read the file
        print("Trying to read file at:", file_path) # Print the path for verification
        data_frame = pd.read_csv(file_path, delimiter='\t', index_col=0)
        df_temp_matrix[df_name] = data_frame
    except FileNotFoundError:
        print(f'File not found at the specified path: {file_path}')

df_adjacency_matrix_ion_channel_Y=df_temp_matrix['df_adjacency_matrix_ion_channel_Y']
df_similarity_matrix_ion_channel_compound_St=df_temp_matrix['df_similarity_matrix_ion_channel_compound_St']
df_similarity_matrix_ion_channel_protein_Sd=df_temp_matrix['df_similarity_matrix_ion_channel_protein_Sd']

Trying to read file at: C:\Users\riskf\OneDrive\DrugTargetSmilesBERT\data\split\ion_channel\ic_admat_dgc.txt
Trying to read file at: C:\Users\riskf\OneDrive\DrugTargetSmilesBERT\data\split\ion_channel\ic_simmat_dc.txt
Trying to read file at: C:\Users\riskf\OneDrive\DrugTargetSmilesBERT\data\split\ion_channel\ic_simmat_dg.txt


In [4]:
#Adjacent matrix. Y
print('Lines (m): {}'.format(df_adjacency_matrix_ion_channel_Y.shape[0]))
print('Columns (n): {}'.format(df_adjacency_matrix_ion_channel_Y.shape[1]))
print('Size (m x n): {}'.format(df_adjacency_matrix_ion_channel_Y.size))

number_interactions_enzimes=(df_adjacency_matrix_ion_channel_Y.values == 1).sum()
print('Known interactions: {}'.format(number_interactions_enzimes))
print('Known interactions (%): {:.4f}%'.format(number_interactions_enzimes/df_adjacency_matrix_ion_channel_Y.size*100))
print('No interactions: {}'.format(df_adjacency_matrix_ion_channel_Y.size-number_interactions_enzimes))
print('No interactions(%): {:.4f}%'.format((df_adjacency_matrix_ion_channel_Y.size-number_interactions_enzimes)/df_adjacency_matrix_ion_channel_Y.size*100))


#print(df_adjacency_matrix_ion_channel_Y.head(5))

Lines (m): 204
Columns (n): 210
Size (m x n): 42840
Known interactions: 1476
Known interactions (%): 3.4454%
No interactions: 41364
No interactions(%): 96.5546%


In [5]:
#Similarity Matrix Compound Columns
print('Lines (m): {}'.format(df_similarity_matrix_ion_channel_compound_St.shape[0]))
print('Columns (n): {}'.format(df_similarity_matrix_ion_channel_compound_St.shape[1]))
print('Size (m x n): {}'.format(df_similarity_matrix_ion_channel_compound_St.size))

#print(df_simmilarity_matrix_ion_channel_compound_Sd.head(5))

Lines (m): 210
Columns (n): 210
Size (m x n): 44100


In [6]:
#Similarity Matrix Human Proteins Lines
print('Lines (m): {}'.format(df_similarity_matrix_ion_channel_protein_Sd.shape[0]))
print('Columns (n): {}'.format(df_similarity_matrix_ion_channel_protein_Sd.shape[1]))
print('Size (m x n): {}'.format(df_similarity_matrix_ion_channel_protein_Sd.size))

#print(df_simmilarity_matrix_ion_channel_protein_St.head(5))

Lines (m): 204
Columns (n): 204
Size (m x n): 41616


### Non-negative matrix factorization

In [None]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import NMF

K = 50  # feature dimension 

Y = csr_matrix(df_adjacency_matrix_ion_channel_Y.values)

# Initialize NMF model
model = NMF(n_components=50, init='random', random_state=0, max_iter=1000)

# Fit the model
A = model.fit_transform(Y)  # Matrix of drug features (latent features for drugs)
B = model.components_  # Matrix of target features (latent features for targets)

# Calculate the complete interaction matrix using the factorized matrices
Y_complete = np.dot(A, B)

# Generating the final dataset
def generate_final_dataset(A, B, Y):
    B = B.T  # Transpose B to match the target feature access pattern
    final_dataset = []
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            features_drug = A[i, :]
            features_target = B[j, :]  # Access the correct j-th target features
            interaction_class = Y[i, j]  # Get the class from the dense Y matrix
            final_dataset.append(np.concatenate([features_drug, features_target, [interaction_class]]))
    return np.array(final_dataset)

In [8]:
##OUTPUT Matrix. Important for next steps.
Y_dense = Y.toarray() if isinstance(Y, csr_matrix) else Y  # Convert to dense if Y is sparse
final_dataset = generate_final_dataset(A, B, Y_dense)  # Generate the dataset
final_df = pd.DataFrame(final_dataset)

file_name = f'final_new_par_NNMF_{K}.csv'
file_path = os.path.join(base_dir, 'data', 'split', ltype, file_name)
final_df.to_csv(file_path, index=False)
print(f"Final dataset saved at {file_path}!")

Final dataset saved at C:\Users\riskf\OneDrive\DrugTargetSmilesBERT\data\split\ion_channel\final_new_par_NNMF_50.csv!


In [9]:
print(final_df)

            0    1    2    3    4    5    6    7    8    9    ...       91   \
0      0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
1      0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
2      0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
3      0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
4      0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
...         ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...       ...   
42835  0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
42836  0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   
42837  0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.992440   
42838  0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000116   
42839  0.000001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.000000   

       92        93   94   95        96   97   98  