MLSyPred Morgan Fingerprint 2048 Combinations PIPELINE

It is important to first convert the SMILES of all the drugs into Morgan Fingerprints to be able to perform the averages needed to determine synergy. 
The script can create averages on the Morgan Fingerprints of two compounds out of 79 compounds in total. 
The purpose of this script is to be able to perform average based on the presence of Morgan Fingerprints per pair of compounds There are 2048 bit Morgan fingerprints (features) and 79 drug compounds

#56 of the drug compounds are in the training set 
#23 of the drug compounds are in the validation set 

The expected result for the average of Morgan Fingerprint 2048bit (features) is 0,0.5,1

#'0' means that none of the 2 compounds have the Morgan fingerprints 2048
#'0.5' means that only one compound has the Morgan fingerprints 2048
#'1' means that both compounds have the Morgan fingerprints 2048

The output of the column SMILE shows the SMILE of the first compound from the combination. 
The outcome is that those pairs of compounds sharing many Morgan Fingerprints in common are more likely to portray synergistic combinational therapy.

Datasets used:

Mason, D. J., Eastman, R. T., Lewis, R., Stott, I. P., Guha, R., & Bender, A. (2018). Using Machine Learning to Predict Synergistic Antimalarial Compound Combinations With Novel Structures. Frontiers in pharmacology, 9, 1096. https://doi.org/10.3389/fphar.2018.01096

Mott, B. T., Eastman, R. T., Guha, R., Sherlach, K. S., Siriwardana, A., Shinn, P., McKnight, C., Michael, S., Lacerda-Queiroz, N., Patel, P. R., Khine, P., Sun, H., Kasbekar, M., Aghdam, N., Fontaine, S. D., Liu, D., Mierzwa, T., Mathews-Griner, L. A., Ferrer, M., Renslo, A. R., … Thomas, C. J. (2015). High-throughput matrix screening identifies synergistic and antagonistic antimalarial drug combinations. Scientific reports, 5, 13891. https://doi.org/10.1038/srep13891

In [68]:
#packages needed for the SMILE conversion (SCRIPT 1)
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import MACCSkeys

#packages needed for the script to run (SCRIPT 2 & 3)
import csv
import numpy as np
import pandas
from itertools import combinations

#other needed packages
from numpy import unique     #needed in script 4
import pandas as pd

# Script #1

In [69]:
#DEFINE FILES

#Input files



inputTraining_S1 = 'smiles_trainingset.csv'
#inputValidation_S1 = 'validation_2048_Morgan_trainingset.csv'

#OUTPUT FILES
outputTraining_S1 = 'smiles_validationset.csv' 
#outputValidation_S1 = 'validation_2048_Morgan_validationset.csv'

#specify if MACCSkeys or Morgan Fingerprints

# if MACCSkeys please assign maccskey = 1 and assign columns  (167)
# if Morgan please assign morgan = 1 and assign columns in columnas (1024 0r 2048)

var_maccskey = 0
var_morgan  = 1

columnas = 2048


In [70]:
def dataModel(input, output):
    
    # open the input file in read mode
    with open(input, 'r') as file_handle:
        # read file content into list
        lines = file_handle.readlines()
        #read file content into list
        counter_lines = (len(lines) )

    # OUTPUT FILE in write mode
    OUTFILES1 = open(output, 'w')
    
    
    #######
    #write number of columns MACCS(167)  or Morgan (1024 0 2048) without counting the first two columns (compound & SMILES)    columnas = 2048
    #columns=2048
    ########


    miscolumns = ("Compound," + "SMILES")
    for col in range(1,columns+1):
        miscolumns = str(miscolumns) + "," + str(col)
    lineoutput=str(miscolumns)+"\n"
    OUTFILES1.write((lineoutput)) 
    #print(lineoutput)

    #For the csv to start reading at row 2
    for n in range(1,counter_lines) : 
    #Determine everything that is in a single row within the csv. In this case the name of the compound and the SMILES
        milinea = lines[n].split(",")
    #Determines column 0 within csv as a variable that would be the compound names
        var0=milinea[0]
    #Determines column 1 within the csv as a variable that in this case would be the SMILES
        var1=milinea[1]
    #Assigns a variable for the change from SMILE format to hexadecimal format of each of the compounds
        tmp = Chem.MolFromSmiles(var1)
        if var_morgan == 1:
            #Assign a variable for the hexadecimal format change to Morgan Fingerprint 2048
            fp1 = AllChem.GetMorganFingerprintAsBitVect(tmp, radius=2, nBits=columnas)
        if var_maccskey == 1:
            fp1 = MACCSkeys.GenMACCSKeys(tmp)
    #Assign a variable to know the number of Morgan Fingerprint 2048 bit generated
        largo=len(fp1)
        #print(fp1)
        #print(largo)
    #This created variable takes into consideration both the name of the compound and the SMILE of the compound within the csv separated by a comma        
    lineoutput = str(var0) + "," + str(var1)
    #Creating this variable removes a document row from the output
        lineoutput=lineoutput.strip()
    # i is going to start counting from 1 to 2047. This is necessary because if it does not count from 0 to 2047, creating 2048 features instead of 2047, one of them being empty.        
        for i in range(largo):
            var2= int((fp1[i]))
        #This variable is created so that the output has the name of the compounds, the SMILES and the Morgan Fingerprints
            lineoutput=lineoutput+","+str(var2)
    #This variable allows the output to present the information of each compound in one line below the other. 
        lineoutput=lineoutput+"\n"
    #Write output file
        OUTFILES1.write((lineoutput))  
    OUTFILES1.close()

In [71]:
def main():

    ##TRAINING SET (MORGAN2048) (WITH 56 COMPOUNDS)
    dataModel(inputTraining_S1, outputTraining_S1)

    ##VALIDATION SET (SYNERGISM) (WITH 23 COMPOUNDS)
    dataModel(inputValidation_S1, outputValidation_S1)

#Executes all code in main
if __name__ == '__main__':
    main()

<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff778259c60>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff77826e620>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff769a1cc10>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff769a1c3a0>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff778259c60>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff77826e620>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff769a1c3a0>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff778259c60>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff77826e620>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff769a1cc10>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff769a1c3a0>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x7ff778259c60>
2048
<rdkit.DataStructs.cDataStructs.ExplicitBitVect obje

# Script #2 

Read outputs script 1.  and the The output file is the drug combination ID (e.g. Drug1_Drug2), along with computed feature values as:
	0.5 if one of the two drugs include the fingerprints equal to 1, i.e. {Drug_1=1,Drug_2=0 or Drug_1=0,Drug_2=1  
	0 if the two drugs include the fingerprints equal to 0, i.e. {Drug_1=0,Drug_2=0
	1 if the two drugs include the fingerprints equal to 1, i.e. {Drug_1=1,Drug_2=1



In [72]:
#DEFINE FILES SCRIPT 2

#DEFINE FILES

#Input files
inputTraining_S2 = outputTraining_S1
#inputValidation_S2 = outputValidation_S1
 
#OUTPUT FILES  
outputTraining_S2 = 'OUTS2.csv'
#outputValidation_S2 = 'validation_set_2048_Morgan_OUTS2.csv'



#Placeholder variables if number of columns changes
COL_COUNT1=1
COL_COUNT2= columns + 1

In [73]:
#Function to open input and create output
def dataModel(input, output):

    # OUTPUT FILE in write mode
    OUTFILE = open(output, 'w')

    # open the input file in read mode
    with open(input, 'r') as file_handle:
        # read file content into list
        lines = file_handle.readlines()
    #lineOutput = lines[0] + lines[1] + lines[2] + lines[3] + lines[4]
    lineOutput=lines[0]
    
    # write in file
    OUTFILE.write(lineOutput)

    # stop editing the file
    OUTFILE.close()

    # establish a dataframe with the input file
    df = pandas.read_csv(input)
    #lineOutput = df.iloc[0:4]
    lineOutput = df.iloc[0]

    # only select starting from row 0 from the df dataframe and creates dff (an updated version of df)
    #dff = df.iloc[4:]
    dff = df.iloc[0:]

    # OUTPUT FILE in append mode (add information without overriding the file)
    OUTFILE = open(output, 'a')

    # names each columns in dff (the updated version of df)
    dff_columns = ['ID', 'Morgan fingerprint or MACCSKeys']
    for i in range(COL_COUNT1,COL_COUNT2):
        dff_columns.append(str(i))
    dff.columns = dff_columns

    # resets the index in dff
    dff = dff.reset_index(drop=True)

    # dff2 is dff without the column names of 'ID' and "Morgan fingerprint 2048 index'
    dff2 = dff.drop(['ID', 'Morgan fingerprint or MACCSKeys'], axis=1)
    dff2_columns = len(dff2.columns)

    # reindex dff2
    dff2 = dff2.reset_index(drop=True)

    # creates a list with all the combinations of compounds selecting all the rows from column 0
    # no combination pair is repeated
    cc = list(combinations(dff.iloc[:, 0], 2))


    for i in cc:

        # selecting first compound
        comp1 = i[0]

        # selecting second compond
        comp2 = i[1]

        # establishes the index value of comp1
        x = dff[dff['ID'] == comp1].index.values

        # establishes the index value of comp1
        y = dff[dff['ID'] == comp2].index.values

        # var2 shows the SMILES of the first compound from the pair of compounds
        var2 = [s for s in lines if comp1 in s]

        # turn var2 into a string
        var2 = str(var2)

        # split var2 in ',' (commas)
        var2 = var2.split(',')

        # establish Morgan as the first item in var2
        Morgan = var2[1]

        # creating dff3 by selecting integers of 'x' and 'y' as rows and all columns from dff2
        dff3 = dff2.iloc[[int(x), int(y)], :]
        lineaver = []

        # range generates a series of number from 0 to 2048 (number of columns in this case)
        for j in range(0, dff2_columns):

            # select all the cells with numerical values of the first compound in row '0' in dff3
            value1 = dff3.iloc[0, j]

            # selecting all the cells with numerical values of the second compound in row '1' in dff3
            value2 = dff3.iloc[1, j]

            # creating the variable to create the average of all the columns from row '0' and '1'
            Total = (int(value1) + int(value2)) / 2

            # perform rounding for whole numbers such as '0' and '1' only; not on '0.5'
            if Total == 0.0:
                Total = 0
            if Total == 1.0:
                Total = 1

            # make the result of Total into string
            Total = str(Total)

            # append Total into lineaver
            lineaver.append(Total)

        # join all the averages from all the pairs of compounds together
        lineaver = ', '.join(map(str, lineaver))

        # add the pair of compounds with the MACCS in string and the average calculations in string in a variable
        lineOutput = '(' + comp1 + '+' + comp2 + ')' + ',' + str(Morgan) \
            + ',' + str(lineaver) + '\n'

        # write in output file by changing the type of lineOutput to string
        OUTFILE.write(str(lineOutput))

    # closing output file
    OUTFILE.close()

In [74]:
#Function to release output from input through function dataModel
def main():

    ##TRAINING SET (Morgan fingerprint 2048) (WITH 56 COMPOUNDS)
    dataModel(inputTraining_S2, outputTraining_S2)

    ##VALIDATION SET (SYNERGISM) (WITH 23 COMPOUNDS)
    dataModel(inputValidation_S2, outputValidation_S2)

#Executes all code in main
if __name__ == '__main__':
    main()

# Script 3 

Add synergy value to create files to process in ML
Input = output of script 2 and a file with synergy between drug 1 and 2 (external file)
The output of script is the result of merging the features (fingerprints) previously created of the training validation sets with the existing labels from the external file.

In [77]:


######################################################################
#     DEFINE SECTION
#   DEFINE FILES NAME AND NUMBER OF COLUMNS IF APPLY

#INPUT FILES

inputTraining_S3 = outputTraining_S2 
#inputValidation_S3= outputValidation_S2

INPUT_SYNERGY_TRAINING = 'COMBINATIONS_SYNERGY_TRAINING_3D7.csv'
#INPUT_SYNERGY_VALIDATION = 'COMBINATIONS_SYNERGY_VALIDATION_3D7.csv'


#OUTPUT FILES

OUTPUT_TRAINING_S3 = 'OUTS3.CSV'
#OUTPUT_VALIDATION_S3 = 'VALIDATION_SET_ML_2048_Morgan_OUTS3.CSV'


#Placeholder variables if number of columns changes
COL_COUNT1 = 1
COL_COUNT2 = columns + 1



##########################################################################
######################    BEGIN FUNCTIONS     #############################
##########################################################################

##########################################################################
##############           PRINT HEADER FUNCTION              ##############
##########################################################################

def header(output):

    OUTFILE = open(output, 'w')
    #write header in file
    header_columns = ['SYNERGY', 'ID', 'Morgan fingerprint or MACCSkeys']

    for i in range(COL_COUNT1,COL_COUNT2):
        header_columns.append(str(i))
    header_columns = (', '.join(map(str,header_columns)))
    lineOutput = str(header_columns) +  "\n"

    #print((lineOutput))
    OUTFILE.write(lineOutput)
    OUTFILE.close()
    print("Output File header created")



##########################################################################



##########################################################################
#### FUNCTION TO JOIN SYNERGY VALUE TO COMBINATION FILE #####
##########################################################################

def ADD_SYNERGY(input1, input2, output):

    #OPEN OUTPUT FILE AGAIN AFTER PRINT HEADER THIS TIME AS APPEND TO ADD NEW LINES
    #WITH COMBINATIONS AND SYNERGY
    OUTFILE = open(output, 'a')

    #read file in df format - this file contains if combinations have synergy
    df = pandas.read_csv(input1)
    #Convert Yes and No in 1 and 0 for ML
    df['SYNERGY(Yes/No)'].replace('Yes', '1',inplace=True)
    df['SYNERGY(Yes/No)'].replace('No', '0',inplace=True)
    df = df.fillna(0)
    #print(df)
    
    
    # Open combinations file with the other information
    with open(input2, 'r') as file_handle:
        # read file content into list
        lines = file_handle.readlines()
         

    #counter of lines (use as counter to skip headers of input2 file)
    lread = 0
    lines_printed = 0
    not_found = 0

    for line in lines :
        lread = lread + 1
        
        if lread > 1  : #
            line = line.split(",")  #split line
            #print(line)
            #extract the compounds to process
            comp1, comp2 = line[0].split('+')
            comp1 = comp1[1:]
            comp2 = comp2[:-1]
            #print(comp1 + "/ " + comp2)
            
            #localize line in synergy df with the 2 compound
            df2 = df[(df['COMPOUND_1']  == comp1) & (df['COMPOUND_2']  == comp2)]
            
            
            if df2.empty:
                df2 = df[(df['COMPOUND_1']  == comp2) & (df['COMPOUND_2']  == comp1)]
                        
            if not df2.empty :  
                #print(df_status)
                synergy = df2.iloc[0].at['SYNERGY(Yes/No)'] #extract synergy value               
            
                #convert line for print
                line = (','.join(map(str,line)))

                #prepare line to print or write in file
                lineOutput = str(synergy) + ","  + str(line)
                OUTFILE.write(lineOutput)
                
                lines_printed = lines_printed + 1
                #print(comp1 + " / " + comp2 + "-- found wrote in file  ****** \n")
            else:
                #print(comp1 + " / " + comp2 + " -- not found and not wrote in file \n")
                not_found = not_found + 1

    # closing output file
    print("Total of combinations not found = " + str(not_found))
    print("Total of lines wrote in file without header  = " + str(lines_printed))
    print("Total lines read = " + str(lread))
    OUTFILE.close()


##########################################################################
######      function to create file to join synergy file and   ###########
######      combination file and use for ML File               ###########
##########################################################################

def add_synergy_value():


    
    ##TRAINING SET (SYNERGISM 3D7) (WITH 56 COMPOUNDS)
    print("Working with Training file")
    header(OUTPUT_TRAINING_S3)
    print("Processing Training File")
    ADD_SYNERGY(INPUT_SYNERGY_TRAINING, inputTraining_S3, OUTPUT_TRAINING_S3 )
    print("##################################" )
    print("End of processing training files \n" )
    print("##################################" )
    
    
    ##VALIDATION SET (MORGAN2048 3D7) (WITH 23 COMPOUNDS)
    #print("Working with Validation file")
    #header(OUTPUT_VALIDATION_S3)
    #print("Processing Validation File")
    #ADD_SYNERGY(INPUT_SYNERGY_VALIDATION, inputValidation_S3, OUTPUT_VALIDATION_S3 )
    #print("###################################" )
    #print("End of processing Validations files" )
    #print("################################### \n" )

    
#########################
##############                 BEGIN                        ##############  
##########################################################################

#Executes all code

add_synergy_value()

Working with Training file
Output File header created
Processing Training File
Total of combinations not found = 749
Total of lines wrote in file without header  = 71
Total lines read = 821
##################################
End of processing training files 

##################################


# Script 4
This module perform a data cleaning of the raw training and validation datasets. First, the features (file columns) with the same values for all drug combinations are deleted from the training set.  After the features (file columns) were removed in the training data set, the same features (columns) were deleted in the validation set. The output file was data without unmeaningful features (columns). 

In [78]:
import pandas as pd
#define input and output files

# INPUT FILES

INPUT_TRAINING_S4 =  OUTPUT_TRAINING_S3 
#INPUT_VALIDATION_S4 = OUTPUT_VALIDATION_S3


# OUTPUT  FILES

OUTFILE_TRAINING_S4 = 'UPDATED_OUTS4.CSV'
OUTFILE_VALIDATION_S4 = 'VALIDATION_SET_ML_2048_Morgan_UPDATED_OUTS4.CSV' 



In [79]:
# PROCESS TRAINING FILE

#establish a dataframe with the input file
df2=pd.read_csv(INPUT_TRAINING_S4)

#df2=df
   
print(df2.shape)
nunique=df2.apply(pd.Series.nunique)
cols_to_del=nunique[nunique==1].index
print(cols_to_del)
df2.drop(cols_to_del,axis=1)
df3=df2.drop(cols_to_del,axis=1)
print(df3.shape)

df3.to_csv(OUTFILE_TRAINING_S4, index=False,encoding='utf8')

# PROCESS VALIDATION FILE

#establish a dataframe with the input file
df2=pd.read_csv(INPUT_VALIDATION_S4)

#df2=df  

print(cols_to_del)
df3=df2.drop(cols_to_del,axis=1)
print(df3.shape)

df3.to_csv(OUTFILE_VALIDATION_S4, index=False,encoding='utf8')


#CLOSE FILES
 
#OUTFILE_TRAINING.close()
#OUTFILE_VALIDATION.close()

(71, 2051)
Index([' 1', ' 3', ' 4', ' 9', ' 10', ' 12', ' 16', ' 17', ' 18', ' 19',
       ...
       ' 2032', ' 2033', ' 2037', ' 2038', ' 2040', ' 2041', ' 2042', ' 2044',
       ' 2046', ' 2047'],
      dtype='object', length=1199)
(71, 852)


NameError: name 'INPUT_VALIDATION_S4' is not defined