# Piggybacking Experiment

This notebook implements the hit-to-lead optimization algorithm ("piggybacking") for lead rediscovery using the `fragment_lead_pairs.csv` data as a reference.

### Some miscellaneous notes:
* In this notebook, we use the `frag_idx` parameter to refer to the index (1-indexed, not 0-indexed) of the fragment lead pair according to `fragment_lead_pairs.csv`. In the csv, there is a column "Table Entry" that enumerates the pairs but rolls over at each year. For instance, one of the rows in the fragment_lead_pairs.csv is: `2021,1,COc1ccc(cc1)C2CC(=O)NCCS2,COc1ccc(NC(=O)C[C@@H]2SCCNC2=O)cc1` which we refer to in this notebook as having `frag_idx = 16` (because it is the 16th fragment from the top in the table)
* Currently for REINVENT4, we are prompting each of the 6 priors (excluding pubchem) to generate 20 molecules each, then combining all of the outputs. To change this, edit the `num_smiles` value (line 35) in the toml files (which can be found in the `mol2mol_prior_tomls` folder)
* `pifp_counts` is a bit of a misnomer; it is not the number of protein interactions, but rather the Tanimoto similarity of the analog's protein interaction fingerprint to the protein interaction fingerprint of the lead

## Imports and setup
These are just the basic imports and packages that we use for this notebook. Note that in the first code block you have to set the path to the OpenEye license to the correct path.

In [2]:
import subprocess
import os
import time
print('Current conda environment:', os.environ['CONDA_DEFAULT_ENV'])

os.environ["PATH"] += ":/usr/local/openeye/bin"
# set this path to the correct path
os.environ["OE_LICENSE"] = "/home/fts_g_ucla_edu/Projects/oe_license.txt"
os.environ['TOKENIZERS_PARALLELISM'] = "false"

cwd = os.getcwd()
print(cwd)

import warnings
warnings.filterwarnings('ignore')

import random
random.seed(42)

from datetime import datetime
import pytz

import shutil

Current conda environment: reinvent
/home/fts_g_ucla_edu/Projects/rips-relay_copy/experiments


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import mols2grid
import useful_rdkit_utils as uru
from rdkit import Chem
from rdkit.Chem import PandasTools
import MDAnalysis as mda
import prolif as plf
from rdkit.Chem.rdFMCS import FindMCS

import joblib
import tensorflow as tf
import pandas as pd

Failed to find the pandas get_adjustment() function to patch
Failed to patch pandas - PandasTools will have limited functionality
2024-08-21 16:27:13.475364: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 16:27:13.498735: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 16:27:13.505701: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-21 16:27:13.522896: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, 

In [4]:
from crem.crem import grow_mol, mutate_mol
crem_db = '../crem_db/crem_db2.5.db'

import mols2grid

from rdkit import Chem
from rdkit.Chem import AllChem, rdFingerprintGenerator, CanonSmiles, Draw, MolFromSmiles, PandasTools
from rdkit.Chem.rdmolops import RDKFingerprint
from rdkit.Chem.Draw import MolsToGridImage
from rdkit import DataStructs
from rdkit.Chem.rdFMCS import FindMCS
from rdkit.DataStructs.cDataStructs import BulkTanimotoSimilarity
import useful_rdkit_utils as uru
from rdkit.Chem import PandasTools
from rdkit import DataStructs

import prolif as plf

import safe as sf
import datamol as dm

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc = {'figure.figsize':(15,8)})

from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, classification_report
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import torch
import itertools 
from coati.generative.coati_purifications import embed_smiles
from coati.models.io.coati import load_e3gnn_smiles_clip_e2e
from coati.models.simple_coati2.io import load_coati2

from molscore import MolScore

## Reading in data and defining some useful reference constants
Here we read in the `data/fragment_lead_pairs.csv` data and the csv that converts the index of the fragment (in the fragment lead pairs csv) to the name of the pdb file (`data/pdb_name_map.csv`). We also set up some basic constants.

In [5]:
fragment_lead_pairs = pd.read_csv('data/fragment_lead_pairs.csv')

fragment_lead_pairs.head()

Unnamed: 0,Year,Table_Entry,Fragment,Lead
0,2022,1,Nc1cc(c[nH]c1=O)C(F)(F)F,N[C@H]1CCN(Cc2cccc(c2)c3ccc4c(=O)[nH]ccc4c3)C1
1,2022,2,CN1C[C@@H](O)[C@H](C1=O)c2ccc(C)cc2,COc1ccc(CN2C[C@H](O)[C@](CCC(C)C)(C2=O)c3ccc(c...
2,2022,3,Fc1cncc(c1)N2C(=O)N[C@@H](Cc3ccccc3)C2=O,Clc1ccccc1C2CC3(C2)NC(=O)N(C3=O)c4cncc5ccccc45
3,2022,4,c1ccc(cc1)c2ccccc2c3nnn[nH]3,Cc1ccc(cc1)c2cccc(c2c3nnn[nH]3)S(=O)(=O)N
4,2022,5,CN(C)C(=O)C(N)Cc1ccc(F)cc1,Clc1ccc(cc1)[C@H]2CN[C@H](C2)C(=O)N3CCN(CC3)c4...


In [6]:
# define global constants
models = ['reinvent', 'crem', 'safe', 'coati']
metrics = ['tanimoto_to_lead', 'docking_score', 'pifp_count', 'NNpredict_tani_to_lead']
pdb_names = pd.read_csv("data/pdb_name_map.csv")
mol2mol_priors = ['sim', 
                             'medsim', 
                             'highsim', 
                             'scaffold', 
                             'genscaffold', 
                             'mmp']
                             

## Helper functions
This section just contains some helper functions to use when computing metrics

In [7]:
def remove_repeats(df, inchi_names):
    """
    Removes rows with repeated 'inchi' and 'Model' values.

    Parameters:
    df (pd.DataFrame): Input DataFrame with columns 'inchi' and 'Model'
    inchi_names (list): List of inchi names to check for repeats

    Returns:
    pd.DataFrame: DataFrame with repeats removed
    """

    # # TODO: remove this block of code
    # df.to_csv('data/garbage/remove_repeats_df.csv')
    # if 'Model' not in df.columns:
    #     raise KeyError("The 'Model' column is missing from the DataFrame!")

    # Iterate over each inchi name in inchi_names
    for inchi_name in inchi_names:
        # Get the subset of the DataFrame for the current inchi_name
        df_inchi = df[df['inchi'] == inchi_name]
        
        duplicates = df_inchi[df_inchi.duplicated(subset=['inchi'], keep='first')]
        # TODO: replace the above with below (below is original, above it possible fix)
        # # Find duplicates based on 'inchi' and 'Model' columns
        # duplicates = df_inchi[df_inchi.duplicated(subset=['inchi', 'Model'], keep='first')]
        
        # Drop the duplicate rows from the original DataFrame
        df = df.drop(duplicates.index)
    
    return df

In [8]:
def concatenate_sdf_files(output_sdf, lead_sdf, analogs_sdf):
    # helper for pifp_distance_to_lead
    with open(output_sdf, 'w') as outfile:
        with open(lead_sdf, 'r') as infile:
            first_line = infile.readline().strip()
            outfile.write("MOL9999\n")  # Replace first line with "MOL9999"
            for line in infile:
                outfile.write(line)
        with open(analogs_sdf, 'r') as infile:
            for line in infile:
                outfile.write(line)


In [9]:
def general_tanimoto_similarity(row1, row2):
    # Compute the bitwise AND and OR
    intersection = np.sum(np.logical_and(row1, row2))
    union = np.sum(np.logical_or(row1, row2))
    # Calculate Tanimoto similarity
    return intersection / union

In [10]:
def clear_directory(directory_path):
    # Check if the directory exists
    if os.path.exists(directory_path):
        # Iterate over all the files and subdirectories in the directory
        for filename in os.listdir(directory_path):
            file_path = os.path.join(directory_path, filename)
            try:
                # Remove files
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)
                # Remove directories
                elif os.path.isdir(file_path):
                    shutil.rmtree(file_path)
            except Exception as e:
                print(f'Failed to delete {file_path}. Reason: {e}')
    else:
        print(f'Directory {directory_path} does not exist')


In [11]:
def wait_for_file(directory, filename):
    file_path = os.path.join(directory, filename)
    print(f"Please upload the file '{filename}' to the folder '{directory}'.")

    # Pause the script until the file is found
    while not os.path.isfile(file_path):
        print(f"Waiting for '{filename}' to be uploaded...")
        time.sleep(5)  # Wait for 5 seconds before checking again

    print(f"File '{filename}' found.")
    return file_path

## Metric functions
The functions in this section define the metrics that we use to evaluate the quality of the molecules. Some of these functions depend on external files/directories (beyond the ones passed in as parameters). They are listed below:

pifp_distance_to_lead:
* `data/docking/pairs`
* `placeholder_concat.sdf`

NNpredict_tani_to_lead:
* `data/molscore_dummy`

In [12]:
def tanimoto_similarity(smi_1, smi_2, use_counts=True):
    # Note this refers to molecular tanimoto similarity
    fpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2,fpSize=2048,countSimulation=True)
    mol_1 = Chem.MolFromSmiles(smi_1)
    mol_2 = Chem.MolFromSmiles(smi_2)
    if use_counts:
        fp_1 = rdFingerprintGenerator.GetCountFPs([mol_1])[0]
        fp_2 = rdFingerprintGenerator.GetCountFPs([mol_2])[0]
    else:
        fp_1 = rdFingerprintGenerator.GetFPs([mol_1])[0]
        fp_2 = rdFingerprintGenerator.GetFPs([mol_2])[0]
    return DataStructs.TanimotoSimilarity(fp_1, fp_2)

In [13]:
def pifp_distance_to_lead(path_to_analog_sdf, frag_idx, pdb_type):

    # takes in path to analog sdfs and returns a list of pifp tani similarities

    PDB_FILEPATH = f"data/docking/pairs/{frag_idx}_{pdb_type}.pdb"

    fp = plf.Fingerprint()

    mol = Chem.MolFromPDBFile(PDB_FILEPATH, removeHs=False)
    prot = plf.Molecule(mol)

    # concat analog sdf with lead sdf
    path_to_lead_sdf = f"data/docking/pairs/{frag_idx}_lead_smi_to_{pdb_type}_pdb.sdf"
    concat_sdf_path = f"placeholder_concat.sdf"
    concatenate_sdf_files(output_sdf=concat_sdf_path, lead_sdf=path_to_lead_sdf, analogs_sdf=path_to_analog_sdf)

    # get pifp
    suppl = plf.sdf_supplier(concat_sdf_path)
    fp.run_from_iterable(suppl,prot,progress=True)
    df = fp.to_dataframe()

    numpy_df = (df.astype('int')).to_numpy()

    num_rows = numpy_df.shape[0]
    first_row = numpy_df[0]
    similarities = []
    
    for i in range(1, num_rows):
        similarity = general_tanimoto_similarity(first_row, numpy_df[i])
        similarities.append(similarity)
    
    return similarities

In [14]:
def docking_score(path_to_analog_smi, output_oeb_path, path_to_oedu, output_sdf_path, saved_sdf_path=None):
    
    # run through omega
    command = f"/usr/local/openeye/bin/omega2 -in {path_to_analog_smi} -out {output_oeb_path} -strictstereo false"
    os.system(command)

    # run through hybrid
    command = f"/usr/local/openeye/bin/hybrid -receptor {path_to_oedu} -dbase {output_oeb_path} -out {output_sdf_path}"
    os.system(command)
    
    # read in output sdf from hybrid
    docked_df = PandasTools.LoadSDF(output_sdf_path)

    # save docked_df as csv if desired
    if saved_sdf_path != None:
        PandasTools.LoadSDF(output_sdf_path).to_csv(saved_sdf_path, index=False)

    # return the docking scores
    return docked_df

In [15]:
def NNpredict_tani_to_lead(df, path_to_scaler, path_to_model, curr_iter):

    columns_to_keep = ['SMILES', 'HYBRID Chemgauss4 score', 'min_freq']
    df = df.loc[:, columns_to_keep]

    smiles = df['SMILES'].to_list()

    ms = MolScore(model_name='mol2mol', task_config='molscore/feature_selection.json')
    scores = ms.score(smiles)

    # Once finished
    metrics = ms.compute_metrics(
        endpoints=None, # Optional list: by default will use the running final score/reward value
        thresholds=None,  # Optional list: if specified will calculate the yield of molecules above that threshold 
        # chemistry_filters_basic=False,  # Optional, bool: Additionally re-calculate metrics after filtering out unreasonable chemistry
        budget=10000,  # Optional, int: Calculate metrics only with molecules within this budget
        n_jobs=1,  # Optional, int: Multiprocessing
        benchmark=None,  # Optional, str: Name of benchmark, this may specify additional metrics to compute
    )

    directory_path = 'data/molscore_dummy'  
    # List all entries in the directory
    entries = os.listdir(directory_path)
    # Filter out only the directories
    folders = [entry for entry in entries if os.path.isdir(os.path.join(directory_path, entry))]
    # Assuming there is only one folder, add its name to the path
    file_path = f"{directory_path}/{folders[0]}/iterations/000001_scores.csv"
    # Read in the molscore output
    df2 = pd.read_csv(file_path, index_col=0)

    # Clear out the directory
    clear_directory(directory_path=directory_path)
    
    # Rename columns
    column_names = {
    'desc_Bertz' : 'Synthetic Complexity',
    'interaction weight ratio' : 'Avg Interaction Strength',
    'Weighted IFP Similarity' : 'Weighted Interaction Similarity',
    'RAScore_pred_proba' : 'Synthetic Accessibility',
    'desc_NumHeteroatoms' : '# Heteroatoms',
    'desc_HeavyAtomMolWt': "Heavy Atom MolWt", 
    'desc_NumHAcceptors': '# HAcceptors', 
    'desc_NumHDonors':"#HDonors",
    'desc_NumRotatableBonds': '# Rotatable Bonds',
    'desc_NumAromaticRings': '# Aromatic Rings', 
    'desc_NumAliphaticRings': 'Number Aliphatic Rings', 
    'desc_RingCount': 'Ring Count',
    'desc_TPSA': 'TPSA', 
    'desc_FormalCharge': 'Formal Charge',
    'desc_CLogP': 'CLogP',
    'desc_MolWt': 'MolWt', 
    'desc_HeavyAtomCount': 'Heavy Atom Count',
    'desc_MaxConsecutiveRotatableBonds': 'Max Consecutive Rotatable Bonds',
    'tanimoto_Sim': 'Tanimoto Sim',
    'dice_Sim': 'Dice Sim',
    'desc_FlourineCount':'Fluorine Count',
    'desc_QED': 'QED',
    'smiles': 'SMILES'
    }
    df2.rename(columns=column_names, inplace=True)
    
    # Define the columns to keep (i.e. to use for inference)
    columns_to_keep = ['SMILES', 'QED', 'CLogP', 'MolWt', 'Heavy Atom Count', 'Heavy Atom MolWt', 
                       '# HAcceptors', '#HDonors', '# Heteroatoms', '# Rotatable Bonds', 
                       '# Aromatic Rings', 'Number Aliphatic Rings', 'Ring Count', 'TPSA', 
                       'Formal Charge', 'Synthetic Complexity', 'Max Consecutive Rotatable Bonds', 
                       'Fluorine Count', 'Synthetic Accessibility']
    df2 = df2.loc[:, columns_to_keep]

    # Combine molscore data with existing data
    data = pd.merge(df2, df, on='SMILES')
    data = data.drop('SMILES', axis=1)
    data.drop_duplicates(inplace=True)

    # Load the scaler
    scaler = joblib.load(path_to_scaler)
    # Load the trained model
    model = tf.keras.models.load_model(path_to_model)

    # Apply the same scaling to the new data
    new_features = data.values
    new_features_scaled = scaler.transform(new_features)

    # Make predictions
    predictions = model.predict(new_features_scaled)

    return predictions['predicted']

## Piggybacking function definition
This section contains code for the actual hit-to-lead optimization algorithm.

In [16]:
# Defines the score weighting/aggregation function
# This takes in a weight array, a metric (index in metrics array), and a score and returns the weighted/adjusted score
# This is a very rudimentary 
def weighted_score_contribution(weights, metric_type, score):
    # molecular tanimoto similarity to lead
    if metric_type == 0:
        # higher is already better so just return itself
        return float(weights[metric_type] * score)
    # docking score
    elif metric_type == 1:
        # makes it so that higher (more positive) is better
        return float(weights[metric_type] * (-1) * score)
    # pifp counts
    elif metric_type == 2:
        # higher is already better so just return itself
        return float(weights[metric_type] * score)
    # neural net predicted score
    elif metric_type == 3:
        # NN should be predicting tanimoto sim to lead so just return itself
        return float(weights[metric_type] * score)
    # nonsense
    else:
        raise ValueError('[weighted_score_contribution]: invalid value for metric_type.')

A note about the piggybacking parameters:
* `frag_lead_idx`: index of the fragment lead pair you would like to run the experiment on. Note that we may not have pdb/oedu files for all of the pairs (check `data/pdb_name_map.csv` to see if we were able to find the pdb files and generate oedu files).
* `pdb_type`: in the published table of fragment lead pairs, there are two types of pdb files: those associated with the fragment, and those associated with the lead. Use `pdb_type` to specify if you are using the fragment pdb (`pdb_type = 'frag'`) or the lead pdb (`pdb_type = 'lead'`)
* `model_idx`: index of the generative model to use (see the `models` array defined in the "Reading in data..." section). We have only implemented functionality for REINVENT4 and CReM.
* `weights`: how to weight each of the four metrics according to the `metrics` array defined the "Reading in data..." section. These weights should add up to 1 and compute a weighted average of the four metrics to compute the final score for each molecule.
* `k`: choose the top-`k` analogs at each iteration.
* `max_iterations`: maximum number of iterations to run before quitting.
* `threshold`: defines how close we need to get (in terms of molecular Tanimoto similarity to the lead) before considering an analog "successful". Set to `1.0` by default (complete Tanimoto similarity).
* `scaler_path`: [optional] path to scaler (for the neural net).
* `NNmodel_path`: [optional] path to neural net model.

About the neural net: If you encounter errors with the neural net model, which is very particular about the number of features, just comment out the code relating to the neural net in the `piggyback_pt2` function where indicated in step 6

Output: logs and output csvs are automatically saved to `data/piggybacking_output`

In [17]:
# THIS ASSUMES THE OEDU FILE FOR THE PROTEIN HAS ALREADY BEEN GENERATED AND STORED IN THE PROPER DIRECTORY

def piggyback_pt2(frag_lead_idx, pdb_type, model_idx, weights, k=10, max_iterations=15, threshold=1.0, scaler_path=None, NNmodel_path=None):



    ##### step 1: frontmatter #####

    frag_smi = fragment_lead_pairs['Fragment'][frag_lead_idx-1]
    lead_smi = fragment_lead_pairs['Lead'][frag_lead_idx-1]



    ##### step 2: write parameters to log file #####

    # Get the current date and time in Pacific Time
    pacific_time = pytz.timezone('America/Los_Angeles')
    now = datetime.now(pacific_time)
    folder_name = now.strftime("%Y-%m-%d_%H-%M-%S")
    new_folder_path = f"data/piggybacking_output/{folder_name}"

    # Create the new directory
    os.makedirs(new_folder_path, exist_ok=True)

    # Create and write to log file
    log_file_path = f"{new_folder_path}/log.txt"
    with open(log_file_path, 'w') as log_file:
        log_file.write(f"frag_lead_idx: {frag_lead_idx} ({frag_smi})\n")
        row = pdb_names[pdb_names['fragment_index'] == frag_lead_idx]
        log_file.write(f"pdb_type: {pdb_type} ({row[f'{pdb_type}_pdb_code'].values[0]})\n")
        log_file.write(f"model_idx: {model_idx} ({models[model_idx]})\n")
        log_file.write(f"metric weights: {weights} ({metrics})\n")
        log_file.write(f"k: {k}\nmax_iterations: {max_iterations}\nthreshold: {threshold}\n")
        log_file.write(f"scaler: {scaler_path}\nNNmodel: {NNmodel_path}\n")



    #######################################
    ########## ENTERING THE LOOP ##########
    #######################################

    # WHILE ITERATIONS IS < MAX ITERATIONS:
    #   CREATE AN EMPTY DATAFRAME TO STORE RESULTS
    #   FOR EACH INPUT ANALOG IN inputs:
    #       RUN THROUGH THE STEPS AND CONCAT RESULTS TO DATA FRAME
    #   SAVE THE FULL RESULTS DATA FRAME TO iterationXXXXX CSV
    #   IF THERE ARE ANALOGS IN THE RESULTS DATA FRAME THAT HAVE TANI SIM TO LEAD >= THRESHOLD:
    #       WRITE THE ROWS TO winner CSV AND EXIT
    #   ELSE:
    #       CHOOSE TOP k ANALOGS IN TERMS OF FINAL SCORE
    #       WRITE THE ROWS TO top_k CSV
    #       WRITE THE SMILES COLUMN TO THE inputs ARRAY
    #       INCREMENT ITERATION INDEX



    curr_iter = 1
    input_analogs = [frag_smi]

    while curr_iter <= max_iterations:

        results = pd.DataFrame()

        for input_smiles in input_analogs:



            ##### step 3: generate analogs #####

            # overwrite mol2mol.smi with the input molecule
            with open("data/mol2mol.smi", 'w') as file:
                file.write(input_smiles)

            # start an empty dataframe to store output from models
            smi_df = pd.DataFrame()

            # if the model is reinvent:
            if model_idx == 0:
                # for each of the 6 (functional) priors, run reinvent and append the output to smi_df
                for pri in mol2mol_priors:
                    toml_filename = "sampling_" + pri + ".toml"
                    !reinvent mol2mol_prior_tomls/{toml_filename} --seed 42
                    df = pd.read_csv("sampling.csv")
                    df['prior'] = pri
                    smi_df = pd.concat([smi_df, df], ignore_index=True)
            # if the model is crem:
            elif model_idx == 1:
                db = '../crem_db/crem_db2.5.db'
                print(input_smiles)
                input_mol = Chem.MolFromSmiles(input_smiles)
                out_list = []
                grow_list = list(mutate_mol(input_mol, db_name=db,return_mol=False))
                for idx,analog in enumerate(grow_list):
                    out_list.append([idx,analog,input_smiles])
                smi_df = pd.DataFrame(out_list,columns=["Idx","SMILES","Input_SMILES"])
                print(len(smi_df))
            # if the model is safe:
            elif model_idx == 2:
                raise ValueError("[piggyback_pt2]: oops! safe hasn't been implemented yet")
            # if the model is coati:
            elif model_idx == 3:
                raise ValueError("[piggyback_pt2]: oops! coati hasn't been implemented yet")
            else:
                raise ValueError("[piggyback_pt2]: model_idx parameter is invalid.")


            ##### step 4: filter odd rings and prep smi_df for evaluation #####

            # add a column in the data frame so we know which input molecule it came from
            smi_df['predecessor'] = input_smiles # a little redundant, should just be Input_SMILES column

            # filter out odd rings
            ring_system_lookup = uru.RingSystemLookup.default()
            smi_df['ring_systems'] = smi_df.SMILES.apply(ring_system_lookup.process_smiles)
            smi_df[['min_ring','min_freq']] = smi_df.ring_systems.apply(uru.get_min_ring_frequency).to_list()
            smi_df = smi_df.query('min_freq > 100').copy()
            
            # remove duplicate values
            smi_df.drop_duplicates(inplace=True, ignore_index=True, subset=['SMILES'])

            # remove the initial fragment from the generated distribution
            if input_smiles in smi_df['SMILES'].values:
                smi_df = smi_df[smi_df['SMILES'] != input_smiles]

            # add mol id column for identification purposes
            smi_df['Name'] = [f"MOL{i:04d}" for i in range(0,len(smi_df))]
            smi_df[["SMILES","Name"]].to_csv("/home/fts_g_ucla_edu/Projects/rips-relay_copy/experiments/placeholder.smi",sep=" ",header=None, index=False)

            smi_df.round(3)



            ##### step 5: compute similarities to lead molecule #####

            similarities_to_lead = [tanimoto_similarity(analog, lead_smi, True) for analog in smi_df['SMILES'].values]
            smi_df['sim_to_lead'] = similarities_to_lead



            ##### step 6: evaluate metrics #####

            # add a column to smi_df that contains the final score
            smi_df['final_score'] = 0.0



            # We assume that we want to compute docking score - comment it out if you don't want it
            docked_df = docking_score(path_to_analog_smi="placeholder.smi", output_oeb_path="placeholder.oeb", path_to_oedu=f"data/docking/pairs/{frag_lead_idx}_{pdb_type}.oedu", output_sdf_path="placeholder.sdf")
            # merge docked_df, which has docking scores, with smi_df by mol ID
            smi_df = smi_df.merge(docked_df, left_on='Name', right_on='ID', how='left')
            # delete the 'Name' column, as it is now identical to the 'ID' column and redundant
            smi_df = smi_df.drop(columns=['Name'])
            # convert all docking scores from str to float
            smi_df['HYBRID Chemgauss4 score'] = smi_df['HYBRID Chemgauss4 score'].astype('float')


            ##### PIFP COUNTS (ASSUMES SDF FROM DOCKING IS IN PLACEHOLDER.SDF) #################
            
            pifp_sims = pifp_distance_to_lead(path_to_analog_sdf='placeholder.sdf', frag_idx=frag_lead_idx, pdb_type=pdb_type)

            # Identify indices with NaNs
            nan_indices = smi_df[smi_df.isna().any(axis=1)].index

            # Drop rows with NaNs from smi_df
            smi_df = smi_df.drop(nan_indices)

            # Drop corresponding values from pifp_sims
            pifp_sims = np.delete(pifp_sims, nan_indices)

            if len(pifp_sims) != len(smi_df):
                raise ValueError("[piggyback_pt2]: length of pifp_sims does not math length of smi_df")

            ###################################
            

            # # This section computes the NN model-predicted molecular Tanimoto similarity to the lead
            # #   comment it out if necessary
            
            # if scaler_path != None and NNmodel_path != None:

            #     # Add a column with mol object information for each molecule
            #     smi_df['ROMol']=[Chem.MolFromSmiles(x) for x in smi_df['SMILES'].values]
            #     smi_df['inchi'] = smi_df.ROMol.apply(Chem.MolToInchiKey)
            #     smi_df.drop_duplicates(subset=['inchi'])

            #     dummy = NNpredict_tani_to_lead(df=smi_df, path_to_scaler=scaler_path, path_to_model=NNmodel_path, curr_iter=curr_iter)
                
            #     smi_df['NNpredict_tani_to_lead'] = dummy



            ##### PIFP COUNTS CONTINUED #################
            
            smi_df['pifp_sim'] = pifp_sims

            ###################################



            for w in range(len(weights)): 
                if weights[w] == 0: # if no weight then we aren't considering it as a contribution so we skip it
                    continue
                elif w == 0: # tanimoto
                    # compute contribution and add to final_score
                    smi_df['final_score'] = smi_df['final_score'] + [weighted_score_contribution(weights, w, tani_sim) for tani_sim in smi_df['sim_to_lead']]
                elif w == 1: # docking score
                    # compute contribution and add to final_score
                    smi_df['final_score'] = smi_df['final_score'] + [weighted_score_contribution(weights, w, dock_score) for dock_score in smi_df['HYBRID Chemgauss4 score']]
                elif w == 2: #pifp_counts
                    # compute contribution and add to final_score
                    smi_df['final_score'] = smi_df['final_score'] + [weighted_score_contribution(weights, w, pifp_sim) for pifp_sim in smi_df['pifp_sim']]
                elif w == 3: # neural net predicted tani sim to lead
                    # compute contribution and add to final_score
                    smi_df['final_score'] = smi_df['final_score'] + [weighted_score_contribution(weights, w, NN_score) for NN_score in smi_df['NNpredict_tani_to_lead']]
                else:
                    raise ValueError("[piggyback_pt2]: invalid weight")
            


            ##### step 7: concat smi_df to the results dataframe

            results = pd.concat([results, smi_df], axis=0)



        ##### step 8: save full results to csv #####

        # remove duplicate values
        results.drop_duplicates(inplace=True, ignore_index=True, subset=['SMILES'])
        results.to_csv(f"{new_folder_path}/iter{curr_iter:05d}_full.csv")



        ##### step 9: check if there are analogs in that have tani sim to lead >= threshold #####

        successful_rows = results[results['sim_to_lead'] >= threshold]

        # check if the number of successful rows is non-zero
        if successful_rows.shape[0] > 0:
            # write the successful rows to a new CSV file
            successful_rows.to_csv(f'{new_folder_path}/iter{curr_iter:05d}_winners.csv', index=False)
            # break out of loop
            break



        ##### step 10: choose top k analogs and repeat #####

        # Select the top-k rows based on the 'final_score' column
        top_k_rows = results.nlargest(k, 'final_score')

        # Write the top-k rows to a new CSV file
        top_k_rows.to_csv(f'{new_folder_path}/iter{curr_iter:05d}_top-k_rows.csv', index=False)

        # Extract the 'SMILES' values from the top-k rows to an array
        input_analogs = top_k_rows['SMILES'].values



        ##### step 11: increment iteration

        curr_iter += 1

## Plotting results
To plot the results, pass in the path to the folder where the piggybacking output data (of the form `data/piggybacking_output/[folder_name]`) is saved, as well as the path to the directory where you want to save the plots.

In [18]:
def plot_piggybacking(path_to_folder, save_path):
    # Read in the log file
    with open(os.path.join(path_to_folder, 'log.txt'), 'r') as file:
        log_data = file.read().splitlines()
    
    # Extract relevant information from the log file
    frag_lead_idx = log_data[0].split(': ')[1]
    pdb_type = log_data[1].split(': ')[1]
    model_idx = log_data[2].split(': ')[1]
    metric_weights = log_data[3].split(': ')[1]
    k = int(log_data[4].split(': ')[1])
    max_iterations = int(log_data[5].split(': ')[1])
    threshold = log_data[6].split(': ')[1]

    # Initialize arrays to store top-k scores and similarities
    scores_array = []
    sim_to_lead_array = []

    # Loop through each iteration
    for i in range(max_iterations + 1):

        # skip when i == 0
        if i == 0:
            continue

        # Construct the filename for top-k rows
        filename_top_k = f'iter{str(i).zfill(5)}_top-k_rows.csv'
        filepath_top_k = os.path.join(path_to_folder, filename_top_k)
        
        # Check if the top-k file exists
        if os.path.exists(filepath_top_k):
            # Read the CSV file
            df = pd.read_csv(filepath_top_k)
        else:
            # Construct the filename for winners
            filename_winners = f'iter{str(i).zfill(5)}_winners.csv'
            filepath_winners = os.path.join(path_to_folder, filename_winners)
            
            # Check if the winners file exists
            if os.path.exists(filepath_winners):
                # Read the CSV file
                df = pd.read_csv(filepath_winners)
                print(f"Threshold met. Ending iterations early at iteration {i}.")
                # Extract the 'final_score' and 'sim_to_lead' columns
                scores = df['final_score'].to_numpy()
                sim_to_lead = df['sim_to_lead'].to_numpy()
                # Store the scores and similarities in the arrays
                scores_array.append(scores)
                sim_to_lead_array.append(sim_to_lead)
                break
            else:
                print(f"File {filename_top_k} or {filename_winners} not found. Exiting loop early.")
                break
        
        # Extract the 'final_score' and 'sim_to_lead' columns
        scores = df['final_score'].to_numpy()
        sim_to_lead = df['sim_to_lead'].to_numpy()
        
        # Store the scores and similarities in the arrays
        scores_array.append(scores)
        sim_to_lead_array.append(sim_to_lead)
    
    # Convert the arrays to DataFrames for easier plotting
    scores_df = pd.DataFrame(scores_array).T
    sim_to_lead_df = pd.DataFrame(sim_to_lead_array).T
    
    # Create the plot for final scores
    plt.figure(figsize=(10, 8))
    sns.violinplot(data=scores_df, inner=None, color="skyblue")
    sns.stripplot(data=scores_df, jitter=True, size=2.5, color="blue", linewidth=0.5)
    
    # Label the plot
    plt.title(f'Frag Lead Index: {frag_lead_idx}\nPDB Type: {pdb_type}, Model Index: {model_idx}\nMetric Weights: {metric_weights}\nTop-{k} Final Scores over {len(scores_array)} Iterations')
    plt.xlabel('Iteration')
    plt.ylabel('Final Scores')
    
    # Save the final scores plot to the specified path
    final_scores_save_path = f"{save_path}_final_scores_plot.png"
    plt.savefig(final_scores_save_path)
    plt.close()

    # Create the plot for similarities to lead
    plt.figure(figsize=(10, 8))
    sns.violinplot(data=sim_to_lead_df, inner=None, color="lightgreen", alpha=0.5)
    sns.stripplot(data=sim_to_lead_df, jitter=True, size=2.5, color="green", linewidth=0.5)
    
    # Label the plot
    plt.title(f'Frag Lead Index: {frag_lead_idx}\nPDB Type: {pdb_type}, Model Index: {model_idx}\nMetric Weights: {metric_weights}\nTop-{k} Similarities to Lead over {len(scores_array)} Iterations')
    plt.xlabel('Iteration')
    plt.ylabel('Similarity to Lead')
    plt.ylim(0, 1)
    
    # Save the similarities plot to the specified path
    sim_to_lead_save_path = f"{save_path}_sim_to_lead_plot.png"
    plt.savefig(sim_to_lead_save_path)
    plt.close()
    
    # Save the plot to the specified path
    plt.savefig(save_path)
    # plt.close()

## Example usage

In [None]:
scaler = 'data/goodness_scoring_models/final_scaler.pkl'
NNmodel = 'data/goodness_scoring_models/final_goodness_predictor.h5'
piggyback_pt2(frag_lead_idx=27, pdb_type='frag', model_idx=0, weights=[0, 0, 1, 0], k=10, max_iterations=15, threshold=1.0, scaler_path=None, NNmodel_path=None)