# Virtual Screening Basics -- What Does the Web App Do

The ultimate goal of this project is summarize some of the principles of virtual screening in a an accessible and visually friendly way in order to maximize usability of these tools and keep everything in one place. The web app format is intended to allow for easy addition of future features without users needing to update local environments or pull new source code. 

This notebook covers only the molecular similarity module of the app for the sake of time and presentation. This would be one "dashboard" options in a larger menu, for example, once completed. 

A snippet of Flask code is provided at the end but SHOULD NOT be run within the notebook. This will make the notebook angry. 


## 1. Selecting the Desired Molecule

Let's start by selecting some molecules. 

The code below accomplishes the following:   

- Calls pubchem and grabs some basic properties we might want to use/know for each molecule 
- Loads the information into a dataframe for easy visualization and access 

Note that get_pubchem_data here is taking in a list of names. That list can be one name or a whole bunch of names. Let's try it out with a small test sample first. 

In [43]:
#Imports 

import requests
import pandas as pd 
import io 

def get_pubchem_data(names): 

    # First we make a blank csv variable for us to assign our responses to      
    csv = ""
    # Now we can go through the list of names passed in and get the compound data we need from pubchem and add it to our csv 
    for index,name in enumerate(names): 
        url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/"+name+"/property/HBondDonorCount,HBondAcceptorCount,RotatableBondCount,MolecularWeight,XLogP,IsomericSMILES/csv"       
        res = requests.get(url).text 
        if csv == "": 
            csv = res
        else: 
            csv = csv + "\n".join(res.split()[1:]) + "\n" 
    # We will add this to a data frame before returning just for convenience of displaying data (and also possibly building a SQL database later)
    # Let's go ahead and add the names of the compounds too 

    data = pd.read_csv(io.StringIO(csv))
    
    return data 


We can try running it by calling:

In [44]:
# Some things I have lying around my house 

compounds = ['Aspirin', 'Ibuprofen','Caffeine']

get_pubchem_data(compounds)


Unnamed: 0,CID,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,MolecularWeight,XLogP,IsomericSMILES
0,2244,1,4,3,180.16,1.2,CC(=O)OC1=CC=CC=C1C(=O)O
1,3672,1,2,4,206.28,3.5,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
2,2519,0,3,0,194.19,-0.1,CN1C=NC2=C1C(=O)N(C(=O)N2C)C


Sweet, now let's make this into a generalized input such that we could use either a csv file or just one molecule name.

In [45]:
import os 
import csv 

# Call this function if you would like to use a csv 
# All it does is construct a list of strings from your csv so we can easily manipulate it later 
def from_csv(csv_file):
    data_list = []

    with open(csv_file, 'r') as file:
        csv_reader = csv.reader(file)
        for line in csv_reader:
            data_list.append(line[0])
    return data_list

# Call this function if you would like to just type something in 
def no_csv(name): 
    data_list= [name]
    return data_list

Why create these options where we have to tell you if I'm uploading a csv or just typing in a name? Makes it easier to reuse this code when you're coding something like a CSV upload button action vs. type-in button action since we'll need both options!

For now, let's try it out with our csv that I compiled from things I found in my medicine cabinet (available on github with this notebook and called julias_cabinet.csv)





In [54]:
# Construct CSV file path -- this will be different depending on how the file structure is configured 
# so make sure to change that file path line to no longer reflect my personal desktop path 
desktop_path = os.path.join(os.path.expanduser('~'), 'Desktop')
file_path = os.path.join(desktop_path,'julias_cabinet.csv')

compounds = from_csv(file_path)

csv_data = get_pubchem_data(compounds)

csv_data

Unnamed: 0,CID,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,MolecularWeight,XLogP,IsomericSMILES
0,5288826,2,4,0,285.34,0.8,CN1CC[C@]23[C@@H]4[C@H]1CC5=C2C(=C(C=C5)O)O[C@...
1,2244,1,4,3,180.16,1.2,CC(=O)OC1=CC=CC=C1C(=O)O
2,3672,1,2,4,206.28,3.5,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O
3,5284371,1,4,1,299.4,1.1,CN1CC[C@]23[C@@H]4[C@H]1CC5=C2C(=C(C=C5)OC)O[C...


We can try it with the single name just to be sure

In [None]:
compound = no_csv("ibuprofen")

data = get_pubchem_data(compound)

data

Unnamed: 0,CID,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,MolecularWeight,XLogP,IsomericSMILES
0,3672,1,2,4,206.28,3.5,CC(C)CC1=CC=C(C=C1)C(C)C(=O)O


## 2. Running a 2D similarity search

Now that we have some basic information about our molecules handy and have them somewhere conveniently accessible, let's make a 2D similarity search. We'll keep it simple and do it for the first item in our list.   

The code below does: 

- Performs a 2D similarity search for a given compound from our previously loaded data frame 
- Return a dataframe with the results 
- Provides a function for saving results as a csv file 
- Provides a function that batch searches for similar compounds and saves to individual csv files  

In [None]:
# PubChem's similarity search can take in a CID so that's what we will pass in  
# We will return a list that we can save later (and convert to a dataframe and a csv too)
import json 

def two_d_similarity(cid): 
    # Note that you can actually request properties of the found similar compounds in this hit directly but we can just get 
    # the cids of these compounds for now  
    cid_str = str(cid)
    
    url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/cid/"+cid_str+"/cids/txt"

    results = requests.get(url).text
    to_list = results.split("\n") 

    data_frame = pd.DataFrame(to_list, columns=['CID'])

    # Really quickly let's remove the CID that we used to search
    data_frame = data_frame[data_frame['CID'] != cid_str]
    
    return data_frame

# Converts our data frame to a csv to save for later 
def save_csv(data, filename): 
    file = data.to_csv(filename)
    

Can get results for one of the items in our "csv_data" dataframe. 

In [None]:
results = two_d_similarity(csv_data.loc[1,'CID'])
save_csv(results, "my_cids_list.csv")

You should have this file with your brand new csv. 
We can also loop through our dataframe and batch make a bunch of csvs all at once 

In [39]:
# Takes in a data frame and saves individual csvs for similarity searches
def csv_batch (list): 
    for index,row in list.iterrows():
        cid = list.loc[index, 'CID']
        result = two_d_similarity(cid)
        file_name = str(cid)+".csv"
        save_csv(result, file_name)

csv_batch(csv_data)

## 3. Running a 3D similarity search 

Now we can do the same thing with a 3D similarity search!   

The code below does: 

- Performs a 3D similarity search for a given compound from our previously loaded data frame 

In [40]:
# This code is very similar to two_d_similarity so we could reuse the code and just include a conditional 
# But for the purposes of clarity let's rewrite it here 

def three_d_similarity(cid): 
    
    cid_str = str(cid)
    
    url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_3d/cid/"+cid_str+"/cids/txt"

    results = requests.get(url).text
    to_list = results.split("\n") 

    data_frame = pd.DataFrame(to_list, columns=['CID'])

    # Really quickly let's remove the CID that we used to search
    data_frame = data_frame[data_frame['CID'] != cid_str]
    
    return data_frame

Can test this out the same way that we tested out the previous code.

In [41]:
three_d_similarity("3672")
# You can also try out the following commented out lines if you want to get a csv 
# results = three_d_similarity("3672")
# save_csv(results)

Unnamed: 0,CID
0,114864
2,12508533
3,165363571
4,715794
5,101072
...,...
657,61336076
658,59129390
659,55265673
660,14707079


## 4. Finding Overlap 

Literature suggests that overlap between these two searches is somewhat low, but hey it's worth a shot.  

The code below does the following: 
- searches the two data frames to see if there is any overlap between the two 
- once something is found that lives in both places we add it to our new csv/data frame 

In [42]:
# First let's make two data frames, one generated with a 2D search and one with a 3D 
# I'm using "3672" as the example but you can do it with any of them 

two_d_results = two_d_similarity("3672")
three_d_results = three_d_similarity("3672")

# Now we see if there's overlap between the two by merging the dataframes which gives us 
# a new dataframe that only contains anything that was overlapping between the two! 

dupes = pd.merge(two_d_results, three_d_results, how ='inner') 

dupes

Unnamed: 0,CID
0,39912
1,114864
2,15250
3,10443535
4,109101
...,...
178,102593121
179,150501996
180,165342873
181,165363571


We actually got a pretty decent number of overlap using my test molecule, but that might not always be the case. Either way, we can use this new list of duplicates to do more testing on it 

## 5. Narrow Down Results

If we find something that exists in both lists, that's awesome. We can even go a little further and see if they meet some of the criteria for being drug-like such as Lipinski's Rule of Three or Pfizer's Rule of Five. The code below accomplishes the following: 
- Grabs the pubchem information for each item in our list 
- Takes in which rule you would like to follow 
- Returns a new Data Frame with the results


In [51]:
# Very similar to our previous get_pubchem function only we are using cids to get the information instead of names. 
# We can (and should) make one function that can handle the different possible inputs but for the sake of following along 
# let's just redefine another function

def get_pubchem_data_cid (cids):

    cids_list = cids['CID'].tolist()

    # First we make a blank csv variable for us to assign our responses to      
    csv = ""
    # Now we can go through the list of names 
    # Note that the only difference between this and our first function is the url! 
    for index,cid in enumerate(cids_list): 
        url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/"+cid+"/property/HBondDonorCount,HBondAcceptorCount,RotatableBondCount,MolecularWeight,XLogP,IsomericSMILES/csv"       
        res = requests.get(url).text 
        if csv == "": 
            csv = res
        else: 
            csv = csv + "\n".join(res.split()[1:]) + "\n" 
    # We will add this to a data frame before returning just for convenience of displaying data (and also possibly building a SQL database later)

    data = pd.read_csv(io.StringIO(csv))
    
    return data 

def drug_like_filter(rule, data): 
    if rule == "Pfizer":
        new_data = data[ ( data['HBondDonorCount'] <= 5) &
            ( data['HBondAcceptorCount'] <= 10 ) &
            ( data['MolecularWeight'] <= 500 ) &
            ( data['XLogP'] < 5 ) ]

    if rule == "Congreve": 
         new_data = data[ ( data['HBondDonorCount'] <= 3) &
        ( data['HBondAcceptorCount'] <= 3 ) &
        ( data['MolecularWeight'] <= 300 ) &
        ( data['XLogP'] < 3 ) &
        ( data['RotatableBondCount'] <= 3 )]
    else: 
        new_data = data 
        print ("invalid rule selected")
    
    return new_data 



So now we can just select a rule and get everything filtered and saved to a data frame and/or csv. Neat and easy! 

In [53]:
results =get_pubchem_data_cid(dupes)
rule = "Congreve"

possible_leads = drug_like_filter(rule, results)

#Let's go ahead and save that to a csv as well 

save_csv(possible_leads, "filtered_leads.csv")



### 6. WebApp with Flask

Wouldn't it be much nicer if we could do all that with just some buttons though? 

If you want to give it a try clone a copy of the repo that's up you can try to run the (very very simple) flask web as a web app.

 Just a warning -- this is very much in development so it's quite possible it'll break on you (oops!) and take a very long time to run. The Flask development environment is not very performance oriented so expect it to be slower than even running the code in the notebook.

If you want to run it on your laptop: 
- Clone the repo at https://github.com/atmoosephere/flask_app 
- Open the terminal and type python3 my_web_app.py 
- Your terminal will tell you an http address where it is running 
- Copy and paste that into a web browser
    
The final version of this will be available on github (and hopefully just on the web) soon(ish)!

In the meantime we can take a look at a sample of the Flask app code here: 

In [None]:
# Imports like normal 
from flask import Flask, render_template, request, make_response
import pandas as pd 
from rdkit import Chem
from rdkit.Chem import Draw
from get_molecule_data import get_pubchem_data
from find_likes import overlap_search, save_csv
from drug_like import run_filter
import csv 

# This is how you start off any flask app pretty much 
app = Flask(__name__)

# Routes are more or less how you navigate your app. So here the index function has a '/' route which 
# indicates it's the "root" of the app -- so we want to make sure that this connects to the the html 
# for what the user will first see 

@app.route('/')
def index():
    # welcome.html is just a simple html file that contains the code for rendering the text and search bar 
    return render_template('welcome.html')

# Here I'm letting Flask know that we're moving to a new page (you'll see it in your web browser like this)
# For now our main page only does one thing which is call this get_likes function through a button 
# But you can have a bunch of methods like this for other functions and features 

@app.route('/get_likes', methods = ['POST','GET'])
def get_likes(): 

    molecule = request.args.get('molecule')

    # Generates the pubchem data for the molecule in question 
    data = get_pubchem_data(molecule)

    overlap = overlap_search(data.loc[0,'CID'])
    # TODO: hardcoded to Congreve for testing, make controlled by switch once button is added 
    screened = run_filter("Congreve", overlap)

    # This code is meant to save the csv and then make it available for download directly on the app
    # However I broke my download button so for now, once you run it you will see the csv saved to the folder 
    # this app is in 
    csv_screened = save_csv(screened, "my_csv.csv")

    # Render the results page 
    return render_template('output.html',  table=screened[1:10].to_html())
        
if __name__ == '__main__':
    app.run(debug=True)



A note that our code would also strongly benefit from using a database to store our information in instead of passing around dataframes. A database isn't being used by this mini web app mostly because this is not the "real" version of this app, just the presentation of the idea. 