# CANDO tutorial
*Updated July 2024*

This notebook will walk you through the basics of using the Computational Analysis of Novel Drug Opportunities platform, or CANDO. This includes preparing data, setting up a CANDO object, making therapeutic predictions, and benchmarking the platform.

## Table of contents

- [Introduction](#Introduction)
- [Basic tutorial](#Basic-tutorial)
  - [Getting started](#Getting-started)
  - [Accessing data](#Accessing-data)
  - [Drug prediction](#Drug-prediction)
  - [Benchmarking](#Benchmarking)
- [Advanced topics](#Advanced-topics)
  - [Customizing dataset](#Customizing-dataset)
  - [Advanced drug prediction](#Advanced-drug-prediction)

We recommend starting with the Getting started section, but if you wish to skip directly to a later section, or if you re-start the notebook and must re-create your CANDO object from scratch, please run the code below.

In [1]:
import cando as cnd # shortened to cnd for ease of typing/clarity
import os

while os.getcwd().split(os.sep)[-1] == 'tutorial': 
    os.chdir('..')

if not os.path.exists('tutorial'):
    cnd.get_tutorial()
    cnd.get_data(v='test.0',org='tutorial')

os.chdir("tutorial")

# Define variables with filepaths for our data
cmpd_map='cmpds-v2.2.tsv' # Drug identifiers
ind_map='cmpds2inds-v2.2.tsv' # Drug-indication associations
matrix_file='tutorial_matrix-approved.tsv' # Drug-protein interaction matrix

# Define additional variables used later in the tutorial; will be discussed as they appear
dist_metric='cosine'
ncpus=1

# Create CANDO object
cando = cnd.CANDO(cmpd_map, ind_map, matrix=matrix_file, compound_set='approved', compute_distance=True, 
                  dist_metric=dist_metric, ncpus=ncpus)

Reading signatures from matrix...
Done reading signatures.

Building compound-indication tables...
  Checking if data exists in tables...
  Data already in tables.
Done building compound-indication tables.

Building cosine distance table...
  Checking if data exists in table...
  Data already exists in table.
Done building cosine distance table.



---

# Introduction

This introduction will overview the fundamental ideas of the **C**omputational **A**nalysis of **N**ovel **D**rug **O**pportunities (CANDO) platform. If you want to continue directly to the practical portion of the tutorial, please skip to the [Getting started](#Getting-started) section.

CANDO is a similarity-based drug discovery platform, meaning it predicts new drugs for given indications based on their similarity to existing drugs within that indication. It calculates similarity by comparing the multitarget interactions of a given compound to those of the drugs of interest. Any quantifiable interaction between a compound and biological system can be used for this purpose, and multiple types of interactions can even be combined. For this tutorial, we will be using a series of compound-protein binding scores as our interaction signature.

Drug-protein binding scores can be generated through a variety of methods. CANDO has multiple in-house pipelines to create such scores based on small molecule drug chemical structure and protein binding pocket data (generated by [the Zhang group's protein binding site prediction algorithm, COACH](https://zhanggroup.org/COACH/)). These pipelines, CANDOCK and BANDOCK, can be used to create a protein library or add new proteins you think might be relevant to an existing protein library. Once these binding scores are generated,  they are arranged into a vector or list corresponding to each compound: the *proteomic signature* of that compound. These proteomic signatures are arranged into a drug-protein interaction matrix. Once this matrix is created, alongside a corresponding lists of compounds and any known compound-indication associations, CANDO can be used to make novel drug predictions.

To prepare for novel drug or indication prediction, CANDO first creates a compound-compound similarity matrix. Similarity scores are calculated based on the compound interaction signatures, with more similar interaction signatures having a higher similarity score. Finally, for each drug with a known effect, a sorted list of the most similar compounds to that drug is created. To predict a new drug for a given indication, the most similar compounds to all drugs already in that indication are compared. If a non-indicated drug is very similar to multiple inidcated drugs, that is, it appears in the top ranks of multiple similar lists, it will be predicted as a possible new drug in that indication. Likewise, to predict a new indication for a given compound, the indications of the drugs most similar to it are found. If multiple drugs similar to the compound have the same inidcation, the indication is returned as a potential use for that compound.

Benchmarking is useful for determining how trustworthy these predictions are. In most cases, the overall performance of any major release of CANDO will already be benchmarked. However, if you customize the set of proteins/interaction scores, indications, and/or compounds available to CANDO, or if you want to know CANDO's performance on a subset of indications or compounds of interest, re-benchmarking will allow you to estimate the trustworthiness of your results more specifically. CANDO has multiple built-in benchmarking processes to allow rapid and easy benchmarking assessments to be carried out as necessary.

---

# Basic tutorial

This section will go through the basics of using CANDO for simple drug discovery purposes.

## Getting started

The first step of using CANDO is importing the package. We will also be importing the os module for use in the next step. 

If this code throws an ImportError, check to see that you have cando.py downloaded and all relevant modules are installed. CANDO can be installed using `pip install cando-py` or using an Anaconda environment. See the [CANDO Github page](https://github.com/ram-compbio/CANDO) for more installation information.

In [2]:
import cando as cnd # shortened to cnd for ease of typing/clarity
import os

Next, you must obtain all input data required to run CANDO. Most of the basic functions of CANDO require only three files:

- **c_map** (compound map) - str, the filename of a file containing information on all drugs to be examined
  - Each compound has two identifiers (a CANDO identifier and a DrugBank identifier), its name, and its approval status (e.g. approved or investigational) listed
- **i_map** (indication map) - str, the filename of a file containing information on all indications to be examined
  - Each line maps one CANDO compound identifier to an indication, represented by its name and two identifiers (a MeSH identifier and a CANDO identifier)
- **matrix** - str, the filename of a file containing drug-protein binding scores (i.e. the protein interaction signature)
  - Must contain the same number of compounds as the compound map file
    
In this case, we will be using real compound and indication maps, but we will use a **tutorial matrix** with a limited number of drug-protein interactions so that we can generate predictions rapidly. Other datasets can be downloaded using the command `cnd.get_data()` or [via this website](http://protinfo.compbio.buffalo.edu/cando/data/v2.2+/).

If you have previously downloaded an older version of this tutorial, run `cnd.clear_cache()` before the other commands below to clear out the old tutorial data.

In [3]:
# In case this box is run 2+ times, navigates up to the first non-tutorial folder
while os.getcwd().split(os.sep)[-1] == 'tutorial': 
    os.chdir('..')

# Pull data for this tutorial
cnd.get_tutorial()
cnd.get_data(v='test.0',org='tutorial')
os.chdir("tutorial") # navigates to the tutorial folder

# Define variables with filepaths for our data
cmpd_map='cmpds-v2.2.tsv' # Drug identifiers
ind_map='cmpds2inds-v2.2.tsv' # Drug-indication associations
matrix_file='tutorial_matrix-approved.tsv' # Drug-protein interaction matrix

# Define additional variables used later in the tutorial; will be discussed as they appear
dist_metric='cosine'
ncpus=1

Downloading data for tutorial...
./tutorial/lmk235.mol exists.
All data for tutorial downloaded.

Creating subdirs in /projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+...

Downloading library and mapping files...
Files will be downloaded to: /projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/mappings/
/projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/mappings/drugbank-test.0.tsv exists.
/projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/mappings/drugbank2ctd-test.0.tsv exists.

Downloading test.0 compound library data...
Files will be downloaded to: /projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/cmpds/

Downloading tutorial protein library data...
Files will be downloaded to: /projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/prots/
tutorial-coach.tsv
/projects/academic/rams/zmfalls/src/CANDO-dev/cando/data/v2.2+/prots/tutorial-coach.tsv exists.

Downloading matrix for test.0 compound library, all sublibrary,

CANDO is an object-oriented program, and the majority of its methods require the creation of a CANDO object before they can be used. The CANDO object organizes and processes all drug, indication, and matrix data. To create a CANDO object, you will use the command `cnd.CANDO()` and pass the three files discussed above as arguments. We will be using a couple of optional arguments in addition to ensure our CANDO object is set up how we want it:

1. **compound_set** - string, indicates which compounds to include in the CANDO object. This is "all" by default, which allows for novel drug prediction. Using "approved" instead allows us to focus on drug repurposing (finding new uses for existing drugs) instsead.
2. **compute_distance** - boolean, determines whether CANDO will compute the distances between the compounds based on the similarity of their interaction signatures. Unless you have a saved distance file (we do not), you have to use compute_distance For larger protein sets. Note, either `compute_distance` or `read_dists` must be used.
3. **save_dists** - string, takes a filename to save the distances calculated (e.g. "big_protein_dists.tsv"). Must be used alongside `compute_distance`. This is useful for larger protein sets (e.g. >100,000 proteins) where it might be costly to repeatedly compute these distances. *Since we are using a small protein set, we will not use this.*
4. **read_dists** - string, takes the filename of a saved distance file (e.g. "big_protein_dists.tsv"). This can only be used if you have previously computed and saved distances. *Since we have not done that, we will not use this.*
5. **dist_metric** - string, distance metric used for compute_distance. This is "rmsd" (root mean squared distance) by default, but "cosine" can also be used. This is only used if compute_distance is True.
6. **ncpus** - integer, parallelizes certain processes across the specified number of CPUs, making them faster. You can increase or decrease this number based on the number of CPUs available to you.

There are many other optional argments available for the CANDO object instantiation. If you want to know more, please consult [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf).

In [4]:
cando = cnd.CANDO(cmpd_map, ind_map, matrix=matrix_file, compound_set='approved', compute_distance=True, 
                  dist_metric=dist_metric, ncpus=ncpus)

Reading signatures from matrix...
Done reading signatures.

Building compound-indication tables...
  Checking if data exists in tables...
  Data already in tables.
Done building compound-indication tables.

Building cosine distance table...
  Checking if data exists in table...
  Data already exists in table.
Done building cosine distance table.



Creating or instantiating the CANDO object, as above, makes all of the functions of CANDO available. However, before continuing onto the actual usage of CANDO, we will first cover how to access information about the major objects that make up CANDO. This will allow you to ensure that all data was integrated into the CANDO object as expected and that you are creating the predictions you intend to in later steps.

---

## Accessing data

CANDO contains a large quantity of data, smooth access to which is essential to usage of its prediction functions.

CANDO consists of multiple objects, each working in tandem to create a drug prediction pipeline. The major objects are:

- **CANDO object** - object, contains and organizes most variables and methods necessary to run CANDO
- **Compound** - object, represents a single small molecule compound/drug
- **Indication** - object, represents a single disease or other indication
- **Protein** - object, represents a protein in the proteomic signature

Other objects also exist, but are less commonly used, including **Compound_pair**, **Pathway**, and **ADR**. You will not use these objects in most applications

We will start by going over the CANDO object, which we just created/instantiated. The CANDO object contains lists of all objects in your model. The CANDO object also has many other properties that might be useful; consult [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf) for more information.

- **cando.compounds** - list, all Compound objects (taken from the compound map, `cmpd_map`)
  - Only compounds in this list can be associated with indications or predicted as novel therapies
- **cando.indications** - list, all Indication objects (taken from the indication map, `ind_map`)
  - Only indications in this list can be associated with drugs or have novel therapies predicted
- **cando.proteins** - list, all Protein objects (taken from the drug-protein interaction matrix, `matrix_file`)

To start with, we will check how many objects are in each group.

In [5]:
print(len(cando.compounds), 'compounds')
print(len(cando.indications), 'indications')
print(len(cando.proteins), 'proteins')

2449 compounds
2214 indications
64 proteins


For this tutorial, our input files have 2449 compounds, 2214 indications, and 64 proteins. As a reminder, **most real matrices contain significantly more than 64 proteins**, but the protein count has been greatly reduced for the purposes of this tutorial.

Let's look into the Compound objects in a little more depth. We can find a Compound if we know its CANDO ID, but we are unlikely to know that information offhand. Luckily, we can look up our compound by name using the `search_compound` function. This function will search for drugs with a similar name to your search term and return the top results.

In [6]:
print('Searching for "siprofloxasin"...')
cando.search_compound('siprofloxasin')

Searching for "siprofloxasin"...
id	name
421	ciprofloxacin
1075	sparfloxacin
1032	ofloxacin
5105	besifloxacin
990	proflavine


Despite some mispellings, we found "ciprofloxacin," the antibiotic we were looking for! Once we have our compound ID, we can use the `get_compound` function to get the corresponding Compound object without going through the cando.compounds list.

In [7]:
cipro = cando.get_compound(421)
print('The variable cipro contains the compound:', cipro.name)

The variable cipro contains the compound: ciprofloxacin


A Compound contains the information CANDO "knows" about the drug, metabolite, or small molecule it represents. This includes the following:

- **compound.id_** - integer, the CANDO ID of the compound (i.e. 421)
- **compound.name** - string, the name of the compound (i.e. ciprofloxacin)
- **compound.status** - the clinical trial status of the compound (from the compound map)
  - All included compounds will be "approved" if `compound_set = approved`; otherwise, can be "approved" or "other"
- **compound.sig** - list, the (proteomic) signature of the drug, meaning all protein binding scores in order (from the matrix file)
- **compound.indications** - list of string, the MeSH IDs of the indications associated with the drug (from the indication map)
  - The drug in question is considered effective against all indications in this list
 
Other variables may also be associated with the Compound object. Consult [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf) for more information.

To explore ciprofloxacin, we will look into some of the properties listed above.

In [8]:
print(cipro.name, cipro.id_)
print(cipro.name, 'is', cipro.status)
print('It is associated with', len(cipro.indications), 'indications')
print('Its signature contains', len(cipro.sig), 'proteins')
print('Signature:', cipro.sig)

ciprofloxacin 421
ciprofloxacin is approved
It is associated with 77 indications
Its signature contains 64 proteins
Signature: [0.198, 0.102, 0.079, 0.166, 0.119, 0.071, 0.049, 0.103, 0.074, 0.093, 0.083, 0.123, 0.055, 0.078, 0.0, 0.094, 0.114, 0.387, 0.28, 0.209, 0.172, 0.087, 0.083, 0.226, 0.222, 0.136, 0.088, 0.059, 0.115, 0.157, 0.183, 0.102, 0.063, 0.095, 0.149, 0.161, 0.162, 0.0, 0.135, 0.102, 0.141, 0.168, 0.208, 0.094, 0.246, 0.125, 0.03, 0.244, 0.076, 0.134, 0.0, 0.131, 0.162, 0.119, 0.108, 0.179, 0.093, 0.143, 0.076, 0.164, 0.194, 0.064, 0.131, 0.137]


Ciprofloxacin has existed since the 1980s, and it has accumulated a large number of indication associations in that time. Its signature represents its interactions with the 64 proteins we included in our matrix. These scores are from 0.0 to 1.0; we can see that ciprofloxacin does not have a high predicted binding affinity with any of these proteins.

Next, let's look at Indication objects. CANDO uses [**Me**dical **S**ubject **H**eadings (MeSH)](https://meshb.nlm.nih.gov/) IDs to internally catalogue indications. In order to predict new drugs for a given indication, you must first choose the most appropriate MeSH term. While you could search MeSH itself for the right term, not all terms have an Indication associated with them in CANDO. Searching CANDO for a MeSH ID ensures that your term is already present in your model.

Finding an Indication is similar to finding a Compound. You can start by using `cando.search_indication` to find the closest text match to the disease you are interested in. If we wanted to find pneumonia, we could start with the following:

In [9]:
cando.search_indication('Pneumonia')

Matches exactly containing Pneumonia:
id             	name
MESH:D011014	Pneumonia
MESH:D011015	Pneumonia, Aspiration
MESH:D018410	Pneumonia, Bacterial
MESH:D011018	Pneumonia, Pneumococcal
MESH:D011020	Pneumonia, Pneumocystis
MESH:D011023	Pneumonia, Staphylococcal
MESH:D011024	Pneumonia, Viral

Matches using string distance:
id             	name
MESH:D011014	Pneumonia
MESH:D011001	Pleuropneumonia
MESH:D011024	Pneumonia, Viral
MESH:D059445	Anhedonia
MESH:D000740	Anemia


Many diseases have multiple comorbidities, etiologies, or otherwise related indications, so it is very common to get multiple results for an indication search. In this case, it seems like viral, aspiration, and bacterial pneumonia should have different causes and possibly different treatments, so it makes sense to focus on a specific type rather than the general term. If we want to study bacterial pneumonia, we would look at "Pneumonia, Bacterial" with a MeSH ID of [MESH:D018410](https://meshb.nlm.nih.gov/record/ui?ui=D018410). 

We can use this identifier to access the Indication itself using `get_indication`.

In [10]:
pneu_bact = cando.get_indication("MESH:D018410")
print('The variable pneu_bact contains the indication:', pneu_bact.name)

The variable pneu_bact contains the indication: Pneumonia, Bacterial


Like a Compound, an Indication contains various information about the indication as it exists in your CANDO object. This includes:
- **indication.id_** - string, the MeSH identifier of the indication (i.e. MESH:D018410)
- **indication.name** - string, the English name of the indication (i.e. Pneumonia, Bacterial)
- **indication.compounds** - list of integer, the CANDO ID of all compounds associated with the indication
  - If a compound is listed here, it is considered effective against our indication

Other properties may also be associated with the Indication object. Consult [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf) for more information.

We can find out more about how many compounds known to treat bacterial pneumonia using these properties.

In [11]:
print(pneu_bact.name, pneu_bact.id_)
print(len(pneu_bact.compounds), 'compounds are associated with bacterial pneumonia')

Pneumonia, Bacterial MESH:D018410
23 compounds are associated with bacterial pneumonia


By combining Compound and Indication properties, we can get even more information about what drugs are associated with bacterial pneumonia in our CANDO object.

In [12]:
print('The following compounds are associated with bacterial pneumonia:')
print(*[cando.get_compound(cmpd).name for cmpd in pneu_bact.compounds], sep=', ')

The following compounds are associated with bacterial pneumonia:
fosfomycin, clindamycin, pefloxacin, gentamicin, clarithromycin, sultamicillin, linezolid, ciprofloxacin, sulfadiazine, delafloxacin, ceftriaxone, amoxicillin, telavancin, ceftazidime, flucloxacillin, ofloxacin, vancomycin, ceftobiprole, omadacycline, azithromycin, cefpirome, erythromycin, cefotaxime


If you check the list above, you will see a familiar compound. Bacterial pneumonia is one of the 77 indications associated with ciprofloxacin.

Now that we can access drug and indication information, we can continue on to completing practical predictions and assessments in the [Drug prediction](#Drug-prediction) and [Benchmarking](#Benchmarking) sections.

---

## Drug prediction

One of the most common ways to use CANDO is to predict new drugs for an indication of interest. Let's say we want to predict new drugs for human immunodeficiency virus (HIV). We can start by searching for "HIV" to find the relevant MeSH term.

In [13]:
cando.search_indication('HIV')

Matches exactly containing HIV:
id             	name
MESH:D015658	HIV Infections
MESH:D006679	HIV Seropositivity
MESH:D019247	HIV Wasting Syndrome
MESH:D039682	HIV-Associated Lipodystrophy Syndrome

Matches using string distance:
id             	name
MESH:D015658	HIV Infections


HIV is associated with the development of multiple other conditions, so multiple results are returned. However, if we are specifically looking for drugs effective against HIV as a *virus*, not treatments for the resulting conditions, "HIV Infections" is the result we are looking for. This is associated with [MESH:D015658](https://meshb.nlm.nih.gov/record/ui?ui=D015658).

Because CANDO is a similarity-based drug discovery method, it is generally more effective at predicting new drugs for indications that already have multiple drugs associated with them. In addition, if there are no drugs associated with the indication we want to predict effective drugs for, we would have to use a different approach to predict novel drugs (explored in the [Advanced topics](#Advanced-topics)).

To ensure we have enough drugs to create effective predictions against HIV, we should find the Indication associated with HIV infections and check how many drugs are already associated with it.

In [14]:
hiv = cando.get_indication("MESH:D015658")

print(len(hiv.compounds), 'compounds are associated with', hiv.name)
print(*[cando.get_compound(cmpd).name for cmpd in hiv.compounds], sep=', ')

32 compounds are associated with HIV Infections
efavirenz, saquinavir, beta_carotene, zidovudine, fosamprenavir, abacavir, lamivudine, adefovir_dipivoxil, zalcitabine, delavirdine, nelfinavir, isoniazid, deferasirox, methadone, maraviroc, hydroxyurea, amprenavir, atazanavir, didanosine, ritonavir, raltegravir, etravirine, miglustat, indinavir, nevirapine, vitamin_a, foscarnet, ribavirin, lopinavir, thalidomide, stavudine, pentamidine


There are 32 drugs associated with HIV Infections in our data, which makes sense as it is a well-studied disease. Most seem to be protease inhibitors (saquinavir, fosamprenavir) or nucleoside analogs (zalcitabine, didanosine). 

We can now make predictions for other drugs/compounds that may be useful against HIV using the `canpredict_compounds` function. Given an indication, canpredict_compounds takes the lists of most similar compounds to each drug already associated with that indication. It then checks to see which compounds are similar to multiple indicated drugs; compounds that are similar to more indicated drugs are ranked higher, and ties are broken based on average rank within the similar lists. Possible parameters include:

- **ind_id** - string, the indication for which new compounds should be predicted (i.e. MESH:D015658)
- **n** - integer, the number of similar compounds to be considered per indicated compoud (10 by default)
- **topX** - integer, the maximum number of predictions to be outputted (10 by default)
- **consensus** - boolean, outputs only compounds that appear in multiple similar lists if True (True by default)
- **keep_associated** - boolean, includes indicated drugs in the output list if True (False by default)
- **cmpd_set** - "all" "approved" or "other", determines which compound set to use ("all" by default)
  - Note, if the CANDO object was created with only approved compounds, "all" will still only include aproved compounds
- **save** - string, name of an output file to save results (tsv format, nothing/not saved by default)

Because we have 32 drugs associated with HIV, the full output of canpredict_compounds would be very long. Therefore, we will only look at the top 10 results (`topX = 10`). There are plenty of opportunities for a compound to appear in at least 2 of the 32 similar lists, so looking at only the top 10 most similar compounds (`n = 10`) and excluding any compounds that appear in only one similar list (`consensus = True`) are also appropriate.

In [15]:
cando.canpredict_compounds("MESH:D015658", n=10, topX=10)

32 compounds found for MESH:D015658 --> HIV Infections
Generating compound predictions using top10 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	7	4.3	7.47e-13   	1128	true    	darunavir
2	4	3.0	2.09e-07   	756	true    	emtricitabine
3	4	5.0	2.09e-07   	331	true    	entecavir
4	4	5.2	2.09e-07   	2982	true    	brivudine
5	4	5.8	2.09e-07   	7854	true    	lifitegrast
6	4	5.8	2.09e-07   	321	true    	trifluridine
7	4	6.2	2.09e-07   	574	true    	moexipril
8	4	7.0	2.09e-07   	4817	true    	regadenoson
9	4	7.0	2.09e-07   	143	true    	idoxuridine
10	3	2.0	9.14e-06   	7256	true    	cobicistat




These results are a little complex at first glance, so let's break down some key columns:

- **score1** - integer, the number of similar lists in which the compound appears at or above rank n (here, rank 10)
  - This is the primary ranking factor
- **score2** - float, the average rank of each compound in the "score1" similar lists it appears in
  - When score1 is tied for two compounds, this breaks the tie
- **id** - integer, the CANDO ID of the compound; can be used to directly access the compound using `cando.get_compound()`
- **approved** - boolean, true if the compound is listed as approved in the input data; false otherwise
- **name** - string, the name of the compound

The results can change greatly if we alter the input parameters. Instead of considering only the top 10 most similar compounds to each indicated compounds, let's change it to the top 25 (`n = 25`) and see what changes.

In [16]:
cando.canpredict_compounds("MESH:D015658", n=25, topX=10)

32 compounds found for MESH:D015658 --> HIV Infections
Generating compound predictions using top25 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	8	5.0	2.74e-11   	1128	true    	darunavir
2	7	11.3	1.00e-09   	1095	true    	paclitaxel
3	7	12.7	1.00e-09   	1126	true    	decitabine
4	6	6.0	3.12e-08   	756	true    	emtricitabine
5	6	8.5	3.12e-08   	7854	true    	lifitegrast
6	6	9.3	3.12e-08   	574	true    	moexipril
7	6	10.2	3.12e-08   	136	true    	cladribine
8	6	11.0	3.12e-08   	9053	true    	macimorelin
9	6	12.3	3.12e-08   	515	true    	clofarabine
10	6	13.2	3.12e-08   	5106	true    	cabazitaxel




Although the first compound is the same (darunavir), the second and third, paclitaxel and decitabine, are new; previously, these two did not even appear in the top 10 predictions. Meanwhile, the previous third place compound, entecavir, is not in the top 10 anymore. To find out where it ended up, we can see more results for the same assessment using `topX = 25`.

In [17]:
cando.canpredict_compounds("MESH:D015658", n=25, topX=25)

32 compounds found for MESH:D015658 --> HIV Infections
Generating compound predictions using top25 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	8	5.0	2.74e-11   	1128	true    	darunavir
2	7	11.3	1.00e-09   	1095	true    	paclitaxel
3	7	12.7	1.00e-09   	1126	true    	decitabine
4	6	6.0	3.12e-08   	756	true    	emtricitabine
5	6	8.5	3.12e-08   	7854	true    	lifitegrast
6	6	9.3	3.12e-08   	574	true    	moexipril
7	6	10.2	3.12e-08   	136	true    	cladribine
8	6	11.0	3.12e-08   	9053	true    	macimorelin
9	6	12.3	3.12e-08   	515	true    	clofarabine
10	6	13.2	3.12e-08   	5106	true    	cabazitaxel
11	6	14.2	3.12e-08   	7855	true    	velpatasvir
12	5	7.8	8.18e-07   	331	true    	entecavir
13	5	12.4	8.18e-07   	7431	true    	eluxadoline
14	5	14.2	8.18e-07   	969	true    	capecitabine
15	5	17.0	8.18e-07   	10070	true    	ivosidenib
16	5	19.4	8.18e-07   	198	true    	mitomycin
17	4	5.2	1.78e-05   	2982	true    	brivudine
18	4	5.8	1.78e-05   	321	true    	trifluri

It turns out entecavir only appeared in one more similar list when we considered the top 25 instead of the top 10 most similar compounds, pushing it down to rank 12.

If we want to, we can also see where compounds already associated with HIV appear relative to newly predicted compounds using the same assessment using `keep_associated = True`. Technically, compounds associated with HIV are at a slight disadvantage, since they cannot appear in their own similar lists, but, because they are effective against HIV, we would still expect them to be pretty highly ranked overall if CANDO is functioning well. Let's run this assessment with `n = 10` and `topX = 10`.

In [18]:
cando.canpredict_compounds("MESH:D015658", n=10, topX=10, 
                           keep_associated=True)

32 compounds found for MESH:D015658 --> HIV Infections
Generating compound predictions using top10 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	7	4.3	7.47e-13   	1128	true    	darunavir
2	6	4.3	5.84e-11   	584	true*   	amprenavir
3	4	2.8	2.09e-07   	1098	true*   	saquinavir
4	4	3.0	2.09e-07   	756	true    	emtricitabine
5	4	5.0	2.09e-07   	331	true    	entecavir
6	4	5.2	2.09e-07   	2982	true    	brivudine
7	4	5.8	2.09e-07   	7854	true    	lifitegrast
8	4	5.8	2.09e-07   	321	true    	trifluridine
9	4	5.8	2.09e-07   	115	true*   	nelfinavir
10	4	6.2	2.09e-07   	574	true    	moexipril




If an asterisk * appears next to the label in the "approved" column, that means that drug is already associated with HIV. We can see there are three such compounds that appear in this list: amprenavir, saquinavir, and nelfinavir. As it turns out, CANDO ranks darunavir higher than every compound approved to treat HIV in this prediction.

Before moving on, let's repeat this assessment on an indication with many fewer approved drugs. When we searched "HIV" earlier, one of our results was "HIV Wasting Syndrome," MESH:D019247. Let's look at that indication now.

In [19]:
hiv_ws = cando.get_indication('MESH:D019247')
print(hiv_ws.name, hiv_ws.id_)
print(len(hiv_ws.compounds))

HIV Wasting Syndrome MESH:D019247
4


There are only 4 drugs associated with HIV wasting syndrome in our dataset. This will likely reduce the number of results available through `canpredict_compounds`. Let's try running it with the same parameters as we did previously.

In [20]:
cando.canpredict_compounds('MESH:D019247', n=10, topX=10)

4 compounds found for MESH:D019247 --> HIV Wasting Syndrome
Generating compound predictions using top10 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	2	1.0	2.72e-07   	5062	true    	stanozolol
2	2	2.0	2.72e-07   	5054	true    	methyltestosterone
3	2	3.0	2.72e-07   	4901	true    	oxymetholone
4	2	5.0	2.72e-07   	1224	true    	testosterone_propionate
5	2	5.5	2.72e-07   	976	true    	trilostane




As you can see, though we still requested the top 10 predictions, only five appeared because only five compounds appear in multiple similar lists of the four drugs associated with HIV wasting syndrome. Stanozolol, the top ranked prediction, appears in only two similar lists (score1 is 2), but it appears in rank 2 on average (score2 is 1.0; rank 1 is 0.0).

If we want to increase the number of predictions we receive, we can try a couple things. First, as we have done previously, we can increase `n` to 25, which is likely to increase the number of compounds appearing in multiple similar lists. 

In [21]:
cando.canpredict_compounds('MESH:D019247', n=25, topX=10)

4 compounds found for MESH:D019247 --> HIV Wasting Syndrome
Generating compound predictions using top25 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	3	9.7	1.09e-08   	1224	true    	testosterone_propionate
2	3	12.7	1.09e-08   	286	true    	progesterone
3	3	13.3	1.09e-08   	10123	true    	drostanolone_propionate
4	3	16.7	1.09e-08   	1083	true    	finasteride
5	3	20.0	1.09e-08   	7125	true    	formestane
6	2	1.0	4.23e-06   	5062	true    	stanozolol
7	2	2.0	4.23e-06   	5054	true    	methyltestosterone
8	2	3.0	4.23e-06   	4901	true    	oxymetholone
9	2	5.5	4.23e-06   	976	true    	trilostane
10	2	7.0	4.23e-06   	598	true    	norethisterone




Now we have a full 10 results, but the ranks of most of our previous results (aside from testosterone propionate) have fallen. If we want to extend the previous list instead of changing the results, we can keep `n = 10` but set `consensus = False`. This will include in the ranks compounds that only appear in one similar list, but are highly ranked in that list.

In [22]:
cando.canpredict_compounds('MESH:D019247', n=10, topX=10, consensus=False)

4 compounds found for MESH:D019247 --> HIV Wasting Syndrome
Generating compound predictions using top10 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	2	1.0	2.72e-07   	5062	true    	stanozolol
2	2	2.0	2.72e-07   	5054	true    	methyltestosterone
3	2	3.0	2.72e-07   	4901	true    	oxymetholone
4	2	5.0	2.72e-07   	1224	true    	testosterone_propionate
5	2	5.5	2.72e-07   	976	true    	trilostane
6	1	0.0	9.96e-05   	5120	true    	hydroxyprogesterone_caproate
7	1	0.0	9.96e-05   	8946	true    	methylprednisone
8	1	1.0	9.96e-05   	10145	true    	gestonorone_caproate
9	1	1.0	9.96e-05   	7870	true    	nomegestrol
10	1	2.0	9.96e-05   	9792	true    	testosterone_undecanoate




As you can see, this output has the same top ranks as our first results, but the table has been extended to include additional compounds.

While CANDO is more often used to predict novel drugs for an indication, it can also be used for the reverse, predicting new indications for an existing compound. This could be useful, for example, if one discovers a new small molecule in nature and wonders if it could have any pharmaceutical applications. `canpredict_indications` finds the most similar drugs to the compound in question, determines what those drugs are indicated to treat, and then returns any indications associated with multiple most similar drugs as potential uses. 

Key parameters of `canpredict_indications` are largely the same as those of `canpredict_compounds`, with three exceptions: the first argument is the Compound object, as opposed to the indication id; there is no `keep_associated` parameter; and there is an additional `sorting` variable that takes a string to determine whether to rank the outputtted indications by probability (`"prob"`) or score (`"score"`).

Let's look at the antimicrobial paromomycin as an example. First, we have to find its id:

In [23]:
cando.search_compound('paromomycin')

id	name
1225	paromomycin
207	capreomycin
198	mitomycin
7481	propoxycaine
102	azithromycin


Then, we get the Compound object and check which indications paromomycin is already associated with:

In [24]:
paro = cando.get_compound(1225)
for indic in paro.indications:
    print(cando.get_indication(indic).name, cando.get_indication(indic).id_)

AIDS-Related Opportunistic Infections MESH:D017088
Cryptosporidiosis MESH:D003457
Dysentery, Amebic MESH:D004404
Hepatic Encephalopathy MESH:D006501
Leishmaniasis, Cutaneous MESH:D016773
Leishmaniasis, Visceral MESH:D007898


It looks like paromomycin is associated with six infection-related indications. We will use the top 10 most similar compounds to paromomycin (`n = 10`) and print the top 10 results (`topX = 10`).

In [25]:
cando.canpredict_indications(paro, n=10, topX=10)

Generating indication predictions for paromomycin...
  Compound id = 1225
  Compound index = 1225
  n = 10
  Printing the 10 highest predicted indications...

rank	probability	score	ind_id    	indication
1	5.22e-09   	4	MESH:D008581	Meningitis
2	4.72e-08   	4	MESH:D004927	Escherichia coli Infections
3	1.38e-07   	3	MESH:D007710	Klebsiella Infections
4	2.50e-07   	3	MESH:D011512	Proteus Infections
5	9.76e-07   	2	MESH:D000380	Agranulocytosis
6	1.20e-06   	3	MESH:D014376	Tuberculosis
7	1.71e-06   	2	MESH:D016870	Neisseriaceae Infections
8	2.72e-06   	2	MESH:D001996	Bronchopneumonia
9	2.72e-06   	2	MESH:D009877	Endophthalmitis
10	1.32e-05   	3	MESH:D004697	Endocarditis, Bacterial



All of paromomycin's existing indications are infections, so it is unsurprising that all of the predictions are also infections. As with the output of `canpredict_compounds`, the score indicates how many times an indication appears associated with the top 10 (or top n) most similar compounds to paromomycin.

This section covered the most basic ways to predict a new drug for an indication and an new indication for a drug. To learn about additional drug discovery methods within CANDO, you may continue to the [Advanced topics](#Advanced-topics) section or consult [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf).

---

## Benchmarking

Benchmarking is essential for assessing how well CANDO is working, particularly when using a new dataset or prediction function, but also when looking at predictions for a specific indication in greater depth. Benchmarking ensures that our platform has predictive value and that we know how much to trust the predictions we are working with.

CANDO has a primary built-in benchmarking protocol, as well as multiple older benchmarking functions. We will focus on the primary benchmarking protocol, `canbenchmark_new`, here. This method uses a leave-one-out benchmarking protocol on every indication with two or more associated drugs. Every drug is withheld from its indication, and CANDO predicts novel drugs for that indication based on the remaining drugs. If the left out drug appears in these predictions, that is considered a success, as CANDO is able to predict the efficacy of that drug.

The primary benchmarking function calculates four metrics:

- **New average indication accuracy (nAIA)** - float, average % of indicated drugs that can be predicted within a given cutoff
  - The "new" distinguishes these metrics from old average indication accuracy, as calculated by our older benchmarking functions
  - A **Control nAIA** is also provided; this is the nAIA we would anticipate if CANDO worked as well as random chance
- **Pairwise accuracy (PA)** - float, % of drugs that appear within a given cutoff in the similarity lists of drugs with the same indication
  - PA assesses similarity list quality, whereas nAIA assesses prediction quality
- **Indication coverage (IC)** - integer, the number of indications with a non-zero indication accuracy
- **Normalized discounted cumulative gain (nNDCG)** - float, scores # indicated drugs predicted within a given cutoff with prioritization of early recall

Higher scores in these metrics should lead to higher confidence in our predictions. Additional metrics, such as precision at a given rank or area under the receiver operating characteristic (AUROC/AUC), may also be calculated. See the [the CANDO documentation](https://github.com/ram-compbio/CANDO/blob/master/docs/CANDO-v2.2.pdf) or [this paper on evaluating drug repurposing technologies](https://www.biorxiv.org/content/10.1101/2020.12.03.410274v1) for more information about metrics.

`canbenchmark_new` has a couple key arguments:
- **file_name** - string, a descriptive name to be assigned to all benchmarking files created
- **n** - integer, the number of similar compounds to be considered per indicated compoud (10 by default)
  - Use the same value for `n` as you used in `canpredict_compounds` to ensure your benchmarking results are applicable
- **indications** - list of string, a list of the subset of indications to be examined
  - If this argument is not used, all indications with at least two associated drugs will be assessed
- **approved** - boolean, if True (default), excludes all drugs that are not associated with any assessed indications from consideration
  - If a subset of indications is being used, **set this argument to False** or your results will be unrealistically optimistic
 
As with other CANDO functions, additional arguments are available. `canbenchmark_new` has not yet been added to the documentation, but some arguments are explained in the function header within cando.py.

We can choose to assess CANDO in two ways: we can assess the overall performance of CANDO, or we can assess a subset of indications of interest. The former is especially useful when we are using an unusual dataset that might affect the performance of CANDO. In this tutorial, we are using a matrix with only 64 proteins, which might alter performance, so it makes sense to run an overall assessment. We will give this assessment the name "tutorial_assessment" and use n=10, as we did when running `canpredict_compounds`. **Note that this function will take a long time to run.**

In [26]:
results = cando.canbenchmark_new('tutorial_assessment', n=10)

Begin running canbenchmark_new...
  Calculating scores...


100%|██████████| 1595/1595 [12:49<00:00,  2.07it/s] 


  Done calculating scores.
  Time to calculate scores: 13 min 49 s
  Compiling and saving results...


100%|██████████| 1595/1595 [01:26<00:00, 18.42it/s]


  Done compiling and saving results.

Summary
               top10   top25   top50  top100  top1464   top1%   top5%  \
nAIA           8.576  13.253  17.996  23.790  100.000  10.068  20.902   
control-nAIA   0.683   1.708   3.415   6.831  100.000   0.956   4.986   
PA             2.499   4.000   6.186  10.254  100.000   2.937   8.106   
IC           523.000 721.000 847.000 982.000 1595.000 607.000 923.000   
nNDCG          0.047   0.058   0.068   0.077    0.160   0.051   0.072   

               top10%   top50%  top100%  
nAIA           27.739   62.766  100.000  
control-nAIA    9.973   50.000  100.000  
PA             13.888   54.864  100.000  
IC           1074.000 1498.000 1595.000  
nNDCG           0.083    0.123    0.160  

Done running canbenchmark_new.
Total time to run canbenchmark_new: 14 min 16 s



The primary results of `canbenchmark_new` are printed out directly for you to read. Based on the nAIA results, across all indications, CANDO is able to recall 8.6% of indicated drugs in the top 10 compounds (out of 1464), 13.3% in the top 25, and 23.8% in the top 100. CANDO also outperforms random chance (control-nAIA) at every cutoff. In addition, the indication coverage (IC) tells us that 523 indications (out of 1595 assessed; see top100% column) have non-zero performance at the top 10 threshold, 721 at the top 25 threshold, and 982 at the top 100 threshold. Note that these results correspond to the 64-protein matrix used in this tutorial (plus the compound and indication mapping); as a general rule, performance is better when a larger protein set is considered.

In addition, multiple additional files that may be of interest are generated:

- **Summary file** - TSV, contains overall metrics; top-level folder with a name starting "summary"
- **Raw results** - CSV, contains the rank at which each indicated drug was predicted; raw_results folder
- **Pairwise results** - CSV, contains the ranks of each pair of drugs with the same indications in each others' similarity lists; pairwise_results folder
- **Results analysed** - TSV, contains indication accuracy or NDCG scores for each individual indication; results_analysed_named folder

These files can be viewed in the tutorial folder, and they can be opened in a text editor like Notepad or in certain spreadsheet editors, like Excel. For now, we will just look at the results of the summary file, which should reflet the results printed out when running `canbenchmark_new` above:

In [27]:
with open('summary-tutorial_assessment-10-approved.tsv', 'r') as f:
    print(f.read()) # Note: do NOT do this with the longer, non-summary files; it may crash or hang

	top10	top25	top50	top100	top1464	top1%	top5%	top10%	top50%	top100%
nAIA	8.57584	13.25254	17.99572	23.78991	100.00000	10.06810	20.90240	27.73863	62.76599	100.00000
control-nAIA	0.68306	1.70765	3.41530	6.83060	100.00000	0.95628	4.98634	9.97268	50.00000	100.00000
PA	2.49867	4.00044	6.18643	10.25412	100.00000	2.93720	8.10577	13.88756	54.86402	100.00000
IC	523.00000	721.00000	847.00000	982.00000	1595.00000	607.00000	923.00000	1074.00000	1498.00000	1595.00000
nNDCG	0.04706	0.05838	0.06752	0.07688	0.16049	0.05104	0.07240	0.08256	0.12346	0.16049



The other reason one might want to complete benchmarking is to determine how well CANDO performs on a given indication. For example, earlier we were looking into HIV infections, MESH:D015658. Let's replicate those results here.

In [28]:
cando.canpredict_compounds("MESH:D015658", n=10)

32 compounds found for MESH:D015658 --> HIV Infections
Generating compound predictions using top10 most similar compounds...

rank	score1	score2	probability	id	approved	name
1	7	4.3	7.47e-13   	1128	true    	darunavir
2	4	3.0	2.09e-07   	756	true    	emtricitabine
3	4	5.0	2.09e-07   	331	true    	entecavir
4	4	5.2	2.09e-07   	2982	true    	brivudine
5	4	5.8	2.09e-07   	7854	true    	lifitegrast
6	4	5.8	2.09e-07   	321	true    	trifluridine
7	4	6.2	2.09e-07   	574	true    	moexipril
8	4	7.0	2.09e-07   	4817	true    	regadenoson
9	4	7.0	2.09e-07   	143	true    	idoxuridine
10	3	2.0	9.14e-06   	7256	true    	cobicistat




We have the same 10 results as before, but now we want to estimate how trustworthy these results are. We can use `canbenchmark_new` with the same `n` value as we used in our prediction to get such an estimate. Since we *only* want to assess how well CANDO performs on HIV here, we can pass a list with only the MeSH ID we are interested in to the `indications` parameter. Finally, since we are using a subset of indications, we should set `approved` to False.

In [29]:
results = cando.canbenchmark_new("HIV_assessment", n=10, indications=["MESH:D015658"], approved=False)

Begin running canbenchmark_new...
  Calculating scores...


100%|██████████| 1/1 [00:01<00:00,  1.79s/it]


  Done calculating scores.
  Time to calculate scores: 2 s
  Compiling and saving results...


100%|██████████| 1/1 [00:00<00:00,  8.02it/s]

  Done compiling and saving results.

Summary
             top10  top25  top50 top100 top2449  top1%  top5% top10% top50%  \
nAIA         9.375 21.875 21.875 40.625 100.000 21.875 43.750 53.125 71.875   
control-nAIA 0.408  1.021  2.042  4.083 100.000  0.980  4.982  9.963 49.980   
PA           3.327  6.552  8.972 11.694 100.000  6.351 12.399 17.944 43.750   
IC           1.000  1.000  1.000  1.000   1.000  1.000  1.000  1.000  1.000   
nNDCG        0.057  0.088  0.088  0.119   0.182  0.088  0.124  0.136  0.157   

             top100%  
nAIA         100.000  
control-nAIA 100.000  
PA           100.000  
IC             1.000  
nNDCG          0.182  

Done running canbenchmark_new.
Total time to run canbenchmark_new: 2 s






We found that 9.375% of drugs associated with HIV infections, or 3 of the 32 drugs, were ranked within the top 10 compounds predicted for this indication, as compared to 2449 compounds ranked overall. If you recall, this is also what we found when we ran `canpredict_compounds` with `keep_associated=True`. When we look at the top 25 compounds, this increases to 21.875%, or 7 out of 32 drugs.

Based on these results, we can expect that not all compounds effective against HIV infections are going to be in our top 10 predictions, which is unsurprising. However, by comparing our nAIA to the control nAIA, we can see that CANDO is performing far better than random chance on predicting effective drugs for HIV. CANDO also performs better on HIV infections than on indications in general. Therefore, it is reasonable to use CANDO to predict novel drugs for HIV infections.

Historically, we have assessed the performance of CANDO based on calculating accuracies for the similar lists of every compound within an indication. This is the idea behind the original CANDO benchmarking algorithm, `canbenchmark`. Although the newer `canbenchmark_new` is more relevant to most applications, the original algorithm is still preserved for posterity and potential in the development of CANDO.

As an example of the original `canbenchmark`, let's look at hyperparathyroidism, MESH:D006961. There are four drugs associated with this indication: cinacalcet (884), paricalcitol (786), dihydrotachysterol (938), and calcitriol (32). As it turns out, cinacalcet does not have any of the other three drugs in its top 10 most similar compounds. However, paricalcitol, dihydrotachysterol, and calcitriol are all in each other's top 10 most similar compounds lists. Since three out of the four compounds' similar lists have at least one other indicated compound in the top 10, we calculate an indication accuracy of 75% at the top 10 threshold for hyperparathyroidism. This can be repeated for multiple rank thresholds (top 25, 100, etc) and for every other indication with at least two associated compounds. From this, the unweighted AIA, weighted APA, and IC can be calculated for these thresholds. 

Doing this work by hand is tedious and repetitive, which is why we created `canbenchmark` to rapidly calculate these metrics for every indication. `canbenchmark` only requires one argument, a string that determines what the output files it creates will be named. The protein (or other interaction) matrix, compound list, and compond-indication associations given to the CANDO object as input may all affect the bnechmarking performance.

In [30]:
cando.canbenchmark('tutorial')

100%|██████████| 2214/2214 [01:48<00:00, 20.40it/s]

	aia
top10	20.316
top25	26.808
top50	32.635
top100	41.836
top2449	100.000
top1%	26.186
top5%	44.843
top10%	57.426
top50%	89.378
top100%	100.000







Besides the abbreviated results it prints out, `canbenchmark` creates three additional files: a raw results file that contains information on every compound-indication pair (found at raw_results/raw_results-tutorial.csv), an analyzed results file that contains every individual indication accuracy (found at results_analysed_named/results_analysed_named-tutorial.csv), and a summary file containing overall metrics (found at summary-tutorial.tsv). You can open these results from your file navigator into spreadsheet software like Excel or text editors like Notepad, and these files can also be directly opened and examined via your code, as below:

In [31]:
with open('summary-tutorial.tsv', 'r') as f:
    print(f.read()) # Note: do NOT do this with the longer, non-summary files; it may crash or hang

	top10	top25	top50	top100	top2449	top1%	top5%	top10%	top50%	top100%
aia	20.316	26.808	32.635	41.836	100.000	26.186	44.843	57.426	89.378	100.000
apa	35.180	47.112	57.909	70.206	100.000	46.230	73.574	83.817	97.693	100.000
ic	819	907	969	1078	1595	904	1112	1254	1532	1595




As we can see, we have an AIA of 20% at the top 10 threshold, meaning that, across all indications, about 20% of compounds had at least one other compound with the same indication in their top 10 most similar lists. The APA is higher because the chance of recovering an indicated compound in the top ranks is higher when there are more indicated compounds, so more heavily weighting larger indications leads to a better result. Indication coverage is 819, meaning that 819 out of 1595 indications with at least two associated compounds had an indication accuracy greater than 0%.

---

# Advanced topics

This section will cover more advanced and customizable ways of working with CANDO. Note that this section will not go into the same depth as the basic tutorial, as it is assumed you already understand the basics of CANDO.

## Customizing dataset

### Interaction matrix

The example interaction matrix, matrix_file, has already been downloaded via get_tutorial(). This function may take anywhere from ~1-10 mins on 3 cores, depending on your computer's processor.

In this step we will generate a matrix of 2,449 approved compounds by 64 proteins, populated with the corresponding interaction score betwen each drug and protein. The final matrix will have drugs as the columns (indexed according to the compound mapping file), and proteins as the indices (indexed by PDB and chain ID).

The function generate_matrix() first creates a dataframe for all chemical fingerprint similarity scores comparing each compound in the specified version library v (or only approved drugs with approved_only=True) to every potential binding site ligand from the PDB using RDKit to compute the chemical fingerprints in which fp denotes the type/radius of fingerprint (rd_ecfp4, rd_ecfp8, etc). The vector type, vect, can be binary ("1024_bit" or "2048_bit") or integer ("int") and denote the presence or absence or count of molecular substructures in the molecule, respectively. If using integer vectors, dist should be set to "dice", but binary vectors can be set to Tanimoto ("tani"). Next, it creates a dataframe of all potential binding sites for each protein in the specified protein library, protlib, with their corresponding binding site scores from the specified binding site prediction method, bs ("coach", "cof", "ssite", "tms"). The function then iterates over all drugs and protein binding sites to populate the matrix with the best score based on the following input parameters: 1) i_score - the scoring protocol of choice, which can be 'C', 'dC', 'P', 'CxP', 'dCxP', where C is the fingerprint similarity score (either Tanimoto or Sorenson-Dice coefficient), dC is the percentile of the C score for the compound compared to all ligands in the library, and P is binding site score associated with the ligand predicted by the COACH algorithm, 2) c_cutoff and p_cutoff - set cutoffs which ignore any C or P scores below each threshold, respectively, and 3) percentile_cutoff - similar to c_cutoff but with the dC score (overrides c_cutoff if not None). These scores serve as a proxy for binding strength/probability of the drug and protein target. We then output the matrix to a tsv file, out_file, which in this case is "tutorial_matrix-all.tsv" (out_path can be set to write the file to a specific directory).

We will create a matrix using the 'CxP' protocol, which just chooses the top Dice score to any ligand predicted to bind to a given protein, without any cutoffs. This function is parallelized, so setting the variable ncpus will change the number of processors that are used for this function. NOTE: percentile cutoff protocols ('dC', 'dCxP') take much longer to compute than 'C' and 'P' protocols. 

In [32]:
# generate example cando interaction matrix (2,449 compounds x 64 proteins)
cnd.generate_matrix(v="test.0", fp="rd_ecfp4", vect="int",
                    dist="dice", org="tutorial", bs="coach",
                    c_cutoff=0.0, p_cutoff=0.0, percentile_cutoff=0.0,
                    i_score="dCxP", out_file='', out_path=".",
                    nr_ligs=True, approved_only=True,
                    lib_path='', prot_path='', lig_name=False, ncpus=ncpus)

Generating CANDO matrix...


100%|██████████| 1/1 [00:00<00:00, 1166.06it/s]
100%|██████████| 52/52 [00:14<00:00,  3.50it/s]


  Matrix written to ./rd_ecfp4-int-dice-tutorial-coach-c0.0-p0.0-dCxP-approved.tsv.
Matrix generation completed in 17 s.



### Novel compound

The CANDO platform contains an extensive library of approved drugs and other compounds from DrugBank. However, if you wish to predict indications or similar drugs for a compound that is not present in our library, we make it possible with the `generate_signature()` function.

First, you must have the compound properly formatted in mol file format. There are many programs that provide conversion between many chemical file formats, such as OpenBabel.

Next, run `generate_signature()`. This will populate a tsv file with Tanimoto/Sorenson-Dice similarity scores of the provided compound to all binding site ligands in our database. These values will be used for the generation of the drug-proteome signature. The input parameters for this function are very similar to those for `generate_matrix()`, however the first argument is the path to the compound structure file in mol format. The output signature file will be saved in tsv format with the name you provide for "out_file" (with the appended path from "out_path").

**NOTE: the interaction score protocol must match the input matrix protocol, which was 'CxP' as above. If we used a different protocol, these scores would not be directly comparable.**

In [33]:
cmpd_file = "lmk235.mol"
signature_file = "lmk235_signature.tsv"

cnd.generate_signature(cmpd_file, fp="rd_ecfp4", vect="int", dist="dice", 
                      org="tutorial", bs="coach", c_cutoff=0.0, p_cutoff=0.0, 
                      percentile_cutoff=0.0, i_score="CxP", out_file=signature_file, 
                      out_path=".", nr_ligs=True)

Generating CANDO signature...




Signature written to ./lmk235_signature.tsv.
signature generation took 1 s to finish.


array([0.1848    , 0.16993631, 0.13525424, 0.15766423, 0.10884956,
       0.12212766, 0.09036145, 0.091     , 0.14578313, 0.09340659,
       0.2705618 , 0.09791209, 0.05578947, 0.09369231, 0.        ,
       0.1248227 , 0.18481283, 0.19      , 0.26209677, 0.23526627,
       0.19830986, 0.08827586, 0.14720721, 0.15319149, 0.27016129,
       0.20377358, 0.13428571, 0.15909091, 0.14285714, 0.18481928,
       0.195     , 0.14853147, 0.15454545, 0.19417476, 0.14193548,
       0.17648352, 0.15144385, 0.        , 0.11092437, 0.15469027,
       0.12273292, 0.17103448, 0.0980198 , 0.14978102, 0.21375   ,
       0.11052632, 0.02769231, 0.2381457 , 0.07590361, 0.09912088,
       0.        , 0.144     , 0.110625  , 0.12048193, 0.15954545,
       0.16819672, 0.12082645, 0.21666667, 0.13175182, 0.1890411 ,
       0.17088608, 0.1275    , 0.15580645, 0.26373494])

We then must add the compound to the existing `CANDO` object with the `add_cmpd()` function. The inputs are the newly generated signature file and the desired name (which below is "lmk-235"). 

In [34]:
cando.add_cmpd(signature_file, new_name='lmk-235')

New compound is lmk-235
New compound has id 10674 and index 10674.



Now that our new compound is added to the platform, we can see what other compounds to which it is similar. We can use the `similar_compounds()` function from before to print those compounds. 

In [35]:
lmk235 = cando.get_compound(10674)
cando.similar_compounds(lmk235, n=10)

Generating most similar compounds for lmk-235...
Compound id = 10674
Compound index = 10674
  Printing 10 most similar compounds...

  rank	dist	id	name
  1	0.000	0	bivalirudin
  2	0.000	1	leuprolide
  3	0.000	2	goserelin
  4	0.000	3	gramicidin_d
  5	0.000	4	desmopressin
  6	0.000	5	cetrorelix
  7	0.000	6	daptomycin
  8	0.000	7	cyclosporine
  9	0.000	9	octreotide
  10	0.000	10	abarelix
  11	0.000	11	pyridoxal_phosphate




Finally, we can predict potential indications for which our new compound may be useful. We can use the `canpredict_indications()` as before to print those results. 

In [36]:
cando.canpredict_indications(lmk235, n=25, topX=15)

Generating indication predictions for lmk-235...
  Compound id = 10674
  Compound index = 10674
  n = 25
  Printing the 15 highest predicted indications...

rank	probability	score	ind_id    	indication
1	9.41e-07   	2	MESH:D006474	Hemorrhagic Disorders
2	9.41e-07   	2	MESH:D008661	Metabolism, Inborn Errors
3	9.41e-07   	2	MESH:D020335	Paraparesis
4	7.59e-05   	2	MESH:D013118	Spinal Cord Diseases
5	1.47e-04   	2	MESH:D003711	Demyelinating Diseases
6	1.47e-04   	2	MESH:D005235	Fatty Liver, Alcoholic
7	1.95e-04   	2	MESH:D007889	Leiomyoma
8	2.07e-04   	6	MESH:D007249	Inflammation
9	3.18e-04   	2	MESH:D065626	Non-alcoholic Fatty Liver Disease
10	4.82e-04   	2	MESH:D004715	Endometriosis
11	6.64e-04   	3	MESH:D008106	Liver Cirrhosis, Experimental
12	8.18e-04   	2	MESH:D013921	Thrombocytopenia
13	1.11e-03   	2	MESH:D002779	Cholestasis
14	1.12e-03   	3	MESH:D006528	Carcinoma, Hepatocellular
15	3.18e-03   	2	MESH:D004195	Disease Models, Animal



Interestingly, both Hypertension (MESH:D006973) and Pain (MESH:D010146) are top predictions - the analgesic and hypotensive properties of lmk-235 are both supported by in vivo studies in the literature. 

### Compound library

In order to create a new compound set, you must first have a TSV (tab separated values) file that contains one of two chemical file types:

1. SMILES - `file_type='smi'`. The file must have the SMILES string as the first column and the corresponding compound name in the second column, e.g.
    `C1CNCCN(C1)S(=O)(=O)C2=CC=CC3=C2C=CN=C3 fasudil`
    
2. Mol - `file_type='mol'`. The file must have the name of the file, without the file extension. In addition, the path to the files must be given in the argument `cmpd_dir`.

In our example, we will create a new compound library of select tyrosine kinase inhibitors (TKIs) that we will then use to create a new matrix with just these drugs. First, we generate the library.

In [37]:
cnd.add_cmpds("tki_set-test.smi", file_type='smi', fp="rd_ecfp4", vect="int", cmpd_dir=".", v='tki')

Creating new compound library tki...
The library will be built at /projects/academic/rams/zmfalls/src/CANDO-dev/tutorial/tki.


100%|██████████| 1/1 [00:00<00:00, 105.87it/s]
100%|██████████| 1/1 [00:00<00:00, 92.13it/s]


    Adding compound 0 - fasudil


AttributeError: 'DataFrame' object has no attribute 'append'

We have now generated a new compound library, located at `./tki/cmpds`. This location is partially hardcoded, so it will also create a directory in your current working directory, `./`, named after the `v` you choose, `./tki`. This makes it easier for the end-user.

Notice we chose to use the rdkit fingerprint ecfp4, `fp="rd_ecfp4"`, and the vector type int, `vect="int"`. We need to keep track of this for our matrix generation.

Next we will generate a matrix using this **tki** compound library. The generate matrix function can take in a customized matrix, as opposed to the pregenerated/downlaoded matrices based upon our predefined `v`, e.g. v2.2, v2.3, etc. By setting `v` to the same name you used in the previous `add_cmpds` function, you can load those cmpds to create a new matrix.

We will set `v='tki'` and `lib_path='.'`. This means we will use the **tki** library, which is located in the current working dir. In addition, sicne we created our compound library using fingerprint type ecfp4, we need to make sure we use it here, as well. The remaining arguemnts has already been discussed in the previous [Interaction matrix](#Interaction-matrix) section.

In [None]:
cnd.generate_matrix(v="tki", lib_path='.',
                    fp="rd_ecfp4", vect="int",
                    dist="dice", org="tutorial", bs="coach",
                    c_cutoff=0.0, p_cutoff=0.0, percentile_cutoff=0.0,
                    i_score="CxP", out_file='', out_path="",
                    nr_ligs=True, approved_only=False,
                    lig_name=False, ncpus=ncpus)

We now have a brand new matrix with all of our new TKIs scored against our set of 64 test proteins. This file is located at `./tki/matrices/`. Again, these paths are mostly hardcoded. You can control where the matrix is written and what the name is using the `out_path` and `out_file` arguments. When these arugments are set to '' the hardcoded paths are used, as shown in this example.

Now we will load the newly created matrix into a CANDO object.

In [None]:
# Set compound mapping variable to the new TKI compound mapping file
tki_map = 'tki/mappings/cmpds-tki.tsv'
tki_matrix = 'tki/matrices/rd_ecfp4-int-dice-tutorial-coach-c0.0-p0.0-CxP.tsv'
# Create CANDO object using the new compound mapping and TKI matrix
tki_cando = cnd.CANDO(tki_map, ind_map, matrix=tki_matrix, compound_set='all', compute_distance=True, 
                  dist_metric=dist_metric, ncpus=ncpus)

Check the new CANDO object data and the most similar compounds to the first TKI.

In [None]:
# print cando object stats
print('compounds', len(tki_cando.compounds))
print('indications', len(tki_cando.indications))
print('proteins', len(tki_cando.proteins))
print('')

# print first TKI name and signature
c = tki_cando.compounds[0]
print(c.name, len(c.sig))
print(c.sig)
print('')

# top5 most similar compounds to first TKI
for s in c.similar[0:5]:
    print(s[0].name, round(s[1], 3))

---

## Advanced drug prediction

### De novo prediction

Sometimes there are no drugs associated with an indication (whether it be in reality or in your input indication mapping); in those cases we can use the `canpredict_denovo()` function to suggest putative candidates. Basically, this function counts/sums the number of protein interaction scores above a set threshold for each compound and ranks them according to frequency and strength. This is particularly useful for pathogen proteomes (like SARS-CoV-2 or bacterial proteomes) or for finding which compounds target a subset of proteins of interest (e.g. kinases). In our case, we have a diverse set of proteins, which would render the results meaningless. Instead, we can input a list of protein IDs in which we are interested. We have precompiled a list of bacterial proteins already present in the sample matrix for convenience. Note: if we had read in an indication-genes mapping file, we could use the `ind_id=` parameter to automatically select all proteins associated with said indication. 

In [None]:
bacterial_proteins = ["3mk7C", "3eziA", "1u2mC", "3atsA", "2x4mA", 
                      "4wliA", "1t4aA", "4zxkA", "1zhhA", "1eb0A", 
                      "4nqwB", "2gqrA", "2qlcA", "3rf1A", "2xpwA"]

cando.canpredict_denovo(method='sum', threshold=0.6, topX=30, proteins=bacterial_proteins)

### Machine learning

The "proteomic vectors" within CANDO lend themselves well to machine learning to perhaps learn more complex relationships between the proteins within the vector and their impacts on the treatment of diseases. CANDO has built-in ML algorithms that allow for two main functionalities:
1. Benchmark the platform using a hold-one-out protocol very similar to canbenchmark
2. Make predictions for novel or non-associated compounds that may be therapeutic for a given disease

The ML module currently supports 2 algorithms: random forests and logistic regression. The models are trained on drugs approved for the disease (positive classes) and an equal number of randomly selected "neutral samples", which are drugs/compounds not approved for the disease (negative samples). Random seeds may be set to ensure the same compounds are used in training. 

We have the option to benchmark the platform with an ML algorithm - this module outputs files very similar to canbenchmark. For this tutorial, we will skip this function as it requires a great deal of time to complete (training a separate model for EVERY drug-disease association, basically). The command to do so with a logistic regression classifier is below, feel free to run it! The `'out='` flag defines the name of the output files. Again, only diseases with 2+ compounds associated are benchmarked. 

`cnd.ml(method='rf', benchmark=True, seed=50, out='test_rf')`

We can also use this module to predict if a certain compound may be therapeutic for a given disease. We can use the 
`'predict='` flag to specify a list of compounds that we wish to predict with the classifier. Let's use three drugs, imatinib, buprenorphine, and lisdexamfetamine, and see if they are predicted to have antibiotic activity using a random forest classifier. 

In [None]:
baci = cando.get_indication('MESH:D001424')

imat = cando.get_compound(503)
bup = cando.get_compound(797)
lamf = cando.get_compound(1121)

cando.ml(method='rf', effect=baci, benchmark=False, seed=50, predict=[imat, bup, lamf])

The accuracy is quite poor, so fine-tuning and hyperparameter optimization is necessary to enhance performance. 

### Custom protein sets

It may be useful for some users to probe compound-protein interaction similarity, but only in the context of a few particular proteins (e.g. set of kinase inhibitors). Instead of generating a new matrix with all of these proteins and their corresponding interaction values, which can begin to take up a lot of storage if done multiple times, the `'protein_set='` flag can be specified during the instantiation of the CANDO object. This flag contains the path to the protein subset the user wishes to use, which is simply a list of UniProt protein IDs. The CANDO object will automatically check for each ID if it either simply matches any UniProt IDs within the matrix or if that UniProt ID is associated with any PDB chains within the matrix (based on a mapping from the SIFTs project). If there are matches, the CANDO object will now contain Compound objects with only those protein interaction values in their signatures. Below is how to use this functionality - this time we can search for the bacterial proteins from the previous `canpredict_denovo()` section, but filter the proteins from the start using a list of UniProt IDs. 

In [None]:
cando_subset = cnd.CANDO(cmpd_map, ind_map, matrix=matrix_file, compound_set='approved',
                         compute_distance=True, protein_set=protein_set,
                         dist_metric=dist_metric, ncpus=ncpus)

print("Number of proteins in new signature =", len(cando_subset.proteins))

The signature was successfully edited to 15 proteins. Note: this does not nececessarily mean the each UniProt ID had a corresponding PDB match -- multiple PDB chains can be associated with a given UniProt ID. If the matrix contains UniProt IDs in the first column instead of PDB IDs, as in this case, the "Direct UniProt matches" value would be incremented if they match.

We can also repeat all benchmarks and predictive algorithms with the new signatures. Below is the default benchmarking results with the new signatures. 

In [None]:
cando_subset.canbenchmark('test_subset')

Let's repeat the ML code from above, but this time with the reduced protein subset. 

In [None]:
baci = cando_subset.get_indication('MESH:D001424')

imat = cando_subset.get_compound(503)
bup = cando_subset.get_compound(797)
lamf = cando_subset.get_compound(1121)

cando_subset.ml(method='rf', effect=baci, benchmark=False, 
                seed=50, predict=[imat, bup, lamf])

As you can see, the output probability significantly change for one of the compounds, lisdexamfetamine, from 0.340 to 0.600, illustrating the impact of protein subset composition on model behavior.

### Virtual screening

Though the CANDO platform is mainly intended for multitarget drug discovery and repurposing, there still exists the option to check the top compound hits for a given protein. This is accomplished via the `virtual_screen()` function. Let's test the top hits for two of the known bacterial proteins from above, namely "1u2mC", "3atsA".

In [None]:
cando.virtual_screen("1u2mC")
cando.virtual_screen("3atsA")

Streptozocin is the top hit for 1u2mC, which suggests it shares significant structural similarity to a ligand known/predicted to bind to this protein. Similarly, kanamycin is the 5th hit for 3atsA. These scores/ranks can change significantly based on scoring protocol used for the matrix - changing i_score in `generate_matrix()` to "dC", for example, would greatly affect the output. 

*Credits: CANDO tutorial created by Zackary Falls; revised by Melissa Van Norden*