# CANDO Tutorial

This notebook will walk you through how to generate a CANDO matrix, set up a CANDO object, probe the data, benchmark the platform, and make therapeutic predictions. 

## Introduction

## Getting Started

We begin by importing the cando package. Once cando has been imported we can pull the example mappings and matrix data for this tutorial, `cnd.get_tutorial()`. Lastly, for our set up, we define important variables that will be used throughout this tutorial. These variables will be explained as each becomes important for different functions.

In [2]:
import sys, os
# import the cando package
import cando as cnd

# Pull data for this entire tutorial
cnd.get_tutorial()

# Define variables for matrix generation and CANDO object creation
matrix_file='./examples/example-matrix.tsv'
cmpd_map='./v2_0/mappings/drugbank-approved.tsv'
ind_map='./v2_0/mappings/ctd_2_drugbank.tsv'
cmpd_scores='./v2_0/cmpds/scores/drugbank-approved-rd_ecfp4.tsv'
prot_scores='./examples/example-prots_scores.tsv'
protein_set="./examples/example-uniprot_set"
dist_metric='rmsd'
ncpus=3

Downloading data for tutorial...
All data for tutorial downloaded.


## Generating interaction matrix

<span style="color:red">**This step is optional**</span> - The example interaction matrix, `matrix_file`, has already been downloaded via `get_tutorial()`. This function may take a few minutes.

In this step we will generate a matrix of 2,162 drugs by 64 proteins, populated with the corresponding interaction score betwen each drug and protein. The final matrix will have drugs as the columns (indexed according to the compound mapping file), and proteins as the indices (indexed by PDB and chain ID).

The function `generate_matrix()` first creates a dataframe for all tanimoto scores comparing each drug in our dataset (2,162) to every potential binding site ligand from the PDB, `cmpd_scores`. Next, it creates a dataframe of all potential binding sites for each protein in our dataset (example set = 64 proteins) with the corresponding binding site score, `prot_scores`. The function then iterates over all drugs and protein binding sites to populate the matrix with the highest tanimoto score. These scores represent how strongly each drug may potentially bind to each protein target. We then output the matrix to a tsv file, `matrix_file`.

This function is parallelized, so setting the vairable `ncpus` will change the number of processors that are used for this function.

In [3]:
# generate example cando interaction matrix (2,162 drugs x 64 proteins)
cnd.generate_matrix(matrix_file=matrix_file, cmpd_scores=cmpd_scores,
                    prot_scores=prot_scores, ncpus=ncpus)

Compiling compound scores...


KeyboardInterrupt: 

## Setting up CANDO object

1. First argument, `cmpd_map`, is the compound mapping which specifies all of the compounds in the matrix by name and ID.

2. Second argument, `ind_map`, is the indication mapping which specifies which diseases the drugs/compounds are approved to treat.

3. `matrix` is the cando interaction matrix file which contains the names of the proteins and the scores to all compounds. **This file must be in tsv format.**

4. `compute_distance` tells the object to compute the distances between the compounds in the platform based on the similarity of their interactions with all of the proteins. The distance metric used, `dist_metric`, can be cosine, rmsd, or many other common distance metrics (default is rmsd). This can be relatively time consuming depending on the amount of proteins, but with proteins on the order of hundreds it shouldn't take longer than a minute or two. If the computation does take a while (let's say >100,000 proteins), you can add a `save_rmsds='name_of_rmsd_file.tsv'` flag to save the computation. Then, simply use the `read_rmsds='name_of_rmsd_file.tsv'` to read in the already computed RMSDs to save time. 

In [4]:
# create object using example cando interaction matrix
# compute distances using rmsd distance metric
cando = cnd.CANDO(cmpd_map, ind_map, matrix=matrix_file, compute_distance=True, 
                  dist_metric=dist_metric, ncpus=ncpus)

Reading signatures from matrix...
Done reading signatures.

Computing rmsd distances...
Done computing rmsd distances.


### Check data
The imported data from `get_tutorial()` contains the CANDO v2.0 mappings with a sample of 64 proteins (for simplicity sake). It should contain the v2.0 set of 2,162 drugs/compounds and 2,178 indications. Let's check to make sure this is the case. The CANDO object should automagically assign each compound its signature of 64 proteins. Let's make sure the first compound in the mapping (bivalirudin) has 64 values in its signature and take a look at the values themselves. Depending on the scoring protocol used, the range/distribution of these values will vary.

The `compute_distance` flag above tells the CANDO object to compute the RMSD between each compound on an all-vs-all basis. Let's check out the top5 most similar compounds to bivalirudin. The `Compound.similar` lists are part of the Compound objects and contain a tuple of every other compound object and its computed RMSD.

In [5]:
# print cando object stats
print('compounds', len(cando.compounds))
print('indications', len(cando.indications))
print('proteins', len(cando.proteins))
print('')

# print bivalirudin signature
c = cando.compounds[0]
print(c.name, len(c.sig))
print(c.sig)
print('')

# top5 most similar compounds to bivalirudin
for s in c.similar[0:5]:
    print(s[0].name, s[1])

compounds 2162
indications 2178
proteins 64

bivalirudin 64
[0.496, 0.482, 0.385, 0.061, 0.0, 0.11, 0.114, 0.375, 0.109, 0.102, 0.12, 0.418, 0.227, 0.0, 0.417, 0.146, 0.109, 0.129, 0.09, 0.099, 0.137, 0.109, 0.1, 0.109, 0.04, 0.133, 0.101, 0.19, 0.129, 0.077, 0.09, 0.571, 0.134, 0.202, 0.108, 0.408, 0.211, 0.0, 0.188, 0.432, 0.136, 0.116, 0.0, 0.0, 0.195, 0.535, 0.143, 0.232, 0.021, 0.056, 0.09, 0.102, 0.447, 0.114, 0.135, 0.01, 0.127, 0.188, 0.039, 0.096, 0.114, 0.214, 0.128, 0.405]

semaglutide 0.0381114730101055
tetracosactide 0.038863342303512696
lixisenatide 0.04069090807539198
corticorelin_ovine_triflutate 0.044826052692602765
afamelanotide 0.04829838894828687


## Canbenchmark
An important part of CANDO is benchmarking how well we can recapture drugs known to treat the same dieases within their respective `Compound.similar` lists. Currently, our benchmarking code calculates three metrics:
1. Average indication accuracy (aia) - this value is the average of every individual indication accuracy
2. Average pairwise accuracy (apa) - this value is the weighted average of each individual indication accuracy based on the number of drugs approved to treat it
3. Indication coverage (ic) - this is the count of the number of non-zero indication accuracies

The higher the metric scores for a given matrix/compound-protein scoring protocol, the more confidence we will have in predictions made using it. 

An accuracy is calculated for each indication that has at least 2 compounds associated. To do this, we hold out each compound and look for any of the other approved compounds within a certain cutoff of the `Compound.similar` list for the held-out compound. The cutoffs are predetermined to be top10, top25, top50, and top100. There are also percent cutoffs that vary based on the number of compounds in the platform. So, for Indication-A with three drugs approved (D1, D2, and D3), we would hold out D1 and look for *either* D2 or D3 in the top10, top25, top50, and top100 compounds to D1. This would be repeated for D2 and D3. So let's say D3 was recaptured at rank 5 for D1, D1 was recaptured at rank 12 for D2, and D1 was recaptured at rank 27 for D3. The top10 average indication accuracy would be (1+0+0) / 3 == 33%, whereas the top25 would be (1+1+0) / 3 == 66%, and the top50 would be 100%. 

### Canbenchmark - classic
Below is how to run the benchmarking code. The first argument of `benchmark_classic()` is the extension to put on the output files ("results_analysed_named" and "raw_results"), and the second argument is the full name of the summary file which contains the metric scores at each cutoff (described above). 

In [6]:
cando.benchmark_classic('test', 'summary-test.tsv')

	aia
top10	19.32
top25	26.24
top50	33.53
top100	42.92
top2162	100.00
top1%	24.74
top5%	44.05
top10%	55.89
top50%	87.19
top100%	100.00




Below is the printed summary file for the classic canbenchmark.

In [7]:
with open("summary-test.tsv", 'r') as f:
    for line in f:
        print(line.strip('\n'))

	top10	top25	top50	top100	top2162	top1%	top5%	top10%	top50%	top100%
aia	19.318	26.238	33.532	42.920	99.999	24.736	44.050	55.890	87.192	99.999
apa	33.506	46.909	59.057	71.217	100.000	44.014	72.344	82.614	97.154	100.000
ic	779	889	976	1083	1570	864	1096	1213	1483	1570


### Canbenchmark - associated
There are variations of the benchmarking code which may be of interest to some users. For example, not every compound in the mapping files are associated with a disease. This can decrease performance. We can have these compounds removed by running another benchmarking method called `benchmark_associated()`, which will automatically filter out these non-associated compounds. 

In [8]:
cando.benchmark_associated('test_associated', 'summary-test_associated.tsv')

Making CANDO copy with only benchmarking-associated compounds
	aia
top10	22.02
top25	30.79
top50	38.96
top100	49.05
top1403	100.00
top1%	24.48
top5%	43.50
top10%	54.72
top50%	86.05
top100%	100.00




Below is the printed summary file for the associated canbenchmark.

In [9]:
with open("summary-test_associated.tsv", 'r') as f:
    for line in f:
        print(line.strip('\n'))

	top10	top25	top50	top100	top1403	top1%	top5%	top10%	top50%	top100%
aia	22.023	30.792	38.963	49.045	99.999	24.481	43.502	54.717	86.049	99.999
apa	38.815	54.201	66.465	77.012	100.000	43.417	71.714	81.791	96.828	100.000
ic	819	950	1040	1149	1570	865	1088	1203	1478	1570


### Canbenchmark - continuous
Yet another variation of our benchmarking is `benchmark_continuous()`. This method identifies percentiles for the compound distances which define our cutoffs. Rather than top10 compounds, we can define an RMSD cutoffs that will not punish a compound that may fall at rank 11, but is still very similar to the hold-out compound. These RMSD cutoffs are determined empirically. 

In [16]:
cando.benchmark_continuous('test_continuous', 'summary-test_continuous.tsv')

	aia
0.1%	10.37
0.5%	16.34
1%	19.72
5%	34.31
10%	42.77
20%	54.76
33%	66.35
50%	77.49
100%	100.00




Below is the printed summary file for the continuous canbenchmark.

In [17]:
with open('summary-test_continuous.tsv', 'r') as f:
    for line in f:
        print(line.strip('\n'))

	0.1%	0.5%	1%	5%	10%	20%	33%	50%	100%
aia	10.365	16.339	19.722	34.310	42.769	54.756	66.353	77.493	99.999
apa	15.391	26.760	33.053	53.289	61.792	71.962	80.122	87.221	100.000
ic	562	689	738	931	1026	1154	1278	1387	1570


## Canpredict

An important part of CANDO is generating putative drug candidates for a specific disease and predicting indications for which added or current drugs in our library can be therapetuic. We can do this with the canpredict functions.

### Canpredict - compounds
Generating putative drug candidates for a specific disease is one way we may want to use the predictive power of our platform. For this, we use the `canpredict_compounds()` function, which basically uses a consensus method to rank putative compounds based on how many times they show up as similar to drugs approved to treat the disease within some cutoff. The default cutoff is 10 (the most stringent from benchmarking), but this can be varied. In its current implementation, `canpredict_compounds()` ranks compounds based on the consensus count, but this can be changed in the future to prioritize distance as well. 

Let's make predictions for breast cancer, or "Breast Neoplasms" according to the CTD mapping. This has a MeSH ID of MESH:D001943, which is the input for `canpredict_compounds()`. This list is very long due to the large amount of drugs approved to treat breast cancer, so we should look at only the top10 (default) by setting `n=10`. We will also only print the first 10 compounds (`topN=10`).

In [11]:
cando.canpredict_compounds("MESH:D001943", n=10, topX=10)

112 compounds found for MESH:D001943 --> Breast Neoplasms
Generating top 10 compound predictions...

rank	score	id	name
1	9	1364	omacetaxine_mepesuccinate
2	8	182	etonogestrel
3	7	450	nizatidine
4	7	1460	simeprevir
5	6	1365	nilotinib
6	6	1536	cabazitaxel
7	6	478	amodiaquine
8	6	550	fludrocortisone
9	5	397	lercanidipine
10	5	2061	venetoclax




Perhaps using the top10 compounds is not encompassing enough. Let's change it to top25 (n=25) and see if the predictions change. We will still print the first 10 compounds.

In [12]:
cando.canpredict_compounds("MESH:D001943", n=25, topX=10)

112 compounds found for MESH:D001943 --> Breast Neoplasms
Generating top 10 compound predictions...

rank	score	id	name
1	14	182	etonogestrel
2	13	1896	osimertinib
3	13	397	lercanidipine
4	13	447	voriconazole
5	12	1364	omacetaxine_mepesuccinate
6	11	682	ethynodiol_diacetate
7	11	1880	paritaprevir
8	11	1546	halcinonide
9	11	2061	venetoclax
10	11	478	amodiaquine




Now let's print the first 25 compounds predicted for 'Breast Neoplasms' using the top25 most similar compounds. We will also show the compounds that are already approved for 'Breast Neoplasms' (removed by default).

In [13]:
cando.canpredict_compounds("MESH:D001943", n=25, topX=25, keep_approved=True)

112 compounds found for MESH:D001943 --> Breast Neoplasms
Generating top 25 compound predictions...

rank	score	approved	id	name
1	14	False		182	etonogestrel
2	13	False		1896	osimertinib
3	13	False		397	lercanidipine
4	13	False		447	voriconazole
5	12	False		1364	omacetaxine_mepesuccinate
6	11	False		682	ethynodiol_diacetate
7	11	False		1880	paritaprevir
8	11	True		468	medroxyprogesterone_acetate
9	11	True		236	megestrol_acetate
10	11	False		1546	halcinonide
11	11	False		2061	venetoclax
12	11	False		478	amodiaquine
13	10	False		562	nicergoline
14	10	False		1536	cabazitaxel
15	10	False		1393	trabectedin
16	10	False		162	topiramate
17	10	True		1736	tibolone
18	10	False		519	estrone
19	10	False		1279	prasterone
20	10	False		488	testosterone
21	10	False		1641	bedaquiline
22	10	True		810	carboplatin
23	9	False		1139	vecuronium
24	9	False		450	nizatidine
25	9	False		895	fluocinonide




Sometimes there are no compounds associated with a disease, which makes canpredict impossible if considering interactomic homology to approved drugs. In these cases, `ind_id=None` and `sum_scores=True` can be set to simply sum the interaction scores within the matrix and output those with the greatest totals. These parameters can be particularly useful when considering matrices with proteins from pathogens or in combination with the `protein_subset` flag (discussed below). 

In [15]:
cando.canpredict_compounds(ind_id=None, topX=50, sum_scores=True)

Finding compounds with greatest summed scores in ./examples/example-matrix.tsv...
Generating top 50 compound predictions...

rank	score	id	name
1	19.46699999999999	25	adenosine_monophosphate
2	17.899	50	nadh
3	17.508999999999993	504	adenosine
4	17.508999999999993	85	vidarabine
5	16.704	1306	flavin_adenine_dinucleotide
6	15.976	1334	lactose
7	15.225	2131	inosine_pranobex
8	15.225	1331	inosine
9	14.905000000000005	13	ademetionine
10	14.759	2022	arbutin
11	14.256999999999996	919	fludarabine
12	14.240999999999994	838	cytarabine
13	14.187000000000001	1602	hyaluronic_acid
14	14.047999999999998	1108	nelarabine
15	13.875999999999998	1950	sodium_ferric_gluconate_complex
16	13.875999999999998	1795	iron_saccharate
17	13.875999999999998	1301	sucrose
18	13.847000000000001	1336	gluconolactone
19	13.616999999999997	781	azacitidine
20	13.411000000000003	1410	mipomersen
21	13.195999999999996	2036	polydatin
22	13.190999999999997	1013	kanamycin
23	13.129	446	lactulose
24	13.102	1440	regadenoson
25	12.928

### Canpredict - indications


Below we print the first 10 indications predicted for Paromomycin using the top10 most similar compounds. Again, this tallies how many times specific diseases show up as associated with the top10 most similar compounds to paromomycin. 

In [22]:
cando.canpredict_indications(cando_cmpd=cando.compounds[10], n=10, topN=10)

Using CANDO compound pyridoxal_phosphate
Compound has id 10 and index 10
Comparing signature to all CANDO compound signatures...
Generating top 10 indication predictions...

rank	score	mesh_id    	indication
1	1	MESH:D004830	Epilepsy, Tonic-Clonic
2	1	MESH:D012640	Seizures
3	1	MESH:D013226	Status Epilepticus
4	1	MESH:D017180	Tachycardia, Ventricular
5	1	MESH:D001249	Asthma
6	1	MESH:D029424	Pulmonary Disease, Chronic Obstructive
7	1	MESH:D015451	Leukemia, Lymphocytic, Chronic, B-Cell
8	1	MESH:D007945	Leukemia, Lymphoid
9	1	MESH:D016403	Lymphoma, Large B-Cell, Diffuse
10	1	MESH:D020522	Lymphoma, Mantle-Cell



### Similar compounds
`similar_compounds()` prints the first `n` most similar compounds for a given compound. This, like `canpredict_indications()` can be used with cando compounds, `cando_cmpd`, or novel compounds with a signature file (we will explore this later).

Below we print the first 10 most similar compounds to Paromomycin.

In [23]:
cando.similar_compounds(cando_cmpd=cando.compounds[10], n=10)

Using CANDO compound pyridoxal_phosphate
Compound has id 10 and index 10
Comparing signature to all CANDO compound signatures...
Generating 10 most similar compound predictions...

rank	dist	id	name
1	0.035	524	metaxalone
2	0.039	1721	tedizolid_phosphate
3	0.042	1124	fosphenytoin
4	0.042	736	emtricitabine
5	0.045	683	enprofylline
6	0.045	627	clavulanate
7	0.046	1726	ibrutinib
8	0.046	1512	fospropofol
9	0.046	946	capecitabine
10	0.047	896	abacavir




## Machine learning with CANDO
The "proteomic vectors" within CANDO lend themselves well to machine learning to perhaps learn more complex relationships between the proteins within the vector and their impacts on the treatment of diseases. CANDO has built-in ML algorithms that allow for two main functionalities:
1. Benchmark the platform using a hold-one-out protocol very similar to canbenchmark
2. Make predictions for novel or non-associated compounds that may be therapeutic for a given disease

The ML module currently supports 4 algorithms: support vector machines (SVMs), 1-class SVMs, random forests, and logistic regression. The models are trained on drugs approved for the disease (positive classes) and an equal number of randomly selected "neutral samples", which are drugs/compounds not approved for the disease (negative samples). Random seeds may be set to ensure the same compounds are used in training. 

### ML - benchmark
We have the option to benchmark the platform with an ML algorithm - this module outputs files very similar to canbenchmark. For this tutorial, we will skip this function as it requires a great deal of time to complete (training a separate model for EVERY drug-disease association, basically). The command to do so with an SVM is below, feel free to run it! The `'out='` flag defines the name of the output files. Again, only diseases with 2+ compounds associated are benchmarked. 

`cnd.ml(method='smv', benchmark=True, seed=50, out='test_smv')`

### ML - predict
We can also use this module to predict if a certain compound may be therapeutic for a given disease. We can use the 
`'predict='` flag to specify a list of compounds that we wish to predict with the classifier. Let's use three drugs, imatinib, buprenorphine, and lisdexamfetamine, and see if they are predicted to be anti-inflammatory using a random forest classifier. 

In [27]:
inflm = cando.get_indication('MESH:D007249')

imat = cando.get_compound(483)
bup = cando.get_compound(775)
lamf = cando.get_compound(1094)

cando.ml(method='rf', effect=inflm, benchmark=False, seed=50, predict=[imat, bup, lamf])

Inflammation
Indication: Inflammation
Leave-one-out cross validation: TP=64, FN=54, Acc=54.24
	Compound	Class
	imatinib	0
	buprenorphine	1
	lisdexamfetamine	0


## Custom protein subsets and signatures
It may be useful for some users to probe compound-protein interaction similarity, but only in the context of a few particular proteins (e.g. set of kinase inhibitors). Instead of generating a new matrix with all of these proteins and their corresponding interaction values, which can begin to take up a lot of storage if done multiple times, the `'protein_set='` flag can be specified during the instantiation of the CANDO object. This flag contains the path to the protein subset the user wishes to use, which is simply a list of UniProt protein IDs. The CANDO object will automatically check for each ID if it either simply matches any UniProt IDs within the matrix or if that UniProt ID is associated with any PDB chains within the matrix (based on a mapping from the SIFTs project). If there are matches, the CANDO object will now contain Compound objects with only those protein interaction values in their signatures. Below is an example with encompanying benchmark and ML examples. Let's check to see if some of the 20 UniProt IDs in the example list had correspnding PDBs. 

In [29]:
cando_subset = cnd.CANDO(cmpd_map, ind_map, matrix=matrix_file, compute_distance=True, protein_set=protein_set,
                  dist_metric=dist_metric, ncpus=ncpus)

print("Number of proteins in new signature =", len(cando_subset.proteins))

Reading signatures from matrix...
Editing signatures according to proteins in ./examples/example-uniprot_set...
Done reading signatures.

Computing rmsd distances...
Done computing rmsd distances.
Number of proteins in new signature = 20


The signature was successfully edited to 20 proteins. Note: this does not nececessarily mean the each UniProt ID had a corresponding PDB match -- multiple PDB chains can be associated with a given UniProt ID. 

We can also repeat all benchmarks and predictive algorithms with the new signatures. Below is the default benchmarking results with the new signatures. 

In [30]:
cando_subset.benchmark_classic('test_subset', 'summary-test_subset')

	aia
top10	16.82
top25	24.28
top50	31.07
top100	40.75
top2162	100.00
top1%	22.55
top5%	42.04
top10%	53.83
top50%	87.69
top100%	100.00




Let's repeat the ML code from above, but this time let's use an SVM. 

In [31]:
inflm = cando_subset.get_indication('MESH:D007249')

imat = cando_subset.get_compound(483)
bup = cando_subset.get_compound(775)
lamf = cando_subset.get_compound(1094)

cando_subset.ml(method='svm', effect=inflm, benchmark=False, seed=50, predict=[imat, bup, lamf])

Indication: Inflammation
Leave-one-out cross validation: TP=104, FN=14, Acc=88.14
	Compound	Class
	imatinib	1
	buprenorphine	1
	lisdexamfetamine	0


## Generate proteomic signatures for new compounds 

The CANDO platform contains an extensive library of approved drugs (2,162) and other compounds (additional 6,590) from DrugBank. However, if you wish to predict indications or similar drugs for a compound that is not present in our library, we make it possible with the following series of functions.

First, You must have the compound properly formatted in PDB file format. There are many programs that provided conversion among many chemical file formats if you require.

Next, you can run the `generate_fp()` function. This will populate a tsv file with tanimoto similarity scores (jaccard index) of the provided compound to all binding site ligands in our database. These values will be used for the generation of the drug-proteome signature. The fingerprint used for tanimoto score will be defined by the input for variable 'fp'. The default for this is "rd_ecfp4". The tsv file will be saved with the name you provide in cmpd_id (e.g. cmpd_id=7561; "7561.tsv") and it will be saved in the directory provided as the out_path along with the fingerprint defined (e.g. out_path="examples"; "examples/rd_ecfp4/7561.tsv").

In [33]:
cnd.generate_scores(fp="rd_ecfp4",
        cmpd_pdb="examples/8100.pdb", out_path="examples", ncpus=ncpus)
cnd.generate_signature(cmpd_scores="examples/rd_ecfp4/8100_scores.tsv",
        prot_scores=prot_scores, ncpus=ncpus)

Downloading file: /Users/williammangione/Documents/UB/samudrala_lab/CANDO/v2_0/ligands_fps/rd_ecfp4.tsv [540.0 KB] [] [ETA:   0:00:31] 

Downloading ligand fingerprints for rd_ecfp4...


Downloading file: /Users/williammangione/Documents/UB/samudrala_lab/CANDO/v2_0/ligands_fps/rd_ecfp4.tsv [164.5 MB] [] [Time:  0:01:08] 


Ligand fingerprints downloaded.
Generating rd_ecfp4 fingerprints and scores...
Calculating tanimoto scores for compound 8100 against all binding site ligands...
Tanimoto scores written to examples/rd_ecfp4/8100_scores.tsv

Compiling compound scores...
Compiling binding site scores...
Generating interaction signature...
8100
Signature written to 8100_signature.tsv.
Signature generation took 0 seconds to finish.


In [34]:
cando.similar_compounds(new_sig="8100_signature.tsv",
        new_name='scy-635', n=10)

New compound is scy-635
New compound has id 2162 and index 2162
Comparing signature to all CANDO compound signatures...
Generating 10 most similar compound predictions...

rank	dist	id	name
1	0.021	2154	glecaprevir
2	0.022	2064	asunaprevir
3	0.023	1741	edoxaban
4	0.023	2102	voxilaprevir
5	0.023	1454	udenafil
6	0.023	1129	cefazolin
7	0.024	1460	simeprevir
8	0.024	1712	ledipasvir
9	0.024	1132	cefotetan
10	0.024	1864	fimasartan




In [35]:
cando.canpredict_indications(new_sig="8100_signature.tsv",
        new_name='scy-635', n=10, topN=10)

New compound is scy-635
New compound has id 2162 and index 2162
Comparing signature to all CANDO compound signatures...
Generating top 10 indication predictions...

rank	score	mesh_id    	indication
1	2	MESH:D010019	Osteomyelitis
2	2	MESH:D006526	Hepatitis C
3	2	MESH:D019698	Hepatitis C, Chronic
4	1	MESH:D054556	Venous Thromboembolism
5	1	MESH:D007172	Erectile Dysfunction
6	1	MESH:D016470	Bacteremia
7	1	MESH:D001437	Bacteriuria
8	1	MESH:D002295	Carcinoma, Transitional Cell
9	1	MESH:D004697	Endocarditis, Bacterial
10	1	MESH:D004927	Escherichia coli Infections

