# TACTIC tutorials

This notebook provides quick tutorials on X applications for `TACTIC`: 
1. Predicting drug interaction outcomes for new strains (using the already built model)
2. Predicting drug interaction outcomes for new drug combinations (using the already built model)
3. Building a TACTIC model with user-defined inputs

# How to use `TACTIC`

This notebook provides a quick tutorial on how to apply TACTIC using sample drug response and drug interaction data. The code below is adaptable to other datasets provided that they are formatted according to the [example datasets](). 

Of note, TACTIC requires `indigopy`, a Python package implementation of [INDIGO](https://doi.org/10.1007/978-1-4939-8891-4_13). Instructions on how to install `indigopy` are detailed in the package [README](https://github.com/sriram-lab/INDIGOpy/blob/main/README.md).

## 0. Set up environment

Import dependencies.

In [1]:
# Data handling
import webbrowser
import numpy as np
import pandas as pd
from itertools import chain, compress
from warnings import warn
from scipy import sparse
from indigopy.core import featurize, classify

# Data analysis
from scipy.stats import spearmanr

# Machine learning
from sklearn.ensemble import RandomForestRegressor

# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Prepare sample data

TACTIC requires 2 types of data inputs: 

1. Drug response (e.g., omics measurements to drug treatments)
2. Drug interactions (e.g., Bliss or Loewe scores)

For this tutorial, we will work with subsets of the [data](https://github.com/sriram-lab/TACTIC/tree/master/data) used for developing the TACTIC approach. Specifically: 

- We will use data from the [Brochado study](https://pubmed.ncbi.nlm.nih.gov/29973719/) for model training, and data measured in *A. baumannii* for model testing
- Based on the drugs represented in the sample drug interactions, we will take: 
    - The subset of chemogenomics data measured for the first 50 genes in *E. coli* MG1655 ([Nichols et al., 2011](https://pubmed.ncbi.nlm.nih.gov/21185072/))
    - The subset of transcriptomics data measured for the first 50 genes in *M. tb* H37Rv ([Ma et al., 2019](https://pubmed.ncbi.nlm.nih.gov/31719182/))

In [2]:
# Instantiate data variable
data = {'source': {}, 'response': {}, 'interactions': {}}

# Load source data
data['source']['interactions']  = pd.read_excel('./data/ixn_data.xlsx', sheet_name='data', engine='openpyxl')
data['source']['key']           = pd.read_excel('./data/ixn_data.xlsx', sheet_name='drug_key', engine='openpyxl')
data['source']['MG1655']        = pd.read_excel('./data/omics_data.xlsx', sheet_name='ecoli_chemogenomics', engine='openpyxl')
data['source']['H37Rv']         = pd.read_excel('./data/omics_data.xlsx', sheet_name='mtb_transcriptomics', engine='openpyxl')

# Prepare data: drug interactions (train and test sets)
dfi = data['source']['interactions'].copy()
dfo = dfi[dfi['Source']==1]
ix = dfo['Interaction'].value_counts()==6
ixns = list(ix.index[ix==True])
data['interactions']['train']   = dfo[dfo['Interaction'].isin(ixns)]
data['interactions']['test']    = dfi[dfi['Strain'].str.startswith('A.')]

# Prepare data: drug response (key, MG1655, H37Rv)
dfi, df1, df2 = data['source']['key'].copy(), data['source']['MG1655'].copy(), data['source']['H37Rv'].copy()
ixns = ixns + data['interactions']['test']['Interaction'].tolist()
drugs = sorted(list(set(chain.from_iterable([list(combo.split(', ')) for combo in ixns]))))
dfo = dfi[dfi['Code'].isin(drugs)].rename(columns={'Chemogenomic_label': 'MG1655', 'Transcriptomic_label': 'H37Rv'})
data['response']['key']         = dfo.copy()
data['response']['MG1655']      = df1[['Gene'] + dfo['MG1655'].tolist()].iloc[:50, :]
data['response']['H37Rv']       = df2[['Gene'] + dfo['H37Rv'].tolist()].iloc[:50, :]

# Inspect sample data
print('Sample datasets (N = 5):')
df = data['interactions']['train'].copy()
print('\t1. Drug interactions, train set {}:'.format(df.shape))
display(df.head())
df = data['interactions']['test'].copy()
print('\t2. Drug interactions, test set {}:'.format(df.shape))
display(df.head())
df = data['response']['key'].copy()
print('\t3. Drug response, key {}:'.format(df.shape))
display(df.head())
df = data['response']['MG1655'].copy()
print('\t4. Drug response, MG1655 {}:'.format(df.shape))
display(df.head())
df = data['response']['H37Rv'].copy()
print('\t5. Drug response, H37Rv{}:'.format(df.shape))
display(df.head())

# Save data for external inspection
fname = './data/sample/drug_response.xlsx'
for key, value in data['response'].items(): 
	if key=='key': 
		value.to_excel(fname, sheet_name=key, index=False, engine='openpyxl', freeze_panes=(1, 1))
	else: 
		with pd.ExcelWriter(fname, mode='a', if_sheet_exists='replace') as file:
			value.to_excel(file, sheet_name=key, index=False, engine='openpyxl', freeze_panes=(1, 1))
	print('Successfully saved drug response data for {}'.format(key))
fname = './data/sample/drug_interactions.xlsx'
for key, value in data['interactions'].items(): 
	if key=='train': 
		value.to_excel(fname, sheet_name=key, index=False, engine='openpyxl', freeze_panes=(1, 1))
	else: 
		with pd.ExcelWriter(fname, mode='a', if_sheet_exists='replace') as file:
			value.to_excel(file, sheet_name=key, index=False, engine='openpyxl', freeze_panes=(1, 1))
	print('Successfully saved drug interaction data for {}'.format(key))

Sample datasets (N = 5):
	1. Drug interactions, train set (192, 9):


Unnamed: 0,Source,Set,Strain,Interaction,Degree,Score,Label,Metric,Occurrence
2,1,TACTIC,E. coli BW25113,"A22, MMC",2,-0.154161,Synergy,Bliss,6
8,1,TACTIC,E. coli BW25113,"AMK, EGCG",2,0.134583,Antagonism,Bliss,6
9,1,TACTIC,E. coli BW25113,"AMK, ERY",2,0.200988,Antagonism,Bliss,9
14,1,TACTIC,E. coli BW25113,"AMK, PRC",2,0.472862,Antagonism,Bliss,6
21,1,TACTIC,E. coli BW25113,"AMK, AMX",2,-0.128012,Synergy,Bliss,8


	2. Drug interactions, test set (45, 9):


Unnamed: 0,Source,Set,Strain,Interaction,Degree,Score,Label,Metric,Occurrence
1850,3,TACTIC,A. baumannii ATCC 17978,"AMK, AMP",2,0.317064,Antagonism,Loewe,3
1851,3,TACTIC,A. baumannii ATCC 17978,"AMK, AZT",2,-0.606979,Synergy,Loewe,5
1852,3,TACTIC,A. baumannii ATCC 17978,"AMK, NAL",2,0.246822,Antagonism,Loewe,4
1853,3,TACTIC,A. baumannii ATCC 17978,"AMK, RIF",2,-0.132424,Neutral,Loewe,6
1854,3,TACTIC,A. baumannii ATCC 17978,"AMK, TET",2,-0.458472,Synergy,Loewe,4


	3. Drug response, key (31, 3):


Unnamed: 0,Code,MG1655,H37Rv
0,A22,A22-15.0,A22
1,AMK,AMIKACIN-0.2,AMIKACIN
2,AMX,AMOXICILLIN-1.5,AMOXICILLIN
3,AMP,AMPICILLIN-8.0,AMP
4,AZI,AZITHROMYCIN-1.0,AZITHROMYCIN


	4. Drug response, MG1655 (50, 32):


Unnamed: 0,Gene,A22-15.0,AMIKACIN-0.2,AMOXICILLIN-1.5,AMPICILLIN-8.0,AZITHROMYCIN-1.0,AZTREONAM-0.04,BACITRACIN-300,BENZALKONIUM-25,CCCP-2.0,...,MITOMYCINC-0.1,NALIDIXICACID-2.0,NOVOBIOCIN-30,PARAQUAT-18.0,POLYMYXINB-6.0,PROCAINE-30,RIFAMPICIN-2.0,SPECTINOMYCIN-6.0,TETRACYCLINE-1.0,VERAPAMIL-1.0
0,b0002,-0.559545,1.647844,-0.994845,-1.298203,-0.468216,-1.332281,1.796933,1.178283,-1.030014,...,1.94556,1.462836,1.138723,-4.531073,2.045477,0.296932,-0.265948,-1.751131,-0.507021,0.395577
1,b0003,0.242666,-0.693398,-1.234871,0.411192,0.28587,0.469039,1.043688,0.66988,-0.567867,...,0.620079,1.579996,0.008923,-1.120873,2.009723,0.325243,-0.43547,0.357894,0.483735,0.599244
2,b0004,1.809979,1.482407,0.687448,0.695141,0.367449,0.135136,-0.680992,1.641956,-0.098073,...,-1.623568,0.000469,-1.053464,0.111009,1.346476,-1.616272,0.805449,-1.930057,0.384499,1.035534
3,b0005,-0.625091,0.140624,1.369324,0.780893,0.084454,0.090245,-0.826658,1.112208,0.650183,...,0.633023,0.0994,1.897863,0.318214,-0.733713,-1.782458,-0.149242,-0.814743,0.353229,-1.52068
4,b0006,-0.357582,-0.134298,-0.649026,1.327997,-0.260863,-1.039351,0.01933,-0.458028,0.238005,...,0.467696,0.016489,0.008245,-0.646982,1.30458,0.833593,0.459269,-0.0747,-1.631192,-1.01594


	5. Drug response, H37Rv(50, 32):


Unnamed: 0,Gene,A22,AMIKACIN,AMOXICILLIN,AMP,AZITHROMYCIN,AZTREONAM,BACITRACIN,BENZALKONIUM,CCCP,...,MTM,NALIDIXICACID,NOVOBIOCIN,PARAQUAT,POLYMYXINB,PROCAINE,RIF,SPECTINOMYCIN,TET,VERAPAMIL
0,Rv0001,0.280216,-0.435491,0.874587,0.769623,-0.59728,-0.409214,0.455288,0.039375,-1.716496,...,0.012251,-1.218246,-0.502595,-0.374935,0.247419,-0.09333,-0.684493,0.793279,-0.839782,1.71083
1,Rv0002,-0.294181,0.224606,-0.294181,-0.562953,-0.294181,-0.294181,-0.294181,-0.294181,-0.422903,...,0.007786,-0.294181,-0.66794,-0.294181,-0.294181,-0.294181,-0.700379,-0.294181,0.502027,-0.17228
2,Rv0003,-0.478872,0.30068,0.173256,-0.654577,-1.029745,0.120626,0.107939,-0.5871,0.111387,...,-0.161867,-0.603865,-0.176118,0.407015,-0.138354,0.333373,-0.996568,-0.739771,1.053258,-1.171968
3,Rv0004,0.004665,0.487582,0.004665,-0.046152,0.004665,0.004665,0.004665,0.004665,-0.17234,...,0.133206,0.004665,0.089403,0.004665,0.004665,0.004665,0.053454,0.004665,-0.568754,0.129359
4,Rv0005,0.137907,-1.297794,0.137907,0.433942,0.137907,0.137907,0.137907,0.137907,-0.002451,...,0.173396,0.137907,-0.396215,0.137907,0.137907,0.137907,-0.317155,0.137907,-1.039578,-0.06196


Successfully saved drug response data for key
Successfully saved drug response data for MG1655
Successfully saved drug response data for H37Rv
Successfully saved drug interaction data for train
Successfully saved drug interaction data for test


## 2. Obtain gene orthologs via OrtholugeDB

One of the key concepts integrated in TACTIC is **transfer learning via orthology mapping**, which extends drug response data measured in a reference organism (e.g., *E. coli* MG1655) to a new organism (e.g., *A. baumannii*). To do so, information on gene orthologs shared between the reference and new organisms must be available. In this section, we walk through how to extract gene orthology data from [OrtholugeDB](https://ortholugedb.ca/). The code below opens the [webpage](https://ortholugedb.ca/?page=matrix-setup) where we can specify the reference and comparison genomes: 

In [6]:
# Open genome comparison webpage in OrtholugeDB
webbrowser.open('https://ortholugedb.ca/?page=matrix-setup')

True

<div>
<img src='./data/sample/images/0_OrtholugeDB_setup.png' width='1000'/>
</div>

In this tutorial, our reference genomes are *E. coli* MG1655 and *M. tb* H37Rv. We will use the webpage above to collect gene orthology information for both references. Starting with *E. coli* MG1655, we can type the keyword `mg1655` to quickly find the reference genome: 

<div>
<img src='./data/sample/images/1_MG1655_reference.png' width='1000'/>
</div>

Based on our sample dataset, there are 9 strains that we are interested in: 

In [8]:
# Print out strains of interest
df = pd.concat([data['interactions']['train'], data['interactions']['test']])
strains = ['E. coli MG1655', 'M. tb H37Rv'] + list(df['Strain'].unique())
print('Strains of interest (N = {}):\n\t{}'.format(len(strains), '\n\t'.join(strains)))

Strains of interest (N = 9):
	E. coli MG1655
	M. tb H37Rv
	E. coli BW25113
	E. coli iAi1
	P. aeruginosa PAO1
	P. aeruginosa PA14
	S. typhimurium 14028s
	S. typhimurium LT2
	A. baumannii ATCC 17978


Against *E. coli* MG1655 as the reference genome, we can find genomes for the remaining 8 strains using the following keywords: 

- *M. tb* H37Rv: `h37rv`
- *E. coli* iAi1: `iai1`
- *P. aeruginosa* PAO1: `pao1`
- *P. aeruginosa* PA14: `pa14`
- *S.* Typhimurium 14028s: `14028s`
- *S.* Typhimurium LT2: `lt2`
- *A. baumannii* ATCC 17978: `17978`

Of note, the genome for *E. coli* BW25113 is not available through OrtholugeDB. Given that both BW25113 and MG1655 are [derived from *E. coli* K-12](https://ecoliwiki.org/colipedia/index.php/Category:Strain:BW25113), we assume that their genomes are equivalent.

<div>
<img src='./data/sample/images/2_H37Rv_comparison.png' width='1000'/>
</div>

<div>
<img src='./data/sample/images/3_all_comparisons.png' width='1000'/>
</div>

Once we have all genomes of interest selected, we can advance to the next step where we specify the level of orthology information desired. For TACTIC, we recommend leaving the default options: 

<div>
<img src='./data/sample/images/4_gene_filter.png' width='1000'/>
</div>

The results page should look like the following: 

<div>
<img src='./data/sample/images/5_results_page.png' width='1000'/>
</div>

We can download the gene orthology data for *E. coli* MG1655 in either CSV or TAB (recommended) format. To collect gene orthology information for *M. tb* H37Rv, we can go back to step 1 and use `h37rv` as our reference genome keyword: 

<div>
<img src='./data/sample/images/6_H37Rv_reference.png' width='1000'/>
</div>

We also select *E. coli* MG1655 as one of our comparison genomes (keyword: `mg1655`) and de-select *M. tb* H37Rv: 

<div>
<img src='./data/sample/images/7_MG1655_comparison.png' width='1000'/>
</div>

## 3. Create orthology maps

Using the downloaded files from OrtholugeDB, we can define orhtology maps for each reference strain in the following manner: 

In [None]:
# Load orthology data
data['source']['orthology'] = {
    'MG1655': pd.read_csv('./data/sample/MG1655_orthologs.txt', sep='\t'), 
    'H37Rv': pd.read_csv('./data/sample/H37Rv_orthologs.txt', sep='\t')
}

# Define orthology map for each reference
data['orthology'] = {}
for key, value in data['source']['orthology'].items(): 
    # Re-format orthology data
    jx = [1, 3] + [x for x in range(4, value.shape[1], 3)]
    df = value.copy().iloc[:, jx]
    df.columns = [s.split(' - ')[0].upper() if 'comparison' in s else s for s in df.columns]
    # Extra step for M. tb H37Rv
    df.iloc[:, 0] = [s.replace('RVBD_', 'Rv') for s in df.iloc[:, 0]]
    # Create orthology map
    data['orthology'][key] = {}
    for strain in strains: 
        # Define keyword
        keyword = strain.split(' ')[-1].upper()
        # Account for special cases
        if key.upper()==keyword: 
            print('Queried strain ({}) is the same as reference ({}). Continuing...'.format(strain, key))
            continue
        elif (keyword=='BW25113') and (key=='MG1655'): 
            data['orthology'][key][strain] = df.iloc[:, 0].tolist()
        else: 
            if keyword=='BW25113': 
                keyword = 'MG1655'
            jx = df.columns.str.endswith(keyword)
            if not any(jx): 
                warn('No orthologs found for {} against {} reference'.format(strain, key))
                continue
            else: 
                data['orthology'][key][strain] = list(compress(df.iloc[:, 0], (df.iloc[:, jx] != '0').values))
        # Print total number of orthologs
        print('Number of orthologs for {} against {} reference: {}'.format(strain, key, len(data['orthology'][key][strain])))

Queried strain (E. coli MG1655) is the same as reference (MG1655). Continuing...
Number of orthologs for M. tb H37Rv against MG1655 reference: 1124
Number of orthologs for E. coli BW25113 against MG1655 reference: 4157
Number of orthologs for E. coli iAi1 against MG1655 reference: 3772
Number of orthologs for P. aeruginosa PAO1 against MG1655 reference: 2078
Number of orthologs for P. aeruginosa PA14 against MG1655 reference: 2083
Number of orthologs for S. typhimurium 14028s against MG1655 reference: 3206
Number of orthologs for S. typhimurium LT2 against MG1655 reference: 3193
Number of orthologs for A. baumannii ATCC 17978 against MG1655 reference: 1563
Number of orthologs for E. coli MG1655 against H37Rv reference: 1064
Queried strain (M. tb H37Rv) is the same as reference (H37Rv). Continuing...
Number of orthologs for E. coli BW25113 against H37Rv reference: 1064
Number of orthologs for E. coli iAi1 against H37Rv reference: 1056
Number of orthologs for P. aeruginosa PAO1 against H

## 4. Data featurization using `indigopy`

## 5. ML model training and testing