# 05 Creating Link Tables for the Multimodal Cancer Network

In this notebook, we step through the construction of link type tables. There are 11 link types in the multimodal cancer network: 

<center>
**Chemical-Chemical, Chemical-Protein, Disease-Chemical, Disease-Disease, Disease-Function, Disease-Protein, Function-Function, Gene-Gene, Gene-Protein, Protein-Function,** and **Protein-Protein.**
</center>
    
We start by importing the required Python packages and setting up the date string to match the date to ``today``.

In [1]:
import os
import datetime

from utils.create_mambo_crossnet_table import create_mambo_crossnet_table

today = datetime.date.today()
datestring = "%s%s%s" % (today.year, today.month, today.day)

## Create Chemical-Chemical Link Tables

Information on Chemical-Chemical link type is retrieved from the [DrugBank database](https://www.drugbank.ca/), which relates DrugBank IDs to each other.

In [2]:
filepath = "datasets/cancer_example/chemical-chemical/drugbank_parsed_chemical_chemical.tsv"
srcfile = "datasets/cancer_example/chemical/miner-chemical-0-drugbank-%s.tsv" % datestring
dstfile = "datasets/cancer_example/chemical/miner-chemical-0-drugbank-%s.tsv" % datestring 
datasetname = "DrugBank" 
dbid = 0
outputdir = "datasets/cancer_example/chemical-chemical/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=False,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 3648


## Create Chemical-Protein Link Tables

Information on Chemical-Protein link type is retrieved from the DrugBank database, which relates DrugBank IDs to Uniprot IDs based on [drug target information](https://www.drugbank.ca/targets).

In [3]:
filepath = "datasets/cancer_example/chemical-protein/drugbank_parsed_chemical_protein.tsv"
srcfile = "datasets/cancer_example/chemical/miner-chemical-0-drugbank-%s.tsv" % datestring
dstfile = "datasets/cancer_example/protein/miner-protein-1-GO_UNIPROT-%s.tsv" % datestring 
datasetname = "DrugBank" 
dbid = 0
outputdir = "datasets/cancer_example/chemical-protein/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 531


## Create Disease-Chemical Link Tables

Information on Disease-Chemical link type is retrieved from the [CTD database](http://ctdbase.org), which relates MESH and OMIM IDs to DrugBank IDs based on the [published literature and the CTD biocurators](http://ctdbase.org/about/). For example, diseases associated with arsenic include: arsenic poisoning, prostatic neoplasms, skin diseases, and myocardial ischemia.

In [4]:
filepath = "datasets/cancer_example/disease-chemical/ctd_disease_chem_parsed.tsv"
srcfile = "datasets/cancer_example/disease/miner-disease-3-CTD_MESH-%s.tsv" % datestring
dstfile = "datasets/cancer_example/chemical/miner-chemical-0-drugbank-%s.tsv" % datestring
datasetname = "CTD_MESH" 
dbid = 0
outputdir = "datasets/cancer_example/disease-chemical/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=0, 
                           src_node_index=0, 
                           dst_node_index=1, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

srcfile = "datasets/cancer_example/disease/miner-disease-2-CTD_OMIM-%s.tsv" % datestring
datasetname = "CTD_OMIM" 
dbid = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=0, 
                           src_node_index=0, 
                           dst_node_index=1, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 261182
Starting at mambo id: 261182
Ending at mambo id: 261896


## Create Disease-Disease Link Tables

Information on Disease-Disease link type is retrieved from [Disease Ontology](http://disease-ontology.org/), which relates Disease Ontology IDs to each other based on anatomical organization of organs and tissues affected by the corresponding disease. For example, carotenemia (DOID:9969) and hyperuricemia (DOID:1920) are both acquired metabolic diseases.

In [5]:
filepath = "datasets/cancer_example/disease-disease/doid_disease_disease_parsed.tsv"
srcfile = "datasets/cancer_example/disease/miner-disease-0-DO-%s.tsv" % datestring
dstfile = "datasets/cancer_example/disease/miner-disease-0-DO-%s.tsv" % datestring
datasetname = "DO" 
dbid = 0
outputdir = "datasets/cancer_example/disease-disease/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=False,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 6357


## Create Disease-Function Link Tables

Information on Disease-Function link type is retrieved from the [CTD database](http://ctdbase.org/about/), which relates MESH IDs to Gene Ontology IDs based on cellular functions, processes, and components involved in disease development and progression.

In [6]:
filepath = "datasets/cancer_example/disease-function/ctd_disease_func_parsed.tsv"
srcfile = "datasets/cancer_example/disease/miner-disease-3-CTD_MESH-%s.tsv" % datestring
dstfile = "datasets/cancer_example/function/miner-function-0-GO-%s.tsv" % datestring
datasetname = "CTD_MESH" 
dbid = 0
outputdir = "datasets/cancer_example/disease-function/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=False,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 361916


## Create Disease-Protein Link Tables

Information on Disease-Protein link type is retrieved from the [CTD database](http://ctdbase.org/about/), which relates diseases (encoded by MESH and OMIM IDs) to proteins (encoded by Uniprot IDs). We construct a separate table for MESH IDs and OMIM IDs.

In [7]:
filepath = "datasets/cancer_example/disease-protein/ctd_disease_protein_parsed.tsv"
srcfile = "datasets/cancer_example/disease/miner-disease-3-CTD_MESH-%s.tsv" % datestring
dstfile = "datasets/cancer_example/protein/miner-protein-1-GO_UNIPROT-%s.tsv" % datestring 
datasetname = "CTD_MESH" 
dbid = 0
outputdir = "datasets/cancer_example/disease-protein/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=0, 
                           src_node_index=0, 
                           dst_node_index=1, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

srcfile = "datasets/cancer_example/disease/miner-disease-2-CTD_OMIM-%s.tsv" % datestring
datasetname = "CTD_OMIM" 
dbid = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=0, 
                           src_node_index=0, 
                           dst_node_index=1, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 950171
Starting at mambo id: 950171
Ending at mambo id: 950541


## Create Function-Function Link Tables

Information on Function-Function link type is retrieved from [Gene Ontology](http://geneontology.org/), which relates Gene Ontology terms to Gene Ontology terms based on their [ontological distance in Gene Ontology](http://geneontology.org/page/ontology-documentation). For example, broad biological process terms are "cellular physiological process" or "signal transduction". Examples of more specific terms are "pyrimidine metabolic process" or "alpha-glucoside transport".

In [8]:
filepath = "datasets/cancer_example/function-function/go_parsed.tsv"
srcfile = "datasets/cancer_example/function/miner-function-0-GO-%s.tsv" % datestring
dstfile = "datasets/cancer_example/function/miner-function-0-GO-%s.tsv" % datestring
datasetname = "GO" 
dbid = 0
outputdir = "datasets/cancer_example/function-function/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=False,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 2613


## Create Gene-Gene Link Tables

Information on Gene-Gene link type is retrieved from [GeneMANIA](http://genemania.org/), which relates ENSEMBL Gene IDs to ENSEMBL Gene IDs based on various types of functional interactions between genes, such as co-pathway membership, co-localization, co-expression, and physical protein interaction. 

In [9]:
types = ["Co-expression", "Co-localization", "Genetic_interactions", "Pathway", "Physical_interactions", "Predicted"]
base_dir = "datasets/cancer_example/gene-gene/"
directory = "datasets/cancer_example/gene-gene/genemania_data/"
srcfile = "datasets/cancer_example/gene/miner-gene-0-ICGC-%s.tsv" % datestring
dstfile = "datasets/cancer_example/gene/miner-gene-0-ICGC-%s.tsv" % datestring 

typecount = {"Co-localization" : 0, 
             "Co-expression" : 0,
             "Genetic_interactions" : 0,
             "Pathway" : 0,
             "Physical_interactions" : 0,
             "Predicted" : 0}
dirname = {"Co-localization" : "colocalization_links", 
           "Co-expression" : "coexpression_links",
           "Genetic_interactions" : "genetic_interactions_links",
           "Pathway": "pathway_links",
           "Physical_interactions" : "physical_interactions_links",
           "Predicted" : "predicted_links"}

for d in dirname.values():
    new_dir = os.path.join(base_dir, d)
    if not os.path.exists(new_dir):
        os.makedirs(new_dir)

for filename in os.listdir(directory):
    if any(typename in filename for typename in types):
        filepath = os.path.join(directory, filename)
        datasetname = filename.split('.')[1]
        filetype = filename.split('.')[0]
        dbid = typecount[filetype]
        outputdir = os.path.join(base_dir, dirname[filetype])
        create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=0, 
                           dst_node_index=1, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True)
        typecount[filetype] += 1

## Create Gene-Protein Link Tables

Information on Gene-Protein link type is retrieved from GeneMANIA and [Biomart](http://www.biomart.org/), which both relate ENSEMBL Gene IDs to ENSEMBL Protein IDs.

In [10]:
filepath = "datasets/cancer_example/gene-protein/ensembl_mapping.tsv"
srcfile = "datasets/cancer_example/gene/miner-gene-0-ICGC-%s.tsv" % datestring
dstfile = "datasets/cancer_example/protein/miner-protein-0-STRING-%s.tsv" % datestring
datasetname = "ENSEMBL" 
dbid = 0
outputdir = "datasets/cancer_example/gene-protein/"
srcindex = 0
dstindex = 1

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 461


## Create Protein-Function Link Tables

Information on Protein-Function link type is retrieved from [Gene Ontology](http://geneontology.org/), which relates Uniprot IDs to Gene Ontology IDs based associations of genes with cellular functions, biological processes and cellular components.

In [11]:
filepath = "datasets/cancer_example/protein-function/gene_association.goa_human"
srcfile = "datasets/cancer_example/protein/miner-protein-1-GO_UNIPROT-%s.tsv" % datestring
dstfile = "datasets/cancer_example/function/miner-function-0-GO-%s.tsv" % datestring
datasetname = "GO" 
dbid = 0
outputdir = "datasets/cancer_example/protein-function/"
srcindex = 1
dstindex = 4

create_mambo_crossnet_table(input_file=filepath, 
                           src_file=srcfile, 
                           dst_file=dstfile, 
                           dataset_name=datasetname,
                           db_id=dbid, 
                           src_node_index=srcindex, 
                           dst_node_index=dstindex, 
                           mode_name1=None,
                           mode_name2=None, 
                           output_dir=outputdir, 
                           full_crossnet_file=None, 
                           db_edge_file=None,
                           src_mode_filter=None, 
                           dst_mode_filter=None, 
                           mambo_id_counter_start=-1,
                           skip_missing_ids=True,
                           verbose=True)

Starting at mambo id: 0
Ending at mambo id: 10047


## Create Protein-Protein Link Tables

Information on Protein-Protein link type is retrieved from [STRING database](https://string-db.org/), which relates protein ENSEMBL IDs to protein ENSEMBL IDs based on physical protein-protein interactions, proximity in genomic sequence space, gene fusions, and other types of evidence. 

In [12]:
types = [
'neighborhood',
'fusion',
'cooccurence',
'coexpression',
'experimental',
'database',
'textmining',
'combined_score',
]

data_dir = 'datasets/cancer_example/protein-protein/string_data'
output_dir = 'datasets/cancer_example/protein-protein/'
srcfile = "datasets/cancer_example/protein/miner-protein-0-STRING-%s.tsv" % datestring
dstfile = srcfile
    
for t in types:
    new_dir = os.path.join(output_dir, t + '_links')
    if not os.path.exists(new_dir):
        os.makedirs(new_dir)

for filename in os.listdir(data_dir):
    filepath = os.path.join(data_dir, filename)
    t = filename.split('-')[0]
    outputdir = os.path.join(output_dir, t + '_links')
    create_mambo_crossnet_table(input_file=filepath, 
                       src_file=srcfile, 
                       dst_file=dstfile, 
                       dataset_name='STRING',
                       db_id=0, 
                       src_node_index=0, 
                       dst_node_index=1, 
                       mode_name1=None,
                       mode_name2=None, 
                       output_dir=outputdir, 
                       full_crossnet_file=None, 
                       db_edge_file=None,
                       src_mode_filter=None, 
                       dst_mode_filter=None, 
                       mambo_id_counter_start=-1,
                       skip_missing_ids=True)