# 04 Creating Mode Tables for the Multimodal Cancer Network

In this notebook, we construct mode tables for the multimodal cancer network. We  construct 5 mode tables for five modes: Chemical, Disease, Gene, Function, and Protein.

We begin by importing the necessary Python packages.

In [1]:
from collections import defaultdict
import os
import time

from utils.create_mapped_mode_table import create_mapped_mode_table
from utils.create_mapping_table import create_mapping_table
from utils.create_mambo_mode_table import create_mambo_mode_table

## Using Controlled Vocabularies to Map between Different Naming Schemes

As explained in notebook [02 Data Representation in Mambo](02 Data Representation in Mambo.ipynb), dictionaries provide a mapping between different naming schemes used for the same biological entities. 

In this notebook, we show how to construct these dictionaries. This procedure is interspersed with the construction of mode tables.

## Creating Chemical Mode Table

We begin by constructing the chemical mode tables. We use a list of chemicals derived from the [DrugBank](https://www.drugbank.ca/) database. DrugBank has a DrugBank ID for each chemical and also provides a [PubChem](https://pubchem.ncbi.nlm.nih.gov/) Compound ID (CID) and/or a PubChem Substance ID (SID) for some of the chemicals. DrugBank database is provided in the XML format, but this has already been parsed into a tab-separated format and is ready to be used with Mambo.

The parsed data is located in:
- `datasets/cancer_example/chemical/drugbank_parsed.tsv`

In [2]:
mappingfile = "datasets/cancer_example/chemical/drugbank_parsed.tsv"
outputfile = "datasets/cancer_example/chemical/mapped_chemicals.tsv"
create_mapping_table(mapping_file=mappingfile, 
                     mindex1=0, 
                     mindex2=1, 
                     output_file=outputfile, 
                     output_index1=0, 
                     output_index2=1, 
                     output_title1="drugbank", 
                     output_title2="PubChemCompound")
create_mapping_table(mapping_file=mappingfile, 
                     mindex1=0, 
                     mindex2=2, 
                     output_file=outputfile, 
                     output_index1=0, 
                     output_index2=2, 
                     output_title1="drugbank", 
                     output_title2="PubChemSubstance")

inputfile = "datasets/cancer_example/chemical/drugbank_parsed.tsv"
outputdir = "datasets/cancer_example/chemical"
create_mapped_mode_table(mode_name="chemical", 
                         input_file=inputfile, 
                         dataset_name="drugbank", 
                         db_id=0,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=1, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
create_mapped_mode_table(mode_name="chemical", 
                         input_file=inputfile, 
                         dataset_name="PubChemCompound", 
                         db_id=1,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=2, 
                         node_index=1,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
create_mapped_mode_table(mode_name="chemical", 
                         input_file=inputfile, 
                         dataset_name="PubChemSubstance", 
                         db_id=2,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=3, 
                         node_index=2,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)

## Creating Disease Mode Tables

We use 3 knowledge databases for information about diseases: [Disease Ontology](http://disease-ontology.org/), [Comparative Toxicogenomics Database (CTD)](http://ctdbase.org/), and [Online Mendelian Inheritance in Man (OMIM)](https://www.omim.org/). Disease Ontology uses Disease Ontology IDs (DOIDs), CTD uses both MESH and OMIM IDs, and OMIM uses OMIM IDs. We therefore need to map disease names between these three controlled vocabularies (i.e., DOIDs, MESH IDs, and OMIM IDs).

We provide the parsed disease information in the following four files:
- `datasets/cancer_example/disease/ctd_mesh_parsed.tsv`
- `datasets/cancer_example/disease/ctd_omim_parsed.tsv`
- `datasets/cancer_example/disease/do_parsed.tsv`
- `datasets/cancer_example/disease/omim_parsed.tsv`

We also provide two mapping files to be used to construct the disease dictionary:
- `datasets/cancer_example/disease/do_mesh_equiv.tsv`
- `datasets/cancer_example/disease/do_omim_equiv.tsv`


In [3]:
mappingfile = "datasets/cancer_example/disease/do_omim_equiv.tsv"
outputfile = "datasets/cancer_example/disease/mapped_diseases.tsv"
create_mapping_table(mapping_file=mappingfile, 
                     mindex1=0, 
                     mindex2=1, 
                     output_file=outputfile, 
                     output_index1=0, 
                     output_index2=1, 
                     output_title1="DOID", 
                     output_title2="OMIM")
mappingfile = "datasets/cancer_example/disease/do_mesh_equiv.tsv"
create_mapping_table(mapping_file=mappingfile, 
                     mindex1=0, 
                     mindex2=1, 
                     output_file=outputfile, 
                     output_index1=0, 
                     output_index2=2, 
                     output_title1="DOID", 
                     output_title2="MESH")

outputdir = "datasets/cancer_example/disease"
inputfile = "datasets/cancer_example/disease/do_parsed.tsv"
create_mapped_mode_table(mode_name="disease", 
                         input_file=inputfile, 
                         dataset_name="DO", 
                         db_id=0,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=1, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
inputfile = "datasets/cancer_example/disease/omim_parsed.tsv"
create_mapped_mode_table(mode_name="disease", 
                         input_file=inputfile, 
                         dataset_name="OMIM", 
                         db_id=1,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=2, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
inputfile = "datasets/cancer_example/disease/ctd_omim_parsed.tsv"
create_mapped_mode_table(mode_name="disease", 
                         input_file=inputfile, 
                         dataset_name="CTD_OMIM", 
                         db_id=2,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=2, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
inputfile = "datasets/cancer_example/disease/ctd_mesh_parsed.tsv"
create_mapped_mode_table(mode_name="disease", 
                         input_file=inputfile, 
                         dataset_name="CTD_MESH", 
                         db_id=3,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=3, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)

## Creating Function Mode Tables

Information about gene or protein function is obtained from [Gene Ontology](http://geneontology.org/) database. Because we are only using one database for function information at this time, there is no need to construct a mapping table for the Function mode.

We provide the parsed function information in the following file:
- `datasets/cancer_example/function/go_nodes.tsv`

In [4]:
inputfile = "datasets/cancer_example/function/go_nodes.tsv"
outputdir = "datasets/cancer_example/function"
create_mambo_mode_table(input_file=inputfile, 
                       db_id=0, 
                       mode_name="function", 
                       dataset_name="GO", 
                       full_mode_file=None, 
                       output_dir=outputdir, 
                       db_node_file=None,
                       mambo_id_counter_start=0, 
                       node_index=0)

## Creating Gene Mode Tables

We use 2 knowledge databases for information about genes: [HUGO](https://www.genenames.org/) and [GeneMANIA](http://genemania.org/). Both HUGO and GeneMANIA use ENSEMBL gene IDs and thus we do not require a mapping table for the Gene mode.

We provide the parsed gene information in the following two files:
- `datasets/cancer_example/gene/genemania_parsed.tsv`
- `datasets/cancer_example/gene/hugo_parsed.tsv`

In [5]:
outputdir = "datasets/cancer_example/gene"
inputfile = "datasets/cancer_example/gene/icgc_parsed.tsv"
create_mambo_mode_table(input_file=inputfile, 
                       db_id=0, 
                       mode_name="gene", 
                       dataset_name="ICGC", 
                       full_mode_file=None, 
                       output_dir=outputdir, 
                       db_node_file=None,
                       mambo_id_counter_start=-1, 
                       node_index=0)

## Creating Protein Mode Tables

We use 2 knowledge databases for information about proteins: [STRING](https://string-db.org/) and [Gene Ontology](http://geneontology.org/). STRING database uses ENSEMBL protein IDs and Gene Ontology uses Uniprot IDs.
 
We provide the parsed protein information in the following two files: 
- `datasets/cancer_example/protein/string_parsed.tsv` 
- `datasets/cancer_example/protein/go_parsed.tsv`

We also provide a mapping file to be used to construct the protein dictionary, which was obtained from Uniprot: 
- `datasets/cancer_example/disease/uniprot_ensembl.tsv`

In [6]:
mappingfile = "datasets/cancer_example/protein/uniprot_ensembl.tsv"
outputfile = "datasets/cancer_example/protein/protein_mapping.tsv"
create_mapping_table(mapping_file=mappingfile, 
                     mindex1=0, 
                     mindex2=1, 
                     output_file=outputfile, 
                     output_index1=0, 
                     output_index2=1, 
                     output_title1="Uniprot", 
                     output_title2="ENSEMBL")

outputdir = "datasets/cancer_example/protein"
inputfile = "datasets/cancer_example/protein/string_parsed.tsv"
create_mapped_mode_table(mode_name="protein", 
                         input_file=inputfile, 
                         dataset_name="STRING", 
                         db_id=0,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=1, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)
inputfile = "datasets/cancer_example/protein/go_parsed.tsv"
create_mapped_mode_table(mode_name="protein", 
                         input_file=inputfile, 
                         dataset_name="GO_UNIPROT", 
                         db_id=1,
                         mapping_file=outputfile, 
                         skip=False, 
                         map_index=2, 
                         node_index=0,
                         output_dir=outputdir, 
                         full_mode_file=None, 
                         db_node_file=None)