# 02 Data Representation

One advantage of using Mambo to represent biological networks is its ability to represent heterogeneous biological network. It is important to understand how these networks are represented: there are two types of data in multimodal networks, modes and links. 

In this section, we detail what **modes** and **links** are and how they are represented. 

# Nodes vs. Modes

Nodes represent biological entities, like genes, chemicals, or diseases. Each node belongs to a node type, or a set of nodes of the same biological type, referred to as a **mode**. In Mambo, nodes are stored in tables, with one table per mode. 

There are two types of mode tables in Mambo. (1) First, there is a **full mode table**, which maps a unique Mambo node ID to each node and also lists which dataset the node comes from. (2) Second, there are additional **mode tables, one for each dataset**, which map the unique Mambo node ID to the name of the node in the dataset from which it originates.

We provide below an example for these two types of mode tables. The first is the full mode table for genes. The gene mode table is constructed from two databases: [GeneMANIA](http://genemania.org) and [HUGO](https://www.genenames.org). The first column in the table corresponds to the unique Mambo node ID and the second column lists the IDs of the databases where the node is found. 

#### Full Mode Table:

\# Full mode table for Genes

mambo_nid | dataset_id
---------|-----------
5152     | 0,1
20531    | 1
9073     | 0,1
13841    | 0,1
20532    | 1
11823    | 0,1
...      | ...


The following two tables are database specific mode tables, one for GeneMANIA and one for HUGO. The first column is the unique Mambo node ID and the second column is the original node ID from the database. For example, the Mambo node ID 5152 corresponds to the [ENSEMBL](https://www.ensembl.org/index.html) gene ID, [ENSG00000121410](https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000121410;r=19:58345178-58353499), which is found in both GeneMANIA (dataset ID 0) and HUGO (dataset ID 1).

#### Mode Table for GeneMANIA (dataset_id = 0):

\# Mode table for dataset: GeneMANIA


mambo_nid | dataset_nid
---------|-----------
0        | ENSG00000000003
1        | ENSG00000000005
2        | ENSG00000000419
3        | ENSG00000000457
4        | ENSG00000000460
...      | ...


#### Mode Table for HUGO (dataset_id = 1):

\# Mode table for dataset: HUGO


mambo_nid | dataset_nid
---------|-----------
5152     | ENSG00000121410
20531    | ENSG00000268895
9073     | ENSG00000148584
13841    | ENSG00000175899
20532    | ENSG00000245105
11823    | ENSG00000166535
...      | ...


Mode tables can be constructed using either `create_mambo_mode_table()`, located in `utils/create_mambo_mode_table.py`, or `create_mapped_mode_table()`, located in `utils/create_mapped_mode_table.py`. The use of these methods is described in more detail in a subsequent notebook, [04 Creating Mode Tables](04 Creating Mode Tables.ipynb).

# Edges vs. Links

Edges represent relationships between nodes. Edges can go between two nodes that belong to the same mode or between nodes from different modes. 

Edges are stored in tables. There are two types of link tables. (1) First, there is a **full link table**, which maps a unique Mambo edge ID to each edge, states which dataset the edge comes from, and lists the node IDs of the two endpoints of the edge. (2) Second, there are additional **link tables, one for each database**, which list the unique Mambo edge IDs of edges in that database and also includes the database the node endpoints originate from.

We provide an example for gene-protein link tables. The following table is the full mode table for gene-protein links. The links are obtained from ENSEMBL through [Biomart](http://www.biomart.org/). The first column in the table corresponds to the unique Mambo edge IDs and the second column lists the IDs of the databases where the edges come from. The third and fourth columns list the Mambo node IDs of source and destination nodes for every edge. 

#### Full Link Table:

\# Full crossnet file for genes (human) to protein (human)

mambo_eid | dataset_id | src_mambo_nid | dst_mambo_nid
---------|------------|--------------|-------------
0        | 0          | 15526        | 9488
1        | 0          | 18316        | 9862
2        | 0          | 8049         | 5051
3        | 0          | 6576         | 17061
4        | 0          | 6008         | 5334
5        | 0          | 2884         | 17446
...      | ...        | ...          | ...


The following table is an example of a database-specific link table. The first column is the unique Mambo edge ID. The second and third columns indicate gene and protein databases, respectively, each node come from. For example, Mambo edge ID 0 corresponds to an edge between a gene with ID 15526 and a protein with ID 9488. Gene 15526 comes from the database with ID 1 (in this case, the HUGO database), and protein 9488 comes from database with ID 0 (in this case, [STRING](https://string-db.org/) database).


#### Database Specific Link Table:

\# Crossnet table for dataset: ENSEMBL


mambo_eid | src_dataset_id | dst_dataset_id
---------|----------------|---------------
0        | 1              | 0
1        | 1              | 0
2        | 1              | 0
3        | 1              | 0
4        | 1              | 0
5        | 1              | 0
...      | ...            | ...


Link tables can be constructed using `create_mambo_crossnet_table()`, located in `utils/create_mambo_crossnet_table.py`. The use of this method is described in more detail in a subsequent notebook, [05 Creating Link Tables](05 Creating Link Tables.ipynb).

# Controlled Vocabularies a.k.a. Mapping Tables

Controlled vocabularies map a given biological entity to different naming schemes. Mambo uses controlled vocabularies to solve a very pressing challenge in biomedical data related to the use of multiple distinct names to refer to the same entity. For example, gene ``A1BG`` is referred to as ``ENSG00000121410`` in the [ENSEMBL database](https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000121410;r=19:58345178-58353499), but as ``HGNC:5`` in the [HUGO database](https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=HGNC:5). That gene also has a protein `` P04217`` in the [UniProtKB database](http://www.uniprot.org/uniprot/P04217). The challenge is even more apparent in other types of biological entities, such as diseases, for example ``Dressler's syndrome`` is referred to as ``DOID:10507`` in [Disease Ontology](http://disease-ontology.org/), but as ``UMLS_CUI:C0152107`` in the [UMLS database](https://www.ncbi.nlm.nih.gov/medgen/508890). Furthermore, this syndrome also has an alternative name, ``Postmyocardial infarction syndrome``.

When data is collected from multiple databases, each database often uses its own naming scheme. However, we only want truly distinct entities to be distinct in the network and do not want the same entity to appear in the network multiple times. To tackle this issue, Mambo provides mapping capabilities in the form of **mapping dictionaries**. Given a mapping between entities in a given mode, function `create_mapping_table()` in `utils/create_mapping_table.py` creates a dictionary for these. If using mapping dictionaries, you should use `create_mapped_mode_table()`, located in `utils/create_mapped_mode_table.py` in order to create mode tables.

The following table shows a mapping table between two naming schemes for proteins, ENSEMBL Protein IDs and UniprotKB IDs. For example, ENSP00000349259 and Q01082 correspond to the same protein, which is represented using Mambo ID 0 in the multimodal network.

mambo_id | ensemble_protein_id | uniprot_id
--------|------------------|-------------
0       | ENSP00000349259  | Q01082 
1       | ENSP00000349708  | O95789
2       | ENSP00000349709  | Q8TD07
3       | ENSP00000350961  | Q96GE9
4       | ENSP00000317379  | O94925
5       | ENSP00000349250  | P40617
...     | ...              | ...
