# Building a protein-protein interection network

In [7]:
import graph_tool_extras as gte

## Introduction

In this notebook, a network of protein-protein interactions in human bodies was built.
The database in the file below is used. It's from the Molecular Interaction Searching Tool's dataset at https://fgrtools.hms.harvard.edu/MIST.

In [8]:
PATH = 'MIST_interaction_ppi_vs5_0-9606.txt'

## Understanding the data

MIST_interaction_ppi_vs5_0-9606.txt has 18 columns and 1048431 rows, each row representing a protein-protein interaction. To understand each column, check the documentation below extracted from https://fgrtools.hms.harvard.edu/MIST/downloads.jsp.
1. MasterNetID: our own unique interaction ID
2. TaxID_A: taxonomy id for interacting partner A, for example, 9606 for human genes
3. GeneA: entrez geneid for interacting partner A
4. TaxID_B: taxonomy id for interacting partner B, for example, 9606 for human genes
5. GeneB: entrez geneid for interacting partner B
6. Rank: confidence of interaction (high, moderate and low)
7. Interaction_type: PPI, genetic, interolog, interolog-genetic
8. Exp_Direct: evidence codes (MI ID) for direct interaction eg. "MI:0018" for 2-Hybrid interaction
9. Exp_Indirect: evidence code for in-direct interaction eg. "MI:0488" for psi-mi
10. TaxID_interolog: this column is only populated for interolog and shows the taxonomy ID of the original data
11. Source_databases: eg. BioGrid
12. Reference: Pubmed_ID or other reference ID (eg reference from FlyBase)
13. Reference_type: eg. "PMID"
14. Count_direct: count of unique evidence code (MI number) as direct interaction
15. Count_indirect: count of unique evidence code (MI number) as in-direct interaction
16. Count_paper: count of unique reference IDs
17. Source_Interolog: this column is only populated for interolog and shows the MasterNetID of the original data
18. Comment

Proteins can be represented by the gene that encodes them. GeneA and GeneB are the columns for the ids of the genes that encodes both proteins in each interaction. Gene names were not available in the database.

## Creating functions to build the network

In [14]:
def get_or_add_vertex(g, id):
    u = g.vertex_by_id(id)
    if u is None:
        u = g.add_vertex_by_id(id)
        u['id'] = id
    return u

In [None]:
def get_or_add_edge(g, gene_a, gene_b, master_net_id):
    e = g.edge_by_ids(gene_a, gene_b)
    if e is None:
        e = g.add_edge_by_ids(gene_a, gene_b)
        e['master_net_id'] = master_net_id
    return e

## Reading the data and building the network

In [15]:
g = gte.Graph(directed=False)

In [16]:
g.add_ep('rank')
g.add_ep('master_net_id')
g.add_vp('id')

In [17]:
with open(PATH) as file:

    next(file)
    
    for line in file:
        parts = line.split('\t')

        parts = [part[1:-1] for part in parts]

        rank = parts[5]
        master_net_id = parts[0]
        gene_a = parts[2]
        gene_b = parts[4]

        # Filtering all protein-protein interactions with low confidence of interaction 
        if rank != 'low':
            vertex_ga = get_or_add_vertex(g, gene_a)
            vertex_gb = get_or_add_vertex(g, gene_b)
            e = get_or_add_edge(g, gene_a, gene_b, master_net_id)

In [20]:
g = gte.clean(g)

In [22]:
gte.save(g, 'mist_ppi_human.net.gz')

## Configuring the layout and rendering the network

In [23]:
from graph_tool import draw
import netpixi

In [24]:
layout = draw.sfdp_layout(g)

In [25]:
gte.move(g, layout)

In [26]:
gte.save(g, 'mist_ppi_human_layout.net.gz')

In [28]:
r = netpixi.render('mist_ppi_human_layout.net.gz', infinite=True)

In [29]:
r.vertex_default(
    size=4,
    color=0xffff00,
    bwidth=1,
    bcolor=0x007700,
)

In [21]:
r.edge_default(
    width=0.2,
    color=0x7777ff,
    curve1=1,
    curve2=1,
)