# Building a protein-protein interection network

In [35]:
import graph_tool_extras as gte

## Introduction

In this notebook, a network of protein-protein interactions in human bodies was built.
The database in the file below is used. It's from the Molecular Interaction Searching Tool's dataset at https://fgrtools.hms.harvard.edu/MIST.

In [36]:
PATH = 'Datasets/MIST_interaction_ppi_vs5_0-9606.txt'

## Understanding the data

MIST_interaction_ppi_vs5_0-9606.txt has 18 columns and 1048431 rows, each row representing a protein-protein interaction. To understand each column, check the documentation below extracted from https://fgrtools.hms.harvard.edu/MIST/downloads.jsp.
1. MasterNetID: our own unique interaction ID
2. TaxID_A: taxonomy id for interacting partner A, for example, 9606 for human genes
3. GeneA: entrez geneid for interacting partner A
4. TaxID_B: taxonomy id for interacting partner B, for example, 9606 for human genes
5. GeneB: entrez geneid for interacting partner B
6. Rank: confidence of interaction (high, moderate and low)
7. Interaction_type: PPI, genetic, interolog, interolog-genetic
8. Exp_Direct: evidence codes (MI ID) for direct interaction eg. "MI:0018" for 2-Hybrid interaction
9. Exp_Indirect: evidence code for in-direct interaction eg. "MI:0488" for psi-mi
10. TaxID_interolog: this column is only populated for interolog and shows the taxonomy ID of the original data
11. Source_databases: eg. BioGrid
12. Reference: Pubmed_ID or other reference ID (eg reference from FlyBase)
13. Reference_type: eg. "PMID"
14. Count_direct: count of unique evidence code (MI number) as direct interaction
15. Count_indirect: count of unique evidence code (MI number) as in-direct interaction
16. Count_paper: count of unique reference IDs
17. Source_Interolog: this column is only populated for interolog and shows the MasterNetID of the original data
18. Comment

## Reading the data and building the network

In [44]:
with open(PATH) as file:

    # Ignoring the first line because it's the header.
    next(file)
    
    # Reading the file line by line, without fully loading it into the memory.
    for line in file:
        
        # Turns the line into a list of parts
        # considering '\t' as a separator.
        parts = line.split('\t')

        # Ignoring the first and the last character
        # of each part to eliminate quotation marks.
        parts = [part[1:-1] for part in parts]
        
        break