# 10 Supplementary: Filtering a Giga-Scale Multimodal Biological Network



Data is available for download from http://string-db.org/cgi/download.pl. Data files should be placed in `datasets/protein_example/string` directory located in this repository. 

**Note:** The most recent version of STRING database available at the time of writing this tutorial is ``STRING.v10``. Version 10 of STRING database is used throughout the notebook. 

## Filtering Raw Data from STRING Database

This notebook shows how to construct a SNAP Table (``snap.TTable``) from a CSV data file and how to filter the table to select rows and columns that are of interest. This functionality is important for construction of cross-species protein-protein interaction (PPI) networks, where we need to filter STRING edge data based on experimental evidence supporting the edges

Import relevant Python packages.

In [2]:
import os
import snap

Specify where the table we would like to filter is located.

In [3]:
filename = "datasets/protein_example/string/protein.links.detailed.v10.txt"

Specify the columns we plan to filter on, and the desired location for the resulting (filtered) table.

In [4]:
columns = [
'neighborhood',
'fusion',
'cooccurence',
'homology',
'coexpression', 
'experiments',
'database',
'textmining',
]

output_dir = 'output'
binary_output = 'result.bin'
binary_output_path = os.path.join(output_dir, binary_output)

Set-up table schema and load the table into memory. 

Specify the column type for each column in the table. Types can be ``snap.atStr``, ``snap.atInt``, and ``snap.atFlt``.

In [None]:
context = snap.TTableContext()
schema = snap.Schema()
schema.Add(snap.TStrTAttrPr("protein1", snap.atStr))
schema.Add(snap.TStrTAttrPr("protein2", snap.atStr))
schema.Add(snap.TStrTAttrPr("neighborhood", snap.atInt))
schema.Add(snap.TStrTAttrPr("fusion", snap.atInt))
schema.Add(snap.TStrTAttrPr("cooccurence", snap.atInt))
schema.Add(snap.TStrTAttrPr("coexpression", snap.atInt))
schema.Add(snap.TStrTAttrPr("experimental", snap.atInt))
schema.Add(snap.TStrTAttrPr("database", snap.atInt))
schema.Add(snap.TStrTAttrPr("textmining", snap.atInt))
schema.Add(snap.TStrTAttrPr("combined_score", snap.atInt))
full_protein_table = snap.TTable.LoadSS(schema, filename, context, " ", snap.TBool(True))

print "Table loaded"

Filter the data.

Create a separate table for each column and only include rows with non-zero scores in that column.

It is also possible to select on string constants, by ``SelectAtomicStrConst()``, or float constants, by ``SelectAtomicFltConst()``, as well as the first `N` rows, by ``SelectFirstNRows()``. General predicates are available through ``Select()``.

In [None]:
for column in columns:
    # Select all rows that have non-zero confidence score for this type of relation (i.e., column)
    filtered_table = snap.TTable.New(full_protein_table.GetSchema(), snap.TTableContext())
    full_protein_table.SelectAtomicIntConst(t, 0, snap.NEQ, filtered_table)
    
    # Can save in binary format
    filtered_table.SaveBin(temp_output_path)
    filtered_table = snap.TTable.Load(snap.TFIn(binary_output_path), context)
    
    # Or can save in text format
    tsv_output_path = os.path.join(output_dir, column + ".tsv")
    filtered_table.SaveSS(tsv_output_path)
    
    print "Saved table for column: %s" % column