# Analyzing the Pathway Commons 2 (PC2) database SIF file
## CS446/546 class session 1

### Goal: count the number of different types of biological interactions in PC2
### Approach: retrieve compressed tab-delimited "edge-list" file and tabulate "interaction" column

### Information you will need:

- The URL is: http://www.pathwaycommons.org/archives/PC2/v9/PathwayCommons9.All.hgnc.sif.gz
- You'll be using the Python modules `gzip`, `timeit`, `pandas`, `urllib.request`, `collections` and `operator`

### What is the ".sif" file format?

SIF stands for Simple Interaction File. The format is like this:
```
A1BG    controls-expression-of  A2M
A1BG    interacts-with  ABCC6
A1BG    interacts-with  ACE2
A1BG    interacts-with  ADAM10
A1BG    interacts-with  ADAM17
A1BG    interacts-with  ADAM9
...
```

### Other stuff you should do:
- Print the first six lines of the uncompressed data file
- Use a timer to time how long your program takes
- Count how many rows there are in the data file 
- Estimate the number of proteins in the database; we'll define them operationally as strings in column 1 or column 3, for which the content of column 2 is one of these interactions: 'interacts-with', 'in-complex-with', 'neighbor-of'
- Count the total number of unique pairs of interacting molecules (ignoring interaction type)
- Count the number rows for each type of interaction in the database
- Pythonistas:  do it using Pandas and without using Pandas

### Step-by-step instructions for Python3:

- Open a file object representing a stream of the remote, compressed data file, using `urlopen`
- Open a file object representing a stream of the uncompressed data file, using `gzip.GzipFile`
- Start the timer
- Read one line at a time, until the end of the file
- Split line on "\t" and pull out the tuple of species1, interaction_type, species2 from the line of text

In [1]:
from urllib.request import urlopen
import gzip
import timeit

baseURL = "http://www.pathwaycommons.org/archives/PC2/v9/"
filename = "PathwayCommons9.All.hgnc.sif.gz"
outFilePath = "pc.sif"
interaction_types_ppi = set(["interacts-with","in-complex-with","neighbor-of"])

start_time = timeit.default_timer()

zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")

# initialize the SIF file interaction counter
intctr = 0
linectr = 0
from collections import defaultdict

interactions = set()
proteins = set()
intnamectr = defaultdict(int)

for line in fd:
    if linectr < 6:
        print(line)
        
    linectr += 1
    
    [prot1, interaction_type, prot2] = line.decode("utf-8").rstrip("\n").split("\t")
    intnamectr[interaction_type] += 1
    if interaction_type in interaction_types_ppi:
        intctr += 1
        proteins |= set([prot1, prot2])
        interactions.add(min(prot1, prot2) + "-" + max(prot1, prot2))       
        
elapsed = timeit.default_timer() - start_time

b'A1BG\tcontrols-expression-of\tA2M\n'
b'A1BG\tinteracts-with\tABCC6\n'
b'A1BG\tinteracts-with\tACE2\n'
b'A1BG\tinteracts-with\tADAM10\n'
b'A1BG\tinteracts-with\tADAM17\n'
b'A1BG\tinteracts-with\tADAM9\n'


How long your program take to run?

In [2]:
print("{0:.2f}".format(elapsed) + " sec")

5.78 sec


How many protein-protein interactions are there in the data file?

In [3]:
print(intctr)

508480


How many unique protein names are there in the data file?

In [4]:
len(proteins)

17531

How many unique pairs of proteins (regarless of interaction type name) are there that interact?

In [5]:
len(interactions)

475553

How many interactions are there of each type, in PC2?

In [6]:
from operator import itemgetter
sorted(intnamectr.items(), key=itemgetter(1), reverse=True)

[('chemical-affects', 492765),
 ('interacts-with', 325616),
 ('in-complex-with', 182864),
 ('controls-state-change-of', 182450),
 ('catalysis-precedes', 149013),
 ('controls-expression-of', 123232),
 ('consumption-controlled-by', 22830),
 ('controls-production-of', 21494),
 ('controls-phosphorylation-of', 17029),
 ('used-to-produce', 14486),
 ('controls-transport-of', 7574),
 ('reacts-with', 3927),
 ('controls-transport-of-chemical', 3322)]

# Let's do it again, using Pandas:

read from the uncompressed data stream, and parse it into a data frame, using `pandas.read_csv`

In [7]:
import pandas
zfd = urlopen(baseURL + filename)
fd = gzip.GzipFile(fileobj=zfd, mode="r")
df = pandas.read_csv(fd, sep="\t", names=["species1","interaction_type","species2"])

Use the `head` method on the data frame, to print out the first six lines

In [8]:
print(df.head())

  species1        interaction_type species2
0     A1BG  controls-expression-of      A2M
1     A1BG          interacts-with    ABCC6
2     A1BG          interacts-with     ACE2
3     A1BG          interacts-with   ADAM10
4     A1BG          interacts-with   ADAM17


Print the unique types of interactions in the data frame, using the `unique` method:

In [9]:
df.interaction_type.unique()

array(['controls-expression-of', 'interacts-with',
       'controls-phosphorylation-of', 'controls-state-change-of',
       'in-complex-with', 'controls-production-of', 'catalysis-precedes',
       'controls-transport-of', 'controls-transport-of-chemical',
       'chemical-affects', 'consumption-controlled-by', 'reacts-with',
       'used-to-produce'], dtype=object)

Subset the data frame by interaction type (using `isin` method), to include only the protein-protein interactions, then count

In [10]:
ppirows = df.interaction_type.isin(interaction_types_ppi)
sum(ppirows)

508480

Make a list of all proteins that occur in a protein-protein interaction, and count the unique protein names by putting them in a `set` and calling `len` on the set

In [11]:
newlist = df["species1"][ppirows].tolist() + df["species2"][ppirows].tolist()
len(set(newlist))

17531

Count unique protein-protein interaction pairs (specific type of interaction irrelevant), again using `set` and `len`

In [12]:
len(set(df["species1"][ppirows] + "-" + df["species2"][ppirows]))

475553

Count each type of interaction in the database, by subsetting to the `interaction` column and using `value_counts`

In [13]:
df["interaction_type"].value_counts()

chemical-affects                  492765
interacts-with                    325616
in-complex-with                   182864
controls-state-change-of          182450
catalysis-precedes                149013
controls-expression-of            123232
consumption-controlled-by          22830
controls-production-of             21494
controls-phosphorylation-of        17029
used-to-produce                    14486
controls-transport-of               7574
reacts-with                         3927
controls-transport-of-chemical      3322
Name: interaction_type, dtype: int64