# Preprocessing of IRefIndex protein-protein interaction network
This notebook processes the raw [IRefIndex](http://irefindex.org/wiki/index.php?title=iRefIndex) PPI network downloaded from [here](http://irefindex.org/download/irefindex/data/archive/release_15.0/psi_mitab/MITAB2.6/9606.mitab.22012018.txt.zip). I used the release `v15.0` for the analysis.

Preprocessing involved:

* Selecting only binary interactions (and no self-interactions)
* Selecting only human interactions
* Converting IDs to ensembl IDs and hugo gene symbols

In [1]:
import pandas as pd
import mygene

In [2]:
IREF_interactions = pd.read_csv('../data/networks/IREF_9606.mitab.22012018.txt.zip',
                                sep='\t', compression='zip')

In [3]:
IREF_binary_interactions = IREF_interactions[IREF_interactions.edgetype == 'X'] # only binary and non-self interactions
IREF_humanbinary = IREF_binary_interactions[IREF_binary_interactions.taxa == 'taxid:9606(Homo sapiens)'] # only human interactions
IREF_humanbinary['#uidA'].head()

0    uniprotkb:A0A024RB96
1        uniprotkb:P40616
2        uniprotkb:Q9NVH1
3        uniprotkb:P29459
4        uniprotkb:Q5HYI7
Name: #uidA, dtype: object

In [4]:
def get_gene_symbols_for_IREF(list_of_ensembl_ids):
    # get Ensembl IDs for gene names
    mg = mygene.MyGeneInfo()
    res = mg.querymany(list_of_ensembl_ids,
                       scopes='refseq,symbol,entrezgene,reporter,uniprot',
                       fields='symbol',
                       species='human', returnall=True
                      )

    def get_symbol_and_ensembl(d):
        if 'symbol' in d:
            return [d['query'], d['symbol']]
        else:
            return [d['query'], None]

    node_names = [get_symbol_and_ensembl(d) for d in res['out']]
    # now, retrieve the names and IDs from a dictionary and put in DF
    node_names = pd.DataFrame(node_names, columns=['Ensembl_ID', 'Symbol']).set_index('Ensembl_ID')
    node_names.dropna(axis=0, inplace=True)
    return node_names

all_ids = IREF_humanbinary['#uidA'].append(IREF_humanbinary.uidB).unique()
all_ids = [i.split(':')[1] for i in all_ids] # remove the trailing database where the interaction comes from
iref_id_mapping = get_gene_symbols_for_IREF(all_ids)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

In [5]:
IREF_humanbinary['#uidA'] = [i[1] for i in IREF_humanbinary['#uidA'].str.split(':')]
IREF_humanbinary['uidB'] = [i[1] for i in IREF_humanbinary['uidB'].str.split(':')]
p1_incl = IREF_humanbinary.join(iref_id_mapping, on='#uidA', how='inner', rsuffix='_p1')
both_incl = p1_incl.join(iref_id_mapping, on='uidB', how='inner', rsuffix='_p2')
both_incl.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,#uidA,uidB,altA,altB,aliasA,aliasB,method,author,pmids,taxa,...,crogidb,crigid,icrogida,icrogidb,icrigid,imex_id,edgetype,numParticipants,Symbol,Symbol_p2
0,A0A024RB96,A0A024RA90,entrezgene/locuslink:196403|refseq:NP_00127317...,entrezgene/locuslink:51619|refseq:NP_057067|un...,hgnc:DTX3|uniprotkb:A0A024RB96_HUMAN|uniprotkb...,hgnc:UBE2D4|uniprotkb:A0A024RA90_HUMAN|uniprot...,"psi-mi:""MI:0018""(two hybrid)",Markson G (2009),pubmed:19549727,taxid:9606(Homo sapiens),...,xc+Dyvh3gHK14B0RxgVqoToxbTI9606,+4O62gRrUCSHLPbxlq5d/Xm4zU0,3006542,4831700,617977,-,X,2,DTX3,UBE2D4
606,A0A024RB96,A0A024RA90,entrezgene/locuslink:196403|refseq:NP_00127317...,entrezgene/locuslink:51619|refseq:NP_057067|un...,hgnc:DTX3|uniprotkb:A0A024RB96_HUMAN|uniprotkb...,hgnc:UBE2D4|uniprotkb:A0A024RA90_HUMAN|uniprot...,"psi-mi:""MI:0397""(two hybrid array)",markson-2009-2,pubmed:19549727,taxid:9606(Homo sapiens),...,xc+Dyvh3gHK14B0RxgVqoToxbTI9606,+4O62gRrUCSHLPbxlq5d/Xm4zU0,3006542,4831700,617977,-,X,2,DTX3,UBE2D4
642337,O75360,A0A024RA90,entrezgene/locuslink:5626|uniprotkb:O75360|rog...,entrezgene/locuslink:51619|refseq:NP_057067|un...,hgnc:PROP1|uniprotkb:PROP1_HUMAN|crogid:SzGxUv...,hgnc:UBE2D4|uniprotkb:A0A024RA90_HUMAN|uniprot...,"psi-mi:""MI:1112""(two hybrid prey pooling appro...",huri-2017-1,-,taxid:9606(Homo sapiens),...,xc+Dyvh3gHK14B0RxgVqoToxbTI9606,57YM6UrS02SV7ivsxQ3fuKCzBIE,14007438,4831700,1774651,-,X,2,PROP1,UBE2D4
642343,O75360,A0A024RA90,entrezgene/locuslink:5626|uniprotkb:O75360|rog...,entrezgene/locuslink:51619|refseq:NP_057067|un...,hgnc:PROP1|uniprotkb:PROP1_HUMAN|crogid:SzGxUv...,hgnc:UBE2D4|uniprotkb:A0A024RA90_HUMAN|uniprot...,"psi-mi:""MI:1112""(two hybrid prey pooling appro...",huri-2017-1,-,taxid:9606(Homo sapiens),...,xc+Dyvh3gHK14B0RxgVqoToxbTI9606,57YM6UrS02SV7ivsxQ3fuKCzBIE,14007438,4831700,1774651,-,X,2,PROP1,UBE2D4
642411,O75360,A0A024RA90,entrezgene/locuslink:5626|uniprotkb:O75360|rog...,entrezgene/locuslink:51619|refseq:NP_057067|un...,hgnc:PROP1|uniprotkb:PROP1_HUMAN|crogid:SzGxUv...,hgnc:UBE2D4|uniprotkb:A0A024RA90_HUMAN|uniprot...,"psi-mi:""MI:0397""(two hybrid array)",huri-2017-2,-,taxid:9606(Homo sapiens),...,xc+Dyvh3gHK14B0RxgVqoToxbTI9606,57YM6UrS02SV7ivsxQ3fuKCzBIE,14007438,4831700,1774651,-,X,2,PROP1,UBE2D4


In [6]:
IREF_ppi_final = both_incl[['Symbol', 'Symbol_p2']]
IREF_ppi_final['confidence'] = 1
IREF_ppi_final.columns = ['partner1', 'partner2', 'confidence']
IREF_ppi_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,partner1,partner2,confidence
0,DTX3,UBE2D4,1
606,DTX3,UBE2D4,1
642337,PROP1,UBE2D4,1
642343,PROP1,UBE2D4,1
642411,PROP1,UBE2D4,1


In [7]:
num_duplicated_edges = IREF_ppi_final.duplicated(subset=['partner1', 'partner2']).sum()
IREF_ppi_final.drop_duplicates(subset=['partner1', 'partner2'], inplace=True)
print ("Duplicated Edges: {} -> New #Edges: {}".format(num_duplicated_edges,
                                                       IREF_ppi_final.shape[0]))

Duplicated Edges: 457567 -> New #Edges: 383640


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [8]:
IREF_ppi_final.to_csv('../data/networks/IREF_symbols_20190730.tsv',
                      sep='\t', compression='gzip')