PPI Networks with missing edges

This repository contains code for reproducing the article Sequence-based protein-protein interaction predictions did not outperform protein associations from STRING at functional annotation

Summary

Download the data
Install dependencies
Execute script pipeline.sh

1. Data

All data used in these experiments for Sacharomyces cerevisiae, Escherichia coli Arabidopsis thaliana and Solanum lycopersicum are from the public domain and can be downloaded from the corresponding websites, as detailed below. Data needs to be save at the appropriate directory. The species names are "yeast", "ecoli", "arabidopsis" and "tomato".

PPI data from BIOGRID

We used version 3.5.181 https://downloads.thebiogrid.org/BioGRID/Release-Archive/BIOGRID-3.5.181/ .
Save at data/$speciesName/interactions/ppi-biogrid.txt.

PPI data from STRING

We used version 11.0 https://string-db.org/cgi/download.pl .
Save at data/$speciesName/interactions/ppi-string.txt.

Protein sequences

Protein sequences were downloaded from SwissProt version 2019_11:
ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2019_11/knowledgebase/ .
Save at data/$speciesName/sequences/sequences.fasta.

For tomato, sequences from the solanum genomics network (https://solgenomics.net/) were also used, available at data/tomato/sequences/sequences_tomatoIDs.fasta

GO annotations

We used the Gene Ontology released on January 1st 2020, which is provided at data/go/go.obo

GO annotations of proteins were downloaded from the Gene Ontology website http://current.geneontology.org/products/pages/downloads.html , accessed on 20/02/2020 for yeast and arabidopsis, 13/03/2020 for tomato and 29/03/2020 for E. coli.

2. Dependencies

For most analyses, we used Python 3.6 and the packages

numpy 1.16.2
scipy 1.2.1
scikit-learn 0.20.3

For running PIPR, clone https://github.com/muhaochen/seq_ppi The Keras package is required. We used version 2.2.4

For running node2vec, clone https://github.com/aditya-grover/node2vec This requires Python 2.7 with networkx 2.2 and numpy 1.14.3

3. Execution

After completing all previous steps, execute the following commands: ./pipeline.sh yeast ./pipeline.sh ecoli ./pipeline.sh arabidopsis ./pipeline.sh tomato

Note that several steps (running BLAST, training of PIPR, tuning the hyperparameters of node2vec or evaluating all possible combinations of STRING networks) can take several hours to days unless run massively in parallel. All the experiments described in the article were conducted on a compute cluster multiple CPU's and GPU's. Duration varies a lot depending on the species. Tomato is the smallest, so it can be run in a few hours. Repeating all experiments for arabidopsis without parallelization will take more than one week.

Run drawFinalFigures.py to obtain Figure 2 (with values hard-coded, no experiments needed).
Use analyzePerformance.py to measure the effect of individual STRING data sources on the total performances.
Use plotPerformance.py to plot the performance and coverage of STRING data sources and combinations thereof.
Use comparePerProtein.py to plot performance as a function of node degree (as in Figure 3). See inside this script for possible combinations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blast

blast

code

code

data

data

README.md

README.md

pipeline.sh

pipeline.sh

Repository files navigation

PPI Networks with missing edges

Summary

1. Data

PPI data from BIOGRID

PPI data from STRING

Protein sequences

GO annotations

2. Dependencies

3. Execution

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
blast		blast
code		code
data		data
README.md		README.md
pipeline.sh		pipeline.sh

stamakro/revival-ppi

Folders and files

Latest commit

History

Repository files navigation

PPI Networks with missing edges

Summary

1. Data

PPI data from BIOGRID

PPI data from STRING

Protein sequences

GO annotations

2. Dependencies

3. Execution

About

Resources

Stars

Watchers

Forks

Languages