Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Interolog queries in sql
Biology researchers are interested in patterns of protein interactions. These protein interaction networks are fundamental to many cellular processes, including DNA replication, transcription, and signaling. Unfortunately, our knowledge of protein/protein interactions is incomplete. Studying interactions is difficult and time-consuming. There are dozens of techniques, with different tradeoffs in terms of cost, applicability, and accuracy
One way to enhance our knowledge of protein interaction networks is to generalize knowledge across species. Given information about protein interactions in species A, can we use this knowledge to make inferences about species B? The answer turns out to be yes. We leverage the concept of a biological ortholog, which are proteins in different species that descended from a common ancestor. A useful property of orthologs is that they behave in similar ways in both species -- this is called functional conservation, which allows annotation transfer. Suppose we have a pair of ortholog pairs from two species A and B. Suppose the proteins from species A demonstrate an interaction. Because of functional conservation between orthologs, we expect the proteins in species B to demonstrate a similar interaction; thus we can transfer the demonstrated interaction from the proteins in species A to the proteins in species B.
This basic intuition is captured by the concept of interologs. An interolog is a pair of orthology pairs that preserve a protein/protein interaction. An interolog can be visualized as a square. The nodes represent proteins; the horizontal edges represent protein/protein interactions; and the vertical edges represent orthology relationships. The figure below depicts an interolog relationship between proteins in yeast and worm.
Analyzing Interologs with SQL
A running theme in our work is exploring the use of relational databases for analyzing scientific data. Our goal here is to perform basic analysis of interologs using SQL. We leverage the SQLShare service developed at the University of Washington. SQLShare is a cloud-backed SQL database with a web front-end. It lowers the barrier to entry by automatically creating and populating database tables based on raw tabular data.
Our first step is to find and upload the appropriate data. We leverage the uniprot database to obtain information about proteins. For each protein, we care about its unique ID, its species, and its orthology group. The information in uniprot is nested, so we wrote a simple Python script (18 lines) to flatten the data into a format suitable for bulk loading into SQLServer. The analysis here uses the SwissProt entries within uniprot; we use the embedded OMA orthology group information.
We also leverage the database of interacting proteins (or DIP). Each entry in DIP represent an interaction between a pair of proteins. We model these as undirected edges in a protein graph.
After ingesting this data, we have two tables. These are viewable as public datasets on the SQLShare site:
- Protein: information about proteins, species, and orthology groups.
- AllInteraction: pairs of proteins that share an interaction.
Our first step is to perform some basic sanity checks of the data. Recall that an ortholog is defined as pairs of proteins in different species that are descendent from a common ancestor. It follows that their should be at most a single member of an orthology group per species. We can verify this property using a simple aggregation query in SQL. In this case, we focus on mice proteins (whose scientific name is Mus musculus).
SELECT oma_group, c=count(*) FROM [email@example.com].[proteins.csv] WHERE species LIKE 'Mus musculus%' GROUP BY oma_group;
SQL makes it easy to express other forms of analyses. For example, this query tells us that the average orthology group size is 7.
SELECT AVG(GroupCount.count) FROM ( SELECT oma_group, count=count(*) FROM [firstname.lastname@example.org].[proteins.csv] GROUP BY oma_group) GroupCount;
A more advanced query involves searching for missing or misclassified edges in the interolog graph. One query of interest is to identify missing protein interactions. This is, we want to find cases where a pair of proteins interact in species A, but the orthologous proteins in species B do not interact. This SQL query returns all such interactions between mice and humans:
SELECT m1.uniprot_accession, m2.uniprot_accession FROM [email@example.com].[HumanInteractions] hi, [firstname.lastname@example.org].[proteins.csv] m1, [email@example.com].[proteins.csv] m2 WHERE hi.group1=m1.oma_group AND hi.group2=m2.oma_group AND m1.species LIKE 'Mus musculus%' AND m2.species=m1.species AND NOT EXISTS ( SELECT * FROM MouseInteractions mi WHERE mi.id1=m1.uniprot_accession AND mi.id2=m2.uniprot_accession);
A related query is to find missing orthologs. For each pair of interacting proteins in species A, we want to find instances where we only have a single ortholog in species B. This implies the existence of a missing orthologous protein in species B. See the query and results.
SELECT m1.uniprot_accession, hi.group2 FROM [firstname.lastname@example.org].[HumanInteractions] hi, [email@example.com].[proteins.csv] m1 WHERE hi.group1=m1.oma_group AND m1.species LIKE 'Mus musculus%' AND NOT EXISTS ( SELECT * FROM [firstname.lastname@example.org].[MouseInteractions] mi WHERE mi.id1=m1.uniprot_accession AND mi.group2=hi.group2);
Many thanks go to our collaborators at Berkeley: Kimmen Sjölander and Cyrus Afrasiabi.