Skip to content

thiesgehrmann/ggMatch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ggMatch

Greedy Gene Matching tool.

ggMatch finds reciprocal best blast/diamond hits across a large number of genomes, in an iterative fashion. The iterative nature removes the need to do the dreaded all-vs-all blast between all genomes. The figure below shows an example gene matching process for a single gene against 298 fungal genomes extracted from the JGI. The black node represents the original query sequence, and edges between other nodes represent high quality reciprocal best blast hits. Each node color represents a different iteration. In the first iteration, we discover the yellow genes, and in later iterations we discover other genes based on the newly discovered genes. A traditional reciprocal best hit search would have revealed only the yellow nodes in this network.

Example gene graph created by ggMatch

Description of method

COMING SOON

Usage

Dependencies

  • Snakemake
  • Conda
  • Interproscan (optional)

Getting started

Simple example

You can download and run the example dataset with the following commands:

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch
  ./ggMatch examples/basic/config.json

Yeast Gene Order Browser example

Also included is an example from the yeast gene order browser

  git clone https://github.com/thiesgehrmann/ggMatch.git
  cd ggMatch/examples/ygob
  ./prepare_ygob_example.sh
  cd ../..
  ./ggMatch examples/ygob/config.json

Output

The tool outputs three files:

  • outdir/run/compare/cmpTable.tsv : A matrix of similarity scores to the original query for each species (-1 if absent)

This score is determined by the number of positive scoring matches in the alignment divided by the length of the query

  • outdir/run/graph/nodes.tsv : A file describing the genes in the discovery network (can be loaded with gephi or cytoscape)

Format:

Id query label iteration
node1 query3 "QuerySet:182893" 0
node2 query3 "93_Amamu1:182893" 1
node3 query3 "92_Aspsy1:40389" 1
node7 query3 "QUERY:query3" -1
node4 query2 "QuerySet:201797" 0
node5 query2 "93_Amamu1:201797" 1
node6 query2 "QUERY:query2" -1
node8 query1 "QUERY:query1" -1
node9 query1 "92_Aspsy1:61310" 0
  • outdir/run/graph/edges.tsv : A file describing the discovery order in the discovery network

Format:

source target validated
node1 node2 1
node1 node3 1
node7 node1 1
node4 node5 1
node6 node4 1
node8 node9 1

Loading the edges.tsv and nodes.tsv files into cytoscape produces the following network: Discovery graph created by ggMatch for the example dataset

Running your own problem

Please look at the example JSON file, and the default parameters in (pipeline_components/defaults.json) The general format takes this form:

{
  "genomes" : {  },
  "queries" : {  },
  "outdir" : "./output",
}

You can modify a number of parameters. These can be found in (pipeline_components/defaults.json).

You can validate your configuration file with ggMatch -v config.json

Adding genomes

A genome is a set of proteins. Provide for each genome a multisequence fasta file of protein sequences. For each genome, add the location of the fasta file in the JSON file, indexed by a , under the "genomes" heading: For example:

"genomes": {
  "genome1" : { "prots" : "proteins_genome1.fasta" },
  "genome2" : { "prots" : "proteins_genome2.fasta" }
}

Defining a query

A query takes the form of a multisequence fasta file. A simple query is a single seed sequence, but if you have pre-existing knowledge of another ortholog, you can provide multiple seed sequences in the same query file.

If the genomes from which these queries originate also exist in your genome, you can prefix the sequence description in the fasta file with "genomeID:". This will link the query sequence to the genome. If that query is identified as a match against any other genome, then the reciprocal blast will be performed against that genome, rather than the set of query sequences.

For example:

>genome1:gene1
MPDDVWSGSSTCSLSSDGMSVRKDMKPEFHRAWPRCTAKAMDLEINEKMPHNETTEVAGVTKIKAVEAVG
GKTGKYIMYAGLAMVMVIYELDNSTVGTYRNFASSDFHQLGKLATLNTAASIITAIFKPPIAKLSDVLGR
GEAYVVTLTFYILSYILC
>prot2
MVAHNFSPRDAQFLTYTNGVSQALMGMGTGLLMYRYRTYKWIGVAGAVIRLVGYGVMVRLRTNESSIAEL
FIVQLVQGIGSGIIETIIIVAAQISVPHAELAQVTSLVMLGTFLGNGIGSAVAGAIYTNQLRDRLEIHLG
PGAAEGQLATLYNSITDRLPEWGTAERTAVNQALGDGHNLVQVTPDSSRSDSLDIEKPKARCF

These queries need to be added to the configuration JSON file:

  "queries" : {
    "query1" : "query_prots.fasta"
  }

About

Greedy Cross-Species gene Matching tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published