Skip to content

Annotation, analysis and illustration of gene clusters

Notifications You must be signed in to change notification settings

stogqy/prettyClusters

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prettyClusters

A set of tools analyze and make non-hideous publication-friendly diagrams of genomic neighborhoods in microbial genomes.

Important note

This is very much a work in progress, and I'm a biochemist doing terrible things to code. There will be bugs. I'll do what I can to address them; if you've come up with a fix, I'm happy to try to incorporate it!

Why?

I spend a lot of time working with bacterial gene clusters. I wanted some sort of tool that could:

  • Export diagrams of gene clusters as vector files for figures.
  • Export diagrams with multiple gene clusters in the same relative scale.
  • Import gene metadata from the JGI's IMG database, which has many genomes, metagenomes, and so on that are absent from NCBI's NR database (and UniProt), or that are present but poorly annotated in databases like NCBI/WGS.
  • Handle hypothetical and predicted proteins helpfully (i.e. by identifying new groups of hypothetical proteins that frequently appear in the genomic neighborhoods of interest).
  • Integrate into workflows using sequence similarity networks generated via the EFI-EST toolset.
  • Interrogate similarity of genomic neighborhoods without relying on the sequence similarity of genes of interest - I did not want to have to make the assumption that sequence similarity and gene cluster similarity necessarily track, since that's not always a sound assumption. I haven't encountered anything that quite handles all of those things in one go, so...

The prettyClusters toolset

The Wiki entries contain a more detailed description of the use of specific functions.

The core toolset

Accessory components

Using prettyClusters

  • A basic installation guide is the best starting point.
  • I've got a simple walkthrough for a standard run using the prettyClusters toolset.
  • This is very much a work in progress, and doing this is using prettyClusters in difficult mode, but I've got a rough workflow for this as well.

prettyClusters output

Illustrating the output of some of the components (or, in the case of the cluster diagrams themselves, just under 10% of the output): genome neighborhood diagram output Notably, sequence similarity and genome neighborhood similarity are not always tightly coupled. The analyses in prettyClusters make it possible to investigate a protein family along both axes.

Development

Planned additions

  • A vignette!
  • Import from UniProt and GFF/GFF-3 formatted files. As with gbToIMG this will likely be a separate function that can replace generateNeighbors, and it will likely suffer from the same data heterogeneity (and lack of gene family annotations) that that tool does. Supplementation with incorpIprScan is likely still going to be advisable.
  • Automatic generation of genome neighborhood diagrams for specific clusters (or for random representatives of a cluster) in prettyClusterDiagrams, since the diagrams get unwieldy with very large datasets...
  • Single scale bar in prettyClusterDiagrams. Probably will try generating a final fake "gene cluster" with single-nt "genes" every kb or something?
  • Generation of HMMs for hypothetical protein families identified in analyzeNeighbors
  • User-supplied HMMs for annotation of predefined custom protein (sub)families as a standalone subfunction.
  • Options to let the user specify distance and clustering methods in prepNeighbors and analyzeNeighbors

Known issues

  • Auto-annotation in prettyClusterDiagrams may miss genes if their initial family categorization (or ORF-finding) was poor! This is doubly the case for GenBank-derived files (use of incorpIprScan can improve annotation, but not ORF-finding).
  • Generation of hypothetical protein and genome neighborhood clusters is approximate and sensitive to user-supplied cutoffs, to distance/clustering methods, to overrepresentation of closely related gene clusters, and to the conservation of genes outside of the gene cluster limits. There are limited ways around these problems, and they come with their own compromises. (Overrepresentation at least can be dealt with using representative sequences chosen via EFI-EST (repnodes) or CD-HIT.)
  • Forward- and reverse-facing genes are on the same vertical level in prettyClusterDiagrams. I personally find it visually clearer to have forward genes above the line and reverse genes below, but will need to probably do a bunch more digging into gggenes and ggplot2 to figure out if/how I can make it happen.
  • Distance between genes of interest and their neighbors is not taken into account. I have done a few initial analyses using weighted distances rather than binary present/absent values; it is not clear they've added much more info than the binary value-based analyses, and they're more complicated to run. Could revisit?
  • Use of non-WSL mafft and blast installs on Windows isn't built-in at the moment; it's a low priority, given how easy WSL is to get up and running.

About

Annotation, analysis and illustration of gene clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%