prettyClusters

A set of tools analyze and make non-hideous publication-friendly diagrams of genomic neighborhoods in microbial genomes.

Important note

This is very much a work in progress, and I'm a biochemist doing terrible things to code. There will be bugs. I'll do what I can to address them; if you've come up with a fix, I'm happy to try to incorporate it!

Why?

I spend a lot of time working with bacterial gene clusters. I wanted some sort of tool that could:

Export diagrams of gene clusters as vector files for figures.
Export diagrams with multiple gene clusters in the same relative scale.
Import gene metadata from the JGI's IMG database, which has many genomes, metagenomes, and so on that are absent from NCBI's NR database (and UniProt), or that are present but poorly annotated in databases like NCBI/WGS.
Handle hypothetical and predicted proteins helpfully (i.e. by identifying new groups of hypothetical proteins that frequently appear in the genomic neighborhoods of interest).
Integrate into workflows using sequence similarity networks generated via the EFI-EST toolset.
Interrogate similarity of genomic neighborhoods without relying on the sequence similarity of genes of interest - I did not want to have to make the assumption that sequence similarity and gene cluster similarity necessarily track, since that's not always a sound assumption. I haven't encountered anything that quite handles all of those things in one go, so...

The `prettyClusters` toolset

The Wiki entries contain a more detailed description of the use of specific functions.

The core toolset

Accessory components

Using `prettyClusters`

A basic installation guide is the best starting point.
I've got a simple walkthrough for a standard run using the prettyClusters toolset.
This is very much a work in progress, and doing this is using prettyClusters in difficult mode, but I've got a rough workflow for this as well.

`prettyClusters` output

Illustrating the output of some of the components (or, in the case of the cluster diagrams themselves, just under 10% of the output): Notably, sequence similarity and genome neighborhood similarity are not always tightly coupled. The analyses in prettyClusters make it possible to investigate a protein family along both axes.

Development

Planned additions

A vignette!
Import from UniProt and GFF/GFF-3 formatted files. As with gbToIMG this will likely be a separate function that can replace generateNeighbors, and it will likely suffer from the same data heterogeneity (and lack of gene family annotations) that that tool does. Supplementation with incorpIprScan is likely still going to be advisable.
Automatic generation of genome neighborhood diagrams for specific clusters (or for random representatives of a cluster) in prettyClusterDiagrams, since the diagrams get unwieldy with very large datasets...
Single scale bar in prettyClusterDiagrams. Probably will try generating a final fake "gene cluster" with single-nt "genes" every kb or something?
Generation of HMMs for hypothetical protein families identified in analyzeNeighbors
User-supplied HMMs for annotation of predefined custom protein (sub)families as a standalone subfunction.
Options to let the user specify distance and clustering methods in prepNeighbors and analyzeNeighbors

Known issues

Auto-annotation in prettyClusterDiagrams may miss genes if their initial family categorization (or ORF-finding) was poor! This is doubly the case for GenBank-derived files (use of incorpIprScan can improve annotation, but not ORF-finding).
Generation of hypothetical protein and genome neighborhood clusters is approximate and sensitive to user-supplied cutoffs, to distance/clustering methods, to overrepresentation of closely related gene clusters, and to the conservation of genes outside of the gene cluster limits. There are limited ways around these problems, and they come with their own compromises. (Overrepresentation at least can be dealt with using representative sequences chosen via EFI-EST (repnodes) or CD-HIT.)
Forward- and reverse-facing genes are on the same vertical level in prettyClusterDiagrams. I personally find it visually clearer to have forward genes above the line and reverse genes below, but will need to probably do a bunch more digging into gggenes and ggplot2 to figure out if/how I can make it happen.
Distance between genes of interest and their neighbors is not taken into account. I have done a few initial analyses using weighted distances rather than binary present/absent values; it is not clear they've added much more info than the binary value-based analyses, and they're more complicated to run. Could revisit?
Use of non-WSL mafft and blast installs on Windows isn't built-in at the moment; it's a low priority, given how easy WSL is to get up and running.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
R		R
data		data
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
prettyClusters.Rproj		prettyClusters.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

data

data

man

man

vignettes

vignettes

.Rbuildignore

.Rbuildignore

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

NAMESPACE

NAMESPACE

README.md

README.md

prettyClusters.Rproj

prettyClusters.Rproj

Repository files navigation

prettyClusters

Important note

Why?

The `prettyClusters` toolset

The core toolset

Accessory components

Using `prettyClusters`

`prettyClusters` output

Development

Planned additions

Known issues

About

Releases

Packages

Languages

stogqy/prettyClusters

Folders and files

Latest commit

History

Repository files navigation

prettyClusters

Important note

Why?

The prettyClusters toolset

The core toolset

Accessory components

Using prettyClusters

prettyClusters output

Development

Planned additions

Known issues

About

Resources

Stars

Watchers

Forks

Languages

The `prettyClusters` toolset

Using `prettyClusters`

`prettyClusters` output