sourmash is a tool for biological sequence analysis and comparisons.
betterplot
is a sourmash plugin that provides improved plotting/viz
and cluster examination for sourmash-based sketch comparisons. It
includes better similarity matrix plotting, MDS plots, and
clustermaps, as well as support for coloring samples based on
categories. It also includes support for sparse comparison output
formats produced by the fast multithreaded manysearch
and pairwise
functions in the
branchwater plugin for sourmash.
sourmash compare
and
sourmash plot
produce basic distance matrix plots that are useful for comparing and
visualizing the relationships between dozens to hundreds of
genomes. And this is one of the most popular use cases for sourmash!
However, the visualization can be improved a lot beyond the basic viz
that sourmash plot
produces. There are a lot of only slightly more
complicated use cases for comparing, clustering, and visualizing many
genomes!
And this plugin exists to explore some of these use cases!
General goals:
- provide a variety of plotting and exploration commands that can be used with sourmash tools;
- provide both command-line functionality and functions that can be imported and used in Jupyter notebooks;
- (maybe) explore other backends than matplotlib;
and who knows what else??
As of v0.4, the betterplot plugin provides:
- improved similarity matrix visualization, along with cluster extraction;
- multidimensional scaling (MDS) plots;
- t-Stochastic Neighbor Embedding (tSNE) plots;
- non-square matrix visualization for the output of
manysearch
; - an upset plot to visualize intersections between sketches;
- a utility function to convert
pairwise
output into a similarity matrix; - a utility function to convert
cluster
output into color categories;
pip install sourmash_plugin_betterplot
See the examples below for some example command lines and output,
and use command-line help (-h/--help
) to see available options.
The labels-to
CSV file taken by most (all?) of the comparison matrix
plotting functions (e.g. plot2
, plot3
, mds
) is the same format
output by
sourmash compare ... --labels-to <file>
and loaded by sourmash plot --labels-from <file>
. The format is
hopefully obvious, but there are a few things to mention -
- the
sort_order
column specifies the order of the columns with respect to the samples in the distance matrix. This is there to support arbitrary re-arranging and processing of the CSV file. - the
label
column is the name that will be displayed on the plot, as well as for the default "categories" CSV matching (see below). You can edit this by hand (spreadsheet, text editor) or programmatically. - as a side note, the
labels.txt
file output bysourmash compare
is entirely ignored ;).
One of the nice features of the betterplot functions is the ability to
provide categories that color the plots. This is critical for some
plots - for example, the mds
and mds2
plots don't make much sense
without colors! - and nice for other plots, like plot3
and
clustermap1
, where you can color columns/rows by category.
To make use of this feature, you need to provide a "categories" CSV
file (typically -C/--categories-csv
). This file is reasonably flexible
in format; it must contain at least two columns, one named category
,
but can contain more as long as category
is provided.
The simplest possible categories CSV format is shown in
10sketches-categories.csv, and
it contains two columns, label
and category
. When this file is
loaded, label
is matched to the name of each point/row/column, and
that point is then assigned that category.
Additional flexibility is provided by the column matching.
Some restrictions of / observations on the current implementation:
- if a categories CSV is provided, every point must have an associated category. It should be possible to have MORE many points and categories - checkme, @CTB!
- there is currently no way to specify a specific color for a category; they get assigned at random.
- it is entirely OK to edit the labels file (see above) and just add
a
category
column. This won't be picked up by the code automatically - you'll need to specify the same file via-C
- but it works fine!
The command lines below are executable in the examples/
subdirectory
of the repository after installing the plugin.
Compare 3 sketches with sourmash compare
, and cluster.
This command:
sourmash compare sketches/{2,47,63}.sig.zip -o 3sketches.cmp \
--labels-to 3sketches.cmp.labels_to.csv
sourmash scripts plot2 3sketches.cmp 3sketches.cmp.labels_to.csv \
-o plot2.3sketches.cmp.png
produces this plot:
Compare 3 sketches with sourmash compare
, cluster, and show a cut point.
This command:
sourmash compare sketches/{2,47,63}.sig.zip -o 3sketches.cmp \
--labels-to 3sketches.cmp.labels_to.csv
sourmash scripts plot2 3sketches.cmp 3sketches.cmp.labels_to.csv \
-o plot2.cut.3sketches.cmp.png \
--cut-point=1.2
produces this plot:
Compare 10 sketches with sourmash compare
, cluster, and use a cut
point to extract multiple clusters. Use --dendrogram-only
to plot
just the dendrogram.
This command:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts plot2 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o plot2.cut.dendro.10sketches.cmp.png \
--cut-point=1.35 --cluster-out --dendrogram-only
produces this plot:
as well as a set of 6 clusters to 10sketches.cmp.*.csv
.
Use MDS to display a comparison generated by sourmash compare
.
These commands:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts mds 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o mds.10sketches.cmp.png \
-C sketches/10sketches-categories.csv
By default this command generates a metric MDS plot. You can generate
a non-metric (NMDS) plot with --nmds
.
Use MDS to display a sparse comparison created using the
branchwater plugin's
pairwise
command. The output of pairwise
is distinct from the
sourmash compare
output: pairwise
produces a sparse CSV file that
contains just the matches above threshold, while sourmash compare
produces a dense numpy matrix.
These commands:
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts pairwise 10sketches.sig.zip -o 10sketches.pairwise.csv
sourmash scripts mds2 10sketches.pairwise.csv \
-o mds2.10sketches.cmp.png \
-C sketches/10sketches-categories.csv
By default this command generates a metric MDS plot. You can generate
a non-metric (NMDS) plot with --nmds
.
The sourmash scripts cluster
command from
the branchwater plugin
will cluster pairwise
output; cluster_to_categories
converts these clusters
into a categories CSV that can be used to color points and columns/rows.
These commands:
# generate pairwise comparison
sourmash scripts pairwise sketches/64sketches.sig.zip -o 64sketches.pairwise.csv \
--write-all
# generate clusters
sourmash scripts cluster 64sketches.pairwise.csv \
-o 64sketches.pairwise.clusters.csv \
--similarity jaccard -t 0
# convert to categories CSV
sourmash scripts cluster_to_categories 64sketches.pairwise.csv \
64sketches.pairwise.clusters.csv -o 64sketches.pairwise.clusters.cats.csv
produce 64sketches.pairwise.clusters.cats.csv
, which categorizes the
input samples based on their cluster membership.
t-distributed stochastic neighbor embedding (t-SNE) is another method
for visualizing high-dimensional data in two dimensions. The tsne
command displays a comparison generated by sourmash compare
.
These commands:
sourmash compare sketches/64sketches.sig.zip -o 64sketches.cmp \
--labels-to 64sketches.cmp.labels_to.csv
sourmash scripts tsne 64sketches.cmp 64sketches.cmp.labels_to.csv \
-C 64sketches.pairwise.clusters.cats.csv -o tsne.64sketches.cmp.png
produce this plot:
(The 64sketches.pairwise.clusters.cats.csv
is generated by the
cluster_to_categories
command above.)
These commands:
sourmash scripts pairwise sketches/64sketches.sig.zip -o 64sketches.pairwise.csv \
--write-all
sourmash scripts tsne2 64sketches.pairwise.csv \
-C 64sketches.pairwise.clusters.cats.csv -o tsne2.64sketches.cmp.png
produce this plot:
(The 64sketches.pairwise.clusters.cats.csv
is generated by the
cluster_to_categories
command above.)
Convert the sparse comparison CSV (created using the
branchwater plugin's pairwise
command) into a sourmash compare
-style similarity matrix.
These commands:
# build pairwise
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts pairwise 10sketches.sig.zip -o 10sketches.pairwise.csv
# convert pairwise
sourmash scripts pairwise_to_matrix 10sketches.pairwise.csv \
-o 10sketches.pairwise.cmp \
--labels-to 10sketches.pairwise.cmp.labels_to.csv
# plot!
sourmash scripts plot2 10sketches.pairwise.cmp \
10sketches.pairwise.cmp.labels_to.csv \
-o plot2.pairwise.10sketches.cmp.png
produce this plot:
Plot a sourmash compare
similarity matrix using the
seaborn
clustermap, which
offers some nice visualization options.
These commands:
sourmash compare sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.cmp \
--labels-to 10sketches.cmp.labels_to.csv
sourmash scripts plot3 10sketches.cmp 10sketches.cmp.labels_to.csv \
-o plot3.10sketches.cmp.png -C sketches/10sketches-categories.csv
produce this plot:
Plot the sparse comparison CSV (created using the
branchwater plugin's manysearch
command) using seaborn's clustermap. Supports separate
category coloring on rows and columns.
These commands:
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60,63}.sig.zip \
-o 10sketches.sig.zip
sourmash scripts manysearch 10sketches.sig.zip \
sketches/shew21.sig.zip -o 10sketches.manysearch.csv
sourmash scripts clustermap1 10sketches.manysearch.csv \
-o clustermap1.10sketches.png \
-u containment -R sketches/10sketches-categories.csv
produce:
Plot an UpSetPlot of the intersections between sketches.
This command:
sourmash scripts upset 10sketches.sig.zip -o 10sketches.upset.png
produces:
Plot a Venn diagram of the intersections between two or three sketches.
This command:
sourmash scripts venn sketches/{2,47,63}.sig.zip \
-o 3sketches.venn.png --ident
produces:
We suggest filing issues in the main sourmash issue tracker as that receives more attention!
betterplot
is developed at
https://github.com/sourmash-bio/sourmash_plugin_betterplot.
See environment.yml
for the dependencies needed to develop betterplot
.
Run:
make examples
to run the examples.
For now, the examples serve as the tests; eventually we will add unit tests.
Bump version number in pyproject.toml
and push.
Make a new release on github.
Then pull, and:
python -m build
followed by twine upload dist/...
.
CTB June 2024