Skip to content

07. Comparison

Krista Ternus edited this page Nov 12, 2020 · 6 revisions

Comparison

Table of Contents

Workflow Overview

In this workflow, sourmash version 2.1.0 performs pairwise-comparisons of the k-mer content across multiple samples without the use of a reference database. Because many variables can impact k-mer content estimates, such as varying depths of sequencing in different samples, it is recommended that this workflow only be used to compare the k-mer content of the same sample after using different trimming and assembly parameters. This allows users to evaluate how their sample's overall k-mer content was impacted by read filtering and assembly. Logistically, this means only one sample name should be present in the config file when executing rules within the comparison workflow.

The comparison workflow first generates MinHash signatures on the filtered reads (i.e., the Read Filtering Workflow output) or the assembled contigs (i.e., the Assembly Workflow output) with sourmash compute. The sourmash compare functionality then compares signatures from multiple datasets to each other and outputs their pairwise comparisons as Jaccard indexes in a final *.csv file. Ideally, this workflow is designed to compare MinHash signatures of different quality trimmed values and assemblers for the same sample and output results in csv files and heatmap visualizations.

Required Files

This workflow has been tested to run offline following the execution of the Read Filtering and Assembly workflows. If you have not already, you will need to perform the Offline Setup for the comparison workflow (note: make sure you have activated you metscale environment):

[user@localhost ~]$ conda activate metscale 

(metscale)[user@localhost ~]$ cd metscale/workflows

(metscale)[user@localhost ~]$ python download_offline_files.py --workflow comparison

Singularity Images

In the metscale/container_images/ directory, you should see the following Singularity images that were created when running the comparison or all flag during the Offline Setup:

File Name File Size
sourmash_2.1.0--py27he1b5a44_0.sif 271 MB

If you are missing this image, you should re-run the setup command, as per instructions in the Offline Setup.

Input Files

The comparison workflow can use filtered reads (outputs from the Read Filtering Workflow) and/or assembled contigs (outputs from the Assembly Workflow) as its inputs. These input files should be located in the metscale/workflows/data directory:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 365 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 359 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 313 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 300 MB
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa 153 MB
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa 142 MB
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa 127 MB
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa 115 MB

If these files look good to go, then you may proceed to run the rest of the workflow offline for the example dataset.

Workflow Execution

Workflows are executed according to the sample names and workflow parameters, as specified in the config file. For more information about config files, see the Getting Started wiki page.

After the config file is ready, be sure to specify the Singularity bind path from the metscale/workflows directory before running the comparison workflow.

cd metscale/workflows
export SINGULARITY_BINDPATH="data:/tmp"  

Execution of the comparison workflow can then be performed with the following command:

snakemake --use-singularity {rules} {other options}

The following rules are available for execution in the comparison workflow (yellow star indicates the terminal rule):

The comparison rules and their parameters are listed under "workflows" in the metscale/workflows/config/default_workflowconfig.settings config file.

Rule Description
comparison_reads_workflow Sourmash computes the MinHash signatures of filtered reads, and then performs pairwise comparisons to generate Jaccard indexes
comparison_assembly_workflow Sourmash computes the MinHash signatures of assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes
comparison_reads_assembly_workflow Sourmash computes the MinHash signatures of filtered reads and assembled contigs, and then performs pairwise comparisons to generate Jaccard indexes
comparison_output_heatmap_plots_reads_workflow Generates a heatmap plot to visualize the csv output from the compared reads
comparison_output_heatmap_plots_assembly_workflow Generates a heatmap plot to visualize the csv output from the compared assemblies
comparison_output_heatmap_plots_reads_assembly_workflow Generates a heatmap plot to visualize the csv output from the compared reads and assemblies
comparison_output_heatmap_plots_all_workflow Generates a heatmap plot to visualize all csv outputs. When run alone, this rule essentially performs all the former rules to generate all comparisons and data outputs.

For this workflow, both the filtered reads and the assembled contigs can be compared with the following command:

snakemake --use-singularity comparison_reads_assembly_workflow

The following command will compare only filtered reads:

snakemake --use-singularity comparison_reads_workflow

The following command will compare only assembled contigs:

snakemake --use-singularity comparison_assembly_workflow

The following command will view the read comparisons as a heatmap visualization:

snakemake --use-singularity comparison_output_heatmap_plots_reads_workflow

The following command will view the assembly comparisons as a heatmap visualization:

snakemake --use-singularity comparison_output_heatmap_plots_assembly_workflow

The following command will view the combined read and assembly comparison as a heatmap visualization:

snakemake --use-singularity comparison_output_heatmap_plots_reads_assembly_workflow

In the event that you wish to execute all the available options for the comparison workflow, you may do so by simply executing the following rule:

snakemake --use-singularity comparison_output_heatmap_plots_all_workflow

Additional options for snakemake can be found in the snakemake documentation.

To specify your own parameters for this or any of the workflows prior to execution, see Workflow Architecture for more information.

Output

After successful execution of the comparison workflow, you will find all of your outputs in the metscale/workflows/data/ directory. You should expect to see the following files for each sample and quality threshold value:

Tool Output File Name Description
Sourmash {sample}_1_reads_trim{qual}_ scaled{scaled_value}.k{value_of_k}.sig MinHash signatures of filtered reads generated by sourmash compute
Sourmash {sample}_1_reads_trim{qual}.{assembler}_ scaled{scaled_value}.k{value_of_k}.sig MinHash signatures of assembled contigs generated by sourmash compute
Sourmash {sample}_1_reads_trim{qual}and{qual} _read_comparison.k{value_of_k}.csv Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads
Sourmash {sample}_1_reads_trim{qual}and{qual} _assembly_comparison.k{value_of_k}.csv Jaccard indexes generated by sourmash compare from the MinHash signatures of assemblies
Sourmash {sample}_1_reads_and{sample}_1_reads_ read_assembly_comparison.k{value_of_k}.csv Jaccard indexes generated by sourmash compare from the MinHash signatures of filtered reads and assemblies
Custom R script {sample}_1_reads_trim{qual}and{qual}_ read_comparison.{value_of_k}.png Heatmap visualization of the Jaccard indexes from comparisons of filtered reads
Custom R script {sample}_1_reads_trim{qual}and{qual}_ assembly_comparison.{value_of_k}.png Heatmap visualization of the Jaccard indexes from comparisons of assembled contigs
Custom R script {sample}_1_readsand{sample}_reads_ read_assembly_comparison.{value_of_k}.png Heatmap visualization of the Jaccard indexes from comparisons of filtered reads and assembled contigs

MinHash signature files are not meant for human interpretation, but these are critical to the comparison workflow, as well as to sourmash gather in the Taxonomic Classification workflow.

The first line of the output *.csv file with the Jaccard indexes describes the order of sample names in the comparison matrix. A sample compared to itself will have a Jaccard index of 1, and a sample with no similarity to another will have a Jaccard index of 0.

Additional Information

Command Line Equivalents

To better understand how the workflows are operating, it may be helpful to see commands that could be used to generate equivalent outputs with the individual tools. Note that the file names in the below examples may not be exact replicates of the file naming conventions in the current workflows, but the commands are equivalent.

Note that MinHash signatures for different k-mer lengths (i.e., values of k) are merged in the comparison workflow to produce a single signature file, but there is one command for each value of k in the examples below. The information is all the same, but more individual signature files are produced by the below example commands (i.e., one signature file for each value of k) than are generated by the snakemake workflow.

The MinHash signatures of reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compute --merge {sample}_reads_trim2_1.fq.gz {sample}_reads_trim2_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_reads_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim2_1.fq.gz SRR606249_subset10_trim2_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2_scaled10000.k51.sig

The MinHash signatures of reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compute --merge {sample}_reads_trim30_1.fq.gz {sample}_reads_trim30_2.fq.gz --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30_scaled10000.k21.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30_scaled10000.k31.sig
sourmash compute --merge SRR606249_subset10_trim30_1.fq.gz SRR606249_subset10_trim30_2.fq.gz --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30_scaled10000.k51.sig

After multiple MinHash signatures have been generated from the filtered reads, they are compared with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_read_comparison.{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_read_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k31.sig SRR606249_subset10_trim30_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_read_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k51.sig SRR606249_subset10_trim30_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_read_comparison.k51.csv

The comparison of assembled contigs from both MEGAHIT and metaSPAdes is also performed in a multiple step process, first by running sourmash compute on both sets of assembled contigs to generate signatures, then by running sourmash compare to determine the similarity between these signature files. This is equivalent to running the commands below.

The MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 2 are computed with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compute {sample}_trim2.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim2.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim2.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim2_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim2.megahit_scaled10000.k51.sig

The MinHash signatures of metaSPAdes and MEGAHIT contigs assembled from reads filtered with a quality score threshold of 30 are computed with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compute {sample}_trim30.{assembler}/contigs.fasta --track-abundance --scaled {scaled_value} -k {value_of_k} -o {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig
sourmash compute SRR606249_subset10_trim30_metaspades/contigs.fasta --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30.metaspades/contigs.fasta --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 21 -o SRR606249_subset10_trim30.megahit_scaled10000.k21.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 31 -o SRR606249_subset10_trim30.megahit_scaled10000.k31.sig
sourmash compute SRR606249_subset10_trim30_megahit/final.contigs.fa --track-abundance --scaled 10000 -k 51 -o SRR606249_subset10_trim30.megahit_scaled10000.k51.sig

After multiple MinHash signatures have been generated from the assembled contigs, they are compared with the following commands (one command for each value of k - 21, 31, and 51):

sourmash compare {sample}_trim2.{assembler}_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30.{assembler}_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}_trim2and30_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10_trim2and30_assembly_comparison.k51.csv

The comparison of both reads and assembled contigs is performed by first running all the sourmash compute commands listed above to generate signatures for all of the trimmed reads and assembled contigs. Then sourmash compare is run with all of these signature files, which is equivalent to running the following command:

sourmash compare {sample}_trim2_scaled{scaled_value}.k{value_of_k}.sig {sample}_trim30_scaled{scaled_value}.k{value_of_k}.sig --csv {sample}and{sample}_read_assembly_comparison.k{value_of_k}.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.k21.sig SRR606249_subset10_trim30_scaled10000.k21.sig SRR606249_subset10_trim2.metaspades_scaled10000.k21.sig SRR606249_subset10_trim2.megahit_scaled10000.k21.sig SRR606249_subset10_trim30.metaspades_scaled10000.k21.sig SRR606249_subset10_trim30.megahit_scaled10000.k21.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k21.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.31.sig SRR606249_subset10_trim30_scaled10000.k31.sig SRR606249_subset10_trim2.metaspades_scaled10000.k31.sig SRR606249_subset10_trim2.megahit_scaled10000.k31.sig SRR606249_subset10_trim30.metaspades_scaled10000.k31.sig SRR606249_subset10_trim30.megahit_scaled10000.k31.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k31.csv
sourmash compare SRR606249_subset10_trim2_scaled10000.51.sig SRR606249_subset10_trim30_scaled10000.k51.sig SRR606249_subset10_trim2.metaspades_scaled10000.k51.sig SRR606249_subset10_trim2.megahit_scaled10000.k51.sig SRR606249_subset10_trim30.metaspades_scaled10000.k51.sig SRR606249_subset10_trim30.megahit_scaled10000.k51.sig --csv SRR606249_subset10andSRR606249_subset10_read_assembly_comparison.k51.csv

Expected Output Files for the Example Dataset

Below is a more detailed description of the output files expected in the metscale/workflows/data/ directory after the comparison workflow has been successfully run.

Using the filtered reads generated by the Read Filtering Workflow and/or assembled contigs generated by the Assembly Workflow:

File Name File Size
SRR606249_subset10_1_reads_trim2_1.fq.gz 365 MB
SRR606249_subset10_1_reads_trim2_2.fq.gz 359 MB
SRR606249_subset10_1_reads_trim30_1.fq.gz 313 MB
SRR606249_subset10_1_reads_trim30_2.fq.gz 300 MB
SRR606249_subset10_1_reads_trim2.metaspades.contigs.fa 153 MB
SRR606249_subset10_1_reads_trim30.metaspades.contigs.fa 142 MB
SRR606249_subset10_1_reads_trim2.megahit.contigs.fa 127 MB
SRR606249_subset10_1_reads_trim30.megahit.contigs.fa 115 MB

The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read datasets with the comparison_workflow_reads rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig 1 MB
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig 903 KB
SRR606249_subset10_1_reads_trim2and30_read_comparison.k21.csv 0.1 KB
SRR606249_subset10_1_reads_trim2and30_read_comparison.k31.csv 0.1 KB
SRR606249_subset10_1_reads_trim2and30_read_comparison.k51.csv 0.1 KB

The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different assembled contig datasets with the comparison_workflow_assembly rule:

File Name File Size
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig 702 KB
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig 646 KB
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig 605 KB
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig 547 KB
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k21.csv 0.4 KB
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k31.csv 0.4 KB
SRR606249_subset10_1_reads_trim2and30_assembly_comparison.k51.csv 0.4 KB

The following files are produced by sourmash after computing the signatures and comparing the k-mer composition of different filtered read and assembled contig datasets with the comparison_workflow_reads_assembly rule:

File Name File Size
SRR606249_subset10_1_reads_trim2_scaled10000.k21_31_51.sig 1.1 MB
SRR606249_subset10_1_reads_trim30_scaled10000.k21_31_51.sig 883 KB
SRR606249_subset10_1_reads_trim2.metaspades_scaled10000.k21_31_51.sig 693 KB
SRR606249_subset10_1_reads_trim30.metaspades_scaled10000.k21_31_51.sig 639 KB
SRR606249_subset10_1_reads_trim2.megahit_scaled10000.k21_31_51.sig 593 KB
SRR606249_subset10_1_reads_trim30.megahit_scaled10000.k21_31_51.sig 536 KB
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k21.csv 0.9 KB
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads read_assembly_comparison.k31.csv 0.9 KB
SRR606249_subset10_1_readsandSRR606249_subset10_1_reads_ read_assembly_comparison.k51.csv 0.9 KB

The tables below summarize the Jaccard indexes in the final *.csv files after analyzing the SRR606249_subset10_1_reads dataset.

k21 results reads_trim2 reads_trim30 megahit_trim2 metaspades_trim2 megahit_trim30 metaspades_trim30
reads_trim2 1.00 0.91 0.61 0.72 0.56 0.67
reads_trim30 0.91 1.00 0.71 0.82 0.66 0.79
megahit_trim2 0.61 0.71 1.00 0.82 0.89 0.82
metaspades_trim2 0.72 0.82 0.82 1.00 0.75 0.91
megahit_trim30 0.56 0.66 0.89 0.75 1.00 0.81
metaspades_trim30 0.67 0.79 0.82 0.91 0.81 1.00
k31 results reads_trim2 reads_trim30 megahit_trim2 metaspades_trim2 megahit_trim30 metaspades_trim30
reads_trim2 1.00 0.90 0.60 0.70 0.55 0.65
reads_trim30 0.90 1.00 0.70 0.81 0.66 0.78
megahit_trim2 0.60 0.70 1.00 0.81 0.88 0.81
metaspades_trim2 0.70 0.81 0.81 1.00 0.75 0.90
megahit_trim30 0.55 0.66 0.88 0.75 1.00 0.80
metaspades_trim30 0.65 0.78 0.81 0.90 0.80 1.00
k51 results reads_trim2 reads_trim30 megahit_trim2 metaspades_trim2 megahit_trim30 metaspades_trim30
reads_trim2 1.00 0.87 0.61 0.69 0.56 0.64
reads_trim30 0.87 1.00 0.70 0.77 0.67 0.76
megahit_trim2 0.61 0.70 1.00 0.81 0.87 0.81
metaspades_trim2 0.69 0.77 0.81 1.00 0.74 0.90
megahit_trim30 0.56 0.67 0.87 0.74 1.00 0.80
metaspades_trim30 0.64 0.76 0.81 0.90 0.80 1.00

Overall, this example showed the same trends in similarity across different k-mer lengths, regardless of trimming quality threshold or assembly status. Additional observations from this example include the following:

  • Datasets showed slightly more similarity to each other when evaluated with smaller k-mer lengths (i.e., k = 21) than larger k-mer lengths (i.e., k = 51)
  • Raw reads trimmed at different quality thresholds were slightly more similar to each other than the assembled contigs
  • Assembled contigs shared slightly more of their k-mer content if they were created by the same assembler