The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. It is useful, for example, for early detection of viral variants of concern (e.g. Alpha, Delta, Omicron) in environmental samples, and has been designed to scan for multiple SARS-CoV-2 variants in wastewater samples, as analyzed jointly by ETH Zurich, EPFL and Eawag. Learn more about this project on its Dashboard, amplicon cooccurrences measured with cojac are visualized on the heatmaps available on per-station or per-variant subpages displayed on CoV-Spectrum.
The analysis requires the whole amplicon to be covered by sequencing read pairs. It currently works at the level of aligned reads, but we plan to be able to adjust confidence scores based on local (window) haplotypes (as generated, e.g., by ShoRAH, doi:10.1186/1471-2105-12-119).
Here are the available command-line tools:
command | purpose |
---|---|
cooc-mutbamscan |
scan an alignment BAM/CRAM/SAM file for mutation co-occurrences and output a JSON or YAML file |
cooc-colourmut |
display a JSON or YAML file as a coloured output on the terminal |
cooc-pubmut |
render a JSON or YAML file to a table as in the publication |
cooc-tabmut |
export a JSON or YAML file as a CSV/TSV table for downstream analysis (e.g.: RStudio) |
cooc-curate |
an (experimental) tool to assist evaluating the quality of variant definitions by looking at mutations' or cooccurences' frequencies from CoV-Spectrum |
phe2cojac |
a tool to generate new variant definition YAMLs for cojac using YMLs available at PHE Genomic's Standardised Variant Definitions |
Use option -h
/ --help
to see available command-line options:
$ cooc-mutbamscan --help
usage: cooc-mutbamscan [-h] [-s TSV | -a BAM/CRAM [BAM/CRAM ...]] [-/ [BATCHNAME]] [-p PATH] [-r REFID] [-m DIR] [-b BED] [-# COOC] [-Q YAML | -A YAML] [-j JSON] [-y YAML] [-t TSV] [-d]
scan amplicon (covered by long read pairs) for mutation cooccurrence
options:
-h, --help show this help message and exit
-s TSV, --samples TSV
V-pipe samples list tsv
-a BAM/CRAM [BAM/CRAM ...], --alignments BAM/CRAM [BAM/CRAM ...]
alignment files
-/ [BATCHNAME], --batchname [BATCHNAME]
concatenate samplename/batchname from samples tsv
-p PATH, --prefix PATH
V-pipe work directory prefix for where to look at align files when using TSV samples list
-r REFID, --reference REFID
reference to look for in alignment files
-m DIR, --vocdir DIR directory containing the yamls defining the variant of concerns
-b BED, --bedfile BED
bedfile defining the amplicons, with format: ref\tstart\tstop\tamp_num\tpool\tstrand
-# COOC, --cooc COOC minimum number of cooccurences to search for
-Q YAML, --in-amp YAML, --amplicons YAML
use the supplied YAML file to query amplicons instead of building it from BED + voc's DIR
-A YAML, --out-amp YAML, --out-amplicons YAML
output amplicon query in a YAML file
-j JSON, --json JSON output results to as JSON file
-y YAML, --yaml YAML output results to as yaml file
-t TSV, --tsv TSV output results to as (raw) tsv file
-d, --dump dump the python object to the terminal
@listfile can be used to pass a long list of parameters (e.g.: a large number of BAMs) in a file instead of command line
$ cooc-colourmut --help
usage: cooc-colourmut [-h] -a YAML (-j JSON | -y YAML)
print coloured pretty table on terminal
options:
-h, --help show this help message and exit
-a YAML, --amplicons YAML
list of query amplicons, from mutbamscan
-j JSON, --json JSON results generated by mutbamscan
-y YAML, --yaml YAML results generated by mutbamscan
see cooc-pubmut for a CSV file that can be imported into an article
$ cooc-pubmut --help
usage: cooc-pubmut [-h] [-m DIR] [-a YAML] (-j JSON | -y YAML) [-o CSV] [-e | -x] [-/ [BATCHNAME]] [-q]
make a pretty table
options:
-h, --help show this help message and exit
-m DIR, --vocdir DIR directory containing the yamls defining the variant of concerns
-a YAML, --amplicons YAML
list of query amplicons, from mutbamscan
-j JSON, --json JSON results generated by mutbamscan
-y YAML, --yaml YAML results generated by mutbamscan
-o CSV, --output CSV name of (pretty) csv file to save the table into
-e, --escape use escape characters for newlines
-x, --excel use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
-/ [BATCHNAME], --batchname [BATCHNAME]
split samplename/batchname (as in samples tsv)
-q, --quiet Run quietly: do not print the table
you need to open the CSV in a spreadsheet that understands linebreaks
$ cooc-tabmut --help
usage: cooc-tabmut [-h] (-j JSON | -y YAML) [-/ [BATCHNAME]] [-o CSV] [-l] [-x] [-m] [-q]
make a table suitable for further processing: RStudio, etc.
options:
-h, --help show this help message and exit
-j JSON, --json JSON results generated by mutbamscan
-y YAML, --yaml YAML results generated by mutbamscan
-/ [BATCHNAME], --batchname [BATCHNAME]
split samplename/batchname (as in samples tsv)
-o CSV, --output CSV name of (pretty) csv file to save the table into
-l, --lines Line-oriented table alternative
-x, --excel use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
-m, --multiindex Use multi-level indexing (amplicons and counts categories)
-q, --quiet Run quietly: do not print the table
$ cooc-curate --help
usage: cooc-curate [-h] [-a YAML] [-m] [-H HIGH] [-l LOW] [--collapse] [--no-collapse] [-c] [-C] VOC [VOC ...]
helps determining specific mutations and cooccurences by querying CoV-Spectrum
positional arguments:
VOC per VOC description YAML file(s)
options:
-h, --help show this help message and exit
-a YAML, --amp YAML, --amplicons YAML
use the YAML file generated by mutbamscan to query amplicons instead of mutations
-m, --mut, --mutations
always do mutations (even if amplicons YAML provided)
-H HIGH, --high HIGH Fraction above which a mutation must be found among seeked lineages
-l LOW, --low LOW Fraction under which a mutation must be found among other lineages
--collapse combine counts of all sublineages together and consider a signle value that corresponds to a lineages family (e.g.: count all B.1.612.2* together). This is especially useful for assessing signature of old
variants that have branched out by now.
--no-collapse consider each sublineage separately (e.g.: count separately BA.1, BA.1.1, BA.2, BA.3, etc. instead of counting B.1.529*)
-c, --ansi, --colour use coloured output
-C, --no-ansi, --no-colour
use text
This tool queries LAPIS, see https://lapis.cov-spectrum.org/swagger/ and https://lapis.cov-spectrum.org/
$ phe2cojac --help
usage: phe2cojac [-h] [-s SHRT] [-y [OUT_YAML]] IN_YAML
convert phe-genomics to cojac's dedicated variant YAML format
positional arguments:
IN_YAML phe-genomics variant YAML input file
options:
-h, --help show this help message and exit
-s SHRT, --shortname SHRT
shortname to use (otherwise auto-build one based on phe-genomic's unique id)
-y [OUT_YAML], --yaml [OUT_YAML]
write cojac variant to a YAML file instead of printing (if empty, build filename from shortname)
Analysis needs to be performed on SARS-CoV-2 samples sequenced using a tiled multiplexed PCRs protocol for which you need a BED (Browser Extensible Data) file describing the amplified regions, and sequenced with read settings that covers the totality of an amplicon.
We provide BED files for the following examples:
These protocols produces ~400bp long amplicons, and thus needs to be sequenced with, e.g., paired end sequencing with read length 250.
Select the desired bedfile using the -b
/ --bedfile
option.
Note:
- this analysis method cannot work on read length much shorter than the amplicons (e.g.: it will not give reliable results for a read-length of 50).
- to use different protocols (e.g. Nimagen), you need to provide a BED file describing the amplicons. Its columns "start" and "stop" are mandatory
Analysis will use variants description YAML that list mutation to be searched.
We provide several examples in the directory voc/
.
Select a directory containing a collection of virus definitions YAMLs using the -m
/ --vocdir
option.
Note:
- you can create new YAML files if you need to look for new variants of concern.
- e.g. it is possible to automatically generate YAMLs for cojac from PHE Genomic's Standardised Variant Definitions:
# fetch the repository of standardised variant definitions
git clone https://github.com/phe-genomics/variant_definitions.git
# generate a YAML for omicron subvariant BA.2 using the corresponding standardised variant definitions
phe2cojac --shortname 'om2' --yaml voc/omicron_ba2_mutations.yaml variant_definitions/variant_yaml/imagines-viewable.yml
# now have a look at the frequencies of mutations using CoV-Spectrum
cooc-curate voc/omicron_ba2_mutations.yaml
# adjust the content of the YAML files to your needs
There are currently two modes to collect the data about co-occurring mutations in reads: analysing stand-alone BAM/CRAM/SAM alignment files, or analysing the output of a cohort analysed with V-pipe (doi:10.1093/bioinformatics/btab015).
Provide a list of BAM files using the -a
/ --alignment
option. Run:
cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -a sam1.bam sam2.bam -j cooc-test.json
Note: you can also use the
-y
/--yaml
option to write to a YAML file instead of a JSON.
You can learn how to analyse fastq.gz files with V-pipe with this tutorial:
Run:
cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -t work/samples.tsv -p work/samples/ -j cooc-test.json
By default cooc-mutbamscan
will look for cooccurrences of at least 2 mutations on the same amplicon. You can change that number using option -#
/--cooc
:
- you can increase it to e.g.: 3 if the variants you study requires more stringent identification
- you can set it to 1, to also count isolated occurrences - in this case
cooc-mutbamscan
will also double as a generic (non coorcurrence-aware) variant caller, so you can get all counts with a single tool.
Using the -A
/ --out-amp
/ --out-amplicons
option, it is possible to store the exact request that was used to analyze samples.
You can then re-use the exact same request using the -Q
/ --in-amp
/ --amplicons
option, or pass it to a visualisation tool.
# store the request in a YAML file
cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -A amplicons.v3.yaml
# adjust the content of amplicons.v3.yaml
# now have a look at the frequencies of mutation cooccurences using CoV-Spectrum
cooc-curate -a amplicons.v3.yaml voc/omicron_ba2_mutations.yaml voc/omicron_ba1_mutations.yaml voc/delta_mutations.yaml
# reuse the amplicon
cooc-mutbamscan -Q amplicons.v3.yaml -a sam1.bam sam2.bam -j cooc-test.json
The default -d
/ --dump
option of cooc-mutbamscan
is not a very user-friendly experience to display the data. You can instead pass a JSON or YAML file to the display script. Run:
cooc-colourmut -a amplicons.v3.yaml -j cooc-test.json
Notes:
- passing the
-a
/--amplicons
parameter is currenlty mandatory, see section Store the amplicon query above
And now, let’s go beyond our terminal and produce a table that can be included in a publication (see bibliography below for concrete example). Run:
cooc-pubmut -m voc/ -a amplicons.v3.yaml -j cooc-test.json -o cooc-output.tsv
Note:
- if provided options
-m
/--vocdir
and-a
/--amplicons
can help generate human-friendly headers (Amplicon 88, 26277-26635) in the table instead of short names (88_om
)- you can also output to comma-separated table (
-o cooc-output.csv
)- Microsoft Excel requires using option
-x
/--excel
(using semi-colon instead of comma in comma-separated-value files). Some versions can also open TSV (but not the Office 365 web app).
You need to open the table with a spread-sheet that can understand line breaks, such as LibreOffice Calc, Google Docs Spreadsheet or, using special options (see above), Microsoft Excel.
72_al | 78_al | 92_al | 93_al | 76_be | 77_d614g | |
---|---|---|---|---|---|---|
sam1.bam | 158 / 809 19.53% |
2 / 452 0.44% |
89 / 400 22.25% |
344 / 758 45.38% |
0 / 1090 0.00% |
371 / 371 100.00% |
sam2.bam | 0 / 1121 0.00% |
0 / 255 0.00% |
58 / 432 13.43% |
142 / 958 14.82% |
0 / 1005 0.00% |
1615 / 1615 100.00% |
It is also possible to use the software pandoc to further convert the CSV to other formats. Run:
cooc-pubmut -j cooc-test.json -o cooc-output.csv
pandoc cooc-output.csv -o cooc-output.pdf
pandoc cooc-output.csv -o cooc-output.html
pandoc cooc-output.csv -o cooc-output.md
If you want to further analyse the data (e.g.: with RStudio), it's also possible to export the data into a more machine-readable CSV/TSV table. Run:
./cooc-pubmut -j cooc-test.json -o cooc-export.csv
You can try importing the resulting CSV in you favourite tool.
A72_al.count | A72_al.mut_all | A72_al.mut_oneless | A72_al.frac | A72_al.cooc | A78_al.count | A78_al.mut_all | A78_al.mut_oneless | A78_al.frac | A78_al.cooc | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
sam1.bam | 809 | 158 | 234 | 0.195303 | 2 | 452 | 2 | 7 | 0.004425 | 2 | ... |
sam2.bam | 1121 | 0 | 0 | 0.000000 | 2 | 255 | 0 | 52 | 0.000000 | 2 | ... |
The columns are tagged as following:
- count: total count of amplicons carrying the sites of interest
- mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
- mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
- frac: fraction (mut_all/count) or empty if no counts
- cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts
If your tool supports multi-level indexing, use the -m
/--multiindex
option. The resulting table will be bilevel indexed:
the first level is the amplicon, the second is the category.
A72_al | A78_al | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | mut_all | mut_oneless | frac | cooc | count | mut_all | mut_oneless | frac | cooc | |
sam1.bam | 809 | 158 | 234 | 0.195303 | 2 | 452 | 2 | 7 | 0.004425 | 2 |
sam2.bam | 1121 | 0 | 0 | 0.0 | 2 | 255 | 0 | 52 | 0.0 | 2 |
Another different table orientation is provided by -l
/--lines
:
sample | amplicon | frac | cooc | count | mut_all | mut_oneless | al | be | d614g |
---|---|---|---|---|---|---|---|---|---|
sam1.bam | 72 | 0.195303 | 2 | 809 | 158 | 234 | 1 | ||
sam1.bam | 78 | 0.004425 | 2 | 452 | 2 | 7 | 1 | ||
sam1.bam | 92 | 0.222500 | 3 | 400 | 89 | 3 | 1 | ||
sam1.bam | 93 | 0.453826 | 2 | 758 | 344 | 140 | 1 | ||
sam1.bam | 76 | 0.000000 | 2 | 1090 | 0 | 377 | 1 | ||
sam1.bam | 77 | 1.000000 | 1 | 371 | 371 | 0 | 1 | ||
sam2.bam | 72 | 0.000000 | 2 | 1121 | 0 | 0 | 1 | ||
sam2.bam | 78 | 0.000000 | 2 | 255 | 0 | 52 | 1 | ||
sam2.bam | 92 | 0.134259 | 3 | 432 | 58 | 3 | 1 | ||
sam2.bam | 93 | 0.148225 | 2 | 958 | 142 | 80 | 1 | ||
sam2.bam | 76 | 0.000000 | 2 | 1005 | 0 | 0 | 1 | ||
sam2.bam | 77 | 1.000000 | 1 | 1615 | 1615 | 0 | 1 |
We recommend using bioconda software repositories for easy installation. You can find instruction to setup your bioconda environment at the following address:
In those instructions, please follow carefully the section 2. Set up channels.
If you use V-pipe’s quick_install.sh
, it will set up an environment that you can activate, e.g.:
bash quick_install.sh -b sars-cov2 -p testing -w work
. ./testing/miniconda3/bin/activate
cojac and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.
You can install cojac in its own environment and activate it:
conda create -n cojac cojac
conda activate cojac
# test it
cooc-mutbamscan --help
And to update it to the latest version, run:
# activate the environment if not already active:
conda activate cojac
conda update cojac
Or you can add it to the current environment (e.g.: in environment base):
conda install cojac
If you want to install the software yourself, you can see the list of dependencies in conda_cojac_env.yaml
.
We recommend using conda to install them:
conda env create -f conda_cojac_env.yaml
conda activate cojac
# now run from the cojac directory
./cooc-mutbamscan --help
cojac itself doesn't have a specific installer (yet) but you can copy its executables in your PATH (so you can call them without specifying their location), e.g.: into the conda environment:
# activate the environment if not already active:
conda activate cojac
cp cooc-* ${CONDA_PREFIX}/bin/
cooc-mutbamscan --help
You can remove the conda environment if you don't need it any more:
# exit the cojac environment first:
conda deactivate
conda env remove -n cojac
The subdirectory notebooks/
contains Jupyter and Rstudio notebooks used in the publication.
-
bioconda package -
further jupyter and rstudio code from the publication -
Move hard-coded amplicons to BED input file -
Move hard-coded mutations to YAML configuration - Refactor code into proper Python package
Long term goal:
- Integration as part of V-pipe
- Integration with ShoRAH amplicons
If you use this software in your research, please cite:
-
Katharina Jahn, David Dreifuss, Ivan Topolsky, Anina Kull, Pravin Ganesanandamoorthy, Xavier Fernandez-Cassi, Carola Bänziger, Elyse Stachler, Lara Fuhrmann, Kim Philipp Jablonski, Chaoran Chen, Catharine Aquino, Tanja Stadler, Christoph Ort, Tamar Kohn, Timothy R. Julian, Niko Beerenwinkel
"Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples."
medRxiv 2021.01.08.21249379; doi:10.1101/2021.01.08.21249379
If you experience problems running the software:
- We encourage to use the issue tracker on GitHub
- For further enquiries, you can also contact the V-pipe Dev Team
- You can contact the publication’s corresponding author