Skip to content

cbg-ethz/cojac

Repository files navigation

COJAC - CoOccurrence adJusted Analysis and Calling

Bioconda package Docker container

Description

The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. It is useful, for example, for early detection of viral variants of concern (e.g. Alpha, Delta, Omicron) in environmental samples, and has been designed to scan for multiple SARS-CoV-2 variants in wastewater samples, as analyzed jointly by ETH Zurich, EPFL and Eawag. Learn more about this project on its Dashboard, amplicon cooccurrences measured with cojac are visualized on the heatmaps available on per-station or per-variant subpages displayed on CoV-Spectrum.

The analysis requires the whole amplicon to be covered by sequencing read pairs. It currently works at the level of aligned reads, but we plan to be able to adjust confidence scores based on local (window) haplotypes (as generated, e.g., by ShoRAH, doi:10.1186/1471-2105-12-119).

Usage

Here are the available command-line tools:

command purpose
cooc-mutbamscan scan an alignment BAM/CRAM/SAM file for mutation co-occurrences and output a JSON or YAML file
cooc-colourmut display a JSON or YAML file as a coloured output on the terminal
cooc-pubmut render a JSON or YAML file to a table as in the publication
cooc-tabmut export a JSON or YAML file as a CSV/TSV table for downstream analysis (e.g.: RStudio)
cooc-curate an (experimental) tool to assist evaluating the quality of variant definitions by looking at mutations' or cooccurences' frequencies from CoV-Spectrum
phe2cojac a tool to generate new variant definition YAMLs for cojac using YMLs available at PHE Genomic's Standardised Variant Definitions

Use option -h / --help to see available command-line options:

$ cooc-mutbamscan --help
usage: cooc-mutbamscan [-h] [-s TSV | -a BAM/CRAM [BAM/CRAM ...]] [-/ [BATCHNAME]] [-p PATH] [-r REFID] [-m DIR] [-b BED] [-# COOC] [-Q YAML | -A YAML] [-j JSON] [-y YAML] [-t TSV] [-d]

scan amplicon (covered by long read pairs) for mutation cooccurrence

options:
  -h, --help            show this help message and exit
  -s TSV, --samples TSV
                        V-pipe samples list tsv
  -a BAM/CRAM [BAM/CRAM ...], --alignments BAM/CRAM [BAM/CRAM ...]
                        alignment files
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        concatenate samplename/batchname from samples tsv
  -p PATH, --prefix PATH
                        V-pipe work directory prefix for where to look at align files when using TSV samples list
  -r REFID, --reference REFID
                        reference to look for in alignment files
  -m DIR, --vocdir DIR  directory containing the yamls defining the variant of concerns
  -b BED, --bedfile BED
                        bedfile defining the amplicons, with format: ref\tstart\tstop\tamp_num\tpool\tstrand
  -# COOC, --cooc COOC  minimum number of cooccurences to search for
  -Q YAML, --in-amp YAML, --amplicons YAML
                        use the supplied YAML file to query amplicons instead of building it from BED + voc's DIR
  -A YAML, --out-amp YAML, --out-amplicons YAML
                        output amplicon query in a YAML file
  -j JSON, --json JSON  output results to as JSON file
  -y YAML, --yaml YAML  output results to as yaml file
  -t TSV, --tsv TSV     output results to as (raw) tsv file
  -d, --dump            dump the python object to the terminal

@listfile can be used to pass a long list of parameters (e.g.: a large number of BAMs) in a file instead of command line
$ cooc-colourmut --help
usage: cooc-colourmut [-h] -a YAML (-j JSON | -y YAML)

print coloured pretty table on terminal

options:
  -h, --help            show this help message and exit
  -a YAML, --amplicons YAML
                        list of query amplicons, from mutbamscan
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan

see cooc-pubmut for a CSV file that can be imported into an article
$ cooc-pubmut --help
usage: cooc-pubmut [-h] [-m DIR] [-a YAML] (-j JSON | -y YAML) [-o CSV] [-e | -x] [-/ [BATCHNAME]] [-q]

make a pretty table

options:
  -h, --help            show this help message and exit
  -m DIR, --vocdir DIR  directory containing the yamls defining the variant of concerns
  -a YAML, --amplicons YAML
                        list of query amplicons, from mutbamscan
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan
  -o CSV, --output CSV  name of (pretty) csv file to save the table into
  -e, --escape          use escape characters for newlines
  -x, --excel           use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        split samplename/batchname (as in samples tsv)
  -q, --quiet           Run quietly: do not print the table

you need to open the CSV in a spreadsheet that understands linebreaks
$ cooc-tabmut --help
usage: cooc-tabmut [-h] (-j JSON | -y YAML) [-/ [BATCHNAME]] [-o CSV] [-l] [-x] [-m] [-q]

make a table suitable for further processing: RStudio, etc.

options:
  -h, --help            show this help message and exit
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        split samplename/batchname (as in samples tsv)
  -o CSV, --output CSV  name of (pretty) csv file to save the table into
  -l, --lines           Line-oriented table alternative
  -x, --excel           use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
  -m, --multiindex      Use multi-level indexing (amplicons and counts categories)
  -q, --quiet           Run quietly: do not print the table
$ cooc-curate --help
usage: cooc-curate [-h] [-a YAML] [-m] [-H HIGH] [-l LOW] [--collapse] [--no-collapse] [-c] [-C] VOC [VOC ...]

helps determining specific mutations and cooccurences by querying CoV-Spectrum

positional arguments:
  VOC                   per VOC description YAML file(s)

options:
  -h, --help            show this help message and exit
  -a YAML, --amp YAML, --amplicons YAML
                        use the YAML file generated by mutbamscan to query amplicons instead of mutations
  -m, --mut, --mutations
                        always do mutations (even if amplicons YAML provided)
  -H HIGH, --high HIGH  Fraction above which a mutation must be found among seeked lineages
  -l LOW, --low LOW     Fraction under which a mutation must be found among other lineages
  --collapse            combine counts of all sublineages together and consider a signle value that corresponds to a lineages family (e.g.: count all B.1.612.2* together). This is especially useful for assessing signature of old
                        variants that have branched out by now.
  --no-collapse         consider each sublineage separately (e.g.: count separately BA.1, BA.1.1, BA.2, BA.3, etc. instead of counting B.1.529*)
  -c, --ansi, --colour  use coloured output
  -C, --no-ansi, --no-colour
                        use text

This tool queries LAPIS, see https://lapis.cov-spectrum.org/swagger/ and https://lapis.cov-spectrum.org/
$ phe2cojac  --help
usage: phe2cojac [-h] [-s SHRT] [-y [OUT_YAML]] IN_YAML

convert phe-genomics to cojac's dedicated variant YAML format

positional arguments:
  IN_YAML               phe-genomics variant YAML input file

options:
  -h, --help            show this help message and exit
  -s SHRT, --shortname SHRT
                        shortname to use (otherwise auto-build one based on phe-genomic's unique id)
  -y [OUT_YAML], --yaml [OUT_YAML]
                        write cojac variant to a YAML file instead of printing (if empty, build filename from shortname)

Howto

Input data requirements

Analysis needs to be performed on SARS-CoV-2 samples sequenced using a tiled multiplexed PCRs protocol for which you need a BED (Browser Extensible Data) file describing the amplified regions, and sequenced with read settings that covers the totality of an amplicon.

We provide BED files for the following examples:

These protocols produces ~400bp long amplicons, and thus needs to be sequenced with, e.g., paired end sequencing with read length 250.

Select the desired bedfile using the -b / --bedfile option.

Note:

  • this analysis method cannot work on read length much shorter than the amplicons (e.g.: it will not give reliable results for a read-length of 50).
  • to use different protocols (e.g. Nimagen), you need to provide a BED file describing the amplicons. Its columns "start" and "stop" are mandatory

Analysis will use variants description YAML that list mutation to be searched.

We provide several examples in the directory voc/.

Select a directory containing a collection of virus definitions YAMLs using the -m / --vocdir option.

Note:

# fetch the repository of standardised variant definitions
git clone https://github.com/phe-genomics/variant_definitions.git
# generate a YAML for omicron subvariant BA.2 using the corresponding standardised variant definitions
phe2cojac --shortname 'om2' --yaml voc/omicron_ba2_mutations.yaml variant_definitions/variant_yaml/imagines-viewable.yml
# now have a look at the frequencies of mutations using CoV-Spectrum
cooc-curate voc/omicron_ba2_mutations.yaml
# adjust the content of the YAML files to your needs

Collect the co-occurrence data

There are currently two modes to collect the data about co-occurring mutations in reads: analysing stand-alone BAM/CRAM/SAM alignment files, or analysing the output of a cohort analysed with V-pipe (doi:10.1093/bioinformatics/btab015).

Standalone files

Provide a list of BAM files using the -a / --alignment option. Run:

cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -a sam1.bam sam2.bam -j cooc-test.json

Note: you can also use the -y / --yaml option to write to a YAML file instead of a JSON.

Analyzing a cohort with V-pipe

You can learn how to analyse fastq.gz files with V-pipe with this tutorial:

Run:

cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -t work/samples.tsv -p work/samples/ -j cooc-test.json

Number of cooccurences

By default cooc-mutbamscan will look for cooccurrences of at least 2 mutations on the same amplicon. You can change that number using option -#/--cooc:

  • you can increase it to e.g.: 3 if the variants you study requires more stringent identification
  • you can set it to 1, to also count isolated occurrences - in this case cooc-mutbamscan will also double as a generic (non coorcurrence-aware) variant caller, so you can get all counts with a single tool.

Store the amplicon query

Using the -A / --out-amp / --out-amplicons option, it is possible to store the exact request that was used to analyze samples. You can then re-use the exact same request using the -Q / --in-amp / --amplicons option, or pass it to a visualisation tool.

# store the request in a YAML file
cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -A amplicons.v3.yaml
# adjust the content of amplicons.v3.yaml

# now have a look at the frequencies of mutation cooccurences using CoV-Spectrum
cooc-curate -a amplicons.v3.yaml voc/omicron_ba2_mutations.yaml voc/omicron_ba1_mutations.yaml voc/delta_mutations.yaml
# reuse the amplicon
cooc-mutbamscan -Q amplicons.v3.yaml -a sam1.bam sam2.bam -j cooc-test.json

Display data on terminal

The default -d / --dump option of cooc-mutbamscan is not a very user-friendly experience to display the data. You can instead pass a JSON or YAML file to the display script. Run:

cooc-colourmut -a amplicons.v3.yaml -j cooc-test.json

terminal screen shot

Notes:

Render table for publication

And now, let’s go beyond our terminal and produce a table that can be included in a publication (see bibliography below for concrete example). Run:

cooc-pubmut -m voc/ -a amplicons.v3.yaml -j cooc-test.json -o cooc-output.tsv

Note:

  • if provided options -m / --vocdir and -a / --amplicons can help generate human-friendly headers (Amplicon 88, 26277-26635) in the table instead of short names (88_om)
  • you can also output to comma-separated table (-o cooc-output.csv)
  • Microsoft Excel requires using option -x/--excel (using semi-colon instead of comma in comma-separated-value files). Some versions can also open TSV (but not the Office 365 web app).

You need to open the table with a spread-sheet that can understand line breaks, such as LibreOffice Calc, Google Docs Spreadsheet or, using special options (see above), Microsoft Excel.

72_al 78_al 92_al 93_al 76_be 77_d614g
sam1.bam 158 / 809
19.53%
2 / 452
0.44%
89 / 400
22.25%
344 / 758
45.38%
0 / 1090
0.00%
371 / 371
100.00%
sam2.bam 0 / 1121
0.00%
0 / 255
0.00%
58 / 432
13.43%
142 / 958
14.82%
0 / 1005
0.00%
1615 / 1615
100.00%

It is also possible to use the software pandoc to further convert the CSV to other formats. Run:

cooc-pubmut -j cooc-test.json -o cooc-output.csv
pandoc cooc-output.csv -o cooc-output.pdf
pandoc cooc-output.csv -o cooc-output.html
pandoc cooc-output.csv -o cooc-output.md

Export table for downstream analysis

If you want to further analyse the data (e.g.: with RStudio), it's also possible to export the data into a more machine-readable CSV/TSV table. Run:

./cooc-pubmut -j cooc-test.json -o cooc-export.csv

You can try importing the resulting CSV in you favourite tool.

A72_al.count A72_al.mut_all A72_al.mut_oneless A72_al.frac A72_al.cooc A78_al.count A78_al.mut_all A78_al.mut_oneless A78_al.frac A78_al.cooc ...
sam1.bam 809 158 234 0.195303 2 452 2 7 0.004425 2 ...
sam2.bam 1121 0 0 0.000000 2 255 0 52 0.000000 2 ...

The columns are tagged as following:

  • count: total count of amplicons carrying the sites of interest
  • mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
  • mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
  • frac: fraction (mut_all/count) or empty if no counts
  • cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts

If your tool supports multi-level indexing, use the -m/--multiindex option. The resulting table will be bilevel indexed: the first level is the amplicon, the second is the category.

A72_alA78_al
countmut_allmut_onelessfraccooccountmut_allmut_onelessfraccooc
sam1.bam8091582340.1953032452270.0044252
sam2.bam1121000.022550520.02

Another different table orientation is provided by -l/--lines:

sample amplicon frac cooc count mut_all mut_oneless al be d614g
sam1.bam 72 0.195303 2 809 158 234 1
sam1.bam 78 0.004425 2 452 2 7 1
sam1.bam 92 0.222500 3 400 89 3 1
sam1.bam 93 0.453826 2 758 344 140 1
sam1.bam 76 0.000000 2 1090 0 377 1
sam1.bam 77 1.000000 1 371 371 0 1
sam2.bam 72 0.000000 2 1121 0 0 1
sam2.bam 78 0.000000 2 255 0 52 1
sam2.bam 92 0.134259 3 432 58 3 1
sam2.bam 93 0.148225 2 958 142 80 1
sam2.bam 76 0.000000 2 1005 0 0 1
sam2.bam 77 1.000000 1 1615 1615 0 1

Installation

We recommend using bioconda software repositories for easy installation. You can find instruction to setup your bioconda environment at the following address:

In those instructions, please follow carefully the section 2. Set up channels.

If you use V-pipe’s quick_install.sh, it will set up an environment that you can activate, e.g.:

bash quick_install.sh -b sars-cov2 -p testing -w work
. ./testing/miniconda3/bin/activate

Prebuilt package

cojac and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.

You can install cojac in its own environment and activate it:

conda create -n cojac cojac
conda activate cojac
# test it
cooc-mutbamscan --help

And to update it to the latest version, run:

# activate the environment if not already active:
conda activate cojac
conda update cojac

Or you can add it to the current environment (e.g.: in environment base):

conda install cojac

Dependencies

If you want to install the software yourself, you can see the list of dependencies in conda_cojac_env.yaml.

We recommend using conda to install them:

conda env create -f conda_cojac_env.yaml
conda activate cojac
# now run from the cojac directory
./cooc-mutbamscan --help

cojac itself doesn't have a specific installer (yet) but you can copy its executables in your PATH (so you can call them without specifying their location), e.g.: into the conda environment:

# activate the environment if not already active:
conda activate cojac
cp cooc-* ${CONDA_PREFIX}/bin/
cooc-mutbamscan --help

Remove conda environment

You can remove the conda environment if you don't need it any more:

# exit the cojac environment first:
conda deactivate
conda env remove -n cojac

Additional notebooks

The subdirectory notebooks/ contains Jupyter and Rstudio notebooks used in the publication.

Upcoming features

  • bioconda package
  • further jupyter and rstudio code from the publication
  • Move hard-coded amplicons to BED input file
  • Move hard-coded mutations to YAML configuration
  • Refactor code into proper Python package

Long term goal:

  • Integration as part of V-pipe
  • Integration with ShoRAH amplicons

Contributions

Package developers:

Additional notebooks:

Corresponding author:

Citation

If you use this software in your research, please cite:

  • Katharina Jahn, David Dreifuss, Ivan Topolsky, Anina Kull, Pravin Ganesanandamoorthy, Xavier Fernandez-Cassi, Carola Bänziger, Elyse Stachler, Lara Fuhrmann, Kim Philipp Jablonski, Chaoran Chen, Catharine Aquino, Tanja Stadler, Christoph Ort, Tamar Kohn, Timothy R. Julian, Niko Beerenwinkel

    "Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples."

    medRxiv 2021.01.08.21249379; doi:10.1101/2021.01.08.21249379

Contacts

If you experience problems running the software: