Name	Name	Last commit message	Last commit date
Latest commit History 69 Commits
images	images
notebooks	notebooks
voc	voc
.gitignore	.gitignore
LICENSE.md	LICENSE.md
README.md	README.md
SARS-CoV-2.insert.V4.txt	SARS-CoV-2.insert.V4.txt
conda_cojac_env.yaml	conda_cojac_env.yaml
cooc-colourmut	cooc-colourmut
cooc-curate	cooc-curate
cooc-mutbamscan	cooc-mutbamscan
cooc-pubmut	cooc-pubmut
cooc-tabmut	cooc-tabmut
nCoV-2019.insert.V3.bed	nCoV-2019.insert.V3.bed
phe2cojac	phe2cojac

COJAC - CoOccurrence adJusted Analysis and Calling

Description

The cojac package comprises a set of command-line tools to analyse co-occurrence of mutations on amplicons. It is useful, for example, for early detection of viral variants of concern (e.g. Alpha, Delta, Omicron) in environmental samples, and has been designed to scan for multiple SARS-CoV-2 variants in wastewater samples, as analyzed jointly by ETH Zurich, EPFL and Eawag. Learn more about this project on its Dashboard, amplicon cooccurrences measured with cojac are visualized on the heatmaps available on per-station or per-variant subpages displayed on CoV-Spectrum.

The analysis requires the whole amplicon to be covered by sequencing read pairs. It currently works at the level of aligned reads, but we plan to be able to adjust confidence scores based on local (window) haplotypes (as generated, e.g., by ShoRAH, doi:10.1186/1471-2105-12-119).

Usage

Here are the available command-line tools:

command	purpose
`cooc-mutbamscan`	scan an alignment BAM/CRAM/SAM file for mutation co-occurrences and output a JSON or YAML file
`cooc-colourmut`	display a JSON or YAML file as a coloured output on the terminal
`cooc-pubmut`	render a JSON or YAML file to a table as in the publication
`cooc-tabmut`	export a JSON or YAML file as a CSV/TSV table for downstream analysis (e.g.: RStudio)
`cooc-curate`	an (experimental) tool to assist evaluating the quality of variant definitions by looking at mutations' or cooccurences' frequencies from CoV-Spectrum
`phe2cojac`	a tool to generate new variant definition YAMLs for cojac using YMLs available at PHE Genomic's Standardised Variant Definitions

Use option -h / --help to see available command-line options:

$ cooc-mutbamscan --help
usage: cooc-mutbamscan [-h] [-s TSV | -a BAM/CRAM [BAM/CRAM ...]] [-/ [BATCHNAME]] [-p PATH] [-r REFID] [-m DIR] [-b BED] [-# COOC] [-Q YAML | -A YAML] [-j JSON] [-y YAML] [-t TSV] [-d]

scan amplicon (covered by long read pairs) for mutation cooccurrence

options:
  -h, --help            show this help message and exit
  -s TSV, --samples TSV
                        V-pipe samples list tsv
  -a BAM/CRAM [BAM/CRAM ...], --alignments BAM/CRAM [BAM/CRAM ...]
                        alignment files
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        concatenate samplename/batchname from samples tsv
  -p PATH, --prefix PATH
                        V-pipe work directory prefix for where to look at align files when using TSV samples list
  -r REFID, --reference REFID
                        reference to look for in alignment files
  -m DIR, --vocdir DIR  directory containing the yamls defining the variant of concerns
  -b BED, --bedfile BED
                        bedfile defining the amplicons, with format: ref\tstart\tstop\tamp_num\tpool\tstrand
  -# COOC, --cooc COOC  minimum number of cooccurences to search for
  -Q YAML, --in-amp YAML, --amplicons YAML
                        use the supplied YAML file to query amplicons instead of building it from BED + voc's DIR
  -A YAML, --out-amp YAML, --out-amplicons YAML
                        output amplicon query in a YAML file
  -j JSON, --json JSON  output results to as JSON file
  -y YAML, --yaml YAML  output results to as yaml file
  -t TSV, --tsv TSV     output results to as (raw) tsv file
  -d, --dump            dump the python object to the terminal

@listfile can be used to pass a long list of parameters (e.g.: a large number of BAMs) in a file instead of command line

$ cooc-colourmut --help
usage: cooc-colourmut [-h] -a YAML (-j JSON | -y YAML)

print coloured pretty table on terminal

options:
  -h, --help            show this help message and exit
  -a YAML, --amplicons YAML
                        list of query amplicons, from mutbamscan
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan

see cooc-pubmut for a CSV file that can be imported into an article

$ cooc-pubmut --help
usage: cooc-pubmut [-h] [-m DIR] [-a YAML] (-j JSON | -y YAML) [-o CSV] [-e | -x] [-/ [BATCHNAME]] [-q]

make a pretty table

options:
  -h, --help            show this help message and exit
  -m DIR, --vocdir DIR  directory containing the yamls defining the variant of concerns
  -a YAML, --amplicons YAML
                        list of query amplicons, from mutbamscan
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan
  -o CSV, --output CSV  name of (pretty) csv file to save the table into
  -e, --escape          use escape characters for newlines
  -x, --excel           use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        split samplename/batchname (as in samples tsv)
  -q, --quiet           Run quietly: do not print the table

you need to open the CSV in a spreadsheet that understands linebreaks

$ cooc-tabmut --help
usage: cooc-tabmut [-h] (-j JSON | -y YAML) [-/ [BATCHNAME]] [-o CSV] [-l] [-x] [-m] [-q]

make a table suitable for further processing: RStudio, etc.

options:
  -h, --help            show this help message and exit
  -j JSON, --json JSON  results generated by mutbamscan
  -y YAML, --yaml YAML  results generated by mutbamscan
  -/ [BATCHNAME], --batchname [BATCHNAME]
                        split samplename/batchname (as in samples tsv)
  -o CSV, --output CSV  name of (pretty) csv file to save the table into
  -l, --lines           Line-oriented table alternative
  -x, --excel           use a semi-colon ';' instead of a comma ',' in the comma-separated-files as required by Microsoft Excel
  -m, --multiindex      Use multi-level indexing (amplicons and counts categories)
  -q, --quiet           Run quietly: do not print the table

$ cooc-curate --help
usage: cooc-curate [-h] [-a YAML] [-m] [-H HIGH] [-l LOW] [--collapse] [--no-collapse] [-c] [-C] VOC [VOC ...]

helps determining specific mutations and cooccurences by querying CoV-Spectrum

positional arguments:
  VOC                   per VOC description YAML file(s)

options:
  -h, --help            show this help message and exit
  -a YAML, --amp YAML, --amplicons YAML
                        use the YAML file generated by mutbamscan to query amplicons instead of mutations
  -m, --mut, --mutations
                        always do mutations (even if amplicons YAML provided)
  -H HIGH, --high HIGH  Fraction above which a mutation must be found among seeked lineages
  -l LOW, --low LOW     Fraction under which a mutation must be found among other lineages
  --collapse            combine counts of all sublineages together and consider a signle value that corresponds to a lineages family (e.g.: count all B.1.612.2* together). This is especially useful for assessing signature of old
                        variants that have branched out by now.
  --no-collapse         consider each sublineage separately (e.g.: count separately BA.1, BA.1.1, BA.2, BA.3, etc. instead of counting B.1.529*)
  -c, --ansi, --colour  use coloured output
  -C, --no-ansi, --no-colour
                        use text

This tool queries LAPIS, see https://lapis.cov-spectrum.org/swagger/ and https://lapis.cov-spectrum.org/

$ phe2cojac  --help
usage: phe2cojac [-h] [-s SHRT] [-y [OUT_YAML]] IN_YAML

convert phe-genomics to cojac's dedicated variant YAML format

positional arguments:
  IN_YAML               phe-genomics variant YAML input file

options:
  -h, --help            show this help message and exit
  -s SHRT, --shortname SHRT
                        shortname to use (otherwise auto-build one based on phe-genomic's unique id)
  -y [OUT_YAML], --yaml [OUT_YAML]
                        write cojac variant to a YAML file instead of printing (if empty, build filename from shortname)

Howto

Input data requirements

Analysis needs to be performed on SARS-CoV-2 samples sequenced using a tiled multiplexed PCRs protocol for which you need a BED (Browser Extensible Data) file describing the amplified regions, and sequenced with read settings that covers the totality of an amplicon.

We provide BED files for the following examples:

nCoV-2019.insert.V3.bed for ARTIC V3
SARS-CoV-2.insert.V4.txt for ARTIC V4

These protocols produces ~400bp long amplicons, and thus needs to be sequenced with, e.g., paired end sequencing with read length 250.

Select the desired bedfile using the -b / --bedfile option.

Note:

this analysis method cannot work on read length much shorter than the amplicons (e.g.: it will not give reliable results for a read-length of 50).

to use different protocols (e.g. Nimagen), you need to provide a BED file describing the amplicons. Its columns "start" and "stop" are mandatory

Analysis will use variants description YAML that list mutation to be searched.

We provide several examples in the directory voc/.

Select a directory containing a collection of virus definitions YAMLs using the -m / --vocdir option.

Note:

you can create new YAML files if you need to look for new variants of concern.

e.g. it is possible to automatically generate YAMLs for cojac from PHE Genomic's Standardised Variant Definitions:

# fetch the repository of standardised variant definitions
git clone https://github.com/phe-genomics/variant_definitions.git
# generate a YAML for omicron subvariant BA.2 using the corresponding standardised variant definitions
phe2cojac --shortname 'om2' --yaml voc/omicron_ba2_mutations.yaml variant_definitions/variant_yaml/imagines-viewable.yml
# now have a look at the frequencies of mutations using CoV-Spectrum
cooc-curate voc/omicron_ba2_mutations.yaml
# adjust the content of the YAML files to your needs

Collect the co-occurrence data

There are currently two modes to collect the data about co-occurring mutations in reads: analysing stand-alone BAM/CRAM/SAM alignment files, or analysing the output of a cohort analysed with V-pipe (doi:10.1093/bioinformatics/btab015).

Standalone files

Provide a list of BAM files using the -a / --alignment option. Run:

cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -a sam1.bam sam2.bam -j cooc-test.json

Note: you can also use the -y / --yaml option to write to a YAML file instead of a JSON.

Analyzing a cohort with V-pipe

You can learn how to analyse fastq.gz files with V-pipe with this tutorial:

https://cbg-ethz.github.io/V-pipe/tutorial/sars-cov2/ (video tutorial)

Run:

cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -t work/samples.tsv -p work/samples/ -j cooc-test.json

Number of cooccurences

By default cooc-mutbamscan will look for cooccurrences of at least 2 mutations on the same amplicon. You can change that number using option -#/--cooc:

you can increase it to e.g.: 3 if the variants you study requires more stringent identification
you can set it to 1, to also count isolated occurrences - in this case cooc-mutbamscan will also double as a generic (non coorcurrence-aware) variant caller, so you can get all counts with a single tool.

Store the amplicon query

Using the -A / --out-amp / --out-amplicons option, it is possible to store the exact request that was used to analyze samples. You can then re-use the exact same request using the -Q / --in-amp / --amplicons option, or pass it to a visualisation tool.

# store the request in a YAML file
cooc-mutbamscan -b nCoV-2019.insert.V3.bed -m voc/ -A amplicons.v3.yaml
# adjust the content of amplicons.v3.yaml

# now have a look at the frequencies of mutation cooccurences using CoV-Spectrum
cooc-curate -a amplicons.v3.yaml voc/omicron_ba2_mutations.yaml voc/omicron_ba1_mutations.yaml voc/delta_mutations.yaml
# reuse the amplicon
cooc-mutbamscan -Q amplicons.v3.yaml -a sam1.bam sam2.bam -j cooc-test.json

Display data on terminal

The default -d / --dump option of cooc-mutbamscan is not a very user-friendly experience to display the data. You can instead pass a JSON or YAML file to the display script. Run:

cooc-colourmut -a amplicons.v3.yaml -j cooc-test.json

Notes:

passing the -a / --amplicons parameter is currenlty mandatory, see section Store the amplicon query above

Render table for publication

And now, let’s go beyond our terminal and produce a table that can be included in a publication (see bibliography below for concrete example). Run:

cooc-pubmut -m voc/ -a amplicons.v3.yaml -j cooc-test.json -o cooc-output.tsv

Note:

if provided options -m / --vocdir and -a / --amplicons can help generate human-friendly headers (Amplicon 88, 26277-26635) in the table instead of short names (88_om)

you can also output to comma-separated table (-o cooc-output.csv)

Microsoft Excel requires using option -x/--excel (using semi-colon instead of comma in comma-separated-value files). Some versions can also open TSV (but not the Office 365 web app).

You need to open the table with a spread-sheet that can understand line breaks, such as LibreOffice Calc, Google Docs Spreadsheet or, using special options (see above), Microsoft Excel.

	72_al	78_al	92_al	93_al	76_be	77_d614g
sam1.bam	158 / 809 19.53%	2 / 452 0.44%	89 / 400 22.25%	344 / 758 45.38%	0 / 1090 0.00%	371 / 371 100.00%
sam2.bam	0 / 1121 0.00%	0 / 255 0.00%	58 / 432 13.43%	142 / 958 14.82%	0 / 1005 0.00%	1615 / 1615 100.00%

It is also possible to use the software pandoc to further convert the CSV to other formats. Run:

cooc-pubmut -j cooc-test.json -o cooc-output.csv
pandoc cooc-output.csv -o cooc-output.pdf
pandoc cooc-output.csv -o cooc-output.html
pandoc cooc-output.csv -o cooc-output.md

Export table for downstream analysis

If you want to further analyse the data (e.g.: with RStudio), it's also possible to export the data into a more machine-readable CSV/TSV table. Run:

./cooc-pubmut -j cooc-test.json -o cooc-export.csv

You can try importing the resulting CSV in you favourite tool.

	A72_al.count	A72_al.mut_all	A72_al.mut_oneless	A72_al.frac	A72_al.cooc	A78_al.count	A78_al.mut_all	A78_al.mut_oneless	A78_al.frac	A78_al.cooc	...
sam1.bam	809	158	234	0.195303	2	452	2	7	0.004425	2	...
sam2.bam	1121	0	0	0.000000	2	255	0	52	0.000000	2	...

The columns are tagged as following:

count: total count of amplicons carrying the sites of interest
mut_all: amplicons carrying mutations on all site of interest (e.g.: variant mutations observed on all sites)
mut_oneless: amplicons where one mutation is missing (e.g.: only 2 out of 3 sites carried the variant mutation, 1 sites carries wild-type)
frac: fraction (mut_all/count) or empty if no counts
cooc: number of considered site (e.g.: 2 sites of interests) or empty if no counts

If your tool supports multi-level indexing, use the -m/--multiindex option. The resulting table will be bilevel indexed: the first level is the amplicon, the second is the category.

	A72_al					A78_al
	count	mut_all	mut_oneless	frac	cooc	count	mut_all	mut_oneless	frac	cooc
sam1.bam	809	158	234	0.195303	2	452	2	7	0.004425	2
sam2.bam	1121	0	0	0.0	2	255	0	52	0.0	2

Another different table orientation is provided by -l/--lines:

sample	amplicon	frac	cooc	count	mut_all	mut_oneless	al	be	d614g
sam1.bam	72	0.195303	2	809	158	234	1
sam1.bam	78	0.004425	2	452	2	7	1
sam1.bam	92	0.222500	3	400	89	3	1
sam1.bam	93	0.453826	2	758	344	140	1
sam1.bam	76	0.000000	2	1090	0	377		1
sam1.bam	77	1.000000	1	371	371	0			1
sam2.bam	72	0.000000	2	1121	0	0	1
sam2.bam	78	0.000000	2	255	0	52	1
sam2.bam	92	0.134259	3	432	58	3	1
sam2.bam	93	0.148225	2	958	142	80	1
sam2.bam	76	0.000000	2	1005	0	0		1
sam2.bam	77	1.000000	1	1615	1615	0			1

Installation

We recommend using bioconda software repositories for easy installation. You can find instruction to setup your bioconda environment at the following address:

https://bioconda.github.io/user/install.html

In those instructions, please follow carefully the section 2. Set up channels.

If you use V-pipe’s quick_install.sh, it will set up an environment that you can activate, e.g.:

bash quick_install.sh -b sars-cov2 -p testing -w work
. ./testing/miniconda3/bin/activate

Prebuilt package

cojac and its dependencies are all available in the bioconda repository. We strongly advise you to install this pre-built package for a hassle-free experience.

You can install cojac in its own environment and activate it:

conda create -n cojac cojac
conda activate cojac
# test it
cooc-mutbamscan --help

And to update it to the latest version, run:

# activate the environment if not already active:
conda activate cojac
conda update cojac

Or you can add it to the current environment (e.g.: in environment base):

conda install cojac

Dependencies

If you want to install the software yourself, you can see the list of dependencies in conda_cojac_env.yaml.

We recommend using conda to install them:

conda env create -f conda_cojac_env.yaml
conda activate cojac
# now run from the cojac directory
./cooc-mutbamscan --help

cojac itself doesn't have a specific installer (yet) but you can copy its executables in your PATH (so you can call them without specifying their location), e.g.: into the conda environment:

# activate the environment if not already active:
conda activate cojac
cp cooc-* ${CONDA_PREFIX}/bin/
cooc-mutbamscan --help

Remove conda environment

You can remove the conda environment if you don't need it any more:

# exit the cojac environment first:
conda deactivate
conda env remove -n cojac

Additional notebooks

The subdirectory notebooks/ contains Jupyter and Rstudio notebooks used in the publication.

Upcoming features

~~bioconda package~~
~~further jupyter and rstudio code from the publication~~
~~Move hard-coded amplicons to BED input file~~
~~Move hard-coded mutations to YAML configuration~~
Refactor code into proper Python package

Long term goal:

Integration as part of V-pipe
Integration with ShoRAH amplicons

Contributions

Package developers:

David Dreifuss ,
Ivan Topolsky ,

Additional notebooks:

Corresponding author:

Niko Beerenwinkel

Citation

If you use this software in your research, please cite:

Katharina Jahn, David Dreifuss, Ivan Topolsky, Anina Kull, Pravin Ganesanandamoorthy, Xavier Fernandez-Cassi, Carola Bänziger, Elyse Stachler, Lara Fuhrmann, Kim Philipp Jablonski, Chaoran Chen, Catharine Aquino, Tanja Stadler, Christoph Ort, Tamar Kohn, Timothy R. Julian, Niko Beerenwinkel

"Detection of SARS-CoV-2 variants in Switzerland by genomic analysis of wastewater samples."

medRxiv 2021.01.08.21249379; doi:10.1101/2021.01.08.21249379

Contacts

If you experience problems running the software:

We encourage to use the issue tracker on GitHub
For further enquiries, you can also contact the V-pipe Dev Team
You can contact the publication’s corresponding author

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COJAC - CoOccurrence adJusted Analysis and Calling

Description

Usage

Howto

Input data requirements

Collect the co-occurrence data

Standalone files

Analyzing a cohort with V-pipe

Number of cooccurences

Store the amplicon query

Display data on terminal

Render table for publication

Export table for downstream analysis

Installation

Prebuilt package

Dependencies

Remove conda environment

Additional notebooks

Upcoming features

Contributions

Package developers:

Additional notebooks:

Corresponding author:

Citation

Contacts

About

Releases 4

Packages

Contributors 7

Languages

License

cbg-ethz/cojac

Folders and files

Latest commit

History

Repository files navigation

COJAC - CoOccurrence adJusted Analysis and Calling

Description

Usage

Howto

Input data requirements

Collect the co-occurrence data

Standalone files

Analyzing a cohort with V-pipe

Number of cooccurences

Store the amplicon query

Display data on terminal

Render table for publication

Export table for downstream analysis

Installation

Prebuilt package

Dependencies

Remove conda environment

Additional notebooks

Upcoming features

Contributions

Package developers:

Additional notebooks:

Corresponding author:

Citation

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 7

Languages

Packages