Skip to content

Application Examples & Commands

Rauf Salamzade edited this page Aug 30, 2023 · 3 revisions

Application Examples

1. Dereplication to select a manageable number of genomes for a single taxonomic group:

The primary reason we developed skDER was to select representative genomes to use to construct a database for commonly studied bacteria genera where a lot of redundancy exists in public databases (e.g. ~35k E. coli genomes in GTDB R214) to aid our other software package zol.

2. Dereplication to select reference genomes for metagenomic alignment/analysis:

A more common usage of dereplication is to select represnetative genomes for metagenomic alignment of reads to avoid partitioning them to multiple similar genomes/MAGs and lose signal or track of species across multiple microbiomes.

The most common tool for this purpose is dRep by Olm et al 2017. They employ a greedy approach to first group somewhat simliar genomes into primary clusters using MASH (very fast) and then use other programs to more accurately calculate ANI between genomes in each primary cluster to get a secondary more granular clustering (e.g. FastANI, gANI, etc.). The authors also nicely include other dependencies such as checkM to determine completness and contamination estimates for each genome.

We think skDER can similarly be used for this application - however - without accounting for contamination (since we don't include checkM as a dependency). For completeness however, users can specify an adjustable parameter for the difference in alignment fraction calculated for pairs of genomes that are X% ANI similar to one another. If the alignment fraction difference exceeds this parameter (default: 10% - e.g. 90% AF for one genome, 75% AF for the other) - then we automatically determine the genome with the higher AF value as redundant (e.g. the genome with the 90% AF). However, this approach can be severely impacted if dealing with MAGs which are contaminated so it might be good to filter out such MAGs in advance perhaps using checkM.

Example Usage Commands:

1. Input is a user-provided genome set in FASTA format

skder -g Ecoli_genome_1.fna Ecoli_genome_2.fna Ecoli_genome_3.fna -o skDER_Results/ -c 10

2. Input is a genus/species ID from GTDB R214:

skder -t "Cutibacterium avidum" -o skDER_Results/ -c 10