Snakemake pipeline for characterizing type III CRISPR-Cas loci and related CorAs. Generates phylogenetic trees with annotations. Chi et al. 2023 (DOI: https://doi.org/10.1038/s41586-023-06620-5)
- You need to have Snakemake, Conda and Hmmer installed. Other dependencies are installed automatically by the pipeline.
- Runs on Unix environments. Tested only on Ubuntu
- Data: you need a local database of genomes. You can use NCBI Datasets command line tool to download the genomes. Use the command
datasets download genome taxon 2 --annotated --assembly-level complete --include genome,protein,gff3
- May not work on Python 3.12 (tested with 3.9)
- Clone the repository
- HMM profiles and related protein alignments are provided in the msa_050523 -folder. Use Hmmer to hmmpress the HMM databases:
hmmpress effectors_050523.hmm
hmmpress all_cas10s.hmm
and modify paths (anything starting with "/media/volume/") to point to the databases. Also modify the hmm_msa_folder variable in the script to point to the directory with the alignments.
- Point the genomes_folder to the root of your downloaded genomes
- Run using the following command, adjusting core count to your needs:
snakemake --snakefile cas10_prober.smk --use-conda --cores 40 --config protein_clustering="False" getGenomesBy="local" genome_mode="all" cas10_anchor="True" --rerun-triggers mtime
If you have trouble, please raise an issue at Github!