author:nwaglechner@gmail.com
git clone https://github.com/waglecn/mabs.git
Conda and snakemake
Miniconda available from: https://docs.conda.io/en/latest/miniconda.html
Python 3.8.3 Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-X86_64.sh
conda env create --name mabs --file environment.yaml
conda activate mabs
- note the version of python installed in the the mabs environment is not necessarily the same as the default miniconda python version
- asking for ete3 in the default environment will required python 3.6 (200921)
- GATK3 jar file
- available from https://console.cloud.google.com/storage/browser/gatk-software/package-archive/gatk
- used '''GenomeAnalysisTK-3.8-1-0-gf15c1c3ef.tar.bz2'''
- see config.yaml
- adapters for trimming - see config.yaml
- look for adapter files bundled with trimmomatic, ie.
locate TruSeq3-PE.fa
- Kraken database ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz
snakemake --configfile config.yaml --cores 8 --use-conda --conda-prefix /path/to/.snakemake/conda
Use config.default.yaml as a template for other config files.
200915
- strange bug causing infinite loop in snakemake downloading refseq genomes. I think this is because of the dynamic() output/input in rules. Checking this out, seeing if the bug happens if I run entire pipeline from scratch.
200917
- noticed a bug in running shovill, increased expected memory usage. Shovill version 0.9.0 running from an older miniconda. Removed miniconda, started from scratch, and pinned Shovill 1.1.0 in shovill.yaml
- after fixing, rerunning seems to work with example data, then works after deleting the mashtree and refseq_download directories.
210302
- on vs masking before gubbins vs after see nickjcroucher/gubbins#275
-
[ ]download internal project data - deferred
- configurable data-dir - 200914
-
download external project data
- refseq genomes - done 200904
- genomes from Bryant et al, SRA
- need to know what these are
-
download reference assemblies - 200908
- first used all contig assemblies, changed to 'complete' keyword
-
reading in samples somehow, obviously this depends on how/where they are downloaded (see previous TODO item) and the data that is already downloaded
- need a dummy rule that requires these as input in order to define wildcards
-
basic Snakefile - 200905
-
build workflow part 1
- index reference assemblies - deferred 200914
- moved to resources/alignment_references
- pre-trim QC - done 200908
- trim - done 200909
- specify adapter files, add variable to config
- post-trim QC done 200909
- kraken check - done 200910
- download kraken db automatically - deferred, added to Required files
- genome assembly on raw reads - 200914
- Erm(41) identification on assembly - 200912
- kraken2 on assembly - 200912
- mashtree assembly - 200913
- map everything to ATCC 19977 for basic coverage - 200914
- index reference assemblies - deferred 200914
-
build workflow part 2 on available assemblies
- tree-guided MRCA - 200915
- MRCA-guided MLST - 200913
- MRCA-guided reference mapping - 200921
- Optional: Mark duplicates with picard
- read filtering - see Martin et al 2018 and Lee et al 2020
- filter soft clips - 200922
- optional GATK realignment, but see for why it was removed in 2015 for gatk4 https://github.com/broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2016-06-21-Changing_workflows_around_calling_SNPs_and_indels.md?id=7847
- added 200923, optional 200924
- intially added gatk4, got errors and followed the rabbit-hole
- to follow Martin et al, added conda env with gatk3.8, since the resulting bam can be used with any downstream variant caller
- annotate regions of interest
- remove PP/PPE regions (BED file)
- identify PP/PPE - 200927
- zero coverage of reference
- remove phage, tnp, IS
- merge ROI BED files
- remove PP/PPE regions (BED file)
- MRCA-guided variant calling with bcftools - 200922
- bcftools mpileup - 200923
- called variants - 200923
- variant filtering
- basic Martin et al - 200925
- density filter - see https://github.com/c2-d2/within-host-diversity/blob/master/fastq_to_vcf_pipeline.py#L122 line
- variant annotation with SNPEff
- SNP-tree construction
- SNP extraction - custom? merge vcf as per Robyn 201006
- - merge SNPs - 201013
- concatenate cSNPSs (exclude hSNPs) 201016
- snp-sites ? snippy?
- - vcfmerge 201014
- add trimming parameters to config file - 200921
- sub-species type assemblies are hard-coded in scripts/tree_MRCA.py, it would be useful for these to be configurable but adds layers of complexity to snakefile
- Added GATK info to REQUIREMENTS, and config.yaml
- Tune variant filtering
- TODO big question here - use stats from part 1 to make new sample_sheet with QC pass samples? No
- make list to prune from SNP alignment - not needed 201012
- need separate list of in-complete genomes, as MRCA-guided MLST didn't work as expected, tree has wrong structure (samples from pt 29 should be mmas) - Fixed 201006, need to convert gbff files before mashtree can read
- start density filter
- merge completed results without recalculating shovill assemblies for old samples - 201010
- merge 0-coverage bed files and PE_PPE bed files 201013
- filter merged bed from vcf
- compress vcf with bcftools
- complete density filter - 20-11-23
- incorporate https://github.com/phac-nml/mab_mabscessus 211021
- merging script
- copy results_folder1 and results_folder2 into results_merge folder
- remove the gubbins folder
- remove the SNP_phylo folder
- remove the files in MRCA_ref_folder, but keep the individual reference sub-folders
- remove the mashtree folder
run snakemake with the following targets, in this order:
- mashtree/assembly_mashtree.complete.tree
- stage1
touch ./MRCA_ref_mapping//tempRGSC.merged..sorted.bam.bai touch ./MRCA_ref_mapping//.intervals touch ./MRCA_ref_mapping//.RG_SC_RA.bam touch ./MRCA_ref_mapping//.RG_SC_RA.mpileup touch ./MRCA_ref_mapping//.RG_SC_RA_filter.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.failed.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.AD_failed.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.hvar.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA.0cov.bed touch ./MRCA_ref_mapping//.RG_SC_RA_filter.hvar_DF.bed
- stage2
- stage3 to generate the merged output (gubbins, SNP phylo, merged beds, etc)