Skip to content

waglecn/mabs

Repository files navigation

mabs

author:nwaglechner@gmail.com

Basic Setup

git clone https://github.com/waglecn/mabs.git

Conda and snakemake

Miniconda available from: https://docs.conda.io/en/latest/miniconda.html

Python 3.8.3 Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  
bash Miniconda3-latest-Linux-X86_64.sh
conda env create --name mabs --file environment.yaml
conda activate mabs
- note the version of python installed in the the mabs environment is not necessarily the same as the default miniconda python version
- asking for ete3 in the default environment will required python 3.6 (200921)

Required files:

locate TruSeq3-PE.fa
  • Kraken database ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz

How to run

snakemake --configfile config.yaml --cores 8 --use-conda --conda-prefix /path/to/.snakemake/conda

Use config.default.yaml as a template for other config files.

Notes

200915

  • strange bug causing infinite loop in snakemake downloading refseq genomes. I think this is because of the dynamic() output/input in rules. Checking this out, seeing if the bug happens if I run entire pipeline from scratch.

200917

  • noticed a bug in running shovill, increased expected memory usage. Shovill version 0.9.0 running from an older miniconda. Removed miniconda, started from scratch, and pinned Shovill 1.1.0 in shovill.yaml
  • after fixing, rerunning seems to work with example data, then works after deleting the mashtree and refseq_download directories.

210302

TODO 200902

  • [ ]download internal project data - deferred

    • configurable data-dir - 200914
  • download external project data

    • refseq genomes - done 200904
    • genomes from Bryant et al, SRA
      • need to know what these are
  • download reference assemblies - 200908

    • first used all contig assemblies, changed to 'complete' keyword
  • reading in samples somehow, obviously this depends on how/where they are downloaded (see previous TODO item) and the data that is already downloaded

    • need a dummy rule that requires these as input in order to define wildcards
  • basic Snakefile - 200905

  • build workflow part 1

    • index reference assemblies - deferred 200914
      • moved to resources/alignment_references
    • pre-trim QC - done 200908
    • trim - done 200909
      • specify adapter files, add variable to config
    • post-trim QC done 200909
    • kraken check - done 200910
      • download kraken db automatically - deferred, added to Required files
    • genome assembly on raw reads - 200914
      • Erm(41) identification on assembly - 200912
    • kraken2 on assembly - 200912
    • mashtree assembly - 200913
    • map everything to ATCC 19977 for basic coverage - 200914
  • build workflow part 2 on available assemblies

    • tree-guided MRCA - 200915
    • MRCA-guided MLST - 200913
    • MRCA-guided reference mapping - 200921
    • Optional: Mark duplicates with picard
    • read filtering - see Martin et al 2018 and Lee et al 2020
    • annotate regions of interest
      • remove PP/PPE regions (BED file)
        • identify PP/PPE - 200927
      • zero coverage of reference
      • remove phage, tnp, IS
      • merge ROI BED files
    • MRCA-guided variant calling with bcftools - 200922
    • variant annotation with SNPEff
    • SNP-tree construction
      • SNP extraction - custom? merge vcf as per Robyn 201006
      • - merge SNPs - 201013
      • concatenate cSNPSs (exclude hSNPs) 201016
        • snp-sites ? snippy?
      • - vcfmerge 201014

TODO 200911

  • add trimming parameters to config file - 200921

TODO 200914

  • sub-species type assemblies are hard-coded in scripts/tree_MRCA.py, it would be useful for these to be configurable but adds layers of complexity to snakefile

TODO 200920

  • Added GATK info to REQUIREMENTS, and config.yaml

TODO 200926

  • Tune variant filtering
  • TODO big question here - use stats from part 1 to make new sample_sheet with QC pass samples? No
    • make list to prune from SNP alignment - not needed 201012
  • need separate list of in-complete genomes, as MRCA-guided MLST didn't work as expected, tree has wrong structure (samples from pt 29 should be mmas) - Fixed 201006, need to convert gbff files before mashtree can read

TODO 201010

  • start density filter
  • merge completed results without recalculating shovill assemblies for old samples - 201010
  • merge 0-coverage bed files and PE_PPE bed files 201013
  • filter merged bed from vcf
    • compress vcf with bcftools

TODO 201013

  • complete density filter - 20-11-23

TODO 201015

210323

  • merging script
  • copy results_folder1 and results_folder2 into results_merge folder
  • remove the gubbins folder
  • remove the SNP_phylo folder
  • remove the files in MRCA_ref_folder, but keep the individual reference sub-folders
  • remove the mashtree folder

run snakemake with the following targets, in this order:

  • mashtree/assembly_mashtree.complete.tree
  • stage1

touch ./MRCA_ref_mapping//tempRGSC.merged..sorted.bam.bai touch ./MRCA_ref_mapping//.intervals touch ./MRCA_ref_mapping//.RG_SC_RA.bam touch ./MRCA_ref_mapping//.RG_SC_RA.mpileup touch ./MRCA_ref_mapping//.RG_SC_RA_filter.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.failed.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.AD_failed.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA_filter.hvar.vcf.gz touch ./MRCA_ref_mapping//.RG_SC_RA.0cov.bed touch ./MRCA_ref_mapping//.RG_SC_RA_filter.hvar_DF.bed

  • stage2
  • stage3 to generate the merged output (gubbins, SNP phylo, merged beds, etc)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published