provide a demonstration snakemake integration for spacegraphcats #288

ctb · 2020-08-01T14:10:20Z

@taylorreiter is doing cool stuff with running lots of spacegraphcats jobs inside of snakemake using --outdir etc. would be nice to provide an example with the sgc documentation.

taylorreiter · 2020-09-22T14:51:58Z

Not a good example yet, but here's a start in case it's helpful for anyone

A snakefile that assumes raw paired-end reads exist in inputs/raw with the ending .fq.gz, and that there is a samples.tsv file with a column names library_name that specifies the prefix to the raw files.

import pandas as pd

m = pd.read_csv("inputs/samples.tsv", sep = "\t", header = 0)
LIBRARIES = m['library_name'].unique().tolist()

# Specifying this variable can be avoided by using checkpoints,
# but this makes the DAG take forever to solve. 
QUERY_GENOMES = ["ERS235530_10.fna", "ERS235531_43.fna", "ERS235603_16.fna"]

rule all:
    input: 
        expand("outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.cdbg_ids.reads.fa.gz", library = LIBRARIES, query_genome = QUERY_GENOMES)
rule kmer_trim_reads:
    input: 
        'inputs/raw/{library}_R1.fq.gz',
        'inputs/raw/{library}_R2.fq.gz'
    output: "outputs/abundtrim/{library}.abundtrim.fq.gz"
    conda: 'envs/spacegraphcats.yml'
    shell:'''
    interleave-reads.py {input} | trim-low-abund.py --gzip -C 3 -Z 18 -M 60e9 -V - -o {output}
    '''


# download the query genomes
rule untar_gather_match_genomes:
    output:  expand("inputs/query_genomes/{query_genome}.gz", query_genome = QUERY_GENOMES)
    input:"inputs/query_genomes.tar.gz"
    params: outdir = "outputs/"
    shell:'''
    mkdir -p {params.outdir}
    tar xf {input} -C {params.outdir}
    '''

rule spacegraphcats:
    input: 
        query = "inputs/query_genomes/{query_genome}.gz", 
        conf = "inputs/sgc_conf/{library}_r1_conf.yml",
        reads = "outputs/abundtrim/{library}.abundtrim.fq.gz"
    output:
        "outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.cdbg_ids.reads.fa.gz",
        "outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.contigs.sig"
    params: outdir = "outputs/sgc_genome_queries"
    conda: "envs/spacegraphcats.yml"
    shell:'''
    spacegraphcats {input.conf} extract_contigs extract_reads --nolock --outdir={params.outdir}  
    '''

envs/spacegraphcats.yml:

channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - bcalm=2.2.1
    - snakemake-minimal=5.8.1
    - cython=0.29.14
    - screed=1.0.1
    - sourmash=3.1.0
    - khmer=3.0.0a3
    - pandas=0.25.3
    - numpy=1.17.3
    - pytest=5.3.2
    - sortedcontainers=2.1.0
    - mypy=0.750
    - pip
    - pip:
      - https://github.com/dib-lab/pybbhash/archive/spacegraphcats.zip
      - git+https://github.com/spacegraphcats/spacegraphcats@master

ctb · 2020-09-23T13:41:08Z

thanks!

ctb mentioned this issue Nov 8, 2020

how does snakemake locking work, anyway? #342

Open

ctb mentioned this issue Nov 20, 2020

explore using shadow to run sgc in /tmp #364

Closed

taylorreiter mentioned this issue Apr 30, 2021

[MRG] add documentation and mkdocs site for spacegraphcats #392

Merged

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide a demonstration snakemake integration for spacegraphcats #288

provide a demonstration snakemake integration for spacegraphcats #288

ctb commented Aug 1, 2020

taylorreiter commented Sep 22, 2020

ctb commented Sep 23, 2020

provide a demonstration snakemake integration for spacegraphcats #288

provide a demonstration snakemake integration for spacegraphcats #288

Comments

ctb commented Aug 1, 2020

taylorreiter commented Sep 22, 2020

ctb commented Sep 23, 2020