Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a demonstration snakemake integration for spacegraphcats #288

Open
ctb opened this issue Aug 1, 2020 · 2 comments
Open

provide a demonstration snakemake integration for spacegraphcats #288

ctb opened this issue Aug 1, 2020 · 2 comments

Comments

@ctb
Copy link
Member

ctb commented Aug 1, 2020

@taylorreiter is doing cool stuff with running lots of spacegraphcats jobs inside of snakemake using --outdir etc. would be nice to provide an example with the sgc documentation.

@taylorreiter
Copy link
Contributor

Not a good example yet, but here's a start in case it's helpful for anyone

A snakefile that assumes raw paired-end reads exist in inputs/raw with the ending .fq.gz, and that there is a samples.tsv file with a column names library_name that specifies the prefix to the raw files.

import pandas as pd

m = pd.read_csv("inputs/samples.tsv", sep = "\t", header = 0)
LIBRARIES = m['library_name'].unique().tolist()

# Specifying this variable can be avoided by using checkpoints,
# but this makes the DAG take forever to solve. 
QUERY_GENOMES = ["ERS235530_10.fna", "ERS235531_43.fna", "ERS235603_16.fna"]

rule all:
    input: 
        expand("outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.cdbg_ids.reads.fa.gz", library = LIBRARIES, query_genome = QUERY_GENOMES)
rule kmer_trim_reads:
    input: 
        'inputs/raw/{library}_R1.fq.gz',
        'inputs/raw/{library}_R2.fq.gz'
    output: "outputs/abundtrim/{library}.abundtrim.fq.gz"
    conda: 'envs/spacegraphcats.yml'
    shell:'''
    interleave-reads.py {input} | trim-low-abund.py --gzip -C 3 -Z 18 -M 60e9 -V - -o {output}
    '''


# download the query genomes
rule untar_gather_match_genomes:
    output:  expand("inputs/query_genomes/{query_genome}.gz", query_genome = QUERY_GENOMES)
    input:"inputs/query_genomes.tar.gz"
    params: outdir = "outputs/"
    shell:'''
    mkdir -p {params.outdir}
    tar xf {input} -C {params.outdir}
    '''

rule spacegraphcats:
    input: 
        query = "inputs/query_genomes/{query_genome}.gz", 
        conf = "inputs/sgc_conf/{library}_r1_conf.yml",
        reads = "outputs/abundtrim/{library}.abundtrim.fq.gz"
    output:
        "outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.cdbg_ids.reads.fa.gz",
        "outputs/sgc_genome_queries/{library}_k31_r1_search_oh0/{query_genome}.gz.contigs.sig"
    params: outdir = "outputs/sgc_genome_queries"
    conda: "envs/spacegraphcats.yml"
    shell:'''
    spacegraphcats {input.conf} extract_contigs extract_reads --nolock --outdir={params.outdir}  
    '''

envs/spacegraphcats.yml:

channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - bcalm=2.2.1
    - snakemake-minimal=5.8.1
    - cython=0.29.14
    - screed=1.0.1
    - sourmash=3.1.0
    - khmer=3.0.0a3
    - pandas=0.25.3
    - numpy=1.17.3
    - pytest=5.3.2
    - sortedcontainers=2.1.0
    - mypy=0.750
    - pip
    - pip:
      - https://github.com/dib-lab/pybbhash/archive/spacegraphcats.zip
      - git+https://github.com/spacegraphcats/spacegraphcats@master

@ctb
Copy link
Member Author

ctb commented Sep 23, 2020

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants