In [None]:
#imports

This notebook is about setting up a SnakeMake workflow to go from SRR run Id -->--> BAM file on **BioWulf**

# Setup environment

Login to an interactive session
>sinteractive --cpus-per-task=6 --mem=6g

Load snakemake
>module load snakemake

# Setup the `Snakemake` file

## Using Wrappers

There are a lot of Snakemake wrappers for common tools [here](https://snakemake-wrappers.readthedocs.io/en/stable/)

The top half contains the snakemake **rules** to copy-paste into the `Snakemake` file. The `wrapper` flag points to the code that it will run, which is listed below.


## Setup the `cluster` file

For every rule, you can choose the partition/time/resources, etc. E.g.,
```
__default__:
    partition: quick
    time: 10
    extra: "--gres=lscratch:10"
hisat2:
    partition: norm
```
In this example, hisat would be run on the norm partition with a walltime of 10 min and 10GB of lscratch - the latter two from the __default__ section since they are not defined for the hisat2 rule.

## Setup the `config` file

1. Options for running the snakefile (from the [NIH class on using snakemake](https://github.com/NIH-HPC/snakemake-class/tree/master/exercise05)). Modify these as appropriate in the file `myprofile/config.yaml`. This refers to the `cluster` file.
 
```
1. -k, --keep-going: By default, snakemake will quit if a job fails (after waiting for running jobs to finish. -k will make snakemake continue with independent jobs.
2. -w, --latency-wait, --output-wait: The amount of time snakemake will wait for output files to appear after a job has finished. This defaults to a low 5s. On the shared file systems latency may be higher. Raising it to 120s is a bit excessive, but it doesn't really hurt too much.
3. --local-cores: The number of CPUs available for local rules
4. --max-jobs-per-second: Max numbers of jobs to submit per second. Please be kind to the batch scheduler.
5. --cluster: The template string used to submit each (non local) job.
6. --jobs: The number of jobs to run concurrently.
7. --cluster-config: The cluster config file
```
Example:

```
#this is config.yaml

max-jobs-per-second: 1
latency-wait: 120
keep-going: true
cluster: 'sbatch -c {cluster.threads} --mem={cluster.mem} --partition={cluster.partition} --time={cluster.time} {cluster.extra}'
```

2. Add the input files here, and Add the Rule parameters. T

Example:
```
#this is config.yaml
{ 
	"samples" : {
		"A" : "data/samples/A.fastq",
		"B" : "data/samples/B.fastq"
	},
	"samtools_view" : {
		"flag" : "0x5"
	}
}
```
This turns into a dictionary that can be accessed in the Snakefile code:
```
input:
    lambda wildcards: config["samples"][wildcards.sample]
params:
    flag = config["samtools_view"]["flag"]
shell:
    "samtools view -c -f {params.flag} {input} > {output}" 
```

# Create the Snakemake rules file

1. The first rule is the expected target file. It will run all rules required to generate `/counts/A.counts`
```
rule all:
    input:
        "counts/A.counts"
```

## Run snakemake

Make a script with the command line options:

```
#! /bin/bash
# this file is snakemake.sh

module load snakemake || exit 1

snakemake --profile ./myprofile --jobs 100 --cluster-config=cluster.yaml
```

Finally, submit the master job. The arguments here (2 CPUs, 8g memory) only apply to the master job and aren't used for the subjobs (I'm pretty sure..).
>sbatch --cpus-per-task=2 --mem=8g snakemake.sh --time=48:00:00

Job display
#module load graphviz

For the workflow:
>snakemake --rulegraph | dot -Tpng > rulegraph.png

For every job:
>snakemake --dag | dot -Tpng > dag.png



In [1]:
from pathlib import Path

In [26]:
def is_paired_end(directory, accession, extension):
    """
    Accepts a directory and tests if it contains 
    accession_1.extension AND accession_2.extension
    
    e.g., is_paired_end(path/to/dir, SRR001, .fasta)
    returns TRUE if
    path/to/dir/SRR001_1.fasta AND path/to/dir/SRR001_2.fasta
    exist.
    Otherwise, FALSE
    """
    p = Path(directory)
    assert p.is_dir(), f"{directory} is not a directory"
    
    paired_end1 = p.joinpath(accession + "_1" + extension)
    paired_end2 = p.joinpath(accession + "_2" + extension)
    
    if paired_end1.is_file() and paired_end2.is_file():
        return True
    else:
        return False

l1 = ['CYGL01000011.1']
for accession in l1:
    if is_paired_end('../../projects/human_virome_project/01annotation/contigs/', accession , '.fna') == True: