# Pipelines in BIU
There are several pipelines defined in BIU:
 * VEP: Variant Effect Prediction
 * RBHMap: Reciprocal Best Blast Hit Mapping between two fasta files

In [1]:
import biu
biu.config.settings.setWhere('/')

## Common parameters for manipulating the pipeline
When creating a pipeline instance, there are a few useful parameters:
 * `rewriteHashedInputFiles = True|False`: Whether or not hashed input files should be re-written
 * `autorun = True|False`: Whether or not the pipeline should be executed at initialization.
 * `**snakemakeOptions`: Parameters to pass to the snakemake engine (See http://snakemake.readthedocs.io/en/latest/api_reference/snakemake.html)
  * `drmaa=" -V"`: Run with drmaa. Submit with e.g. qsub. Default None. Requires drmaa package.
  * `cores=10`: Run with 10 cores. Default 1
  * `use_conda=True`: Use conda to install dependencies. Default True.
  

## Making your own pipeline

To make your own pipeline, you need to define two things, a class that wraps your pipeline, and a Snakemake file which defines the steps.

### Making the Snakefile

A snakefile takes as input a configuration file. The contents of this file are defined by you, but there are also some additional parameters that help you in the execution of the pipeline:
    
 * `config["common_dir"]`: A common directory for all instances of this pipeline type (e.g. for downloading files that are always constant across all versions of the pipeline.)
 * `config["tmp_dir"]`: A temporary directory for temporary files exclusively for this pipeline instance
 * `config["outdir"]`: An output directory exclusively for this pipeline instance.
 * `config["hash"]`: A hash that is unique to this instance.

Your snakefile MUST define an `output` rule, which will be run.

Let's define an example snakefile:

In [10]:
snakefile = """
###############################################################################

biu.__version__()

localrules: download, combine, output
rule download:
  output:
    file = '%s/downloaded_file.fa' % config['outdir']
  shell: '''
    curl 'http://molb7621.github.io/workshop/_downloads/sample.fa' > '{output.file}'
  '''
  
rule combine:
  input:
    file   = rules.download.output.file,
    myfile = config['input_file']
  output:
    file = '%s/combined.fa' % config['outdir']
  shell: '''
    cat '{input.file}' <(echo '') '{input.myfile}' > '{output.file}'
  '''
  
rule output:
  input:
    file = rules.combine.output.file
  output:
    file = '%s/%s' % ( config['outdir'], config['output_file_name'])
  shell: '''
    cp '{input.file}' '{output.file}'
  '''
"""

snakefilePath = 'example_files/example.Snakefile'

with open(snakefilePath, 'w') as ofd:
    ofd.write(snakefile)

### Making the class

The class needs to define some default parameters of this pipeline (such as the name of the output file), and handle the input files and output.

Because the hash is generated from the configuration file, we need to have a standard way to ensure that the hash doesn't change 

In [3]:
class myPipeline(biu.structures.Pipeline):
    
    # Default parameters
    __defaultConfig = {
        "output_file_name" : "outfile.fasta"
    }
    
    def __init__(self, fasta, config={}, **kwargs):
        biu.structures.Pipeline.__init__(self, snakefilePath, {**self.__defaultConfig, **config}, **kwargs)
        
        # Now we need to define the instance specific options
        smConfig = {}
        smConfig['input_file'] = self.__writeTemporaryFile(fasta)
        
        self.setConfig(smConfig)
        if self.autorun:
            self.run(["output"])
        #fi
        
    #edef
    
    def __writeTemporaryFile(self, fasta):
        
          # _generateInputFileName takes a list of strings as input. By sampling, it quickly makes a hash of the file.
        filename, exists = self._generateInputFileName([ fasta[s].seq for s in fasta ])
        if not(exists):
            fasta.write(filename)
        #fi
        
        return filename
    #edef
    
    def getOutput(self):
        if not(self.success): # Check if the pipeline ran successfully
            return None
        #fi
        return biu.formats.Fasta('%s/%s' % (self.config["outdir"], self.config['output_file_name']))
    #edef
#edef

## Running the pipeline

In [11]:
mp = myPipeline(biu.formats.Fasta('example_files/example.fasta'))

D: Fasta input source is file
Building DAG of jobs...


BIU (Bio Utilities) python module
{"where": "/", "neo4j_install_dir": "/exports/molepi/tgehrmann/src/neo4j", "debug_messages": true, "debug_stream": "stderr", "pipelines_base": "/exports/molepi/tgehrmann/pipeline_runs", "pipelines_common_name": "common", "pipelines_temporary_indir_name": "temporary_input", "pipelines_conda_prefix_name": "conda"}
 Current config hash: 3dd4f4f802236b4fe8bb44ae9106efea


Nothing to be done.
Complete log: /home/tgehrmann/repos/BIU/docs/.snakemake/log/2018-06-29T143403.002081.snakemake.log


In [5]:
mp.config

{'common_dir': '/exports/molepi/tgehrmann/pipeline_runs/common/myPipeline',
 'tmp_dir': '/exports/molepi/tgehrmann/pipeline_runs/temporary_input/myPipeline',
 'biu_settings': {'where': '/',
  'neo4j_install_dir': '/exports/molepi/tgehrmann/src/neo4j',
  'debug_messages': True,
  'debug_stream': 'stderr',
  'pipelines_base': '/exports/molepi/tgehrmann/pipeline_runs',
  'pipelines_common_name': 'common',
  'pipelines_temporary_indir_name': 'temporary_input',
  'pipelines_conda_prefix_name': 'conda'},
 'biu_location': '/home/tgehrmann/repos/BIU/docs',
 'outdir': '/exports/molepi/tgehrmann/pipeline_runs/myPipeline/0ce6d6dda4a09b139997a44c62e8a958',
 'output_file_name': 'outfile.fasta',
 'hash': '0ce6d6dda4a09b139997a44c62e8a958',
 'input_file': '/exports/molepi/tgehrmann/pipeline_runs/temporary_input/myPipeline/digest.0484e81d06666d4ca8d63b3a12bc77cc'}

In [6]:
newFasta = mp.getOutput()

print(newFasta)

for seq in newFasta:
    print('>%s\n%s' % (seq, newFasta[seq].seq))

Fasta object
 Where: /exports/molepi/tgehrmann/pipeline_runs/myPipeline/0ce6d6dda4a09b139997a44c62e8a958/outfile.fasta
 Entries: 6
 Primary type: dna

>HSGLTH1
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACCTAAGACGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGTCCCGGGATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGACCCCTGGCCGGTCCGCGCAGGCGCAGCGGGGTCGCAGGGCGCGGCGGGTTCCAGCGCGGGGATGGCGCTGTCCGCGGAGGACCGGGCGCTGGTGCGCGCCCTGTGGAAGAAGCTGGGCAGCAACGTCGGCGTCTACACGACAGAGGCCCTGGAAAGGTGCGGCAGGCTGGGCGCCCCCGCCCCCAGGGGCCCTCCCTCCCCAAGCCCCCCGGACGCGCCTCACCCACGTTCCTCTCGCAGGACCTTCCTGGCTTTCCCCGCCACGAAGACCTACTTCTCCCACCTGGACCTGAGCCCCGGCTCCTCACAAGTCAGAGCCCACGGCCAGAAGGTGGCGGACGCGCTGAGCCTCGCCGTGGAGCGCCTGGACGACCTACCCCACGCGCTGTCCGCGCTGAGCCACCTGCACGCGTGCCAGCTGCGAGTGGACCCGGCCAGCTTCCAGGTGAGCGGCTGCCGTGCTGGGCCCCTGTCCCCGGGAGGGCCCCGGCGGGGTGGGTGCGGGGGGCGTGCGGGGCGGGTGCAGGCGAGTGAGCCTTGAGCGCTCGCCGCAGCTCCTGGGCCACTGCCTGCTGGTAACCCTCGCCCGGCACTACCCCGGAGACTTCAGCCCCGCGCTGC

D: Fasta input source is file


## Using BIU in the pipeline
You can make use of your running BIU instance in the pipeline.
