# Snakemake Workshop 

Trevor Barnes <br>
ΔE+ Research Lab, SFU

## Overview
This presentation is intended to give an overview of the workflow management tool, [Snakemake](https://snakemake.readthedocs.io/en/stable/). 

## Audience 
Examples in this presentation are targeted at energy system modellers who are comfortable working with Python. This workshop will use the [OSeMOSYS](http://www.osemosys.org/) Energy Modelling tool in the examples. With that said, there is no background knowledge needed in energy system modelling to follow this example.

## Motivation 
"The Snakemake workflow management system is a tool to create reproducible and scalable data analyses." ([ref](https://snakemake.readthedocs.io/en/stable/)). Especially in the field of open-source energy modelling, ensuring data is reproducibale is often a requirment for publication. Introducting a tool to manage your data pipelines ensures your work is consistently reproduciable and allows others to eaisly reporduce your results. 


## Similar tools
Snakemake is a Python implementation of a workflow management tool started from the field of bioinfomatics. Many other competitors exist, such as [nextflow](https://www.nextflow.io/), [airflow](https://github.com/apache/airflow), [galaxy](https://usegalaxy.org/), or [mage](https://www.mage.ai/), often with specific use cases or audiances. 

## References
Throughout this workshop, I will reference information from the following sources:
- [Snakemake homepage](https://snakemake.github.io/)
- [Snakemake paper](https://f1000research.com/articles/10-33)
- [Snakemake docs](https://snakemake.readthedocs.io/en/stable/index.html)
- [Reproduciable Data Analytic Workflows](https://lachlandeer.github.io/snakemake-econ-r-tutorial/)

# Background information 

## Common Terms 
- **Workflow**: A full data processing pipeline (a series of linked actions)
- **Rule**: Rules decompose the workflow into small steps (for examples, running a single script) 
- **Directed Acyclic Graph (DAC)**: Defines how rules link together
- **Wildcard**: Method to generalize a rule

## Rule Structure 

### Basics
Rules must follow a strict formatting guide. At minimum, a rule must have an "action" associated with it. In this context, an "action" will be a command invoked by a line of code. 

```python
rule hello_world:
    shell:
        'echo "Hello World!"'
```

### Inputs and Outputs 

More often, a rule will also have an (or multiple!) input and output files. 

```python 
rule name: 
    input: 
        "input/file/path.csv"
    output:
        "output/file/path.csv"
    shell:
        "python script.py {input} {output}"
```

### Actions

Actions can be more than just shell commands! They can also directly run python scripts, or directly run python code. For example, all three of these scripts are interchangable (assuming hello_world.py only prints "hello world")

```python
rule hello_world:
    shell:
        "python hello_world.py"
```
<br>
```python 
rule name: 
    run:
        "print('hello world')"
```
<br>
```python 
rule name: 
    script:
        "scripts/hello_world.py"
```

Note, that is using the `script:` command, the directory structure must be set up correctly. Morevoer, all snakemake variables will be passed into the python script (such as the input and output file(s))

### Other Commands 

Rules can have many attributes associated with them (see all [here](https://snakemake.readthedocs.io/en/stable/snakefiles/writing_snakefiles.html#grammar)). For example, the following rule is valid.  

```python 
rule create_capex:
    message:
        "Creating Capital Costs Data"
    input:
        demand="SpecifiedAnnualDemand.csv"
        emissions="AnnualEmissionLimit.csv"
    output:
        capex="CapitalCosts.csv"
    params:
        scale_factor=2,
    threads: 4
    log:
        "logs/create_capex.log"
    conda:
        "envs/osemosys.yaml"
    script:
        "scripts/capex.py"
```

### Target Rule

The **Target Rule** is the rule that Snakemake tries to execute to. This rule is usually called `all` and should not have any `input` files. More information [here](https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-7-adding-a-target-rule).

```python 
rule all:
    output:
        "reults/plot.png"
```

## Running a Workflow

Snakemake workflows are run from the command line. They can be run on the cloud or using HPC infrastructure, however, we will only cover local running in this workshop. Full command line options can be found [here](https://snakemake.readthedocs.io/en/stable/executing/cli.html#command-line-interface).

To run a workflow simply type the following into the command line (assuming you have snakemake installed in your conda environment) 

```bash
snakemake --cores 4
```

This will run the workflow with (up to) 4 cores. You must always specifiy the number of cores to run the workflow with. A shorthand to use all available cores is the command 

```bash 
snakemake -c
```

## Directory Structure 

See [here](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html) for full information, but in general you need to structure your project as shown.  

```bash
├── .gitignore
├── README.md
├── LICENSE.md
├── workflow
│   ├── rules
|   │   ├── module1.smk
|   │   └── module2.smk
│   ├── envs
|   │   ├── tool1.yaml
|   │   └── tool2.yaml
│   ├── scripts
|   │   ├── script1.py
|   │   └── script2.R
│   ├── notebooks
|   │   ├── notebook1.py.ipynb
|   │   └── notebook2.r.ipynb
│   ├── report
|   │   ├── plot1.rst
|   │   └── plot2.rst
|   └── Snakefile
├── config
│   ├── config.yaml
│   └── some-sheet.tsv
├── results
└── resources
```

# Workflow Overview

## Model 

Suppose we have a three technology OSeMOSYS Model. It consists of a hydro powerplant, a solar powerplant, and a natural gas powerplant that all contribute to a single electricity demand. The model is shown below. 

![Model](workshop/images/model.png)


## Workflow

The workflow for your model consists of four steps: 
1. Run a series of Python scripts to generate scenario data for annual demand, capital costs, emission penalties, and variable fuel costs
2. Convert all CSV data into a GNU MathProg datafile using [otoole](https://otoole.readthedocs.io/en/latest/)
3. Solve the model through GLPK
4. Visualize the total annual installed capacity

Within this workflow, the generation of all other parameter data is held constant. Each scenario you intend to run will depend on user defined parameters that affect your parameters described in step one (ie. increased electrifiction scenario, cheap solar scenario, ect...). A graphical overview of the workflow is given below. Note, the varaible cost script must be run after the annual demand script. 

<img src="workshop/images/workflow.png" width="600" height="600">

# Classical Workflow

## Steps to Manually Run the Workflow 

Note that all scripts mentioned can be found in the `workflow/scripts/` folder

### 1. Create a Scenario Folder 
```bash 
mkdir scenarios 
```

### 2. Copy of the reference data 
```bash 
cp resources/data scenarios/data
```

### 3. Run the capital costs script
In this case we will implement a 1.5 scaling factor on solar panels

```bash 
python workflow/scripts/capital_costs.py scenarios/data/CapitalCosts.csv scenarios/data/CapitalCosts.csv "SPV" 1.5
```

### 4. Run the emission penalty script
In this case we will create a linear increase from 0 to 100 $/T

```bash 
python workflow/scripts/emission_penalty.py scenarios/data/EmissionsPenalty.csv scenarios/data/EmissionsPenalty.csv 0 100
```

### 5. Run the demand script
In this case we will scale the demand by 2

```bash 
python workflow/scripts/demand.py scenarios/data/SpecifiedAnnualDemand.csv scenarios/data/SpecifiedAnnualDemand.csv 2
```

### 6. Run the variable costs script
Note that this script **must** be run after the demand script 

```bash 
python workflow/scripts/variable_costs.py scenarios/data/SpecifiedAnnualDemand.csv scenarios/data/VariableCosts.csv scenarios/data/VariableCosts.csv
```

### 7. Use otoole to create a datafile 
otoole will simply convert csv data into a GNU MathPog file.  

```bash 
otoole convert csv datafile scenarios/data scenarios/data.txt resources/config.yaml
```

### 8. Create a folder to hold result data

```bash 
mkdir scenarios/results
```

### 9. Run GLPK to build the model 
OSeMOSYS requires the current working directory to have a `results`
folder when solving through GLPK, so we first change directories to 
the scenario folder 

```bash 
cd scenarios
glpsol -m ../resources/osemosys.txt -d scenarios/
cd ..
```

### 10. Run capacity plotting script 

```bash 
python plot_capacity.py scenarios/results/AnnualCapacity.csv scenarios/AnnualCapacity.png
```

### 11. View Results 

The `AnnualCapacity.png` file can be found in the folder `results/scenario`

![Results](workshop/images/example-result.png)

## Issues with Classical Workflow 
In many cases we will run each of these scenarios one-by-one, manually chnaging data in the script.
- Running many scenarios becomes a nightmare 
- Easy to make data mistakes 
- Hard to replicate results 

# Snakemake Workflow 

## Creating the workflow 

The solution to this workflow can be found in `workshop/solutions/snakemake-1`

### 1. Create a snakefile

Lets automate this process through the use of a snakemake workflow! Snakemake will look for a file (called a `snakefile`) located either in the root directory or the `workflow` directory. This file has already been created at `workflow/snakefile`. The `snakefile` will hold all the logic in the workflow. 

Note, while the file does not have to be called `snakefile`, Snakemake will automatically look for a file called `snakefile` unless specified otherwise

### 2. Import libraries

In the `snakefile`, we will import libraries as normally done in python 

```python
import os
import shutil
from snakemake.utils import min_version
min_version("6.0")
```

### 3. Create a "target" rule
In the `snakefile`, create a target rule. This rule holds the final outcome of our workflow, in this case the `AnnualCapacity.csv`. 

```python
rule all:
    input:
        "results/scenario/AnnualCapacity.png"
```

### 4. Create the rules to execute the scripts 

```python
rule capital_cost:
    input:
        "resources/data/AnnualDemand.csv"
    output:
        "results/scenario/data/AnnualDemand.csv"
    params:
        technology = "SPV",
        scaling_factor = 1.5
    shell: 
        "python workflow/scripts/capital_costs.py {input} {output} {params.technology} {params.scaling_factor}"
```

```python
rule emission_penalty:
    input:
        "resources/data/EmissionsPenalty.csv"
    output:
        "results/scenario/data/EmissionsPenalty.csv"
    params:
        start = 0,
        end = 100
    shell: 
        "python workflow/scripts/emission_penalty.py {input} {output} {params.start} {params.end}"
```

```python
rule demand:
    input:
        "resources/data/AnnualDemand.csv"
    output:
        "results/scenario/data/AnnualDemand.csv"
    params:
        scaling_factor = 2,
    shell: 
        "python workflow/scripts/demand.py {input} {output} {params.scaling_factor}"
```

```python
rule variable_cost:
    input:
        var_cost = "resources/data/VariableCosts.csv",
        demand = "results/scenario/data/AnnualDemand.csv"
    output:
        "results/scenario/data/VariableCosts.csv"
    shell: 
        "python workflow/scripts/variable_costs.py {input.demand} {input.var_cost} {output}"
```

Note! That the `variable_cost` rule must take in the output from the `demand` rule. 


### 5. Copy reference data rule

In general, Snakemake rule outputs should be unique, meaning that the same file shouldn't be created through multiple rules (unless other conditions are imposed). Therefore, we need to be careful not to create a rule that also outputs the files `AnnualDemand.csv`, `CapitalCosts.csv`, `EmissionsPenalty.csv`, and `VariableCosts.csv`. Note that outputs do not necessarly have to be unique, but you will need to ensure you manage AmbigiousRuleOrder exceptions (more info [here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules)). 

We first create constants of the files we need: 

```python
OSEMOSYS_CSVS = os.listdir("resources/data")
CSVS_TO_CREATE = [
    "AnnualDemand.csv",
    "EmissionsPenalty.csv",
    "VariableCosts.csv",
    "CapitalCosts.csv"
]
CSVS_TO_COPY = [f for f in OSEMOSYS_CSVS if f not in CSVS_TO_CREATE]
```

Then we create a rule to copy over the files in `CSVS_TO_COPY` list. Note here that we are directly executing python code in the rule through the `run:` command. 

```python 
rule copy_csv_files:
    input:
        expand("resources/data/{csv}", csv=OSEMOSYS_CSVS)
    output:
        expand("results/scenario/data/{csv}", csv=CSVS_TO_COPY)
    params:
        folder = directory("results/scenario/data")
    run:
        for path in input:
            _, f = os.path.split(path) # f will be in form of "file.csv"
            if f in CSVS_TO_CREATE:
                continue
            shutil.copy(path, os.path.join(params.folder, f))
```

### 6. Create the datafile  
```python 
rule otoole:
    input:
        expand("results/scenario/data/{csv}", csv=OSEMOSYS_CSVS)
    output:
        "results/scenario/data.txt"
    params:
        csv_dir = "results/scenario/data",
        config="resources/config.yaml"
    shell:
        "otoole convert csv datafile {params.csv_dir} {output} {params.config}"
```

### 7. Solve the model rule

When using GLPK to solve OSeMOSYS models, by default a results directory must exist in the directory of running. The shell code simply changes the working directory to the location of the scenario before invoking `glpsol`.

```python 
rule solve:
    input: 
        "results/scenario/data.txt"
    output:
        "results/scenario/results/TotalCapacityAnnual.csv"
    params:
        model="resources/osemosys.txt"
    shell:
        """
        FILE="resources/data.txt" &&
        f="$(basename -- $FILE)" &&
        cd results/scenario &&
        glpsol -m ../../{params.model} -d $f
        """
```

### 8. Plot results rule
```python
rule plot:
    input:
        "results/scenario/results/TotalCapacityAnnual.csv"
    output:
        "results/scenario/AnnualCapacity.png"
    shell:
        "python workflow/scripts/plot_capacity.py {input} {output} 'Scenario'"
```

### 9. Create missing directories rule

Throughout this workflow, it is good to ensure that directories exist before running commands. We can do this using the `directory` command in snakemake. 

```python
rule create_scenario_dir:
    output:
        directory("results/scenario/data")
    shell:
        "mkdir output"
```
<br>
```python
rule results_dir:
    input:
        "results/scenario"
    output:
        directory("results/scenario/results")
    shell:
        "mkdir {output}"
```

## Visualize the workflow 
Before executing the workflow, lets visualize it! This will create the DAG and save it as `dag.pdf` in the root directory

```bash
snakemake --dag all | dot -Tpdf > dag.pdf
```

![dag-1](workshop/images/snakemake-1.png)

## Execute the workflow 
We can now FINALLY execute the workflow! Lets run it using all available cores. 

```bash 
snakemake -c
```

At the start of the run, Snakemake will print out all the rules that it will run. 

```bash
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
all                     1              1              1
capital_cost            1              1              1
copy_csv_files          1              1              1
demand                  1              1              1
emission_penalty        1              1              1
otoole                  1              1              1
plot                    1              1              1
solve                   1              1              1
variable_cost           1              1              1
total                   9              1              1
```

After running, we can check the directory structre for our results. 

![folder-0.png](workshop/images/folder-structure-0.png)

## Issues 
We are still hardcoding in variable values. This is a reproduciable workflow, but not a flexible workflow. Lets fix that with wildcards! 

# Wildcards

The solution to this workflow can be found in `workshop/solutions/snakemake-2`

Usually, it is useful to generalize a rule to be applicable to a number of e.g. datasets. For this purpose, wildcards can be used. Automatically resolved multiple named wildcards are a key feature and strength of Snakemake in comparison to other systems. (See more [here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards)). 

Wildcards are identified by enclosing them in curly brackets. For example `{scenario}` would be interpreted as a wildcard to `Snakemake`. Note that the target rule can **not** include wildcards. We need to explicitly tell Snakemake what files we want to create in the target rule. 

## Replace `scenario` as a wildcard

### 1. Change scenario in all non-target rules

In general, the following rule is how add a wildcard

```python
rule capital_cost:
    input:
        "resources/data/CapitalCost.csv"
    output:
        "results/{scenario}/data/CapitalCost.csv"
    params:
        technology = "SPV",
        scaling_factor = 1.5
    shell: 
        "python workflow/scripts/capital_costs.py {input} {output} {params.technology} {params.scaling_factor}"
```

If your rule includes another function (such as `expand()`), enclose the wildcard in additional curly braces.

```python
rule otoole:
    input:
        expand("results/{{scenario}}/data/{csv}", csv=OSEMOSYS_CSVS)
    output:
        "results/{scenario}/data.txt"
    params:
        csv_dir = "results/{scenario}/data",
        config="resources/config.yaml"
    shell:
        "otoole convert csv datafile {params.csv_dir} {output} {params.config}"
```

If you wildcard needs to be accessed through a shell command (or input function), explicitly identify it with `{wildcards.scenario}`

```python
rule plot:
    input:
        "results/{scenario}/results/TotalCapacityAnnual.csv"
    output:
        "results/{scenario}/AnnualCapacity.png"
    shell:
        "python workflow/scripts/plot_capacity.py {input} {output} {wildcards.scenario}"
```

### 2. Update the Target Rule

Note, that we now need to explicitly tell Snakemake what files it should generate. 

```python
SCENARIOS = ["Kamaria", "Teddy", "Pierre", "Narges", "Yalda", "Trevor", "Elias"]

rule all:
    input:
        expand("results/{scenario}/AnnualCapacity.png", scenario=SCENARIOS)
```

## Visualize the workflow 

```bash
snakemake --dag all | dot -Tpdf > dag.pdf
```

How do you expect it to change? 

![dag-2](workshop/images/snakemake-2.png)

## Run the workflow 

```bash 
snakemake -c
```

We can see that 57 steps are needed to complete the workflow 

```bash
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
all                     1              1              1
capital_cost            7              1              1
copy_csv_files          7              1              1
demand                  7              1              1
emission_penalty        7              1              1
otoole                  7              1              1
plot                    7              1              1
solve                   7              1              1
variable_cost           7              1              1
total                  57              1              1
```

Lets look at how our folder structure changed! 

![folder-1.png](workshop/images/folder-structure-1.png)

![folder-2.png](workshop/images/folder-structure-2.png)

## Issues 

We are hardcoding in scenario values into our `snakefile`. This is bad practice. Lets create a configuration file! 

# Configuration File

The solution to this workflow can be found in `workshop/solutions/snakemake-3` and `workshop/solutions/config-1`

## Configuartion Setup 
First, set up a `config/config.yaml` file

```yaml
  scenario_one:
    capex:
      tech: "SPV"
      scale: 1.5
    emission_penalty:
      start: 0
      end: 100
    demand:
      scale: 2
  scenario_two:
    capex:
      tech: "HYD"
      scale: 5
    emission_penalty:
      start: 0
      end: 10
    demand:
      scale: 3
  scenario_three:
    ...
    ...
```

## Import the configuration file  

In the `snakefile`, add the following line

```python
configfile: "config/config.yaml"
```

Snakemake will autmatically parse this out as a dictionary with the variable `config`

## Update the workflow 

First updated the scneario names in the target rule

```python 
configfile: "config/config.yaml"
SCENARIOS = [x for x in config["scenarios"]]

rule all:
    input:
        expand("results/{scenario}/AnnualCapacity.png", scenario=SCENARIOS)
```

Next, update the `params` in the rules that call the python scripts. Note that you can not directly evaluate expressions in `input`, `output`, `params` sections. Therefore, we use input functions. Exploring input functions is outside the scope of this workshop, but more information can be found [here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#input-functions).

```python
rule capital_cost:
    input:
        "resources/data/CapitalCost.csv"
    output:
        "results/{scenario}/data/CapitalCost.csv"
    params:
        technology = lambda wildcards: config["scenarios"][wildcards.scenario]["capex"]["tech"],
        scaling_factor = lambda wildcards: config["scenarios"][wildcards.scenario]["capex"]["scale"],
    shell: 
        "python workflow/scripts/capital_costs.py {input} {output} {params.technology} {params.scaling_factor}"
```

```python 
rule emission_penalty:
    input:
        "resources/data/EmissionsPenalty.csv"
    output:
        "results/{scenario}/data/EmissionsPenalty.csv"
    params:
        start = lambda wildcards: config["scenarios"][wildcards.scenario]["emission_penalty"]["start"],
        end = lambda wildcards: config["scenarios"][wildcards.scenario]["emission_penalty"]["start"],
    shell: 
        "python workflow/scripts/emission_penalty.py {input} {output} {params.start} {params.end}"
```

```python 
rule demand:
    input:
        "resources/data/SpecifiedAnnualDemand.csv"
    output:
        "results/{scenario}/data/SpecifiedAnnualDemand.csv"
    params:
        scaling_factor = lambda wildcards: config["scenarios"][wildcards.scenario]["demand"]["scale"],
    shell: 
        "python workflow/scripts/demand.py {input} {output} {params.scaling_factor}"
```

## Execute the Workflow 

![dag-2](workshop/images/snakemake-3.png)

```bash
snakemake -c
```
<br>
```bash
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
all                     1              1              1
capital_cost            3              1              1
copy_csv_files          3              1              1
demand                  3              1              1
emission_penalty        3              1              1
otoole                  3              1              1
plot                    3              1              1
solve                   3              1              1
variable_cost           3              1              1
total                  25              1              1

Select jobs to execute...
```

## Issues

We have now created a generalized, flexible, and reproduciable workflow! <br>

However, what happens if we want to itterate over LOTS of different parameter values? We could implement logic at the start of the script to generate many different permutations of values. OR we can introduce more wildcards! 


# More Wildcards! 

The solution to this workflow can be found in `workshop/solutions/snakemake-4` and `workshop/solutions/config-2`

## File path structure

Lets generalize all our parameter values and use them to build our scenario names. Our sceanrios will follow the structure. Note, that we no longer have `scenario` as a wildcard

```bash
d{scale_factor}/capex_{tech}{scale}/ep{start}/ep{end}/
```

So for example, results for the following scenario will be in the folder:
- Scale demand by 2
- Scale solar capaital costs by 0.5
- Starting emission penalty of 0
- Ending emission penalty of 100

<br>
```bash
results/d2/capex_SPV0.5/ep0/ep100/AnnualCapacity.png
```

## Configuration File

We will also need to update our configuration file. We remove sceanrio names and replace them with ranges for our parameters to iterate over. 

```yaml
scenarios:
  capex:
    techs: ["SPV", "HYD", "GAS"]
    scale: [0.25, 4]
  emission_penalty:
    start: [0, 25]
    end: [25, 100]
  demand:
    start: 0.5
    end: 4.5
    step: 1
```

## Target rule update

```python
configfile: "config/config.yaml"
    
TECHS = config["scenarios"]["capex"]["techs"]
CAPEX_SCALES = config["scenarios"]["capex"]["scale"]
EP_STARTS = config["scenarios"]["emission_penalty"]["start"]
EP_ENDS = config["scenarios"]["emission_penalty"]["end"]
D_SCALES = list(np.arange(
    config["scenarios"]["demand"]["start"], 
    config["scenarios"]["demand"]["end"], 
    config["scenarios"]["demand"]["step"])
)

rule all:
    input:
        expand("results/d{d_scale}/capex_{tech}{capex_scale}/ep{ep_start}/ep{ep_end}/AnnualCapacity.png", 
            d_scale=D_SCALES,
            tech=TECHS,
            capex_scale=CAPEX_SCALES,
            ep_start=EP_STARTS,
            ep_end=EP_ENDS,
        )
```


## Rule updates

Repeat this structure for all rules!

```python
rule capital_cost:
    input:
        "resources/data/CapitalCost.csv"
    output:
        "results/d{d_scale}/capex_{tech}{capex_scale}/ep{ep_start}/ep{ep_end}/data/CapitalCost.csv"
    shell: 
        "python workflow/scripts/capital_costs.py {input} {output} {wildcards.tech} {wildcards.capex_scale}"
```

## Run the workflow 

```bash
snakemake -c
```

```bash
Job stats:
job                 count    min threads    max threads
----------------  -------  -------------  -------------
all                     1              1              1
capital_cost           72              1              1
copy_csv_files         72              1              1
demand                 72              1              1
emission_penalty       72              1              1
otoole                 72              1              1
plot                   72              1              1
solve                  72              1              1
variable_cost          72              1              1
total                 577              1              1

Select jobs to execute...
```

In rule print outs, you can also see what wildcards are being subbed into the rule. 

```bash
[Fri Mar 31 17:46:20 2023]
rule capital_cost:
    input: resources/data/CapitalCost.csv
    output: results/d2.5/capex_SPV2/ep50/ep100/data/CapitalCost.csv
    jobid: 1526
    reason: Missing output files: results/d2.5/capex_SPV2/ep50/ep100/data/CapitalCost.csv
    wildcards: d_scale=2.5, tech=SPV, capex_scale=2, ep_start=50, ep_end=100
    resources: tmpdir=/tmp
```

## Issues 

While this works, do we really need to run all of the python scripts for each new scenario? No! The scenarios are sharing data and just creating different permutations! Lets expand our workflow to make it more efficient! 

# Optimize the workflow

The solution to this workflow can be found in `workshop/solutions/snakemake-5` and `workshop/solutions/config-2`

## Update the rules 

### Add a temporary directory

Note that we could also use Snakemake's built in `temp` command (see [here](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#protected-and-temporary-files) for more info)

```python
rule create_temp_dir:
    output:
        directory("results/temp")
    shell:
        "mkdir output"
```

### Update Rules

```python
rule demand:
    input:
        "resources/data/SpecifiedAnnualDemand.csv"
    output:
        "results/temp/demand_{d_scale}.csv"
    shell: 
        "python workflow/scripts/demand.py {input} {output} {wildcards.d_scale}"
```
<br>
```python
rule copy_demand_data:
    input:
        "results/temp/demand_{d_scale}.csv"
    output:
        "results/d{d_scale}/capex_{tech}{capex_scale}/ep{ep_start}/ep{ep_end}/data/SpecifiedAnnualDemand.csv"
    shell:
        "cp {input} {output}"
```

Same process for Emissions Penalty, Varaible Costs, and Captital Costs

## Performance improvements

By slightly changing the order, we reduced the number of times python scripts had to run from 72 * 4 = 288 to just 17. Thats a reduction of 94%!


```bash
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                   count    min threads    max threads
------------------  -------  -------------  -------------
all                       1              1              1
capital_cost              6              1              1
copy_capex_data          72              1              1
copy_csv_files           72              1              1
copy_demand_data         72              1              1
copy_emission_data       72              1              1
copy_var_cost_data       72              1              1
demand                    4              1              1
emission_penalty          3              1              1
otoole                   72              1              1
plot                     72              1              1
solve                    72              1              1
variable_cost             4              1              1
total                   594              1              1
```

# Conclusions


## Benifits 
Snakemake gives you the ability to eaisly creaty flexible, scalable, and reproduciable workflow without leaving python. 

## Other Functionality 
- Snakemake scales easy to cloud and HPC infrastructure
- Lots of other functionality (input functions, docker, flags, rule order, temporary files, modularity, Jupyter integration, wrapers, remote files, logging, wildcard constraints, ect...)
- Create scenarios from csv files rather than yaml files 
- Powerful for uncertanity analysis, sensitivity analysis, and scenario analysis 
- Very easy way to make your work accessible and reproduciable (for you and others!)

## Helpful Rules 

```bash 
rule clean:
    shell:
        "rm -rf results/*"
```
<br>
```bash
rule make_dag:
    shell:
        "snakemake --dag all | dot -Tpdf > dag.pdf"
```

## Discussion!