# Comparative BDSKY-serial Workflow


This Workflow Notebook is for Comparing BDSKY-Serial BEAST 2 analysis performed on different sequences.

<details>
    <summary>Click Here to See a Decription of Parameters</summary>
        <pre>
            <code>

Running an Instance of this Workflow
-------------------------------------------
overall_save_dir: str
    Path to where you are saving all the runs of this workflow.

specific_run_save_dir: str, optional
    Sub-directory of overall_save_dir you wish to save the outputs from this workflow.
    If None, 'None' or an empty string a timestamp of format 'YYYY-MM-DD_hour-min-sec' is used instead.

max_threads: int, default None
    The maximum number of threads to use when calling gnu parallel in phases 2i and 4. If None and BEAST_pype is running
    in a SLURM job the SLURM environment variable `SLURM_CPUS_PER_TASK` is used. If None and BEAST_pype is NOT running in
    a SLURM job the number of cores available minus 1 is used (`multiprocessing.cpu_count() - 1`).

kernel_name: str, default 'beast_pype'
    Name of Jupyter python kernel to use when running workflow. This is also the name of the conda environment to use in phases 4 &
    phase 2ii (as these Jupyter notebooks use the `bash` kernel).

General Inputs
----------------
template_xml_path: str
    Path to template BEAST 2 xml.

fasta_path: str
    Path to fasta file containing sequences to be placed into template xml.

metadata_path: str
      Path to csv or tsv containing metadata for sequences in fasta_path.

sample_id_field: str
    Name of field in metadata_db containing sequence IDs.

collection_date_field: str
    Name of field in metadata_db containing collection dates of sequences. Should be formatted YYYY-MM-DD.

Defining XML Sets (Partitioning Data)
--------------------------------------------
xml_set_definitions : dict {str: str}
        The definitions for the xml_sets you wish to use.
        Keys:   The name used for the xml_set. Will be used to name directories so certain characters should be
                   avoided see https://www.mtu.edu/umc/services/websites/writing/characters-avoid/.
        Values: Will be used with pandas `DataFrame.query` to separate out your data see:
                        * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
                        * https://sparkbyexamples.com/pandas/pandas-dataframe-query-examples/
                        * https://www.slingacademy.com/article/pandas-working-with-the-dataframe-query-method-5-examples/

data_filter: str
    Optional can be an empy string, None or 'None'. Additional filter applieid to metadata_db when selecting 
    sequences and metadata to be used on pipeline. Must conform to [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), see further [example](https://www.slingacademy.com/article/pandas-working-with-the-dataframe-query-method-5-examples/).

Initial Tree Building & Downsampling
------------------
use_initial_tree:  bool, default True
    If False an initial tree will not be generated skipping Phases 2i and 2ii. As such, in phase 4 BEAST 2 generate its own
    initial tree.

initial_tree_type: str (either 'Distance' or 'Temporal') or None, default 'Temporal'
    Initial tree type to use.
    If 'Distance' the IQtree tree from Phase-2i-IQTree.ipynb is used for the
    initial tree and phase 2ii is skipped.
    if 'Temporal' the TreeTime tree from Phase-2ii-TreeTime-and-Down-Sampling.ipynb
    is used for the initial tree.

down_sample_to: int
    If the number sequences in a fasta file is above this the number of sequences is cut to this number via downsampling.

BDSky Options
------------------
origin_start_addition: float, optional
    Suggested infection period of pathogen. **Should be in years.** This + initial MLE tree height is used as starting value of origin.

origin_upper_addition: float/int, optional
    This + initial MLE tree height is used as upper value of origin prior.

origin_prior: dict {'lower': float, 'upper': float, 'start': float}, optional
    Details of the origin prior assumed to be uniformly distributed.

rt_dims: int, optional
    Number of Rt dimensions (time periods over which Rt is estimated).

rt_partitions: dict of strings {'unit': 'days, weeks or years', 'every': int/float, 'end': date str YYYY-MM-DD}, optional
    Instructions for setting rt_change date, going backwards from the youngest date provided in that xml_set's metadata until rt_partitions["end"]  is reached.
    If rt_partitions["end"] is not given the oldest date provided in that xml_set's metadata is used for this end point value instead.
    rt_partitions["end"] should be a datetime object or string of format 'YYYY-MM-DD'.
    If given rt_dims must equal None.

sampling_prop_dims: int, optional
    Number of sampling promotion dimensions (time periods over which sampling promotion is estimated).

sampling_prop_partitions: dict of strings {'unit': 'days, weeks or years', 'every': int/float, 'end': date str YYYY-MM-DD}, optional
    Instructions for setting sampling_prop_change date, going backwards from the youngest date provided in that xml_set's metadata until sampling_prop_partitions["end"]  is reached.
    If sampling_prop_partitions["end"] is not given the oldest date provided in that xml_set's metadata is used for this end point value instead.
    sampling_prop_partitions["end"] should be a datetime object or string of format 'YYYY-MM-DD'.
    If given sampling_prop_dims must equal None.

zero_sampling_before_first_sample: bool, default False
    If true fix the sampling proportion to 0 for the period before the first sample.


MCMC Tree/Logfile names Chain-lengths & Save Steps
------------------
log_file_basename: str, optional
    If provided .tree, .log and .state files from running BEAST 2 will have this name prefixed by 'run-with-seed-{seed}-'.

chain_length: int
    Number of chains to use for BEAST runs.

trace_log_every: int
    How often to save a log file during BEAST runs.

tree_log_every: int
    How often to save a tree file during BEAST runs.

screen_log_every: int
    How often to output to screen during BEAST runs.

store_state_every: int 
    How often to store MCMC state during BEAST runs.


Running BEAST 2
--------------------
number_of_beast_runs: int
    Number of chains to use (repeated runs to do) when running BEAST.

seeds: list of ints
    Seeds to use when running BEAST.

beast_options_without_a_value: list of strs
    Options not requiring a value to pass to BEAST 2.
     For instance to use a GPU when running BEAST 2 this would be `['-beagle_GPU']`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

beast_options_needing_a_value: dict
    Options requiring a value to pass to BEAST 2.
    For instance to use 3 threads when running BEAST 2 this would be: `{'-threads': 3}`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

sbatch_options_without_a_value: list of strs
   Options not requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.

sbatch_options_needing_a_value: dict
    Options requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.

  </code>
</pre>


In [None]:
'''
Parameters
-------------
'''
# Running an Instance of this Workflow
overall_save_dir = None
specific_run_save_dir=None
max_threads=None
kernel_name = 'beast_pype'

# General Inputs
template_xml_path = None
fasta_path = None
metadata_path = None
sample_id_field = 'strain'
collection_date_field = 'date'

# Defining XML Sets (Partitioning Data)
xml_set_definitions = None
data_filter = None

# Initial Tree Building & Downsampling
use_initial_tree = True
initial_tree_type = 'Temporal'
root_strain_names = None
remove_root = False
down_sample_to = None


# BDSky Options
origin_start_addition = None
origin_upper_addition = None
origin_prior = None
rt_dims = None
rt_partitions = None
sampling_prop_dims=None
sampling_prop_partitions=None
zero_sampling_before_first_sample=False


# MCMC Tree/Logfile names Chain-lengths & Save Steps
log_file_basename = None
chain_length = None
trace_log_every = None
tree_log_every = None
screen_log_every = None
store_state_every = None

# Running BEAST 2
number_of_beast_runs = None
seeds = None
beast_options_without_a_value=None
beast_options_needing_a_value=None
sbatch_options_without_a_value=None
sbatch_options_needing_a_value=None

# Choosing a specific report template
report_template = None
xml_set_label = None

# Setup
## Creat Dictionary of Parameters

This needs to be done before importing packages

In [None]:
parameters = %who_ls
parameters = {var: eval(var) for var in parameters}

Import packages, etc.

In [None]:
import yaml
from beast_pype.nb_utils import execute_notebook
from time import perf_counter
import pandas as pd
import importlib.resources as importlib_resources
from beast_pype.workflow_params import BDSKYSerialComparativeWorkflowParams
from beast_pype.diagnostics import gen_beast_diagnostic_nb

## Check, Setup and Record parameters

In [None]:
parameters = BDSKYSerialComparativeWorkflowParams(**parameters)

### Creating a record for runtimes

This record list of dictionaries will be turned into a pandas dataframe and saved as a csv at the end of this notebook.

In [None]:
runtime_records = []

### Set path to workflow modules

In [None]:
workflow_modules = importlib_resources.path('beast_pype', 'workflow_modules')

## Phase 1: Data Gathering

### Placing Phase 1 Parameters in a Dictionary

In [None]:
phase_1_start= perf_counter()
phase_1_params = parameters.retrieve_phase_1_params()

#### Running Phase 1.

In [None]:
#papermill_description=Phase-1-Metadata-and-Sequence-Separation.ipynb
phase_1i_log =execute_notebook(input_path=f'{workflow_modules}/Phase-1-Metadata-and-Sequence-Separation.ipynb',
                                  output_path=parameters.save_dir + '/Phase-1-Metadata-and-Sequence-Separation.ipynb',
                                  parameters=phase_1_params,
                                  progress_bar=True,
                                  nest_asyncio=True,
                                  kernel_name=kernel_name
                                  )
runtime_records.append({
    'Phase': 'Phase-1-Metadata-and-Sequence-Separation.ipynb',
    'Sample': None,
    'Chain': None,
    'Runtime': perf_counter() - phase_1_start
})

## Phase 2: Data Pre-Processing
### Phase 2i: Building an IQ Tree tree.
#### Placing Phase 2i Parameters in a Dictionary

In [None]:
if use_initial_tree:
    phase_2i_start = perf_counter()
    phase_2i_params = parameters.retrieve_phase_2i_params()

#### Running Phase 2i.

In [None]:
#papermill_description=Phase-2i-IQTree-Building
if use_initial_tree:
    phase_2i_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2i-IQTree-Building.ipynb',
                                      output_path=parameters.save_dir + '/Phase-2i-IQTree-Building.ipynb',
                                      parameters=phase_2i_params,
                                      progress_bar=True,
                                      nest_asyncio=True
                                     )
    for xml_set_directory in parameters.xml_set_directories.values(): # This loop could and should be in parallel
        phase_2i_IQTree_Correction_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2i-IQTree-Correction.ipynb',
                                                         output_path=parameters.save_dir + '/Phase-2i-IQTree-Correction.ipynb',
                                                         parameters={
                                                             'fasta_path': f'{xml_set_directory}/sequences.fasta',
                                                             'tree_path': f'{xml_set_directory}/initial_trees/iqtree.nwk'
                                                         },
                                                         progress_bar=True,
                                                         nest_asyncio=True,
                                                         kernel_name=kernel_name
                                                         )

### Phase 2ii: TreeTime & Down Sampling

#### Placing Phase 2ii Parameters in a Dictionary

In [None]:
if use_initial_tree:
    phase_2ii_start = perf_counter()


#### Running Phase 2ii.

In [None]:
#papermill_description=Phase-2ii-TreeTime-and-Down-Sampling
if use_initial_tree and initial_tree_type == 'Temporal':
    for xml_set_directory in parameters.xml_set_directories.values(): # This loop could and should be in parallel
        phase_2ii_params = parameters.retrieve_phase_2ii_params(xml_set_directory)
        phase_2ii_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                            output_path=f'{xml_set_directory}/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                            parameters=phase_2ii_params,
                                            progress_bar=True,
                                            nest_asyncio=True,
                                            kernel_name=kernel_name
                                            )
    runtime_records.append({
        'Phase': 'Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_2ii_start
    })

## Phase 3 Generating BEAST xmls

### Running Phase 3

In [None]:
#papermill_description=Phase-3-Generating-XMLs
phase_3_start = perf_counter()
for xml_set, xml_set_directory in parameters.xml_set_directories.items(): # This loop could and should be in parallel
    phase_3_params = parameters.retrieve_phase_3_params(xml_set, xml_set_directory)
    phase_3_log = execute_notebook(input_path=f'{workflow_modules}/Phase-3-Gen-BDSKY-Serial-xml.ipynb',
                                      output_path=f'{xml_set_directory}/Phase-3-Gen-BDSKY-Serial-xml.ipynb',
                                      parameters=phase_3_params,
                                      progress_bar=True,
                                      nest_asyncio=True,
                                      kernel_name=kernel_name
                                      )
runtime_records.append({
    'Phase': 'Phase-3-Gen-BDSKY-Serial-xml.ipynb',
    'Sample': None,
    'Chain': None,
    'Runtime': perf_counter() - phase_3_start
})

## Phase 4 Running BEAST
### Placing Phase 4 Parameters in a Dictionary

In [None]:
phase_4_start = perf_counter()
phase_4_params = parameters.retrieve_phase_4_params()

### Running Phase 4.

In [None]:
#papermill_description=Phase-4-Running-BEAST
if 'sbatch_arg_string' in phase_4_params:
    phase_4_log = execute_notebook(input_path=f'{workflow_modules }/Phase-4-SBATCH-Running-BEAST.ipynb',
                                      output_path=parameters.save_dir + '/Phase-4-SBATCH-Running-BEAST.ipynb',
                                      parameters=phase_4_params,
                                      progress_bar=True,
                                      nest_asyncio=True)
else:
    phase_4_log = execute_notebook(input_path=f'{workflow_modules }/Phase-4-GNU-Parallel-Running-BEAST.ipynb',
                                      output_path=parameters.save_dir + '/Phase-4-GNU-Parallel-Running-BEAST.ipynb',
                                      parameters=phase_4_params,
                                      progress_bar=True,
                                      nest_asyncio=True)
runtime_records.append({
        'Phase': 'Phase-4',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_4_start
    })

## Phase 5: Diagnosing Outputs and Generate Report

Currently, this has to be performed manually. That being said, the code cell below will parameterize a copy of the notebook ready to run. See below for location.

In [None]:
with open(parameters.save_dir + '/pipeline_run_info.yml', 'r') as file:
    data = file.read()
file.close()
pipeline_run_info = yaml.safe_load(data)


In [None]:
phase_5_params = parameters.retrieve_phase_5_params()
gen_beast_diagnostic_nb(parameters.save_dir, **phase_5_params)
print(f'Phase 5 notebook is ready for manual use at: \n{parameters.save_dir}/Phase-5-Diagnosing-XML-sets-and-Generate-Report.ipynb')

## Recording Runtimes

Converting to pandas DataFrame and saving as CSV.

In [None]:
runtime_df = pd.DataFrame.from_records(runtime_records)
runtime_df.to_csv(parameters.save_dir + "/runtimes.csv", index=False)