# Workflow Generic Comparative Notebook


This Workflow Notebook is for Comparing BEAST runs were the settings in the xml are the same but the sequences are different.

<details>
    <summary>Click Here to See a Decription of Parameters</summary>
        <pre>
            <code>

Running an Instance of this Workflow
-------------------------------------------
overall_save_dir:  str
    Path to where you are saving all the runs of this workflow.
    This creates a folder by the name you specify here e.g. creates a folder named
    folder in the root folder to save all the files produced when running the workflow.

specific_run_save_dir:  str, (optional, can be missing from yml)
    Sub-directory of overall_save_dir you wish to save all the files from this instance of this workflow.
    If not given a timestamp of format 'YYYY-MM-DD_hour-min-sec' is used instead.

max_threads:  int, (default, can be missing from yml)
    The maximum number of threads to use when calling gnu parallel in phases 2i and 4. If not given and BEAST_pype is running
    in a SLURM job the SLURM environment variable `SLURM_CPUS_PER_TASK` is used. If not given and BEAST_pype is NOT running in
    a SLURM job the number of cores available minus 1 is used (`multiprocessing.cpu_count() - 1`).

kernel_name: str, default 'beast_pype'
    Name of Jupyter python kernel to use when running workflow. This is also the name of the conda environment to use in phases 4 &
    phase 2ii (as these Jupyter notebooks use the `bash` kernel).

General Inputs
----------------
template_xml_path:  str, (required)
    Path to template BEAST 2 xml.

fasta_path:  str, (required)
    Path to fasta file containing sequences to be placed into template xml.

metadata_path:  str, (required)
      Path to csv or tsv containing metadata for sequences in fasta_path.

sample_id_field:  str, (required)
    Name of field in metadata_db containing sequence IDs.

collection_date_field:  str, (required)
    Name of field in metadata_db containing collection dates of sequences. Should be formatted YYYY-MM-DD.

   Defining XML Sets (Partitioning/Segregating Data)
   --------------------------------------------
xml_set_definitions :  dict {str: str}, (required)
   The definitions for the xml_sets you wish to use.
       Keys: str
               The name used for the xml_set. Will be used to name directories so certain characters should be
               avoided see https://www.mtu.edu/umc/services/websites/writing/characters-avoid/.
       Values: str
               Will be used with pandas `DataFrame.query` to separate out your data.
               Must conform to pandas `DataFrame.query` format see:
               * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
               * https://sparkbyexamples.com/pandas/pandas-dataframe-query-examples/
               * https://www.slingacademy.com/article/pandas-working-with-the-dataframe-query-method-5-examples/

data_filter:  str, (can be commented out)
    Optional can be an empy string, null (None in python) or 'null (None in python)'.
    Additional filter applied to metadata_db when selecting
    sequences and metadata to be used on pipeline.
    Must conform to pandas `DataFrame.query` format see:
        * https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html
        * https://sparkbyexamples.com/pandas/pandas-dataframe-query-examples/
        * https://www.slingacademy.com/article/pandas-working-with-the-dataframe-query-method-5-examples/

Initial Tree Building & Downsampling
------------------
use_initial_tree:  bool, default True
    If False an initial tree will not be generated skipping Phases 2i and 2ii. As such, in phase 4 BEAST 2 generate its own
    initial tree.

initial_tree_type: str (either 'Distance' or 'Temporal') or None, default 'Temporal'
    Intial tree type to use.
    If 'Distance' the IQtree tree from Phase-2i-IQTree.ipynb is used for the
    initial tree and phase 2ii is skipped.
    if 'Temporal' the TreeTime tree from Phase-2ii-TreeTime-and-Down-Sampling.ipynb
    is used for the initial tree.

down_sample_to: int
    If the number sequences in a fasta file is above this the number of sequences is cut to this number via downsampling.
    If not given downsampling will not occur. When given down sampling only
   occurs for a xml_set if there are more sequences than the value given.
    If downsampling occurs the following are saved in  '{overall_save_dir}/{specific_run_save_dir}/{xml_set}' and used in generating
    a BEAST 2 xml in phase 3:
    *   initial_trees/down_sampled_time.nwk:  A downsampled temporal tree.
    *    down_sampled_sequences.fasta:  Fasta file containing downsamplec sequences.
    *    down_sampled_metadata.csv:  the down sampled metadata.


MCMC Tree/Logfile names Chain-lengths & Save Steps
------------------
log_file_basename:  str, (required)
    If provided .tree, .log and .state files from running BEAST 2 will have this name prefixed by 'run-with-seed-{seed}-',
    number being that of the chain.
    Must not contain whitespace.

chain_length:  int, (optional)
    Number of chains to use for BEAST runs (e.g. 50000000).
    If not given value in template_xml_path will be used.

trace_log_every:  int, (optional)
    How often to save a log file during BEAST runs (e.g. 5000).
    If not given value in template_xml_path will be used.

tree_log_every:  int, (optional)
    How often to save a tree file during BEAST runs (e.g. 5000).
    If not given value in template_xml_path will be used.

screen_log_every:  int, (optional)
    How often to output to screen during BEAST runs (e.g. 5000).
    If not given value in template_xml_path will be used.

store_state_every:  int, (optional)
    How often to store MCMC state during BEAST runs (e.g. 5000).
    If not given value in template_xml_path will be used.


Running BEAST 2
--------------------
number_of_beast_runs: int
    Number of chains to use (number of parallel runs to do) when running BEAST (e.g. 9).

seeds: list of ints
    Seeds to use when running BEAST.
    If given, length of list should be the same as the number of chains (so each run has a designated seed).

beast_options_without_a_value: list of strs
    Options not requiring a value to pass to BEAST 2.
     For instance to use a GPU when running BEAST 2 this would be `['-beagle_GPU']`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

beast_options_needing_a_value: dict
    Options requiring a value to pass to BEAST 2.
    For instance to use 3 threads when running BEAST 2 this would be: `{'-threads': 3}`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

sbatch_options_without_a_value: list of strs
   Options not requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.

sbatch_options_needing_a_value: dict
    Options requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.


  </code>
</pre>

In [None]:
'''
Parameters
-------------
'''
# Running an Instance of this Workflow
overall_save_dir = None
specific_run_save_dir=None
max_threads=None
kernel_name = 'beast_pype'

# General Inputs
template_xml_path = None
fasta_path = None
metadata_path = None
sample_id_field = 'strain'
collection_date_field = 'date'

# Defining XML Sets (Partitioning Data)
xml_set_definitions = None
data_filter = None

# Initial Tree Building & Downsampling
use_initial_tree = True
initial_tree_type = 'Temporal'
root_strain_names = None
remove_root = False
down_sample_to = None

# MCMC Tree/Logfile names Chain-lengths & Save Steps
log_file_basename=None
chain_length = None
trace_log_every = None
tree_log_every = None
screen_log_every = None
store_state_every = None

# Running BEAST 2
number_of_beast_runs = None
seeds = None
beast_options_without_a_value=None
beast_options_needing_a_value=None
sbatch_options_without_a_value=None
sbatch_options_needing_a_value=None

# Choosing a specific report template
report_template = None
xml_set_label = None

# Setup
## Creat Dictionary of Parameters

This needs to be done before importing packages

In [None]:
parameters = %who_ls
parameters = {var: eval(var) for var in parameters}

Import packages, etc.

In [None]:
from beast_pype.nb_utils import execute_notebook
from time import perf_counter
import pandas as pd
import importlib.resources as importlib_resources
from beast_pype.workflow_params import GenericComparativeWorkflowParams
from beast_pype.diagnostics import gen_beast_diagnostic_nb

## Check, Setup and Record parameters

In [None]:
parameters = GenericComparativeWorkflowParams(**parameters)

### Creating a record for runtimes

This record list of dictionaries will be turned into a pandas dataframe and saved as a csv at the end of this notebook.

In [None]:
runtime_records = []

### Set path to workflow modules

In [None]:
workflow_modules = importlib_resources.path('beast_pype', 'workflow_modules')

## Phase 1: Data Gathering

### Placing Phase 1 Parameters in a Dictionary

In [None]:
phase_1_start= perf_counter()
phase_1_params = parameters.retrieve_phase_1_params()

#### Running Phase 1.

In [None]:
#papermill_description=Phase-1-Metadata-and-Sequence-Separation.ipynb
phase_1i_log =execute_notebook(input_path=f'{workflow_modules}/Phase-1-Metadata-and-Sequence-Separation.ipynb',
                                  output_path=parameters.save_dir + '/Phase-1-Metadata-and-Sequence-Separation.ipynb',
                                  parameters=phase_1_params,
                                  progress_bar=True,
                                  nest_asyncio=True,
                                  kernel_name=kernel_name
                                  )
runtime_records.append({
    'Phase': 'Phase-1-Metadata-and-Sequence-Separation.ipynb',
    'Sample': None,
    'Chain': None,
    'Runtime': perf_counter() - phase_1_start
})

## Phase 2: Data Pre-Processing
### Phase 2i: Building an IQ Tree tree.
#### Placing Phase 2i Parameters in a Dictionary

In [None]:
if use_initial_tree:
    phase_2i_start = perf_counter()
    phase_2i_params = parameters.retrieve_phase_2i_params()

#### Running Phase 2i.

In [None]:
#papermill_description=Phase-2i-IQTree-Building
if use_initial_tree:
    phase_2i_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2i-IQTree-Building.ipynb',
                                      output_path=parameters.save_dir + '/Phase-2i-IQTree-Building.ipynb',
                                      parameters=phase_2i_params,
                                      progress_bar=True,
                                      nest_asyncio=True
                                     )
    for sub_dir in parameters.xml_set_directories.values(): # This loop could and should be in parallel
        phase_2i_IQTree_Correction_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2i-IQTree-Correction.ipynb',
                                                         output_path=parameters.save_dir + '/Phase-2i-IQTree-Correction.ipynb',
                                                         parameters={
                                                             'fasta_path': f'{sub_dir}/sequences.fasta',
                                                             'tree_path': f'{sub_dir}/initial_trees/iqtree.nwk'
                                                         },
                                                         progress_bar=True,
                                                         nest_asyncio=True,
                                                         kernel_name=kernel_name
                                                         )


### Record runtime

In [None]:
if use_initial_tree:
    runtime_records.append({
        'Phase': 'Phase-2i-IQTree-Building.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_2i_start
    })

### Phase 2ii: TreeTime & Down Sampling

#### Placing Phase 2ii Parameters in a Dictionary

In [None]:
if use_initial_tree and initial_tree_type=='Temporal':
    phase_2ii_start = perf_counter()


#### Running Phase 2ii.

In [None]:
if use_initial_tree:
    #papermill_description=Phase-2ii-TreeTime-and-Down-Sampling
    for sub_dir in parameters.xml_set_directories.values(): # This loop could and should be in parallel
        phase_2ii_params = parameters.retrieve_phase_2ii_params(sub_dir)
        phase_2ii_log = execute_notebook(input_path=f'{workflow_modules}/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                            output_path=f'{sub_dir}/Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
                                            parameters=phase_2ii_params,
                                            progress_bar=True,
                                            nest_asyncio=True,
                                            kernel_name=kernel_name
                                            )
    runtime_records.append({
        'Phase': 'Phase-2ii-TreeTime-and-Down-Sampling.ipynb',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_2ii_start
    })

## Phase 3 Generating BEAST xmls

### Running Phase 3

In [None]:
#papermill_description=Phase-3-Generating-XMLs
phase_3_start = perf_counter()
for sub_dir in parameters.xml_set_directories.values(): # This loop could and should be in parallel
    phase_3_params = {
        'save_dir': sub_dir,
        **parameters.retrieve_params([
            'template_xml_path',
            'use_initial_tree',
            'collection_date_field',
            'sample_id_field',
            'log_file_basename',
            'chain_length',
            'trace_log_every',
            'tree_log_every',
            'screen_log_every',
            'store_state_every'])
    }
    phase_3_log = execute_notebook(input_path=f'{workflow_modules}/Phase-3-Gen-Generic-xml.ipynb',
                                      output_path=f'{sub_dir}/Phase-3-Gen-Generic-xml.ipynb',
                                      parameters=phase_3_params,
                                      progress_bar=True,
                                      nest_asyncio=True,
                                      kernel_name=kernel_name
                                      )
runtime_records.append({
    'Phase': 'Phase-3-Gen-Generic-xml.ipynb',
    'Sample': None,
    'Chain': None,
    'Runtime': perf_counter() - phase_3_start
})

## Phase 4 Running BEAST
### Placing Phase 4 Parameters in a Dictionary

In [None]:
phase_4_start = perf_counter()
phase_4_params = parameters.retrieve_phase_4_params()

### Running Phase 4.

In [None]:
#papermill_description=Phase-4-Running-BEAST
if 'sbatch_arg_string' in phase_4_params:
    phase_4_log = execute_notebook(input_path=f'{workflow_modules }/Phase-4-SBATCH-Running-BEAST.ipynb',
                                      output_path=parameters.save_dir + '/Phase-4-SBATCH-Running-BEAST.ipynb',
                                      parameters=phase_4_params,
                                      progress_bar=True,
                                      nest_asyncio=True)
else:
    phase_4_log = execute_notebook(input_path=f'{workflow_modules }/Phase-4-GNU-Parallel-Running-BEAST.ipynb',
                                      output_path=parameters.save_dir + '/Phase-4-GNU-Parallel-Running-BEAST.ipynb',
                                      parameters=phase_4_params,
                                      progress_bar=True,
                                      nest_asyncio=True)
runtime_records.append({
        'Phase': 'Phase-4',
        'Sample': None,
        'Chain': None,
        'Runtime': perf_counter() - phase_4_start
    })

## Phase 5: Diagnosing Outputs and Generate Report

Currently, this has to be performed manually. That being said, the code cell below will parameterize a copy of the notebook ready to run. See below for location.

In [None]:
phase_5_params = parameters.retrieve_phase_5_params()
gen_beast_diagnostic_nb(parameters.save_dir, **phase_5_params)
print(f'Phase 5 notebook is ready for manual use at: \n{parameters.save_dir}/Phase-5-Diagnosing-XML-sets-and-Generate-Report.ipynb')

## Recording Runtimes

Converting to pandas DataFrame and saving as CSV.

In [None]:
runtime_df = pd.DataFrame.from_records(runtime_records)
runtime_df.to_csv(parameters.save_dir + "/runtimes.csv", index=False)