# Workflow for COVID Strain Surveillance

This notebook is effectively a wrapper for the BDSKY-Serial-Comparative.ipynb workflow.

<details>
    <summary>Click Here to See a Decription of Parameters</summary>
        <pre>
            <code>

Running an Instance of this Workflow
-------------------------------------------
overall_save_dir: str
    Path to where you are saving all the runs of this workflow.

specific_run_save_dir: str, optional
    Sub-directory of overall_save_dir you wish to save the outputs from this workflow.
    If None, 'None' or an empty string a timestamp of format 'YYYY-MM-DD_hour-min-sec' is used instead.

max_threads: int, default None
    The maximum number of threads to use when calling gnu parallel in phases 2i and 4. If None and BEAST_pype is running
    in a SLURM job the SLURM environment variable `SLURM_CPUS_PER_TASK` is used. If None and BEAST_pype is NOT running in
    a SLURM job the number of cores available minus 1 is used (`multiprocessing.cpu_count() - 1`).

kernel_name: str, default 'beast_pype'
    Name of Jupyter python kernel to use when running workflow. This is also the name of the conda environment to use in phases 4 &
    phase 2ii (as these Jupyter notebooks use the `bash` kernel).

General Inputs
----------------
template_xml_path: str
    Path to template BEAST 2 xml.

fasta_path: str
    Path to fasta file containing sequences to be placed into template xml.

metadata_path: str
      Path to csv or tsv containing metadata for sequences in fasta_path.

sample_id_field: str
    Name of field in metadata_db containing sequence IDs.

collection_date_field: str
    Name of field in metadata_db containing collection dates of sequences. Should be formatted YYYY-MM-DD.

Defining XML Sets (Partitioning Data for various strains of COVID)
-----------------------------------------------------------------------------
lineage_field: str
    Field in metadata listing the lineage name of a sequence.

 dr_strain: str
     Name  of dominant resident (DR) lineage.

 voi_strains: list of strs
    Names  of Variant Of Interest (VOI) lineages.

sub_lineage_map: dict {str: list of strs}
    Dictionary defining sub lineages of dr_strain and voi_strains.

data_filter: str
    Optional can be an empy string, None or 'None'. Additional filter applieid to metadata_db when selecting
    sequences and metadata to be used on pipeline. Must conform to [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), see further [example](https://www.slingacademy.com/article/pandas-working-with-the-dataframe-query-method-5-examples/).

Initial Tree Building & Downsampling
------------------
use_initial_tree:  bool, default True
    If False an initial tree will not be generated skipping Phases 2i and 2ii. As such, in phase 4 BEAST 2 generate its own
    initial tree.

initial_tree_type: str (either 'Distance' or 'Temporal') or None, default 'Temporal'
    Initial tree type to use.
    If 'Distance' the IQtree tree from Phase-2i-IQTree.ipynb is used for the
    initial tree and phase 2ii is skipped.
    if 'Temporal' the TreeTime tree from Phase-2ii-TreeTime-and-Down-Sampling.ipynb
    is used for the initial tree.

down_sample_to: int
    If the number sequences in a fasta file is above this the number of sequences is cut to this number via downsampling.

BDSky Options
------------------
origin_start_addition: float, optional
    Suggested infection period of pathogen. **Should be in years.** This + initial MLE tree height is used as starting value of origin.

origin_upper_addition: float/int, optional
    This + initial MLE tree height is used as upper value of origin prior.

origin_prior: dict {'lower': float, 'upper': float, 'start': float}, optional
    Details of the origin prior assumed to be uniformly distributed.

rt_partition_freq: dict of strings {'unit': 'days, weeks or years', 'every': int/float}
    Instructions for setting rt_partition dates, going backwards from the youngest date in all the metadata (all xml_sets/strains)
    For VOI strains the endpoint is the youngest date out of the oldest dates for each of the VOI strains' metadata.
    For the DR stain the last partition goes backwards one more 'unit' multiplied by 'every' from the VOI strains.
    Alternatively, you can set the DR stain the last partition by including 'dr_extra_offset' with its own 'unit' and 'amount'.


sampling_prop_partition_freq: dict of strings {'unit': 'days, weeks or years', 'every': int/float}
    Instructions for setting sampling_prop_partition dates, going backwards from the youngest date in all the metadata (all xml_sets/strains)
    For VOI strains the endpoint is the youngest date out of the oldest dates for each of the VOI strains' metadata.
    For the DR stain the last partition goes backwards one more 'unit' multiplied by 'every' from the VOI strains.
    Alternatively, you can set the DR stain the last partition by including 'dr_extra_offset' with its own 'unit' and 'amount'.


MCMC Tree/Logfile names Chain-lengths & Save Steps
------------------
log_file_basename: str, optional
    If provided .tree, .log and .state files from running BEAST 2 will have this name prefixed by 'run-with-seed-{seed}-'.

chain_length: int
    Number of chains to use for BEAST runs.

trace_log_every: int
    How often to save a log file during BEAST runs.

tree_log_every: int
    How often to save a tree file during BEAST runs.

screen_log_every: int
    How often to output to screen during BEAST runs.

store_state_every: int
    How often to store MCMC state during BEAST runs.


Running BEAST 2
--------------------
number_of_beast_runs: int
    Number of chains to use (repeated runs to do) when running BEAST.

seeds: list of ints
    Seeds to use when running BEAST.

beast_options_without_a_value: list of strs
    Options not requiring a value to pass to BEAST 2.
     For instance to use a GPU when running BEAST 2 this would be `['-beagle_GPU']`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

beast_options_needing_a_value: dict
    Options requiring a value to pass to BEAST 2.
    For instance to use 3 threads when running BEAST 2 this would be: `{'-threads': 3}`.
    See https://www.beast2.org/2021/03/31/command-line-options.html.

sbatch_options_without_a_value: list of strs
   Options not requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.

sbatch_options_needing_a_value: dict
    Options requiring a value to pass to sbatch.
    See https://slurm.schedmd.com/sbatch.html.

  </code>
</pre>

In [None]:
'''
Parameters
-------------
'''
# Running an Instance of this Workflow
overall_save_dir = None
specific_run_save_dir=None
max_threads=None
kernel_name = 'beast_pype'

# General Inputs
template_xml_path = None
fasta_path = None
metadata_path = None
sample_id_field = 'strain'
collection_date_field = 'date'

# Defining XML Sets (Partitioning Data for various strains of COVID)
lineage_field = None
dr_strain = None
voi_strains = None
sub_lineages_mapping = None
data_filter = None

# Initial Tree Building & Downsampling
use_initial_tree = True
initial_tree_type = 'Temporal'
root_strain_names = None
remove_root = False
down_sample_to = None

# BDSky Options
origin_start_addition = None
origin_upper_addition = None
origin_prior = None
rt_partition_freq = None
sampling_prop_partition_freq=None
zero_sampling_before_first_sample=False

# MCMC Tree/Logfile names Chain-lengths & Save Steps
log_file_basename = None
chain_length = None
trace_log_every = None
tree_log_every = None
screen_log_every = None
store_state_every = None

# Running BEAST 2
number_of_beast_runs = None
seeds = None
beast_options_without_a_value=None
beast_options_needing_a_value=None
sbatch_options_without_a_value=None
sbatch_options_needing_a_value=None

# Choosing a specific report template
xml_set_label = 'COVID Strain'

# Setup
## Creat Dictionary of Parameters to pass on to BDSKY-Serial-Comparative.ipynb.

This needs to:
 * be done before importing packages.
  * Exclude parameters not used in BDSKY-Serial-Comparative.ipynb otherwise errors will be thrown.

In [None]:
parameters = %who_ls
parent_workflow_parameters = ['lineage_field', 'dr_strain', 'voi_strains', 'sub_lineages_mapping', 'rt_partition_freq', 'sampling_prop_partition_freq']
parameters = {var: eval(var) for var in parameters if var not in parent_workflow_parameters}
parameters['parent_workflow_parameters'] = {var: eval(var) for var in parent_workflow_parameters}

Import packages, etc.

In [None]:
from collections import Counter
import pandas as pd
from beast_pype.COVID import gen_partition_dates
from beast_pype.nb_utils import execute_notebook
import importlib.resources as importlib_resources

Note save_dir

In [None]:
save_dir = f"{overall_save_dir}/{specific_run_save_dir}"

# Error Checks

## Check the values in the sub_lineage maps are unique

In [None]:
seen = set()
for lineage, sub_lineages in sub_lineages_mapping.items():
    counts_over_one = [sub_lineage for sub_lineage,freq in Counter(sub_lineages).items() if freq>1]
    if counts_over_one:
        raise ValueError(f"Sub lineage(s) {', '.join(counts_over_one)} appear more than once in the sub_lineages_mapping for {lineage}. \n"+
                                 'All sublineages should only be listed once across the entire sub_lineages_mapping.')
    for sub_lineage in sub_lineages:
            if sub_lineage in seen:
                sub_lineage_in = [potential_lineage_found_in
                                  for potential_lineage_found_in, sub_lineages_to_check in sub_lineages_mapping.items() if sub_lineage in sub_lineages_to_check]
                raise ValueError(f"The sublineage {sub_lineage} is listed in the sub_lineages_mapping for {', '.join(sub_lineage_in)}. \n"+
                                 'All sublineages should only be listed once across the entire sub_lineages_mapping.')
    seen.update(sub_lineages)

# Create xml_set_definitions.

In [None]:
xml_set_definitions = {f'DR_{dr_strain}': f"`{lineage_field}` in ('" + "', '".join(sub_lineages_mapping[dr_strain]) + "')",
                    **{f"VOI_{voi_strain}":  f"`{lineage_field}` in ('" + "', '".join(sub_lineages_mapping[voi_strain]) + "')" for voi_strain in voi_strains}
                       }
parameters['xml_set_definitions'] = xml_set_definitions

# Create rt_partitions & sampling_prop_partitions

In [None]:
if metadata_path.endswith('.tsv'):
    sep = '\t'
elif metadata_path.endswith('.csv'):
    sep = ','
else:
    raise ValueError(f"Unrecognized file format: {metadata_path}, only CSV and TSV files are supported.")
metadata = pd.read_csv(metadata_path, sep=sep, parse_dates=[collection_date_field])
selected_metadata = metadata[metadata[lineage_field].isin(seen)]
date_of_youngest_tip = selected_metadata[collection_date_field].max()
voi_oldest_tip_dates = {
    voi_strain: metadata[metadata[lineage_field].isin(sub_lineages_mapping[voi_strain])][collection_date_field].min()
    for voi_strain in voi_strains
    }
voi_youngest_oldest_tip = max(voi_oldest_tip_dates.values())

if rt_partition_freq is not None:
    parameters['rt_partitions'] = gen_partition_dates(rt_partition_freq,
                                                      date_of_youngest_tip,
                                                      voi_youngest_oldest_tip,
                                                      voi_strains,
                                                      dr_strain)
if sampling_prop_partition_freq is not None:
    parameters['sampling_prop_partitions'] = gen_partition_dates(sampling_prop_partition_freq,
                                                                date_of_youngest_tip,
                                                                voi_youngest_oldest_tip,
                                                                voi_strains,
                                                                dr_strain)

# Run the BDSKY-Serial-Comparative workflow.

In [None]:
workflows_path = importlib_resources.path('beast_pype', 'workflows')
parameters['report_template'] = 'COVID-Strain-Surveillance'
BDSKY_Serial_Comparative_log =execute_notebook(input_path=f'{workflows_path}/BDSKY-Serial-Comparative.ipynb',
                                                  output_path=save_dir + '/BDSKY-Serial-Comparative.ipynb',
                                                  parameters=parameters,
                                                  progress_bar=True,
                                                  nest_asyncio=True,
                                                  kernel_name=kernel_name
                                                  )