# Run WES-QC pipeline

This Jupyter notebook runs all steps for WES-QC pipeline.
It supports both running in an SSH mode (on a local machine with remote access to the cluster) or directly from the remote cluster head node.

For details explaining the steps, see the documentation in the `docs/wes-qc-hail.md`.




## How to run the notebook

### Run with jupyter notebook on the local machine

1. First, prepare a python environment with a Jupyter server. There is no additional dependency to install.
A quick way to do this is to install `uv` - `curl -LsSf https://astral.sh/uv/install.sh | sh`, see [uv](https://astral.sh/uv/) for more details.
Then run `uv add jupyterlab`, the `jupyter` executable will be installed in `.venv/bin/jupyter`.

2. Then, set up an SSH connection to the cluster, for example, if the cluster IP address is `172.27.1.1`, add the following to your `~/.ssh/config` file:
```
Host wes
    HostName 172.27.1.1
    User ubuntu
    IdentityFile ~/.ssh/id_rsa
```
id_rsa is the private key for accessing the cluster when the cluster was created.

3. Then, in the python environment, run the following command to start the notebook:
```
jupyter notebook scripts/run-wes-qc-pipeline-all-steps.ipynb
```
or if you are using VSCode, you can open the notebook directly in VSCode and set the Python interpreter to the one with the Jupyter notebook installed.

4. Follow the instructions in the notebook to run the steps.
Note that the variable `jupyter_notebook_on_cluster` needs to be set to `False` in the notebook.

**Warning:** if you run the notebook in SSH mode, your computer runs a local jupyter server to execute commands and track the progress.
If your computer loses connection to the cluster head node due to any network issues (or, for example, when your computer goes to sleep), the data processing terminates.


### Run with jupyter notebook on the cluster

1. First, prepare a python environment with Jupyter notebook installed, same as above.
See the above [Run with jupyter notebook on the local machine](#run-with-jupyter-notebook-on-the-local-machine) section for more details.

2. Then, start a jupyter notebook server on the cluster:
```
jupyter notebook --no-browser --port=8889 scripts/run-wes-qc-pipeline-all-steps.ipynb
```
**Note:** for clusters created by the Sanger `osdataproc` utitity you cannot use the default port 8888 because it is already used by the built-in notebook server on the master node.
The built-in notebook server on the master node is not suitable because its Spark will claim all the resources and the pipeline will not be able to run.

If you are using VSCode, you can open the notebook directly in VSCode and set the python interpreter to the system python.

3. Follow the instructions in the notebook to run the pipeline.
Note that the variable `jupyter_notebook_on_cluster` needs to be set to `True` in the notebook.


## Prepare to run

### Define several service functions

In [None]:
# Set this to `True` if you want to execute this jupyter notebook on the cluster head node.
jupyter_notebook_on_cluster = False

In [None]:
# Set up logging
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

In [None]:
# The small function to run command either via SSH (for local run)
# or directly (when the notebook is on cluster)
def run_cmd(cmd):
    if jupyter_notebook_on_cluster:
        !{cmd}
    else:
        !ssh -o StrictHostKeyChecking=no wes "{cmd}"

In [None]:
# Function that runs a series of scripts
def run_step_scripts(step_folder, scripts):
    for script, prefix in scripts.items():
        print("=" * 120 + "\n")
        logger.info(f"Running {script}")
        cmd = f"cd {path_to_wes} && ./scripts/hlrun_local --prefix={prefix} {step_folder}/{script}"
        run_cmd(cmd)

### Specify the path to the wes-qc directory

All WES-QC repo should be located on the cluster head node).
Make sure the config/inputs.yaml symlinks to the correct dataset config.

In [None]:
path_to_wes = "/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm"

### Check Hail is working on the cluster

In [None]:
python = "/home/ubuntu/venv/bin/python"
cmd = (
    f'cd {path_to_wes}; {python} -c \\"import hail as hl; hl.init()\\" '
    if not jupyter_notebook_on_cluster
    else f'cd {path_to_wes}; {python} -c "import hail as hl; hl.init()" '
)
run_cmd(cmd)

## Processing your data

### How to configure run steps
To configure run steps, you only need to set up a dictionary describing the step.
Dictionary key is the command to run (with arguments if they are required), the value is the name of the log file.
This notebook automatically collects all logs form the running steps.


### Step 0 - Resource Preparation

In [None]:
step0 = {"1-import_1kg.py --all": "0-1-import-1kg", "2-generate-truthset-ht.py": "0-2-generate-truthset-ht"}
step0_folder = "0-resource_preparation"
run_step_scripts(step0_folder, step0)

### Step 1 - Import Data

In [None]:
step1 = {
    "1-import_gatk_vcfs_to_hail.py": "1-1-import_gatk_vcfs_to_hail",
    "2-import_annotations.py": "1-2-import_annotations",
    "3-validate-gtcheck.py": "1-3-validate-gtcheck",
    "4-mutation-spectra_preqc.py": "1-4-mutation-spectra_preqc",
}
step1_folder = "1-import_data"
run_step_scripts(step1_folder, step1)

### Step 2 - Sample QC

The SampleQC step is pretty automated and mostly produced acceptable results even with the default set of parameters.
However, we strongly advise you to refer to the documentation, review the metrics plots and tune filtering thresholds if necessary.

In [None]:
step2 = {
    "1-hard_filters_sex_annotation.py": "2-1-hard-filters-sex-annotation",
    "2-prune_related_samples.py": "2-2-prune-related-samples",
    "3-population_pca_prediction.py --all": "2-3-population-pca-prediction",
    "4-find_population_outliers.py": "2-4-find-population-outliers",
    "5-filter_fail_sample_qc.py": "2-5-filter-fail-sample-qc",
}
step2_folder = "2-sample_qc"
run_step_scripts(step2_folder, step2)

**After this step, you must review the graphs generated by the step 2.4 and check the total number of survived/filtered samples.**

If necessary, tune the sample filtering parameters in the config file. Please refer to the manual for details.

### Step 3 - Variant QC 
The first part of the VariantQC step - before training the random forest model.
Don't forget to set the RF model ID you want.

In [None]:
model_id = "test-run-model"

step3_1 = {
    "1-split_and_family_annotate.py --all": "3-1-split_and_family_annotate",
    "2-create_rf_ht.py": "3-2-create-rf-ht",
    f"3-train_rf.py --manual-model-id {model_id}": "3-3-train-rf",
}
step3_folder = "3-variant_qc"
run_step_scripts(step3_folder, step3_1)

At this point we need to change the model run ID in the config file

In [None]:
yaml_file = "config/inputs.yaml"
cmd = f"sed --follow-symlinks -i 's/rf_model_id:.*/rf_model_id: {model_id}/' {path_to_wes}/{yaml_file}"
run_cmd(cmd)

Now we can run the second part of the VariantQC. Please ferer to the docs (`docs/wes-qc-hail.md`) for details.

After finishing the variant QC process, you need to review the results and choose hardfilters for the genotype filtering.


In [None]:
step3_2 = {
    "4-apply_rf.py": "3-4-apply-rf",
    "5-annotate_ht_after_rf.py": "3-5-annotate-ht-after-rf",
    "6-rank_and_bin.py": "3-6-rank-and-bin",
    "7-plot_rf_output.py": "3-7-plot-rf-output",
}
step3_folder = "3-variant_qc"
run_step_scripts(step3_folder, step3_2)

**At this point you must review the resulting plots to choose the correct bins for the first step of GenotypeQC.**
Please refer to the main manual for instructions.

### Step 4 - Genotype QC

To perform genotype QC, you need to determine the best combination of hard filters,
to save "good" variations as much as possible,
and get rid of all "bad" variants and genotypes at the same time.

The first script of the genotype QC helps you to analyze different combinations of hard filters
and choose optimal values.

Please refer to the docs for details.

In [None]:
step4_1 = {
    "1-compare_hard_filter_combinations.py --all": "4-1-compare-hard-filter-combinations",
}
step4_folder = "4-genotype_qc"
run_step_scripts(step4_folder, step4_1)

**At this step, you MUST review and analyze the results to choose correct values for hardfilter combinations**.
The values for the public datasets are not suitable for your data.
Please refer to the docs for details: `docs/wes-qc-hail.md`.

After choosing the hard filters, you can run the last part of the data processing

In [None]:
step4_2 = {
    "2-apply_range_of_hard_filters.py": "4-2-apply-range-of-hard-filters",
    "3a-export_vcfs_range_of_hard_filters.py": "4-3a-export-vcfs-range-of-hard-filters",
    "3b-export_vcfs_stringent_filters.py": "4-3b-export-vcfs-stringent-filters",
    "5-mutation-spectra_afterqc.py": "4-5-mutation-spectra_afterqc",
}
step4_folder = "4-genotype_qc"
run_step_scripts(step4_folder, step4_2)