# Run WES-QC pipeline

This Jupyter notebook runs all steps for WES-QC pipeline.
It supports both running in an SSH mode (on a local machine with remote access to the cluster) or directly from the remote cluster head node.

Fir details explaining the steps, see the documentation in the `docs/wes-qc-hail.md`.




### Run with jupyter notebook on the local machine

1. First, prepare a python environment with a Jupyter server. There is no additional dependency to install.
A quick way to do this is to install `uv` - `curl -LsSf https://astral.sh/uv/install.sh | sh`, see [uv](https://astral.sh/uv/) for more details.
Then run `uv add jupyterlab`, the `jupyter` executable will be installed in `.venv/bin/jupyter`.

2. Then, set up an SSH connection to the cluster, for example, if the cluster IP address is `172.27.1.1`, add the following to your `~/.ssh/config` file:
```
Host wes
    HostName 172.27.1.1
    User ubuntu
    IdentityFile ~/.ssh/id_rsa
```
id_rsa is the private key for accessing the cluster when the cluster was created.

3. Then, in the python environment, run the following command to start the notebook:
```
jupyter notebook scripts/run-wes-qc-pipeline-all-steps.ipynb
```
or if you are using VSCode, you can open the notebook directly in VSCode and set the Python interpreter to the one with the Jupyter notebook installed.

4. Follow the instructions in the notebook to run the steps.
Note that the variable `jupyter_notebook_on_cluster` needs to be set to `False` in the notebook.

**Warning:** if you run the notebook in SSH mode, your computer runs a local jupyter server to execute commands and track the progress.
If your computer loses connection to the cluster head node due to any network issues (or, for example, when your computer goes to sleep), the data processing terminates.


### Run with jupyter notebook on the cluster

1. First, prepare a python environment with Jupyter notebook installed, same as above.
See the above [Run with jupyter notebook on the local machine](#run-with-jupyter-notebook-on-the-local-machine) section for more details.

2. Then, start a jupyter notebook server on the cluster:
```
jupyter notebook --no-browser --port=8889 scripts/run-wes-qc-pipeline-all-steps.ipynb
```
Note for clusters created by the Sangeyou cannot use the default port 8888 because it is already used by the built-in notebook server on the master node.
The built-in notebook server on the master node is not suitable because its Spark will claim all the resources and the pipeline will not be able to run.

If you are using VSCode, you can open the notebook directly in VSCode and set the python interpreter to the system python.

3. Follow the instructions in the notebook to run the pipeline.
Note that the variable `jupyter_notebook_on_cluster` needs to be set to `True` in the notebook.


In [2]:
# Set this to `True` if you want to execute this jupyter notebook on the cluster head node.
jupyter_notebook_on_cluster = False

In [3]:
# Set up logging
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

In [4]:
# The small function to run command either via SSH (for local run)
# or directly (when the notebook is on cluster)
def run_cmd(cmd):
    if jupyter_notebook_on_cluster:
        !{cmd}
    else:
        !ssh -o StrictHostKeyChecking=no wes "{cmd}"

In [5]:
# Function that runs a series of scripts
def run_step_scripts(step_folder, scripts):
    for script, prefix in scripts.items():
        print("=" * 120 + "\n")
        logger.info(f"Running {script}")
        cmd = f"cd {path_to_wes} && ./scripts/hlrun_local --prefix={prefix} {step_folder}/{script}"
        run_cmd(cmd)

### Specify the path to the wes-qc directory

All WES-QC repo should be located on the cluster head node).
Make sure the config/inputs.yaml symlinks to the correct dataset config.

In [7]:
path_to_wes = "/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm"

### Check Hail is working on the cluster

In [8]:
python = "/home/ubuntu/venv/bin/python"
cmd = (
    f'cd {path_to_wes}; {python} -c \\"import hail as hl; hl.init()\\" '
    if not jupyter_notebook_on_cluster
    else f'cd {path_to_wes}; {python} -c "import hail as hl; hl.init()" '
)
run_cmd(cmd)

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1353-0.2.133-4c60fddb171a.log


### How to configure run steps
To configure run steps you only need to set up a dictionaly describing the step.
Dictionary key is the command to run (with arguments if they are required), the value is the name of the log file.
This notebook automatically collects all logs form the running steps.


### Step 0 - Resource Preparation

In [8]:
step0 = {"1-import_1kg.py --all": "0-1-import-1kg", "2-generate-truthset-ht.py": "0-2-generate-truthset-ht"}
step0_folder = "0-resource_preparation"
run_step_scripts(step0_folder, step0)

2025-02-25 13:43:45,640 - INFO - Running 1-import_1kg.py --all



Running the job with spark-submit
spark-submit 0-resource_preparation/1-import_1kg.py --all
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/0-resource_preparation
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/0-resource_preparation/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250225-1343-0.2.133-4c60fddb171a.log
Loading VCFs from file:///lustre/scratch126/teams/hgi/gz3/public-dataset/resources/mini_1000G
2025-02-25 13:44:04.948 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'Sample name' as type str (not specified)
  Loading field 'Se

2025-02-25 13:45:45,605 - INFO - Running 2-generate-truthset-ht.py



Running the job with spark-submit
spark-submit 0-resource_preparation/2-generate-truthset-ht.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/0-resource_preparation
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/0-resource_preparation/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
^C


### Step 1 - Import Data

In [7]:
step1 = {
    "1-import_gatk_vcfs_to_hail.py": "1-1-import_gatk_vcfs_to_hail",
    "2-import_annotations.py": "1-2-import_annotations",
    "3-validate-gtcheck.py": "1-3-validate-gtcheck",
    "4-mutation-spectra_preqc.py": "1-4-mutation-spectra_preqc",
}
step1_folder = "1-import_data"
run_step_scripts(step1_folder, step1)

2025-02-26 10:15:51,341 - INFO - Running 1-import_gatk_vcfs_to_hail.py



Running the job with spark-submit
spark-submit 1-import_data/1-import_gatk_vcfs_to_hail.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1015-0.2.133-4c60fddb171a.log
info: Found 1 VCFs in /lustre/scratch126/teams/hgi/gz3/public-dataset/test_data_from_1000_genomes
info: Loading VCFs WITHOUT header
Saving as hail mt to file:///lustre/scratch126/teams/hgi/gz3/public-dataset/matrixtables/gatk_unprocessed.mt
2025-02-26 10:16:06.436 Hail: 

2025-02-26 10:16:45,081 - INFO - Running 2-import_annotations.py



Running the job with spark-submit
spark-submit 1-import_data/2-import_annotations.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1016-0.2.133-4c60fddb171a.log
=== Running verifyBamID validation 
2025-02-26 10:17:04.199 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field '#SEQ_ID' as type str (user-supplied)
  Loading field 'RG' as type str (user-supplied)
  Loading field 'CHIP_ID' as type str (user-supplied)
 

2025-02-26 10:17:32,456 - INFO - Running 3-validate-gtcheck.py



Running the job with spark-submit
spark-submit 1-import_data/3-validate-gtcheck.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1017-0.2.133-4c60fddb171a.log

=== WES data samples validation ===
Total 111 IDs in WES data data
111 unique IDs in WES data
All IDs in WES data are unique

=== MicroARRAY data samples validation ===
Total 111 IDs in MicroARRAY data data
111 unique IDs in MicroARRAY data
All IDs in MicroARRAY data are 

2025-02-26 10:18:02,106 - INFO - Running 4-mutation-spectra_preqc.py



Running the job with spark-submit
spark-submit 1-import_data/4-mutation-spectra_preqc.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/1-import_data/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1018-0.2.133-4c60fddb171a.log
2025-02-26 10:18:16.091 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
    To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'
2025-02-26 10:18:25.413 Hail: INFO: Ordering unsorted dataset with network shuffl

### Step 2 - Sample QC

The SampleQC step is pretty automated and mostly produced acceptable results even with the default set of parameters.
However, we strongly advise you to refer to the documentation, review the metrics plots and tune filtering thresholds if necessary.

In [8]:
step2 = {
    "1-hard_filters_sex_annotation.py": "2-1-hard-filters-sex-annotation",
    "2-prune_related_samples.py": "2-2-prune-related-samples",
    "3-population_pca_prediction.py --all": "2-3-population-pca-prediction",
    "4-find_population_outliers.py": "2-4-find-population-outliers",
    "5-filter_fail_sample_qc.py": "2-5-filter-fail-sample-qc",
}
step2_folder = "2-sample_qc"
run_step_scripts(step2_folder, step2)

2025-02-26 10:21:41,777 - INFO - Running 1-hard_filters_sex_annotation.py



Running the job with spark-submit
spark-submit 2-sample_qc/1-hard_filters_sex_annotation.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1021-0.2.133-4c60fddb171a.log
Reading input matrix
=== Applying hard filters before sex prediction ===
===Imputing sex ===
2025-02-26 10:21:57.764 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2025

2025-02-26 10:23:28,054 - INFO - Running 2-prune_related_samples.py



Running the job with spark-submit
spark-submit 2-sample_qc/2-prune_related_samples.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1023-0.2.133-4c60fddb171a.log
=== Filtering to autosomes
=== Splitting multiallelic sites
=== Performing LD pruning
2025-02-26 10:23:43.086 Hail: INFO: ld_prune: running local pruning stage with max queue size of 860371 variants
2025-02-26 10:23:57.983 Hail: INFO: wrote table with 106794 rows in 3 partitions

2025-02-26 10:27:01,150 - INFO - Running 3-population_pca_prediction.py --all



Running the job with spark-submit
spark-submit 2-sample_qc/3-population_pca_prediction.py --all
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1027-0.2.133-4c60fddb171a.log
2025-02-26 10:27:22.848 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type int32 (user-supplied)
  Loading field 'f3' as type str (use

2025-02-26 10:28:50,822 - INFO - Running 4-find_population_outliers.py



Running the job with spark-submit
spark-submit 2-sample_qc/4-find_population_outliers.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/2-sample_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250226-1028-0.2.133-4c60fddb171a.log
2025-02-26 10:29:22.174 Hail: INFO: wrote matrix table with 566135 rows and 111 columns in 3 partitions to file:///lustre/scratch126/teams/hgi/gz3/public-dataset/matrixtables/gatk_unprocessed_with_pop.mt
Filtering input MT by depth: DP=20, genotype quality: GQ=20, VAF: VAF=0.25

2025-02-26 10:31:53,490 - INFO - Running 5-filter_fail_sample_qc.py



Running the job with spark-submit
spark-submit 2-sample_qc/5-filter_fail_sample_qc.py
Read from remote host 172.27.25.130: Connection reset by peer
client_loop: send disconnect: Broken pipe


### Step 3 - Variant QC 
The first part of the VariantQC step - before training the random forest model.
Don't forget to set the RF model ID you want.

In [12]:
model_id = "test-run-model"

step3_1 = {
    "1-split_and_family_annotate.py --all": "3-1-split_and_family_annotate",
    "2-create_rf_ht.py": "3-2-create-rf-ht",
    f"3-train_rf.py --manual-model-id {model_id}": "3-3-train-rf",
}
step3_folder = "3-variant_qc"
run_step_scripts(step3_folder, step3_1)

2025-02-27 11:02:38,944 - INFO - Running 1-split_and_family_annotate.py --all



Running the job with spark-submit
spark-submit 3-variant_qc/1-split_and_family_annotate.py --all
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1102-0.2.133-4c60fddb171a.log
2025-02-27 11:03:09.602 Hail: INFO: wrote matrix table with 566135 rows and 104 columns in 3 partitions to file:///lustre/scratch126/teams/hgi/gz3/public-dataset/matrixtables/mt_varqc.mt
Annotating entries with allele balance
Annotating variants with mean allele balan

2025-02-27 11:22:10,271 - INFO - Running 2-create_rf_ht.py



Running the job with spark-submit
spark-submit 3-variant_qc/2-create_rf_ht.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1122-0.2.133-4c60fddb171a.log
pedfile:  /lustre/scratch126/teams/hgi/gz3/public-dataset/metadata/test_1000g_v2.triosonly.ped
--- Annotation with trio stats ---
INFO (gnomad.variant_qc.random_forest 176): Computing feature medians for imputation of missing numeric values
INFO (gnomad.variant_qc.random_forest 196): V

2025-02-27 11:23:08,594 - INFO - Running 3-train_rf.py --manual-model-id test-run-model



Running the job with spark-submit
spark-submit 3-variant_qc/3-train_rf.py --manual-model-id test-run-model
The gnomAD fucntion train_rf_model() in some cases
could work incorrectly in the parallel SPARK environment.

If the run of the function will fail with some weird message
(no space left on device, wrong imports, etc),
try running model training on the master node only:

PYTHONPATH=$(pwd):$PYTHONPATH PYSPARK_DRIVER_PYTHON=/home/ubuntu/venv/bin/python spark-submit --master local[*]  3-variant_qc/3-train_rf.py

info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /

Changing the model run ID in the config file

In [10]:
yaml_file = "config/inputs.yaml"
cmd = f"sed --follow-symlinks -i 's/rf_model_id:.*/rf_model_id: {model_id}/' {path_to_wes}/{yaml_file}"
run_cmd(cmd)

Now we can run the second part of the VariantQC. Please ferer to the docs (`docs/wes-qc-hail.md`) for details.

After finishing the variant QC process, you need to review the results and choose hardfilters for the genotype filtering.


In [14]:
step3_2 = {
    "4-apply_rf.py": "3-4-apply-rf",
    "5-annotate_ht_after_rf.py": "3-5-annotate-ht-after-rf",
    "6-rank_and_bin.py": "3-6-rank-and-bin",
    "7-plot_rf_output.py": "3-7-plot-rf-output",
}
step3_folder = "3-variant_qc"
run_step_scripts(step3_folder, step3_2)

2025-02-27 12:28:19,897 - INFO - Running 4-apply_rf.py



Running the job with spark-submit
spark-submit 3-variant_qc/4-apply_rf.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1228-0.2.133-4c60fddb171a.log
INFO (gnomad.variant_qc.random_forest 428): Loading model from file:///lustre/scratch126/teams/hgi/gz3/public-dataset/variant_qc_random_forest/test-run-model/rf.model
INFO (gnomad.variant_qc.random_forest 353): Applying RF model.      (0 + 1) / 1]
2025-02-27 12:29:45.037 Hail: INFO: Coerced

2025-02-27 12:30:02,726 - INFO - Running 5-annotate_ht_after_rf.py



Running the job with spark-submit
spark-submit 3-variant_qc/5-annotate_ht_after_rf.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1230-0.2.133-4c60fddb171a.log
2025-02-27 12:30:21.531 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type str (user-supplied)
  Loading field 'f3' as type str (user-supplied

2025-02-27 12:38:14,721 - INFO - Running 6-rank_and_bin.py



Running the job with spark-submit
spark-submit 3-variant_qc/6-rank_and_bin.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1238-0.2.133-4c60fddb171a.log
=== Assigning ranks ===
2025-02-27 12:38:35.019 Hail: INFO: Ordering unsorted dataset with network shuffle
2025-02-27 12:38:41.539 Hail: INFO: wrote table with 579395 rows in 3 partitions to file:///lustre/scratch126/teams/hgi/gz3/public-dataset/tmp/persist_TablegtbECTK1VE
2025-02-27 1

2025-02-27 12:39:42,536 - INFO - Running 7-plot_rf_output.py



Running the job with spark-submit
spark-submit 3-variant_qc/7-plot_rf_output.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/3-variant_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1239-0.2.133-4c60fddb171a.log
2025-02-27 12:40:02.778 Hail: INFO: Ordering unsorted dataset with network shuffle
  df = df.groupby(["model", "bin"]).agg(
  df = df.groupby(["model", "bin"]).agg(
  df = df.groupby(["model", "bin"]).agg(
  y_fun=lambda x: x[0],
  df[f"{col}_cumul"] = df.groupby("model")[col].aggr

### Step 4 - Genotype QC

To perform genotype QC, you need to determine the best combination of hard filters,
to save "good" variations as much as possible,
and get rid of all "bad" variants and genotypes at the same time.

The first script of the genotype QC helps you to analyze different combinations of hard filters
and choose optimal values.

Please refer to the docs for details.

In [15]:
step4_1 = {
    "1-compare_hard_filter_combinations.py --all": "4-1-compare-hard-filter-combinations",
}
step4_folder = "4-genotype_qc"
run_step_scripts(step4_folder, step4_1)

2025-02-27 12:42:33,787 - INFO - Running 1-compare_hard_filter_combinations.py --all



Running the job with spark-submit
spark-submit 4-genotype_qc/1-compare_hard_filter_combinations.py --all
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1242-0.2.133-4c60fddb171a.log
=== Preparing control GIAB sample ===
INFO:root:Preparing GiaB HailTable
2025-02-27 12:42:52.751 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'f0' as type str (not specified)
  Loading field 'f1' as type int32 (user-supplied)
 

**At this step, you MUST review and analyze the results to choose correct values for hardfilter combinations**.
The values for the public datasets are not suitable for your data.
Please refer to the docs for details: `docs/wes-qc-hail.md`.

After choosing the hard filters, you can run the last part of the data processing

In [10]:
step4_2 = {
    "2-apply_range_of_hard_filters.py": "4-2-apply-range-of-hard-filters",
    "3a-export_vcfs_range_of_hard_filters.py": "4-3a-export-vcfs-range-of-hard-filters",
    "3b-export_vcfs_stingent_filters.py": "4-3b-export-vcfs-stingent-filters",
    "5-mutation-spectra_afterqc.py": "4-5-mutation-spectra_afterqc",
}
step4_folder = "4-genotype_qc"
run_step_scripts(step4_folder, step4_2)

2025-02-27 13:55:11,461 - INFO - Running 2-apply_range_of_hard_filters.py



Running the job with spark-submit
spark-submit 4-genotype_qc/2-apply_range_of_hard_filters.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1355-0.2.133-4c60fddb171a.log
2025-02-27 13:55:30.657 Hail: INFO: Reading table without type imputation1) / 1]
  Loading field 'f0' as type str (user-supplied)
  Loading field 'f1' as type int32 (user-supplied)
  Loading field 'f2' as type str (user-supplied)
  Loading field 'f3' as type str (use

2025-02-27 14:01:20,490 - INFO - Running 3a-export_vcfs_range_of_hard_filters.py



Running the job with spark-submit
spark-submit 4-genotype_qc/3a-export_vcfs_range_of_hard_filters.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1401-0.2.133-4c60fddb171a.log
Exporting chr1
2025-02-27 14:01:34.553 Hail: WARN: export_vcf: ignored the following fields:
    'chromosome' (global)
    'meanHetAB' (row)
2025-02-27 14:02:00.169 Hail: INFO: merging 4 files totalling 46.9M... + 1) / 3]
2025-02-27 14:02:01.373 Hail: INFO: w

2025-02-27 14:06:21,605 - INFO - Running 3b-export_vcfs_stingent_filters.py



Running the job with spark-submit
spark-submit 4-genotype_qc/3b-export_vcfs_stingent_filters.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1406-0.2.133-4c60fddb171a.log
Exporting chr1
2025-02-27 14:06:34.882 Hail: WARN: export_vcf: ignored the following fields:
    'chromosome' (global)
    'meanHetAB' (row)
    'stringent_AN' (row)
    'stringent_AC' (row)
    'stringent_AC_Hom' (row)
    'stringent_AC_Het' (row)
    'medium_

2025-02-27 14:10:52,244 - INFO - Running 5-mutation-spectra_afterqc.py



Running the job with spark-submit
spark-submit 4-genotype_qc/5-mutation-spectra_afterqc.py
info: script_dir  /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc
Loading config '/lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/4-genotype_qc/../config/inputs.yaml', default
=== Cleaning up temporary folder /lustre/scratch126/teams/hgi/gz3/public-dataset/tmp
=== Initializing Hail ===
Running on Apache Spark version 3.5.3
SparkUI available at http://spark-master:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /lustre/scratch126/teams/hgi/gz3/wes_qc_pycharm/hail-20250227-1410-0.2.133-4c60fddb171a.log
2025-02-27 14:11:05.863 Hail: WARN: entries(): Resulting entries table is sorted by '(row_key, col_key)'.
    To preserve row-major matrix table order, first unkey columns with 'key_cols_by()'
2025-02-27 14:11:16.323 Hail: INFO: Ordering unsorted dataset with network shuf