# Parabricks Hands-On Workshop

## DNA Methylation Analysis by Bisulfite Sequencing

For DNA methylation analysis using bisulfite sequencing, the unmethylated cytosines (C) are converted to thymidines (T). Only the methylated cytosines (mC) are read as cytosines (C). Here we perform the alignment step, which is the most time consuming step in the analysis.

#### Download the Reference Genome

We will download the dataset used in the germline workflow tutorial, which contains a human genome reference sequence and its index files that will be used.

In [None]:
# The tar file is 9.3GB and, when extracted, an additional 14GB
!mkdir sample_data
%cd sample_data
!wget -O parabricks_sample.tar.gz "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"
!tar xvf parabricks_sample.tar.gz
!mv parabricks_sample/* .
%cd ..

In [6]:
!ls sample_data/Ref

Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.amb
Homo_sapiens_assembly38.fasta.ann
Homo_sapiens_assembly38.fasta.bwt
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.pac
Homo_sapiens_assembly38.fasta.sa
Homo_sapiens_assembly38.known_indels.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi


In [11]:
!mkdir outputdir

#### Generate Reference Genome Index

In bisulfite sequencing, the reference genome need to be processed to turn 'C' to 'T' and 'G' to 'A'. Here we download and install `bwameth.py` to do it.

In [48]:
!wget https://raw.githubusercontent.com/brentp/bwa-meth/v0.2.7/bwameth.py

--2024-10-11 08:44:16--  https://raw.githubusercontent.com/brentp/bwa-meth/v0.2.7/bwameth.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19560 (19K) [text/plain]
Saving to: ‘bwameth.py’


2024-10-11 08:44:16 (13.4 MB/s) - ‘bwameth.py’ saved [19560/19560]



Install necessary tools for `bwameth.py`

In [2]:
!pip install toolshed
!apt install bwa

Collecting toolshed
  Downloading toolshed-0.4.6.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: toolshed
  Building wheel for toolshed (setup.py) ... [?25ldone
[?25h  Created wheel for toolshed: filename=toolshed-0.4.6-py3-none-any.whl size=9196 sha256=43eb47948f2af33b645a8b3a0bab1e540d3b52d5183555a3e9a12aef92470ab6
  Stored in directory: /root/.cache/pip/wheels/ee/f1/0f/4f83f90d39e7c7aed3aac15e04bf1847beaf1d2affb896fd8e
Successfully built toolshed
Installing collected packages: toolshed
Successfully installed toolshed-0.4.6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  bwa
0 upgraded, 1 newly installed, 0 to remove and 12 not upgraded.
Need to get 195 kB of archives.
After this operation, 466 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 bwa amd64 0.7.17-6 [195 kB]
Fetch

Use `bwameth.py` to generate the reference genome for bisulfite mapping. This step runs on CPU and
took 2.5 hr.

In [4]:
!python3 bwameth.py index sample_data/Ref/Homo_sapiens_assembly38.fasta

already converted: c2t in sample_data/Ref/Homo_sapiens_assembly38.fasta to sample_data/Ref/Homo_sapiens_assembly38.fasta.bwameth.c2t
indexing with bwa-mem: sample_data/Ref/Homo_sapiens_assembly38.fasta.bwameth.c2t
[bwa_index] Pack FASTA... 72.55 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=12869387668, availableWord=917537264
[BWTIncConstructFromPacked] 10 iterations done. 99999988 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999988 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999988 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999988 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999988 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999988 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999988 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999988 characters proces

#### Download Sample Data

In [22]:
# Download whole-genome bisulfite sequencing (WGBS) data from the ENCODE project
%cd sample_data/Data
!wget https://www.encodeproject.org/files/ENCFF567DAI/@@download/ENCFF567DAI.fastq.gz
%cd ../..

/tutorial/sample_data
--2024-10-11 05:46:17--  https://www.encodeproject.org/files/ENCFF567DAI/@@download/ENCFF567DAI.fastq.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2015/07/21/94ea1db4-87c0-4737-a372-9a274415408d/ENCFF567DAI.fastq.gz?response-content-disposition=attachment%3B%20filename%3DENCFF567DAI.fastq.gz&AWSAccessKeyId=ASIATGZNGCNXVROVIGBF&Signature=5cdTGLDIOFTtGWBNlJyX%2BIrgZLQ%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEC0aCXVzLXdlc3QtMiJHMEUCIHKQn0LGFIdPrllZOrrYdU%2BFJek1u7ZcwFtGZkqyh16xAiEAqxmfX9rrK2e%2B1ArUhW5%2BwiJCbNv3bIsnjzXROTmzcukqvAUIhv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgwyMjA3NDg3MTQ4NjMiDCF4QxZd%2FRBvblapxyqQBXm5RHJt5Dh4kKz0P6Ik1LiD%2F8MEGl8QO%2F%2Bv65%2FzvjEczvdenhApthGar5oObVR9vZSxde6RAi84z1mNGAL0r8HSzcpspQkWacTjO5Kz%2FOWbUAmRF4zlM

In [7]:
#Download reduced representation binsulfite sequencing (RRBS) sample from ENCODE K562
%cd sample_data/Data
!wget https://www.encodeproject.org/files/ENCFF000MHC/@@download/ENCFF000MHC.fastq.gz
%cd ../..

--2024-10-18 01:58:53--  https://www.encodeproject.org/files/ENCFF000MHC/@@download/ENCFF000MHC.fastq.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2011/08/31/c9e108e1-b249-4373-b443-9ebd6ce8030e/ENCFF000MHC.fastq.gz?response-content-disposition=attachment%3B%20filename%3DENCFF000MHC.fastq.gz&AWSAccessKeyId=ASIATGZNGCNXZATMDXPK&Signature=uM6%2B5Zcc9AgfhoOcy9IHxrKQuRA%3D&x-amz-security-token=IQoJb3JpZ2luX2VjENL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIQCSneceynGxfehrOCewQ7cFdX8bYRioZJl6IVp134ZiiwIgVMLYYCNXI4u0u8ikPmr%2FPsmzq4RAW57m2JlYK2CSMJcqswUIOxAAGgwyMjA3NDg3MTQ4NjMiDPXiwdYuOtuV4qeyDSqQBVnOmL8LWwBTI%2FLiZYbBUmCqQwcG4LHu5%2F%2B0JQ5W7QvCrhX%2BbICsY8e2BpFia9axzJXLsX7zaJdIsuPnnLXRcUzOlcfvZArpCEGIC9SAc62EnxyoFfMNbbku7j59%2B7%2FaZm5wzXWl4ISQkEq

#### GPU Monitoring

In [7]:
!nvidia-smi

Fri Oct 11 03:09:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB-N         On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0              44W / 160W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Run the command below in the terminal
### watch -n 0.5 nvidia-smi
#

#### Run Alignment: fq2bam_meth

I am using a V100 GPU with 16 GB RAM. Therefore, I added the low memory option to limit the GPU memory used for this job.

In [32]:
!pbrun fq2bam_meth \
      --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-se-fq sample_data/Data/ENCFF567DAI.fastq.gz \
      --out-bam outputdir/fq2bam_meth_output.bam \
      --logfile fq2bam_meth.log \
      --num-gpus 1 \
      --bwa-nstreams 1 \
      --memory-limit 16 \
      --low-memory

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /tutorial/sample_data/Data/ENCFF567DAI.fastq.gz
[Parabricks Options Mesg]: @RG\tID:C6EGJANXX.4\tLB:lib1\tPL:bar\tSM:sample\tPU:C6EGJANXX.4
[Parabricks Options Mesg]: Using --low-memory reduces the number of reads sent to GPU per batch in fq2bam_meth.
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Oct-19 14:42:41] ------------------------------------------------------------------------------
[PB Info 2024-Oct-19 14:42:41] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Oct-19 14:42:41] ||                              Version 4.3.2-1                             ||
[PB Info 2024-Oct-19 14:42:41] ||                      GPU-PBBWA mem, Sor

In [None]:
!ls -lh outputdir