# Parabricks Hands-On Workshop

## Germline Workflow

#### Download the Sample Data

We will download the reference human genome, its index files, a known variants file, and 
a 30x whole-genome sequencing sample for germline analysis.

In [None]:
# The tar file is 9.3GB and, when extracted, an additional 14GB
!mkdir sample_data
%cd sample_data
!wget -O parabricks_sample.tar.gz "https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"
!tar xvf parabricks_sample.tar.gz
!mv parabricks_sample/* .
%cd ..

In [5]:
!ls sample_data/Data

markdup_input.bam  sample_1.fq.gz  sample_2.fq.gz  single_ended.bam


In [6]:
!ls sample_data/Ref

Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.amb
Homo_sapiens_assembly38.fasta.ann
Homo_sapiens_assembly38.fasta.bwt
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.pac
Homo_sapiens_assembly38.fasta.sa
Homo_sapiens_assembly38.known_indels.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi


In [11]:
!mkdir outputdir

#### GPU Monitoring

In [7]:
!nvidia-smi

Fri Oct 11 03:09:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB-N         On  | 00000000:0A:00.0 Off |                    0 |
| N/A   35C    P0              44W / 160W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# Run the command below in the terminal
### watch -n 0.5 nvidia-smi
#

#### Run Germline Workflow: GATK (BWA-MEM + HaplotypeCaller)

This workflow runs the GATK germline pipeline in one command. This command will run read alignment followed by variant calling using HaplotypeCaller. It will not only output the VCF file, but also output intermediate BAM and BQSR files if needed. In this workshop, we are limited to a single V100 GPU with 16GB RAM. Therefore, the `--low-memory` option is used.

In [27]:
!pbrun germline -h

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


usage: pbrun germline <options>
Help: pbrun germline -h

Run Germline pipeline to convert FASTQ to VCF.

options:
  -h, --help            show this help message and exit

Input Output file options:
  Options for Input and Output files for this tool.

  --ref REF             Path to the reference file. (default: None)
  --in-fq [IN_FQ ...]   Path to the pair-ended FASTQ files followed by
                        optional read groups with quotes (Example:
                        "@RG\tID:foo\tLB:lib1\tPL:bar\tSM:sample\tPU:foo").
                        The files must be in fastq or fastq.gz format. All
                        sets of inputs should have a read group; otherwise,
                        none should have a read group, and it will be
                        automatically added by the pipeline. This option can
                        be repeated multiple times. Example 1: --in-fq
              

In [13]:
!pbrun germline \
    --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-fq sample_data/Data/sample_1.fq.gz sample_data/Data/sample_2.fq.gz \
    --knownSites sample_data/Ref/Homo_sapiens_assembly38.known_indels.vcf.gz \
    --out-bam outputdir/germline_out.bam \
    --out-variants outputdir/variants.vcf \
    --out-recal-file outputdir/recal.table \
    --low-memory   

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /tutorial/sample_data/Data/sample_1.fq.gz and
/tutorial/sample_data/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1


[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Set --bwa-options="-K #" to produce compatible pair-ended results with previous versions of
fq2bam or BWA MEM.
[Parabricks Options Mesg]: Read group created for /tutorial/sample_data/Data/sample_1.fq.gz and
/tutorial/sample_data/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Oct-11 03:26:21] ----------------------------------------------------------------------------

In [28]:
!ls outputdir

germline_out.bam       recal.table	  variants_dv.g.vcf.idx
germline_out.bam.bai   variants.vcf	  variants_dv.vcf
germline_out_chrs.txt  variants_dv.g.vcf


#### Run Germline Workflow: BWA-MEM + DeepVariant

Germline workflow with DeepVariant is also available using the below command.

In [23]:
!pbrun deepvariant_germline \
    --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-fq sample_data/Data/sample_1.fq.gz sample_data/Data/sample_2.fq.gz \
    --out-bam outputdir/dvgermline_out.bam \
    --out-variants outputdir/variants_dv.vcf \
    --low-memory    

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation


[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /results/sample_data/Data/sample_1.fq.gz and
/results/sample_data/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1


[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Set --bwa-options="-K #" to produce compatible pair-ended results with previous versions of
fq2bam or BWA MEM.
[Parabricks Options Mesg]: Read group created for /results/sample_data/Data/sample_1.fq.gz and
/results/sample_data/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Oct-09 08:32:48] ------------------------------------------------------------------------------
[

### Alignment: FASTQ to BAM

The alignment and variant calling steps in the germline workflow can run separately with these commands. Again, `--bwa-nstreams` is set to `1` because we are limited to a single V100 GPU. With more GPU memory, you can run more streams (i.g., 16) and shorten the time for running.

In [13]:
!pbrun fq2bam \
      --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-fq sample_data/Data/sample_1.fq.gz sample_data/Data/sample_2.fq.gz \
      --out-bam outputdir/fq2bam_output.bam \
      --num-gpus 1 \
      --bwa-nstreams 1 

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Set --bwa-options="-K #" to produce compatible pair-ended results with previous versions of
fq2bam or BWA MEM.
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /results/sample_data/Data/sample_1.fq.gz and
/results/sample_data/Data/sample_2.fq.gz
[Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1
[PB Info 2024-Oct-09 07:19:33] ------------------------------------------------------------------------------
[PB Info 2024-Oct-09 07:19:33] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Oct-09 07:19:33] ||                              Version 4.3.2-1                             ||
[PB Info 2024-Oct-09 07:19:33] ||                      GPU-PBBWA mem, Sorting Phase-I              

### Variant Calling: BAM to VCF

#### GATK Haplotypecaller

In [11]:
!pbrun haplotypecaller -h

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

usage: pbrun haplotypecaller <options>
Help: pbrun haplotypecaller -h

Run HaplotypeCaller to convert BAM/CRAM to VCF.

optional arguments:
  -h, --help            show this help message and exit

Input Output file options:
  Options for Input and Output files for this tool:

  --ref REF             Path to the reference file. (default: None)
  --in-bam IN_BAM       Path to the input BAM/CRAM file for variant calling.
                        The argument may also be a local folder containing
                        several bams; each will be processed by 1 GPU in batch
                        mode. (default: None)
  --in-recal-file IN_RECAL_FILE
                        Path to the input BQSR report. (default: None)
  --interval-file INTERVAL_FILE
                        Path to an interval file in one of these formats:
                        Picard-style (.interval_list or .picard), GATK-style
         

- vcf

In [16]:
!pbrun haplotypecaller \
      --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-bam outputdir/fq2bam_output.bam \
      --out-variants outputdir/variants_gatk.vcf \
      --num-gpus 1 \
      --htvc-low-memory

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

/usr/local/parabricks/binaries/bin/htvc /results/sample_data/Ref/Homo_sapiens_assembly38.fasta /results/outputdir/fq2bam_output.bam 1 -o /results/outputdir/variants_gatk.vcf -nt 5 --low-memory
[PB Info 2024-Oct-09 07:40:25] ------------------------------------------------------------------------------
[PB Info 2024-Oct-09 07:40:25] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Oct-09 07:40:25] ||                              Version 4.3.2-1                             ||
[PB Info 2024-Oct-09 07:40:25] ||                         GPU-GATK4 HaplotypeCaller                        ||
[PB Info 2024-Oct-09 07:40:25] ------------------------------------------------------------------------------
[PB Info 2024-Oct-09 07:40:48] 0 /results/outputdir/fq2bam_output.bam/results/outputdir/variants_gatk.vcf
[PB Info 2024-Oct-09 07:40:48] ProgressMeter -	Current-Locus	Elapsed

- gvcf

For population studies, you may want to generate `.gvcf` files to me merged for joint genotyping in later steps.

In [20]:
!pbrun haplotypecaller \
      --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-bam outputdir/fq2bam_output.bam \
      --gvcf \
      --out-variants outputdir/variants_gatk.g.vcf \
      --num-gpus 1\
      --htvc-low-memory

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

/usr/local/parabricks/binaries/bin/htvc /results/sample_data/Ref/Homo_sapiens_assembly38.fasta /results/outputdir/fq2bam_output.bam 1 -o /results/outputdir/variants_gatk.g.vcf -nt 5 -g --low-memory
[PB Info 2024-Oct-09 08:25:04] ------------------------------------------------------------------------------
[PB Info 2024-Oct-09 08:25:04] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Oct-09 08:25:04] ||                              Version 4.3.2-1                             ||
[PB Info 2024-Oct-09 08:25:04] ||                         GPU-GATK4 HaplotypeCaller                        ||
[PB Info 2024-Oct-09 08:25:04] ------------------------------------------------------------------------------
[PB Info 2024-Oct-09 08:25:27] 0 /results/outputdir/fq2bam_output.bam/results/outputdir/variants_gatk.g.vcf
[PB Info 2024-Oct-09 08:25:27] ProgressMeter -	Current-Locus	

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. (https://gatk.broadinstitute.org/hc/en-us/articles/360035531812-GVCF-Genomic-Variant-Call-Format)

In [15]:
!ls -lh outputdir/

total 6.9G
drwxr-xr-x. 2 root root  225 Nov 29 07:45 .
drwxr-xr-x. 5 root root  216 Nov 29 07:46 ..
-rw-r--r--. 1 root root 4.5G Nov 29 07:05 fq2bam_output.bam
-rw-r--r--. 1 root root 6.6M Nov 29 07:05 fq2bam_output.bam.bai
-rw-r--r--. 1 root root  86K Nov 29 07:05 fq2bam_output_chrs.txt
-rw-r--r--. 1 root root 1.6G Nov 29 07:45 variants_gatk.gvcf
-rw-r--r--. 1 root root 389K Nov 29 07:45 variants_gatk.gvcf.idx
-rw-r--r--. 1 root root  23M Nov 29 07:27 variants_gatk.vcf


#### DeepVariant

In [16]:
!pbrun deepvariant -h

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

usage: pbrun deepvariant <options>
Help: pbrun deepvariant -h

Run DeepVariant to convert BAM/CRAM to VCF.

optional arguments:
  -h, --help            show this help message and exit

Input Output file options:
  Options for Input and Output files for this tool:

  --ref REF             Path to the reference file. (default: None)
  --in-bam IN_BAM       Path to the input BAM/CRAM file for variant calling.
                        (default: None)
  --interval-file INTERVAL_FILE
                        Path to a BED file (.bed) for selective access. This
                        option can be used multiple times. (default: None)
  --out-variants OUT_VARIANTS
                        Path of the vcf/g.vcf/g.vcf.gz file after variant
                        calling. (default: None)
  --pb-model-file PB_MODEL_FILE
                        Path to a non-default parabricks model file for
                        de

- vcf

In [18]:
!pbrun deepvariant \
    --ref /workdir/sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam /workdir/outputdir/fq2bam_output.bam \
    --out-variants /workdir/outputdir/variants_dv.vcf \
    --num-streams-per-gpu 2 \
    --run-partition \
    --gpu-num-per-partition 1 \
    --num-gpus 1

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Detected 1 CUDA Capable device(s), considering 1 device(s)
  CUDA Driver Version / Runtime Version          12.2 / 12.0
Using model for CUDA Capability Major/Minor version number:    89
CUDA_VISIBLE_DEVICES=0 /usr/local/parabricks/binaries//bin/deepvariant /workdir/sample_data/Ref/Homo_sapiens_assembly38.fasta /workdir/outputdir/fq2bam_output.bam 1 2 -o 0.vcf  -L "chr1" -L "chr2" -L "chr3" -L "chr4" -L "chr5" -L "chr6" -L "chr7" -L "chrX" -L "chr8" -L "chr9" -L "chr11" -L "chr10" -L "chr12" -L "chr13" -L "chr14" -L "chr15" -L "chr16" -L "chr17" -L "chr18" -L "chr20" -L "chr19" -L "chrY" -L "chr22" -L "chr21" -L "chrM" -n 6 --model /usr/local/parabricks/binaries//model/80+/shortread/deepvariant.eng --channel_insert_size --pileup_image_width 221 --max_reads_per_partition 1500 --partition_size 1000 --vsc_min_count_snps 2 --vsc_min_count_indels 2 --vsc_min_fraction_snps 0.12 --min_mapping_quality 5 --min_bas

DeepVariant from Parabricks has the ability to use multiple streams on a GPU. The number of streams that can be used depends on the available resources. The default number of streams is set to two but can be increased up to a maximum of six to get better performance. This is something that has to be experimented with, before getting the optimal number on your system. (https://docs.nvidia.com/clara/parabricks/4.1.0/bestperformance.html#best-performance-for-deepvariant)

In [21]:
!pbrun deepvariant \
    --ref /workdir/sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam /workdir/outputdir/fq2bam_output.bam \
    --out-variants /workdir/outputdir/variants_dv.vcf \
    --num-streams-per-gpu 4 \
    --run-partition \
    --gpu-num-per-partition 1 \
    --num-gpus 1

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Detected 1 CUDA Capable device(s), considering 1 device(s)
  CUDA Driver Version / Runtime Version          12.2 / 12.0
Using model for CUDA Capability Major/Minor version number:    89
CUDA_VISIBLE_DEVICES=0 /usr/local/parabricks/binaries//bin/deepvariant /workdir/sample_data/Ref/Homo_sapiens_assembly38.fasta /workdir/outputdir/fq2bam_output.bam 1 4 -o 0.vcf  -L "chr1" -L "chr2" -L "chr3" -L "chr4" -L "chr5" -L "chr6" -L "chr7" -L "chrX" -L "chr8" -L "chr9" -L "chr11" -L "chr10" -L "chr12" -L "chr13" -L "chr14" -L "chr15" -L "chr16" -L "chr17" -L "chr18" -L "chr20" -L "chr19" -L "chrY" -L "chr22" -L "chr21" -L "chrM" -n 6 --model /usr/local/parabricks/binaries//model/80+/shortread/deepvariant.eng --channel_insert_size --pileup_image_width 221 --max_reads_per_partition 1500 --partition_size 1000 --vsc_min_count_snps 2 --vsc_min_count_indels 2 --vsc_min_fraction_snps 0.12 --min_mapping_quality 5 --min_bas

Using the --run-partition, --proposed-variants, and --gvcf options at the same time will lead to a substantial slowdown.

- gvcf

In [15]:
!pbrun deepvariant \
    --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
    --in-bam outputdir/germline_out.bam \
    --out-variants outputdir/variants_dv.g.vcf \
    --num-streams-per-gpu 1 \
    --gvcf \
    --num-gpus 1

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation

Detected 1 CUDA Capable device(s), considering 1 device(s)
  CUDA Driver Version / Runtime Version          12.3 / 12.3
Using model for CUDA Capability Major/Minor version number:    70
/usr/local/parabricks/binaries/bin/deepvariant /tutorial/sample_data/Ref/Homo_sapiens_assembly38.fasta /tutorial/outputdir/germline_out.bam 1 1 -o /tutorial/outputdir/variants_dv.g.vcf -n 6 --model /usr/local/parabricks/binaries/model/70/shortread/deepvariant.eng -g --channel_insert_size --pileup_image_width 221 --max_reads_per_partition 1500 --partition_size 1000 --vsc_min_count_snps 2 --vsc_min_count_indels 2 --vsc_min_fraction_snps 0.12 --min_mapping_quality 5 --min_base_quality 10 --alt_aligned_pileup none --variant_caller VERY_SENSITIVE_CALLER --dbg_min_base_quality 15 --ws_min_windows_distance 80 --aux_fields_to_keep HP --p_error 0.001 --max_ins_size 10
[PB Info 2024-Oct-11 03:53:55] --------------------------------

For DNA methylation analysis using bisulfite sequencing, the unmethylated cytosines (C) are converted to thymidines (T). Only the methylated cytosines (mC) are read as cytosines (C). Here we perform the alignment step, which is the most time consuming step in the analysis.

#### Download Sample Data

In [22]:
# Download whole-genome bisulfite sequencing (WGBS) data from the ENCODE project
%cd sample_data/Data
!wget https://www.encodeproject.org/files/ENCFF567DAI/@@download/ENCFF567DAI.fastq.gz
%cd ../..

/tutorial/sample_data
--2024-10-11 05:46:17--  https://www.encodeproject.org/files/ENCFF567DAI/@@download/ENCFF567DAI.fastq.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2015/07/21/94ea1db4-87c0-4737-a372-9a274415408d/ENCFF567DAI.fastq.gz?response-content-disposition=attachment%3B%20filename%3DENCFF567DAI.fastq.gz&AWSAccessKeyId=ASIATGZNGCNXVROVIGBF&Signature=5cdTGLDIOFTtGWBNlJyX%2BIrgZLQ%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEC0aCXVzLXdlc3QtMiJHMEUCIHKQn0LGFIdPrllZOrrYdU%2BFJek1u7ZcwFtGZkqyh16xAiEAqxmfX9rrK2e%2B1ArUhW5%2BwiJCbNv3bIsnjzXROTmzcukqvAUIhv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgwyMjA3NDg3MTQ4NjMiDCF4QxZd%2FRBvblapxyqQBXm5RHJt5Dh4kKz0P6Ik1LiD%2F8MEGl8QO%2F%2Bv65%2FzvjEczvdenhApthGar5oObVR9vZSxde6RAi84z1mNGAL0r8HSzcpspQkWacTjO5Kz%2FOWbUAmRF4zlM

In [7]:
#Download reduced representation binsulfite sequencing (RRBS) sample from ENCODE K562
%cd sample_data/Data
!wget https://www.encodeproject.org/files/ENCFF000MHC/@@download/ENCFF000MHC.fastq.gz
%cd ../..

--2024-10-18 01:58:53--  https://www.encodeproject.org/files/ENCFF000MHC/@@download/ENCFF000MHC.fastq.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2011/08/31/c9e108e1-b249-4373-b443-9ebd6ce8030e/ENCFF000MHC.fastq.gz?response-content-disposition=attachment%3B%20filename%3DENCFF000MHC.fastq.gz&AWSAccessKeyId=ASIATGZNGCNXZATMDXPK&Signature=uM6%2B5Zcc9AgfhoOcy9IHxrKQuRA%3D&x-amz-security-token=IQoJb3JpZ2luX2VjENL%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIQCSneceynGxfehrOCewQ7cFdX8bYRioZJl6IVp134ZiiwIgVMLYYCNXI4u0u8ikPmr%2FPsmzq4RAW57m2JlYK2CSMJcqswUIOxAAGgwyMjA3NDg3MTQ4NjMiDPXiwdYuOtuV4qeyDSqQBVnOmL8LWwBTI%2FLiZYbBUmCqQwcG4LHu5%2F%2B0JQ5W7QvCrhX%2BbICsY8e2BpFia9axzJXLsX7zaJdIsuPnnLXRcUzOlcfvZArpCEGIC9SAc62EnxyoFfMNbbku7j59%2B7%2FaZm5wzXWl4ISQkEq

#### Generate Reference Genome Index

In bisulfite sequencing, the reference genome need to be processed to turn 'C' to 'T' and 'G' to 'A'. Here we download and install `bwameth.py` to do it.

In [48]:
!wget https://raw.githubusercontent.com/brentp/bwa-meth/v0.2.7/bwameth.py

--2024-10-11 08:44:16--  https://raw.githubusercontent.com/brentp/bwa-meth/v0.2.7/bwameth.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19560 (19K) [text/plain]
Saving to: ‘bwameth.py’


2024-10-11 08:44:16 (13.4 MB/s) - ‘bwameth.py’ saved [19560/19560]



Install necessary tools for `bwameth.py`

In [2]:
!pip install toolshed
!apt install bwa

Collecting toolshed
  Downloading toolshed-0.4.6.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: toolshed
  Building wheel for toolshed (setup.py) ... [?25ldone
[?25h  Created wheel for toolshed: filename=toolshed-0.4.6-py3-none-any.whl size=9196 sha256=43eb47948f2af33b645a8b3a0bab1e540d3b52d5183555a3e9a12aef92470ab6
  Stored in directory: /root/.cache/pip/wheels/ee/f1/0f/4f83f90d39e7c7aed3aac15e04bf1847beaf1d2affb896fd8e
Successfully built toolshed
Installing collected packages: toolshed
Successfully installed toolshed-0.4.6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  bwa
0 upgraded, 1 newly installed, 0 to remove and 12 not upgraded.
Need to get 195 kB of archives.
After this operation, 466 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 bwa amd64 0.7.17-6 [195 kB]
Fetch

In [4]:
!python3 bwameth.py index sample_data/Ref/Homo_sapiens_assembly38.fasta

already converted: c2t in sample_data/Ref/Homo_sapiens_assembly38.fasta to sample_data/Ref/Homo_sapiens_assembly38.fasta.bwameth.c2t
indexing with bwa-mem: sample_data/Ref/Homo_sapiens_assembly38.fasta.bwameth.c2t
[bwa_index] Pack FASTA... 72.55 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=12869387668, availableWord=917537264
[BWTIncConstructFromPacked] 10 iterations done. 99999988 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999988 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999988 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999988 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999988 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999988 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999988 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999988 characters proces

#### Run Alignment: fq2bam_meth

In [32]:
!pbrun fq2bam_meth \
      --ref sample_data/Ref/Homo_sapiens_assembly38.fasta \
      --in-se-fq sample_data/Data/ENCFF567DAI.fastq.gz \
      --out-bam outputdir/fq2bam_meth_output.bam \
      --logfile fq2bam_meth.log \
      --num-gpus 1 \
      --bwa-nstreams 1 \
      --memory-limit 16 \
      --low-memory

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



[Parabricks Options Mesg]: Checking argument compatibility
[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /tutorial/sample_data/Data/ENCFF567DAI.fastq.gz
[Parabricks Options Mesg]: @RG\tID:C6EGJANXX.4\tLB:lib1\tPL:bar\tSM:sample\tPU:C6EGJANXX.4
[Parabricks Options Mesg]: Using --low-memory reduces the number of reads sent to GPU per batch in fq2bam_meth.
[Parabricks Options Mesg]: Using --low-memory sets the number of streams in bwa mem to 1.
[PB Info 2024-Oct-19 14:42:41] ------------------------------------------------------------------------------
[PB Info 2024-Oct-19 14:42:41] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2024-Oct-19 14:42:41] ||                              Version 4.3.2-1                             ||
[PB Info 2024-Oct-19 14:42:41] ||                      GPU-PBBWA mem, Sor

In [None]:
!ls -lh outputdir