# Parabricks Hands-On Workshop

## Tutorial 2: Bulk RNA Sequencing Analysis

For gene expression analysis, bulk RNA-seq is the most common method. In this exercise, we will demonstrate how to do use the Parabricks rna_fq2bam tool to perform the alignment step for RNA-seq analysis, which is the most time consuming step.  The output can be further analyzed for normalization, quantification, or differentially expressed gene analysis.

## Prepare Example Data

We will use a sample dataset from the ENCODE Project. This is the NCI-H460 cell line derived from human large cell lung carcinoma. This sample was sequenced using Illumina HiSeq 2000, paired-ended, 2x100bp read length. File sizes are 5.44 GB and 5.66 GB for read 1 and read 2, respectively. 

We also need a reference genome library, which can be obtained from GENCODE.

To save time for this workshop, I have transfered these files to TWCC Cloud Object Storage. We will download data from there.

In [1]:
!mkdir rna_data
%cd rna_data
!wget http://cos.twcc.ai/pbworkshop/rna_sample/ENCFF114TXS.fastq.gz
!wget http://cos.twcc.ai/pbworkshop/rna_sample/ENCFF667GGC.fastq.gz
%cd ..

mkdir: cannot create directory ‘rna_data’: File exists
/home/yingja1227/rna_data
URL transformed to HTTPS due to an HSTS policy
--2025-09-03 12:21:39--  https://cos.twcc.ai/pbworkshop/rna_sample/ENCFF114TXS.fastq.gz
Resolving cos.twcc.ai (cos.twcc.ai)... 203.145.219.21
Connecting to cos.twcc.ai (cos.twcc.ai)|203.145.219.21|:443... connected.
HTTP request sent, awaiting response... 

  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


200 OK
Length: 5842623047 (5.4G) [binary/octet-stream]
Saving to: ‘ENCFF114TXS.fastq.gz’


2025-09-03 12:22:40 (91.5 MB/s) - ‘ENCFF114TXS.fastq.gz’ saved [5842623047/5842623047]

URL transformed to HTTPS due to an HSTS policy
--2025-09-03 12:22:41--  https://cos.twcc.ai/pbworkshop/rna_sample/ENCFF667GGC.fastq.gz
Resolving cos.twcc.ai (cos.twcc.ai)... 203.145.219.21
Connecting to cos.twcc.ai (cos.twcc.ai)|203.145.219.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6081103917 (5.7G) [binary/octet-stream]
Saving to: ‘ENCFF667GGC.fastq.gz’


2025-09-03 12:23:44 (91.3 MB/s) - ‘ENCFF667GGC.fastq.gz’ saved [6081103917/6081103917]

/home/yingja1227


#### Prepare reference genome library index file

1. Download reference genome GRCh38 fasta file
2. Download GENCODE gene annotation GTF file
3. Use STAR (running on CPU) to generate the index file
- Download and install STAR
- Add STAR to the environmental variable PATH
- Run STAR to generate the genome index file

Step 3 runs on CPU. It took 25 min on 32 cores. We will download the files (genomelib.tar.gz) from storage and unzip.

##### Skip 
The folowing three cells should be skipped. It is only here to show you how to generate the genome library if you need to do it yourself.

In [16]:
# Download STAR to make the genome index file for rna_fq2bam
!wget https://github.com/alexdobin/STAR/archive/2.7.2a.tar.gz
!tar -xzf 2.7.2a.tar.gz
%cd STAR-2.7.2a/source
!make STAR

--2025-08-28 16:50:26--  https://github.com/alexdobin/STAR/archive/2.7.2a.tar.gz
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/alexdobin/STAR/tar.gz/refs/tags/2.7.2a [following]
--2025-08-28 16:50:27--  https://codeload.github.com/alexdobin/STAR/tar.gz/refs/tags/2.7.2a
Resolving codeload.github.com (codeload.github.com)... 20.27.177.114
Connecting to codeload.github.com (codeload.github.com)|20.27.177.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘2.7.2a.tar.gz’

2.7.2a.tar.gz           [      <=>           ]   7.76M  6.49MB/s    in 1.2s    

2025-08-28 16:50:28 (6.49 MB/s) - ‘2.7.2a.tar.gz’ saved [8142240]

/home/yingja1227/STAR-2.7.2a/source
/bin/bash: make: command not found


NameError: name 'os' is not defined

In [23]:
import os
STAR_path = os.popen("cd STAR-2.7.2a/bin/Linux_x86_64 && pwd").read().strip()
os.environ["PATH"] += os.pathsep + STAR_path
!echo $PATH#

/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/yingja1227/.local/bin:/home/yingja1227/.local/bin:/home/yingja1227/bin::/home/yingja1227/STAR-2.7.2a/source:/home/yingja1227/STAR-2.7.2a/bin/Linux_x86_64#


In [2]:
%cd Ref
#!wget http://cos.twcc.ai/pbworkshop/Ref/Homo_sapiens_assembly38.fasta
!wget http://cos.twcc.ai/pbworkshop/Ref/gencode.v48.primary_assembly.annotation.gtf.gz
!gunzip gencode.v48.primary_assembly.annotation.gtf.gz
%cd ..

/home/yingja1227/Ref
URL transformed to HTTPS due to an HSTS policy
--2025-09-03 12:23:44--  https://cos.twcc.ai/pbworkshop/Ref/gencode.v48.primary_assembly.annotation.gtf.gz
Resolving cos.twcc.ai (cos.twcc.ai)... 203.145.219.21
Connecting to cos.twcc.ai (cos.twcc.ai)|203.145.219.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63662692 (61M) [application/gzip]
Saving to: ‘gencode.v48.primary_assembly.annotation.gtf.gz’


2025-09-03 12:23:45 (60.7 MB/s) - ‘gencode.v48.primary_assembly.annotation.gtf.gz’ saved [63662692/63662692]

gzip: gencode.v48.primary_assembly.annotation.gtf already exists; do you wish to overwrite (y or n)? ^C
/home/yingja1227


In [26]:
## Prepare the reference genome index file using STAR
#Took 25 min on 32 core CPU
!STAR --runMode genomeGenerate \
     --genomeDir Ref/genomelib \
     --genomeFastaFiles Ref/Homo_sapiens_assembly38.fasta \
     --sjdbGTFfile Ref/gencode.v48.primary_assembly.annotation.gtf \
     --sjdbOverhang 100 \
     --runThreadN 32

Aug 28 18:35:15 ..... started STAR run
Aug 28 18:35:15 ... starting to generate Genome files
Aug 28 18:36:10 ... starting to sort Suffix Array. This may take a long time...
Aug 28 18:36:28 ... sorting Suffix Array chunks and saving them to disk...
Aug 28 19:12:57 ... loading chunks from disk, packing SA...
Aug 28 19:14:03 ... finished generating suffix array
Aug 28 19:14:03 ... generating Suffix Array index
Aug 28 19:17:08 ... completed Suffix Array index
Aug 28 19:17:08 ..... processing annotations GTF

Fatal INPUT FILE error, no exon lines in the GTF file: germline_data/Ref/gencode.v48.basic.annotation.gtf.gz
Solution: check the formatting of the GTF file, it must contain some lines with exon in the 3rd column.
          Make sure the GTF file is unzipped.
          If exons are marked with a different word, use --sjdbGTFfeatureExon .

Aug 28 19:17:09 ...... FATAL ERROR, exiting


##### Download genome library
Instead of running the above steps to generate the genome library, we will download the one that I have pre-built from the TWCC Cloud Object Service (COS) and uncompress it for use.

In [4]:
%cd Ref
!wget http://cos.twcc.ai/pbworkshop/genomelib.tar.gz
!tar -xzvf genomelib.tar.gz
%cd ..

/home/yingja1227/Ref
URL transformed to HTTPS due to an HSTS policy
--2025-09-03 12:33:12--  https://cos.twcc.ai/pbworkshop/genomelib.tar.gz
Resolving cos.twcc.ai (cos.twcc.ai)... 203.145.219.21
Connecting to cos.twcc.ai (cos.twcc.ai)|203.145.219.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27915410991 (26G) [binary/octet-stream]
Saving to: ‘genomelib.tar.gz.1’


2025-09-03 12:38:46 (79.6 MB/s) - ‘genomelib.tar.gz.1’ saved [27915410991/27915410991]

genomelib/
genomelib/transcriptInfo.tab
genomelib/chrName.txt
genomelib/geneInfo.tab
genomelib/SA
genomelib/exonInfo.tab
genomelib/chrNameLength.txt
genomelib/Genome
genomelib/sjdbList.fromGTF.out.tab
genomelib/chrLength.txt
genomelib/chrStart.txt
genomelib/sjdbList.out.tab
genomelib/genomeParameters.txt
genomelib/sjdbInfo.txt
genomelib/SAindex
genomelib/exonGeTrInfo.tab
/home/yingja1227


In [3]:
# Download the generated genome library index
%cd Ref
!wget http://cos.twcc.ai/refgenome/genomelib.tar.gz
!tar -xvzf genomelib.tar.gz
!ls
%cd ..

/home/yingja1227/Ref
URL transformed to HTTPS due to an HSTS policy
--2025-09-03 12:33:11--  https://cos.twcc.ai/refgenome/genomelib.tar.gz
Resolving cos.twcc.ai (cos.twcc.ai)... 203.145.219.21
Connecting to cos.twcc.ai (cos.twcc.ai)|203.145.219.21|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2025-09-03 12:33:11 ERROR 403: Forbidden.

genomelib/
genomelib/transcriptInfo.tab
^C
Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta
Homo_sapiens_assembly38.fasta.amb
Homo_sapiens_assembly38.fasta.ann
Homo_sapiens_assembly38.fasta.bwt
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.pac
Homo_sapiens_assembly38.fasta.sa
Homo_sapiens_assembly38.known_indels.vcf.gz
Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
ctat_genome_lib_build_dir
gencode.v48.primary_assembly.annotation.gtf
gencode.v48.primary_assembly.annotation.gtf.gz
genomelib
genomelib.tar.gz
/home/yingja1227


Make directory for the output files.

In [5]:
!mkdir rna_output

mkdir: cannot create directory ‘rna_output’: File exists


## Run Parabricks for RNA Read Alignment
Running RNA alignment requires the following input files:
- Sample `fastq` files need to be unzip
- Reference genome library generated by STAR
- Reference genome `.fasta` file

It is also necessary to specify the output file path.

In [6]:
!pbrun rna_fq2bam \
    --in-fq rna_data/ENCFF114TXS.fastq rna_data/ENCFF667GGC.fastq \
    --genome-lib-dir Ref/genomelib \
    --output-dir rna_output \
    --ref Ref/Homo_sapiens_assembly38.fasta \
    --out-bam rna_output/rna_out.bam \
    --low-memory \
    --num-gpus 1

Please visit https://docs.nvidia.com/clara/#parabricks for detailed documentation



[Parabricks Options Mesg]: Automatically generating ID prefix
[Parabricks Options Mesg]: Read group created for /home/yingja1227/rna_data/ENCFF114TXS.fastq and
/home/yingja1227/rna_data/ENCFF667GGC.fastq
[Parabricks Options Mesg]: @RG\tID:C5KEEACXX.2\tLB:lib1\tPL:bar\tSM:sample\tPU:C5KEEACXX.2
[PB Info 2025-Sep-03 12:51:01] ------------------------------------------------------------------------------
[PB Info 2025-Sep-03 12:51:01] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2025-Sep-03 12:51:01] ||                              Version 4.4.0-1                             ||
[PB Info 2025-Sep-03 12:51:01] ||                                   star                                   ||
[PB Info 2025-Sep-03 12:51:01] ------------------------------------------------------------------------------
[PB Info 2025-Sep-03 12:51:01]  ..... started STAR run
[PB Info 2025-S

In [7]:
!ls rna_output

Chimeric.out.junction  Log.progress.out  rna_fq2bam.bam       rna_out.bam
Log.final.out	       SJ.out.tab	 rna_fq2bam.bam.bai   rna_out.bam.bai
Log.out		       fusion_output	 rna_fq2bam_chrs.txt  rna_out_chrs.txt


#### Next Steps: Downstream Analysis
Now we are reading to run downstream analysis from the `.bam` file. Here are some possible next steps:
- Quantification: Count reads per gene or transcript by`featureCounts` or `HTSeq-Count`.
- Normalization: TPM, FPKM/RPKM
- Differentially expressed genes (DEG): Compare perturbation or disease vs. normal by `DESeq2` or `edgeR`.

You can start with the `.bam` file to perform your own downstream analysis.