# Bioinformatics of the Central Dogma - iPython Notebook

# Objective: Perform bioinformatic analysis of DNA sequencing data using command-line tools

## Day 1: Exome Sequencing Analysis

We will use sequencing data from a family to identify potentially disease-associated mutations.

## Step 1: Set up the environment

### Install necessary bioinformatics tools using conda

In [1]:
!mamba install -c bioconda fastqc multiqc bwa samtools bcftools freebayes=1.3.6 -y


Looking for: ['fastqc', 'multiqc', 'bwa', 'samtools', 'bcftools', 'freebayes=1.3.6']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
bioconda/linux-64 (check zst) [90m━━╸[0m[33m━━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.1s[2K[1A[2K[0G[+] 0.2s
bioconda/linux-64 (check zst) [90m━━╸[0m[33m━━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.2s[2K[1A[2K[0G[+] 0.3s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.3s[2K[1A[2K[0G[+] 0.4s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.4s[2K[1A[2K[0G[+] 0.5s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.5s[2K[1A[2K[0G[+] 0.6s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.6s[2K[1A[2K[0G[+] 0.7s
bioconda/linux-64 (check zst) [90m━━━╸[0m[33m━━━━━━━━━━━[0m   0.0 B @  ??.?MB/s Checking  0.7s[2K[1A[2K[0G[+] 0.8s
bioc

## Step 2: Load the raw sequencing data

In [3]:
!mkdir -p working
%cd working

/home/timp/Code/bcmb_bootcamp/day5/assignments/working


In [4]:
# Download raw FASTQ files from Zenodo

!wget https://zenodo.org/record/3243160/files/father_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/father_R2.fq.gz
!wget https://zenodo.org/record/3243160/files/mother_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/mother_R2.fq.gz
!wget https://zenodo.org/record/3243160/files/proband_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/proband_R2.fq.gz


--2024-08-29 21:40:18--  https://zenodo.org/record/3243160/files/father_R1.fq.gz
Resolving zenodo.org (zenodo.org)... 188.184.98.238, 188.184.103.159, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.98.238|:443... connected.
HTTP request sent, awaiting response... 301 MOVED PERMANENTLY
Location: /records/3243160/files/father_R1.fq.gz [following]
--2024-08-29 21:40:21--  https://zenodo.org/records/3243160/files/father_R1.fq.gz
Reusing existing connection to zenodo.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 150814720 (144M) [application/octet-stream]
Saving to: ‘father_R1.fq.gz’


2024-08-29 21:40:34 (11.0 MB/s) - ‘father_R1.fq.gz’ saved [150814720/150814720]

--2024-08-29 21:40:35--  https://zenodo.org/record/3243160/files/father_R2.fq.gz
Resolving zenodo.org (zenodo.org)... 188.184.103.159, 188.184.98.238, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.103.159|:443... connected.
HTTP request sent, awaiting response... 301 MOVED 

In [None]:
## Steps to download the reference genome and index it -- THIS WAS ALREADY RUN FOR YOU
# The data is large and the indexing takes a while, so we have already done this for you

!wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
!gunzip hg19.fa.gz
!bwa index hg19.fa


## Step 3: Quality Control

In [5]:
# Run FastQC to evaluate the quality of the raw data
!fastqc father_R1.fq.gz father_R2.fq.gz mother_R1.fq.gz mother_R2.fq.gz proband_R1.fq.gz proband_R2.fq.gz -o .

application/gzip
application/gzip
Started analysis of father_R1.fq.gz
application/gzip
application/gzip
application/gzip
application/gzip
Approx 5% complete for father_R1.fq.gz
Approx 10% complete for father_R1.fq.gz
Approx 15% complete for father_R1.fq.gz
Approx 20% complete for father_R1.fq.gz
Approx 25% complete for father_R1.fq.gz
Approx 30% complete for father_R1.fq.gz
Approx 35% complete for father_R1.fq.gz
Approx 40% complete for father_R1.fq.gz
Approx 45% complete for father_R1.fq.gz
Approx 50% complete for father_R1.fq.gz
Approx 55% complete for father_R1.fq.gz
Approx 60% complete for father_R1.fq.gz
Approx 65% complete for father_R1.fq.gz
Approx 70% complete for father_R1.fq.gz
Approx 75% complete for father_R1.fq.gz
Approx 80% complete for father_R1.fq.gz
Approx 85% complete for father_R1.fq.gz
Approx 90% complete for father_R1.fq.gz
Approx 95% complete for father_R1.fq.gz
Analysis complete for father_R1.fq.gz
Started analysis of father_R2.fq.gz
Approx 5% complete for father

In [6]:
# Aggregate FastQC results using MultiQC
!multiqc .


[91m///[0m ]8;id=132077;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2mv1.24.1[0m

[34m       file_search[0m | Search path: /home/timp/Code/bcmb_bootcamp/day5/assignments/working
[2K         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m18/18[0m  2m11/18[0m [2mfather_R1_fastqc.html[0m
[?25h[34m            fastqc[0m | Found 6 reports
[34m     write_results[0m | Data        : multiqc_data
[34m     write_results[0m | Report      : multiqc_report.html
[34m           multiqc[0m | MultiQC complete


## Step 4: Alignment

In [None]:
# Map the sequencing data to the human reference genome (hg19) using BWA-MEM

In [None]:
!bwa mem -R '@RG\tID:000\tSM:father' /home/kwoyshn1/hg19_reference_genome/hg19.fa father_R1.fq.gz father_R2.fq.gz > father.sam

In [None]:
!bwa mem -R '@RG\tID:001\tSM:mother' /home/kwoyshn1/hg19_reference_genome/hg19.fa mother_R1.fq.gz mother_R2.fq.gz > mother.sam

In [None]:
!bwa mem -R '@RG\tID:002\tSM:proband' /home/kwoyshn1/hg19_reference_genome/hg19.fa proband_R1.fq.gz proband_R2.fq.gz > proband.sam

In [None]:
# Convert SAM to BAM and sort BAM files

In [None]:
!samtools view -Sb father.sam | samtools sort -o father.bam

In [None]:
!samtools view -Sb mother.sam | samtools sort -o mother.bam

In [None]:
!samtools view -Sb proband.sam | samtools sort -o proband.bam

## Step 5: Filter Alignments

In [None]:
# Filter BAM files to retain only properly paired reads and remove duplicates

In [18]:
!samtools view -b -f 2 father.bam > father.filtered.bam

In [19]:
!samtools view -b -f 2 mother.bam > mother.filtered.bam

In [20]:
!samtools view -b -f 2 proband.bam > proband.filtered.bam

In [None]:
!samtools rmdup father.filtered.bam father.filtered.rmdup.bam

In [None]:
!samtools rmdup mother.filtered.bam mother.filtered.rmdup.bam

In [None]:
!samtools rmdup proband.filtered.bam proband.filtered.rmdup.bam

## Step 6: Variant Calling

In [None]:
# Use FreeBayes to call variants

In [27]:
!freebayes -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -b father.filtered.rmdup.bam mother.filtered.rmdup.bam proband.filtered.rmdup.bam > variants.vcf

## Step 7: Post-processing

In [None]:
# Normalize VCF with bcftools

In [28]:
!bcftools norm -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -m -any variants.vcf -o normalized_variants.vcf

Lines   total/split/joined/realigned/skipped:	36260/1758/0/3816/0


## Step 8: Download VCF file, interpret in OpenCravat
