# Bioinformatics of the Central Dogma - iPython Notebook

# Objective: Perform bioinformatic analysis of DNA sequencing data using command-line tools

## Day 1: Exome Sequencing Analysis

We will use sequencing data from a family to identify potentially disease-associated mutations.

## Step 1: Set up the environment

### Install necessary bioinformatics tools using conda

In [None]:
!mamba install -c bioconda fastqc multiqc bwa samtools bcftools freebayes=1.3.6 -y

## Step 2: Load the raw sequencing data

In [None]:
# Download raw FASTQ files from Zenodo

In [None]:
!wget https://zenodo.org/record/3243160/files/father_R1.fq.gz

In [None]:
!wget https://zenodo.org/record/3243160/files/father_R2.fq.gz

In [None]:
!wget https://zenodo.org/record/3243160/files/mother_R1.fq.gz

In [None]:
!wget https://zenodo.org/record/3243160/files/mother_R2.fq.gz

In [None]:
!wget https://zenodo.org/record/3243160/files/proband_R1.fq.gz

In [None]:
!wget https://zenodo.org/record/3243160/files/proband_R2.fq.gz

In [None]:
## Steps to download the reference genome and index it -- THIS WAS ALREADY RUN FOR YOU
# The data is large and the indexing takes a while, so we have already done this for you

# !wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# !gunzip hg19.fa.gz
# !bwa index hg19.fa


## Step 3: Quality Control

In [2]:
# Run FastQC to evaluate the quality of the raw data

In [None]:
!fastqc father_R1.fq.gz father_R2.fq.gz mother_R1.fq.gz mother_R2.fq.gz proband_R1.fq.gz proband_R2.fq.gz -o .

In [None]:
# Aggregate FastQC results using MultiQC

In [None]:
!multiqc .

## Step 4: Alignment

In [None]:
# Map the sequencing data to the human reference genome (hg19) using BWA-MEM

In [None]:
!bwa mem -R '@RG\tID:000\tSM:father' /home/kwoyshn1/hg19_reference_genome/hg19.fa father_R1.fq.gz father_R2.fq.gz > father.sam

In [None]:
!bwa mem -R '@RG\tID:001\tSM:mother' /home/kwoyshn1/hg19_reference_genome/hg19.fa mother_R1.fq.gz mother_R2.fq.gz > mother.sam

In [None]:
!bwa mem -R '@RG\tID:002\tSM:proband' /home/kwoyshn1/hg19_reference_genome/hg19.fa proband_R1.fq.gz proband_R2.fq.gz > proband.sam

In [None]:
# Convert SAM to BAM and sort BAM files

In [None]:
!samtools view -Sb father.sam | samtools sort -o father.bam

In [None]:
!samtools view -Sb mother.sam | samtools sort -o mother.bam

In [None]:
!samtools view -Sb proband.sam | samtools sort -o proband.bam

## Step 5: Filter Alignments

In [None]:
# Filter BAM files to retain only properly paired reads and remove duplicates

In [18]:
!samtools view -b -f 2 father.bam > father.filtered.bam

In [19]:
!samtools view -b -f 2 mother.bam > mother.filtered.bam

In [20]:
!samtools view -b -f 2 proband.bam > proband.filtered.bam

In [None]:
!samtools rmdup father.filtered.bam father.filtered.rmdup.bam

In [None]:
!samtools rmdup mother.filtered.bam mother.filtered.rmdup.bam

In [None]:
!samtools rmdup proband.filtered.bam proband.filtered.rmdup.bam

## Step 6: Variant Calling

In [None]:
# Use FreeBayes to call variants

In [27]:
!freebayes -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -b father.filtered.rmdup.bam mother.filtered.rmdup.bam proband.filtered.rmdup.bam > variants.vcf

## Step 7: Post-processing

In [None]:
# Normalize VCF with bcftools

In [28]:
!bcftools norm -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -m -any variants.vcf -o normalized_variants.vcf

Lines   total/split/joined/realigned/skipped:	36260/1758/0/3816/0


## Step 8: Download VCF file, interpret in OpenCravat
