# Day 5 Morning - Genomics Application

### Objective: Perform bioinformatic analysis of DNA sequencing data using command-line tools

Exome sequencing is a wet lab technique that enriches just for the exomes of the genome. In humans, this is only about 30Mbp, a small fraction (1%) of the total genome size (~3Gb). The advantage is the cost and time savings of looking at a smaller portion of the genome.

For this test, we are going to analyze the exome of a family where the male child is affected by osteopetrosis (https://ghr.nlm.nih.gov/condition/osteopetrosis). The parents (though consanguineous (related)) do not suffer from this disorder, suggesting it is passed as a recessive genotype. There are different genetically inherited types of osteopetrosis - we need to identify what type and what gene variants are causing.

Sequencing a family that has a child with a rare and autosomally recessive disease, you will use DNA sequencing to identify where the mutations are in their genome. We hav4We will use sequencing data from a family to identify potentially disease-associated mutations.

(stealing idea from https://training.galaxyproject.org/training-material/topics/variant-analysis/tutorials/exome-seq/tutorial.html#post-processing-freebayes-calls)

### Step 1: Set up the environment

First - we are going to use mamba to install bioinformatic tools.  Lots of bioinformatic tools are available via mamba/conda, not just python packages.  This makes things really easy to install. 

In [None]:
!mamba install -c bioconda fastqc multiqc bwa samtools bcftools freebayes=1.3.6 -y

## Step 2: Load the raw sequencing data

Going to setup a working directory for our data files.  Using "Python magic" we can run shell commands in the notebook.  This is a great way to interact with the command line tools we will be using.

Python magic has two main aspects - first is the ! command which runs a shell command.  The second is the % command which runs a magic command.  We will use both in this notebook.

In [None]:
#create a working directory, then change to it for the rest of the cells
!mkdir -p ~/working 
%cd ~/working 

In [None]:
# Download raw FASTQ files from Zenodo

!wget https://zenodo.org/record/3243160/files/father_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/father_R2.fq.gz
!wget https://zenodo.org/record/3243160/files/mother_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/mother_R2.fq.gz
!wget https://zenodo.org/record/3243160/files/proband_R1.fq.gz
!wget https://zenodo.org/record/3243160/files/proband_R2.fq.gz


Steps to download the reference genome and index it - we are going to use the human reference genome hg19.  This is an older version - but frankly I know it works for this data because I've used this exercise before.  We then have to tell the sequence aligner to "index" the reference - this assembles a file from the full reference sequence that allows the aligner to quickly place the reads correctly. 

The data is large and the indexing takes a while, so we have already done this for you.

In [None]:
#!wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
#!gunzip hg19.fa.gz
#!bwa index hg19.fa


## Step 3: Quality Control

Using a tool called FASTQC we are going to check the quality of the raw FASTQ files.

Fastq data is a text file that looks like this:

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

\+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

The first line is a “@” followed by an identifier “Seq_ID”
The second line is a sequence of DNA (ACGT)
The third line is a “+” optionally followed by the same identifier
The fourth line is the quality of the DNA sequence – how sure you are that it is right.

The quality score is given by the sequencing instrument, should be used as a guide. “PHRED” Quality score is given by -10 log10 (Probability base is wrong) It’s encoded as a single letter, using an ASCII scheme (ascii value – 33)


In [None]:
# Run FastQC to evaluate the quality of the raw data
!fastqc father_R1.fq.gz father_R2.fq.gz mother_R1.fq.gz mother_R2.fq.gz proband_R1.fq.gz proband_R2.fq.gz -o .

In [None]:
# Aggregate FastQC results using MultiQC
!multiqc .

## Step 4: Alignment

Next we are going to align the FASTQ files - using the index we already generated for the hg19 human reference.

This will generate a Sequence Alignment/Map SAM/BAM file:
The specification for this format is here (https://samtools.github.io/hts-specs/SAMv1.pdf)
It’s a TAB-delimited file with 11 “mandatory” columns, and some other options ones. BAM (Binary Alignment/Map Format) is the same as a SAM, except not in raw text, instead binary compress which takes up less hard drive space.
Samtools can be used to manipulate these files, as can pysam (python) and other packages in different languages. Best simple tool for looking at SAM/BAM files: IGV (http://software.broadinstitute.org/software/igv/)


In [None]:
!bwa mem -R '@RG\tID:000\tSM:father' /home/kwoyshn1/hg19_reference_genome/hg19.fa father_R1.fq.gz father_R2.fq.gz > father.sam

In [None]:
!bwa mem -R '@RG\tID:001\tSM:mother' /home/kwoyshn1/hg19_reference_genome/hg19.fa mother_R1.fq.gz mother_R2.fq.gz > mother.sam

In [None]:
!bwa mem -R '@RG\tID:002\tSM:proband' /home/kwoyshn1/hg19_reference_genome/hg19.fa proband_R1.fq.gz proband_R2.fq.gz > proband.sam

In [None]:
# Convert SAM to BAM and sort BAM files

samples=['father','mother','proband']

for mysamp in samples:
    !samtools sort {mysamp}.sam -o {mysamp}.bam
    
!samtools sort father.sam -o father.bam
!samtools sort mother.sam -o mother.bam
!samtools sort proband.sam -o proband.bam

## Step 5: Filter Alignments

We will filter out any reads that don't have correct pairings, are not mapped, or are duplicates.  This is a common step in bioinformatics analysis - we want to make sure we are only looking at high quality data.

Because we performed PCR during sequencing, we need to mark duplicates in the BAM files. This is because PCR duplicates can lead to false positive variant calls. Samtools can be used to mark duplicates.

In [None]:
!samtools view -b -f 2 father.bam > father.filtered.bam
!samtools view -b -f 2 mother.bam > mother.filtered.bam
!samtools view -b -f 2 proband.bam > proband.filtered.bam

In [None]:
!samtools rmdup father.filtered.bam father.filtered.rmdup.bam
!samtools rmdup mother.filtered.bam mother.filtered.rmdup.bam
!samtools rmdup proband.filtered.bam proband.filtered.rmdup.bam

## Step 6: Variant Calling

Next we have to look for differences in our samples (father, mother, proband) as compared to the reference genome.  

![image.png](attachment:image.png) (credit ekg.github.io)

Even though Illumina is quite accurate, it’s not perfect.  Variant callers like freebayes try to figure out what’s a “real” variant from what’s an artifact/sequencing noise.

(Note this step takes about 10 minutes or so)

In [None]:
!freebayes -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -b father.filtered.rmdup.bam mother.filtered.rmdup.bam proband.filtered.rmdup.bam > variants.vcf

## Step 7: Post-processing

Now we are working with a Variant Call Format (VCF) File:

The specification for this format is here: (https://samtools.github.io/hts-specs/VCFv4.2.pdf) 
It’s another tab delimited file, with header lines delineated by “##”, then columns describing the variant’s position and type, then columns per sample with the number of reads and/or information supporting the variant
Bcftools can be used to manipulate these tools, as can pysam (python) and other packages in different languages

The bcftools norm command is a tool used for normalizing Variant Call Format (VCF) files. It is part of the bcftools software package, which is commonly used in genomics analysis.

When working with genetic data, it is important to have standardized and consistent representations of genetic variants. VCF files contain information about genetic variations, such as single nucleotide polymorphisms (SNPs) or insertions/deletions (indels). However, due to various factors, such as sequencing errors or alternative representations of the same variant, VCF files can have redundant or inconsistent entries.

The bcftools norm command helps to address these issues by normalizing the VCF file. It performs several operations to ensure that the variants are represented in a standardized and simplified manner. Here are some of the tasks performed by bcftools norm:

Left-aligning variants: Variants may have different starting positions due to small differences in the reference genome used for alignment. bcftools norm adjusts the positions of variants to the leftmost position possible, improving consistency.

Normalizing indels: Indels can have multiple representations, such as insertions or deletions relative to the reference genome. bcftools norm normalizes indels to a standardized representation, making it easier to compare and analyze them.

Decomposing complex variants: Complex variants, such as multiple substitutions or overlapping indels, can be decomposed into simpler components. bcftools norm breaks down complex variants into their constituent parts, simplifying downstream analysis.

Removing redundant variants: In some cases, VCF files may contain redundant entries for the same variant. bcftools norm identifies and removes these redundant variants, reducing duplication and improving data quality.

By performing these normalization steps, bcftools norm helps to ensure that VCF files are consistent, standardized, and easier to work with in downstream analysis pipelines. It is an essential tool in genomics research and variant calling workflows.

In [None]:
!bcftools norm -Oz -f /home/kwoyshn1/hg19_reference_genome/hg19.fa -m -any variants.vcf -o normalized_variants.vcf.gz

Next - we are going to try to process the normalized VCF file to identify the gene that is likely causing the disease.  We will use bcftools to filter to just autosomal recessive variants - variants which are heterozygous in the parents and homozygous in the proband.

First we'll index

In [None]:
!bcftools index normalized_variants.vcf.gz

Now we'll filter based on the samples - the order is proband (0), mother (1), father (2)

In [None]:
!bcftools view -i 'FMT/GT[0]="1/1" && FMT/GT[1]="0/1" && FMT/GT[2]="0/1"' normalized_variants.vcf.gz -o filtered_recessive.vcf


## Step 8: Download VCF file, interpret in OpenCravat
