<a href="https://colab.research.google.com/github/ternithinator/test2/blob/main/NGS_collab_snp_identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Variant calling is the process of identifying genetic variants, such as single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) from next-generation sequencing (NGS) data. It involves comparing sequencing reads from an individual to a reference genome to detect differences, like single nucleotide variants (SNVs). In this tutorial, we will focus specifically on detecting SNPs and indels.

In this step:

a) Call variants (Single Nucleotide Polymorphisms and Indels) using tools like bcftools or strelka

b) Produce a VCF file (Variant Call Format) with detailed information about each variant

c) Optionally apply variant filtering to remove false positives or low-confidence variants

Why it's important: This is the core step of population genetics—identifying the genetic differences across sample.


For this tutorial we are going to use Strelka, a tool utilised for germline and somatic variant calling.

1: Germline Calling: Utilizes haplotype-based model to accurately detect inherited variants.

(Haplotype is a set of DNA variants inherited together on the same chromosome copy)

2: Somatic Calling: identifying genetic mutations that arise in somatic (non-germline) cells. These mutations are not inherited from parents and do not get passed on to offspring.

Workflow Execution: Strelka2 can be run in two steps: configuration (specifying input data) and execution (specifying parameters).


The first step is to download the Miniconda installer for Linux using `wget`and create an environment for installing Srelka + Samtools.

In [None]:
%%bash
# Download and install Miniconda
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p /usr/local/miniconda

# Add conda to PATH and initialize shell
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

# Accept Terms of Service for required conda channels
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create environment and install Strelka + Samtools
conda create -y -n strelka_samtools -c bioconda strelka samtools

### **Explanation**

#### 1) Download Miniconda installer

```bash
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
```

* Downloads the latest Miniconda installer for Linux.
* `-q` → quiet mode.
* Saves file as `miniconda.sh`.

#### 2) Install Miniconda silently

```bash
bash miniconda.sh -b -p /usr/local/miniconda
```
* Runs installer in **batch mode** (`-b`) → no user interaction needed.
* Installs conda into `/usr/local/miniconda`.


#### 3) Add conda to PATH + initialize shell

```bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
```

* Makes the `conda` command visible to your shell.
* Loads the conda environment activation function.

#### 4) Create a new environment with Strelka + Samtools

```bash
conda create -y -n strelka_samtools -c bioconda strelka samtools
```

* `-y` → auto-confirm installation.
* `-n strelka` → environment will be named **strelka**.
* `-c bioconda` → install packages from **Bioconda**, which hosts bioinformatics tools.
* Installs both:

  * **Strelka** → variant caller for germline & somatic variants
  * **Samtools** → essential tool for BAM/SAM handling
---

List packages inside the environment `strelka_samtools`

In [None]:
%%bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh

conda list -n strelka_samtools

We will run this step `which configureStrelkaGermlineWorkflow.py` only to check whether Strelka is installed correctly and available in our PATH.

In [None]:
%%bash
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka_samtools

which configureStrelkaGermlineWorkflow.py

After this we will create a new directory `input_files` in which reference genome `FNA` file will be downloaded from the zenodo repository.

In [None]:
%%bash
rm -rf input_files

# Create the directory if it doesn't exist
mkdir -p input_files

# URL for reference genome tar.gz file
wget -P input_files  https://zenodo.org/records/14258052/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna

# List extracted contents
ls -F input_files/

Next, we will generate **FASTA index file** of the reference file using Samtools. So that Strelka can quickly find any postiion in the genome without scanning the entire file.

In [None]:
%%bash
# Activate conda environment that contains samtools
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka_samtools

# Move into input_files directory
cd input_files

# Index the reference FASTA
samtools faidx GCA_021130815.1_PanTigT.MC.v3_genomic.fna

#List the file
ls -F


### **Explanation**

`samtools faidx` is a command that creates a FASTA index (`.fai` file), which stores the exact byte positions of each chromosome so genomic regions can be accessed instantly.

---



For the variant calling step, we will download the BAM file which is aligned, sorted, and duplicate-marked file (generated during the mapping step), from the Zenodo repository. This will ensure that they are ready for use in downstream variant identification.

In [None]:
%%bash

# Download all BAM and BAI files into bam_files/
wget -q "https://zenodo.org/records/17895817/files/BEN_CI16_aligned_reads_deduplicated.bam?download=1"     -O input_files/BEN_CI16_aligned_reads_sorted_deduplicated.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_CI16_aligned_reads_deduplicated.bam.bai?download=1" -O input_files/BEN_CI16_aligned_reads_sorted_deduplicated.bam.bai

wget -q "https://zenodo.org/records/17895817/files/BEN_NW10_aligned_reads_deduplicated.bam?download=1"     -O input_files/BEN_NW10_aligned_reads_sorted_deduplicated.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_NW10_aligned_reads_deduplicated.bam.bai?download=1" -O input_files/BEN_NW10_aligned_reads_sorted_deduplicated.bam.bai

wget -q "https://zenodo.org/records/17895817/files/BEN_SI18_aligned_reads_deduplicated.bam?download=1"     -O input_files/BEN_SI18_aligned_reads_sorted_deduplicated.bam
wget -q "https://zenodo.org/records/17895817/files/BEN_SI18_aligned_reads_deduplicated.bam.bai?download=1" -O input_files/BEN_SI18_aligned_reads_sorted_deduplicated.bam.bai

# List downloaded files
ls -lh input_files


---
Once we have our required BAM files, we will start the step for variant calling. In this analysis we use the Strelka germline workflow, which is initiated through the `configureStrelkaGermlineWorkflow.py` script.

This germline configuration file contains:

1: Information about the input BAM files (aligned, sorted, and duplicate-marked reads).

2: The reference genome path, which Strelka uses for aligning sequences and identifying variant sites.

3: Workflow parameters, such as filtering settings, runtime options, and rules for how Strelka processes sequencing data.

4: Module definitions, specifying the order and structure of the processing steps used to detect SNVs and small indels.

5: Paths to output directories, where Strelka will write variant call results, logs, and intermediate files.

In short, this germline configuration is used by Strelka to build and execute the variant calling workflow. It ensures that the pipeline runs consistently with the correct inputs, reference genome, and settings needed for accurate germline variant detection.\

In [None]:
%%bash
# Activate environment
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka_samtools

# Create the run directory manually (recommended for clarity)
mkdir -p strelka_run

# Configure Strelka workflow
configureStrelkaGermlineWorkflow.py \
    --bam input_files/BEN_CI16_aligned_reads_sorted_deduplicated.bam \
    --bam input_files/BEN_NW10_aligned_reads_sorted_deduplicated.bam \
    --bam input_files/BEN_SI18_aligned_reads_sorted_deduplicated.bam \
    --referenceFasta input_files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna \
    --runDir strelka_run

###**Explanation**

#### 1) Create a run directory

```bash
mkdir -p strelka_run
```


Strelka needs a clean directory to store: workflow scripts, configuration files, logs and final outputs.


#### 2) Configure the Strelka Germline Workflow

```bash
configureStrelkaGermlineWorkflow.py \
    --bam input_files/BEN_CI16_aligned_reads_sorted_deduplicated.bam \
    --bam input_files/BEN_NW10_aligned_reads_sorted_deduplicated.bam \
    --bam input_files/BEN_SI18_aligned_reads_sorted_deduplicated.bam \
    --referenceFasta input_files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna \
    --runDir strelka_run
```

`configureStrelkaGermlineWorkflow.py` This is the Strelka script that **creates the entire workflow**, including makefiles, configuration JSON files, and scripts to run variant calling.


```bash
--bam input_files/BEN_CI16_aligned_reads_sorted_deduplicated.bam
--bam input_files/BEN_NW10_aligned_reads_sorted_deduplicated.bam
--bam input_files/BEN_SI18_aligned_reads_sorted_deduplicated.bam
```

Provides all BAM files


```bash
--referenceFasta input_files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna
```

Tells Strelka **which genome assembly** was used to align the reads.


```bash
--runDir strelka_run
```
Instructs Strelka to create the full workflow inside the folder `strelka_run`.

---

In the final step, Strelka enters the workflow that was previously created inside the `strelka_run/` directory and executes all steps of the germline variant-calling pipeline. It automatically processes all the BAM files. It performs variant calling on each sample using the reference genome, and then generates the final SNP and INDEL output files inside `strelka_run/results/`.



In [None]:
%%bash
# Activate environment
export PATH="/usr/local/miniconda/bin:$PATH"
source /usr/local/miniconda/etc/profile.d/conda.sh
conda activate strelka_samtools

strelka_run/runWorkflow.py -m local -j 4

###**Explanation**
* `-m local` Runs Strelka on local machine.

* `-j 4` Uses 4 CPU threads in parallel to speed up variant calling.






After running the Strelka germline variant calling workflow, the pipeline produces a set of output files, most importantly the **VCF (Variant Call Format) files**. These files contain the list of single nucleotide variants (SNVs) and small insertions/deletions (indels) identified from the input BAM files.

Once the VCF files are generated, they represent the final variant calls produced by Strelka, and they can now be used for downstream analyses. Typical post-processing steps may include:

* **Quality assessment** of the variants by inspecting Strelka’s FILTER and quality score annotations.
* **Merging or comparing VCFs** (if multiple samples were processed) to evaluate shared or unique variants.
* **Functional annotation** using external tools (e.g., VEP, SnpEff) to predict gene impacts, consequences, or biological relevance.
* **Filtering based on depth, genotype quality, or allele frequency**, depending on the research goals.

In summary, the generation of VCF files marks the completion of the variant calling workflow, providing a structured dataset of all detected variants. These files form the foundation for interpretation, annotation, and further biological or population-level analyses.


## Tutorial Questions

1) What basic information does each row in a VCF file represent (e.g. position, reference base, alternative base)?

2) Where in the VCF can you see whether a variant is homozygous or heterozygous?

3) What does PASS mean in the FILTER column of a Strelka VCF?

4) Why do you think Strelka produces separate VCF files for SNPs and indels?

5) Which column in the VCF tells you how confident the variant call is, and what happens to this value when the evidence for a variant is weak?