<a href="https://colab.research.google.com/github/ternithinator/test2/blob/main/Genomics_Workshop_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<video src="https://raw.githubusercontent.com/PoODL-CES/PoODL_NGS_workshop.io/main/Nature%20of%20Science.mp4" controls autoplay loop width="400" align="right"></video>

## Genomics Learning Workshop: NGS Analysis Pipeline

Easy-to-follow workflow for processing Illumina whole-genome resequencing reads for population genomics. Steps include trimming, mapping, sorting, variant calling, variant filtering, PCA, and ADMIXTURE analysis. For more details, see the bottom of this notebook and visit the GitHub repository.
Repository link: [Genomics Learning Workshop](https://github.com/PoODL-CES/Genomics_learning_workshop)

This workshop introduces:
- Handling raw FASTQ files
- Performing quality control and trimming
- Mapping reads to a reference genome
- Calling and filtering variants
- Running PCA and ADMIXTURE for population analysis


In [None]:
# Miniconda installation and environment setup for Colab NGS Workshop

# Download and install Miniconda (skip if already installed)
!wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -p /usr/local/miniconda

import sys, os
sys.path.append('/usr/local/miniconda/lib/python3.8/site-packages')
os.environ['PATH'] = "/usr/local/miniconda/bin:" + os.environ['PATH']

# Explicitly clear potentially problematic environment variables from Python's os.environ
if 'CONDA_PREFIX' in os.environ:
    del os.environ['CONDA_PREFIX']
if 'CONDA_ENVS_PATH' in os.environ:
    del os.environ['CONDA_ENVS_PATH']

# Accept ToS for main and R conda channels
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
!conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r

# Create new conda environment for your workshop if it doesn't exist
!if ! conda info --envs | grep -w 'workshop_ngs'; then \
    conda create -y -n workshop_ngs python=3.7; \
else \
    echo "Environment 'workshop_ngs' already exists, skipping creation."; \
fi

# Install necessary bioinformatics tools into the environment
!conda install -y -n workshop_ngs -c bioconda -c conda-forge trim-galore samtools fastqc bwa

# Verify tool versions inside activated environment in one shell session
!bash -c "source /usr/local/miniconda/bin/activate workshop_ngs && \
          trim_galore --version && samtools --version && fastqc --version && \
          bwa --version"

To activate the `workshop_ngs` environment for specific commands and ensure its tools are used, you can wrap your commands within a `bash -c "source /usr/local/miniconda/bin/activate workshop_ngs && your_command_here"` block. This ensures the environment is properly sourced before the command runs. Let's try listing the environment contents and checking `CONDA_PREFIX` within that activated session:

In [None]:
!bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && \
          echo '--- Environment activated ---' && \
          echo 'CONDA_PREFIX in activated env: '$CONDA_PREFIX && \
          echo '--- Listing contents of activated env ---' && \
          conda list"

We’ll be using real WGS resequencing data for this workshop. The dataset is a small subset for demonstration purposes. We will download the data from (https://zenodo.org/records/14258052) . Step 1 would be to make a parent directory to sort all the data in one place.


In [None]:
!mkdir -p fastq_files
!ls -F

Now, let's navigate into the `fastq_files` directory.

In [None]:
!cd fastq_files/ && pwd

Let's organize the download process by first ensuring the `fastq_files` directory exists, then navigating into it to download all the FASTQ files directly to their intended location.

In [None]:
# Define the list of FASTQ file links
fastq_links = [
    "https://zenodo.org/records/14258052/files/BEN_CI16_sub_1.fq.gz",
    "https://zenodo.org/records/14258052/files/BEN_CI16_sub_2.fq.gz",
    "https://zenodo.org/records/14258052/files/BEN_NW10_sub_1.fq.gz",
    "https://zenodo.org/records/14258052/files/BEN_NW10_sub_2.fq.gz",
    "https://zenodo.org/records/14258052/files/BEN_SI18_sub_1.fq.gz",
    "https://zenodo.org/records/14258052/files/BEN_SI18_sub_2.fq.gz",
]

# Create the directory if it doesn't exist
!mkdir -p fastq_files

# Clean up any existing fastq files and duplicates before downloading
!rm -f fastq_files/*.fq.gz*

# Change to the directory and download files
for link in fastq_links:
    # Use -nc (no-clobber) to avoid re-downloading if file exists and is complete,
    # and -P . to save to current directory (fastq_files after cd)
    !bash -c "cd fastq_files && wget -nc -P . -q {link}"

# List the contents of the fastq_files directory to confirm all downloads
!ls -F fastq_files/

To visualise the content of the directory 'fastq_files' , to confirm that all the files have been successfully downloaded.

Once all the fastq files are downloaded, the next would be to run "FastQC" tool on the raw reads for the quality assessment step before the downstream analysis part.

***Why it's important:*** Poor-quality reads can lead to false variant calls or poor mapping, so quality control is crucial.

To do so, first, let's create a directory to store the FastQC reports.

In [None]:
!mkdir -p fastqc_results
!ls -F

Now, we will run FastQC on all the FASTQ files within the `workshop_ngs` environment and save the reports to the `fastqc_results` directory.

In [None]:
%%time
!bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && \
          fastqc fastq_files/*.fq.gz -o fastqc_results"

# List the generated FastQC reports
!ls -F fastqc_results/

The .html files generated by FastQC are standard web pages, which means you can visualize them by opening them in any web browser!

Here's how you can access them from Colab:
 On the left-hand side of your Colab interface, there's usually a file explorer icon (a folder). Click on it to open the file browser. Navigate to the fastqc_results directory. You'll see all the .html files listed there. You can right-click on any .html file and select 'Download' to save it to your local machine, then open it with your preferred web browser.




Post qualitty assessment, the next step will be the trimming of the adapters or low-quality bases using tools such as Trim-galore. This step will ensure producing clean FASTQ files, ready for mapping.

Step 1: We'll create a directory to store the trimmed FASTQ files.

In [None]:
!mkdir -p trimmed_fastq_files
!ls -F

Now, we will activate the `workshop_ngs` conda environment and run `Trim Galore!` on all the FASTQ files. We'll specify that they are paired-end reads and direct the output to the `trimmed_fastq_files` directory.

Trim Galore automatically detects and removes common adapter sequences and performs quality trimming. The `--paired` option is essential for correctly processing paired-end reads.

In [None]:
%%time
# Run Trim Galore! within the activated conda environment
# We iterate through the samples to process paired-end reads correctly

import os

# Get a list of fastq files programmatically
all_files = os.listdir('fastq_files/')
fastq_files_list = [os.path.join('fastq_files', f) for f in all_files if f.endswith('.fq.gz')]

# Extract unique sample prefixes (e.g., BEN_CI16, BEN_NW10, BEN_SI18) from the filenames
sample_prefixes = sorted(list(set([f.split('/')[-1].split('_sub_')[0] for f in fastq_files_list])))

print(f"Detected FASTQ files: {fastq_files_list}")
print(f"Detected sample prefixes: {sample_prefixes}")

for prefix in sample_prefixes:
    read1 = f"fastq_files/{prefix}_sub_1.fq.gz"
    read2 = f"fastq_files/{prefix}_sub_2.fq.gz"
    output_dir = "trimmed_fastq_files"

    # Use bash -c to activate conda env and run trim_galore
    # --paired: for paired-end reads
    # -o: specify output directory
    # --fastqc is removed as per user request to avoid generating html/zip files
    !bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && \
              trim_galore --paired {read1} {read2} -o {output_dir}"

    # Remove unwanted output files for the current prefix
    !rm -f {output_dir}/{prefix}_sub_*.fq.gz_trimming_report.txt
    !rm -f {output_dir}/{prefix}_sub_*_fastqc.html
    !rm -f {output_dir}/{prefix}_sub_*_fastqc.zip

# List the contents of the trimmed_fastq_files directory to confirm only _val files remain
!ls -F trimmed_fastq_files/

### Downloading reference files for mapping

To prepare for the read mapping step, we need to download the reference genome and its associated index files. These files will be stored in a new directory called `reference`.

In [None]:
# Create the reference directory
!mkdir -p reference

# List of reference file URLs to download
reference_links = [
    "https://zenodo.org/records/14258052/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna",
    "https://zenodo.org/records/14258082/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.amb",
    "https://zenodo.org/records/14258082/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.ann",
    "https://zenodo.org/records/14258082/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.bwt",
    "https://zenodo.org/records/14258082/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.pac",
    "https://zenodo.org/records/14258082/files/GCA_021130815.1_PanTigT.MC.v3_genomic.fna.sa"
]

# Remove any existing files in the reference directory to ensure a clean re-download
!rm -f reference/*

# Download each file into the reference directory, explicitly naming the output file
import os
for link in reference_links:
    filename = os.path.basename(link)
    !bash -c "wget -q {link} -O reference/{filename}"

# Verify the downloaded files
!ls -F reference/

# Mapping reads to the Refernce genome:
Here we:

Use an aligner like BWA or Bowtie2 to map the cleaned reads to a reference genome

Convert output from SAM to BAM format (compressed and binary)

Sort the BAM files by genomic coordinates (using samtools sort)


Index the BAM files so tools can access them efficiently

*Why it's important:* Proper alignment is the foundation for all downstream analysis. Sorting/indexing ensures quick and efficient variant detection.



Create a new directory named `mapped_reads` to store output alignment files, then index the reference genome located at `reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna` using `bwa index` for efficient read mapping. After indexing, map the trimmed paired-end reads from each sample to the indexed reference genome using `bwa mem`, converting the output to sorted BAM files and indexing them using `samtools`.

In [None]:
!mkdir -p mapped_reads
!ls -F

In [None]:
!ls reference/


In [None]:
!bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && bwa"



In [None]:
!ls reference/




In [None]:
%%time

# Path to reference FASTA (index files are in same folder = correct)
REFERENCE_GENOME = "reference/GCA_021130815.1_PanTigT.MC.v3_genomic.fna"

for prefix in sample_prefixes:

    read1 = f"trimmed_fastq_files/{prefix}_sub_1_val_1.fq.gz"
    read2 = f"trimmed_fastq_files/{prefix}_sub_2_val_2.fq.gz"
    output_bam = f"mapped_reads/{prefix}.bam"

    print(f"\n--- Mapping {prefix} ---")

    # BWA mem → BAM → sorted BAM
    !bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && \
              bwa mem {REFERENCE_GENOME} {read1} {read2}  > {prefix}.sam"


!ls -F mapped_reads/

In [None]:
           samtools view -bS - | \
              samtools sort -o {output_bam} -"

    # Index BAM
    print(f"Indexing BAM for {prefix}...")
    !bash -c "source /usr/local/miniconda/etc/profile.d/conda.sh && conda activate workshop_ngs && samtools index {output_bam}