Skip to content

unshortcode/Bioinformatics-Learning-Journey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 Automated Targeted Sequencing Analysis Pipeline

Snakemake Bioconda Language

📌 Project Overview

This project implements a reproducible Clinical Targeted Sequencing Pipeline for identifying antibiotic resistance variants in E. coli. Unlike whole-genome approaches, this pipeline integrates Bedtools to filter specific genomic regions of interest, mimicking real-world gene panel assays used in precision medicine.

The workflow automates the journey from raw FASTQ data to annotated clinical insights, reducing 34,000+ raw variants down to 48 high-confidence functional variants.

🛠️ Tech Stack & Tools

  • Workflow Management: Snakemake
  • QC & Trimming: FastQC, Trimmomatic
  • Alignment: BWA-MEM, Samtools
  • Target Enrichment: Bedtools (In-silico panel filtering)
  • Variant Calling: Bcftools (mpileup/call)
  • Annotation: snpEff (Functional prediction)
  • Visualization: R (vcfR, tidyverse (ggplot2, dplyr, tidyr), gridExtra)

🧬 Data Source

To ensure reproducibility, this pipeline uses publicly available sequencing data from the European Nucleotide Archive (ENA). * Sample: Escherichia coli B strain. * Accession Number: SRR2584863 (Illumina MiSeq paired-end sequencing). * Reference Genome: E. coli K-12 MG1655 (Accession: NC_000913.3).

🔄 Pipeline Workflow

graph TD;
    A[Raw FASTQ] -->|FastQC & Trimmomatic| B[Clean Reads];
    B -->|BWA MEM| C[Aligned BAM];
    C -->|Bedtools Intersect| D[Targeted BAM];
    D -->|Bcftools Call| E[Raw VCF];
    E -->|Filter QUAL>20| F[High-Quality VCF];
    F -->|snpEff| G[Annotated Clinical VCF];
    G -->|Custom R Script| H[Clinical Report PNG];
Loading

📊 Key Results & Analysis

1. Targeted Filtering Efficiency

By applying the targets.bed filter (simulating a resistance gene panel), the noise was drastically reduced, focusing only on clinically relevant regions.

Whole Genome Targeted Panel
Total Variants ~34,044 48

2. Quality Control (ChromoQC)

The genomic landscape plot below demonstrates robust coverage across the target regions.

  • Mean Depth: ~75X (High confidence for variant detection).
  • Mapping Quality: Consistently at 60 (Unique mapping).

Genome-wide QC Plot

3. Biological Annotation (Clinical Insights)

Using snpEff and custom R visualization, we identified key functional changes:

  • Variant Impact: 4 variants identified as MODERATE impact (Missense), potentially affecting protein function.

  • Gen caiE mang đột biến, là ứng viên tiềm năng cho các thay đổi kiểu hình.

    • Nucleotide 34944 trên nhiễm sắc thể
    • Đột biến T thành C (vị trí 428 của gene)
    • Thay đổi Axit Amin p.Asn143Ser, tại vị trí axit amin số 143, Asparagine thành Serine

snpEff Summary

🚀 How to Run

Prerequisites

  • Conda
  • Linux environment

Installation

1. Clone the repository:

git clone [https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git](https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git)
cd Bioinformatics-Learning-Journey

2. Create the environment:

conda create -n bio_pipeline -c bioconda -c conda-forge snakemake bwa samtools bcftools fastqc trimmomatic snpeff bedtools r-vcfr r-ggplot2 r-dplyr r-tidyr r-gridextra
conda activate bio_pipeline

3. Run the pipeline:

snakemake -c1

📂 Project Structure

.
├── Snakefile                   # Main workflow definition
├── images/️                     # Các biểu đồ sử dụng trong README
├── logs/                       # Execution logs (Snakemake & Tools)
├── data/                       # Dữ liệu giải trình tự thô (Input - Git ignored)
│   ├── reads_1.fastq.gz        # Forward reads
│   └── reads_2.fastq.gz        # Reverse reads
├── scripts/
│   ├── plot_vcf.R              # QC visualization script
│   └── snpeff_summary_plot.R   # Clinical summary visualization script
├── refs/
│   ├── ecoli_ref.fasta         # Reference genome
|   ├── adapters.fa             # Trình tự adapter dùng để trimming
│   └── targets.bed             # Target panel regions
├── results/                    # Output files (Git ignored mapped/ and trimmed/)
│   ├── mapped/                 # Alignment files
│   ├── qc/                     # FastQC quality reports
│   ├── trimmed/                # Cleaned and processed reads
│   ├── variants/               # VCFs (Raw, Filtered, Annotated)
│   └── plots/                  # Final PDF reports
├── CHANGELOG.md                # Development Log (Step-by-step notes)
└── README.md                   

Tác giả: Nguyễn Trường Long

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages