🧬 Automated Targeted Sequencing Analysis Pipeline

📌 Project Overview

This project implements a reproducible Clinical Targeted Sequencing Pipeline for identifying antibiotic resistance variants in E. coli. Unlike whole-genome approaches, this pipeline integrates Bedtools to filter specific genomic regions of interest, mimicking real-world gene panel assays used in precision medicine.

The workflow automates the journey from raw FASTQ data to annotated clinical insights, reducing 34,000+ raw variants down to 48 high-confidence functional variants.

🛠️ Tech Stack & Tools

Workflow Management: Snakemake
QC & Trimming: FastQC, Trimmomatic
Alignment: BWA-MEM, Samtools
Target Enrichment: Bedtools (In-silico panel filtering)
Variant Calling: Bcftools (mpileup/call)
Annotation: snpEff (Functional prediction)
Visualization: R (vcfR, tidyverse (ggplot2, dplyr, tidyr), gridExtra)

🧬 Data Source

To ensure reproducibility, this pipeline uses publicly available sequencing data from the European Nucleotide Archive (ENA). * Sample: Escherichia coli B strain. * Accession Number: SRR2584863 (Illumina MiSeq paired-end sequencing). * Reference Genome: E. coli K-12 MG1655 (Accession: NC_000913.3).

🔄 Pipeline Workflow

graph TD;
    A[Raw FASTQ] -->|FastQC & Trimmomatic| B[Clean Reads];
    B -->|BWA MEM| C[Aligned BAM];
    C -->|Bedtools Intersect| D[Targeted BAM];
    D -->|Bcftools Call| E[Raw VCF];
    E -->|Filter QUAL>20| F[High-Quality VCF];
    F -->|snpEff| G[Annotated Clinical VCF];
    G -->|Custom R Script| H[Clinical Report PNG];

📊 Key Results & Analysis

1. Targeted Filtering Efficiency

By applying the targets.bed filter (simulating a resistance gene panel), the noise was drastically reduced, focusing only on clinically relevant regions.

	Whole Genome	Targeted Panel
Total Variants	~34,044	48

2. Quality Control (ChromoQC)

The genomic landscape plot below demonstrates robust coverage across the target regions.

Mean Depth: ~75X (High confidence for variant detection).
Mapping Quality: Consistently at 60 (Unique mapping).

3. Biological Annotation (Clinical Insights)

Using snpEff and custom R visualization, we identified key functional changes:

Variant Impact: 4 variants identified as MODERATE impact (Missense), potentially affecting protein function.
Gen caiE mang đột biến, là ứng viên tiềm năng cho các thay đổi kiểu hình.
- Nucleotide 34944 trên nhiễm sắc thể
- Đột biến T thành C (vị trí 428 của gene)
- Thay đổi Axit Amin p.Asn143Ser, tại vị trí axit amin số 143, Asparagine thành Serine

🚀 How to Run

Prerequisites

Conda
Linux environment

Installation

1. Clone the repository:

git clone [https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git](https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git)
cd Bioinformatics-Learning-Journey

2. Create the environment:

conda create -n bio_pipeline -c bioconda -c conda-forge snakemake bwa samtools bcftools fastqc trimmomatic snpeff bedtools r-vcfr r-ggplot2 r-dplyr r-tidyr r-gridextra
conda activate bio_pipeline

3. Run the pipeline:

snakemake -c1

📂 Project Structure

.
├── Snakefile                   # Main workflow definition
├── images/️                     # Các biểu đồ sử dụng trong README
├── logs/                       # Execution logs (Snakemake & Tools)
├── data/                       # Dữ liệu giải trình tự thô (Input - Git ignored)
│   ├── reads_1.fastq.gz        # Forward reads
│   └── reads_2.fastq.gz        # Reverse reads
├── scripts/
│   ├── plot_vcf.R              # QC visualization script
│   └── snpeff_summary_plot.R   # Clinical summary visualization script
├── refs/
│   ├── ecoli_ref.fasta         # Reference genome
|   ├── adapters.fa             # Trình tự adapter dùng để trimming
│   └── targets.bed             # Target panel regions
├── results/                    # Output files (Git ignored mapped/ and trimmed/)
│   ├── mapped/                 # Alignment files
│   ├── qc/                     # FastQC quality reports
│   ├── trimmed/                # Cleaned and processed reads
│   ├── variants/               # VCFs (Raw, Filtered, Annotated)
│   └── plots/                  # Final PDF reports
├── CHANGELOG.md                # Development Log (Step-by-step notes)
└── README.md

Tác giả: Nguyễn Trường Long

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Automated Targeted Sequencing Analysis Pipeline

📌 Project Overview

🛠️ Tech Stack & Tools

🧬 Data Source

🔄 Pipeline Workflow

📊 Key Results & Analysis

1. Targeted Filtering Efficiency

2. Quality Control (ChromoQC)

3. Biological Annotation (Clinical Insights)

🚀 How to Run

Prerequisites

Installation

1. Clone the repository:

2. Create the environment:

3. Run the pipeline:

📂 Project Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
images		images
logs		logs
refs		refs
results		results
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
Snakefile		Snakefile
snpEff_genes.txt		snpEff_genes.txt
snpEff_summary.html		snpEff_summary.html

unshortcode/Bioinformatics-Learning-Journey

Folders and files

Latest commit

History

Repository files navigation

🧬 Automated Targeted Sequencing Analysis Pipeline

📌 Project Overview

🛠️ Tech Stack & Tools

🧬 Data Source

🔄 Pipeline Workflow

📊 Key Results & Analysis

1. Targeted Filtering Efficiency

2. Quality Control (ChromoQC)

3. Biological Annotation (Clinical Insights)

🚀 How to Run

Prerequisites

Installation

1. Clone the repository:

2. Create the environment:

3. Run the pipeline:

📂 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages