This project implements a reproducible Clinical Targeted Sequencing Pipeline for identifying antibiotic resistance variants in E. coli. Unlike whole-genome approaches, this pipeline integrates Bedtools to filter specific genomic regions of interest, mimicking real-world gene panel assays used in precision medicine.
The workflow automates the journey from raw FASTQ data to annotated clinical insights, reducing 34,000+ raw variants down to 48 high-confidence functional variants.
- Workflow Management: Snakemake
- QC & Trimming: FastQC, Trimmomatic
- Alignment: BWA-MEM, Samtools
- Target Enrichment: Bedtools (In-silico panel filtering)
- Variant Calling: Bcftools (mpileup/call)
- Annotation: snpEff (Functional prediction)
- Visualization: R (vcfR, tidyverse (ggplot2, dplyr, tidyr), gridExtra)
To ensure reproducibility, this pipeline uses publicly available sequencing data from the European Nucleotide Archive (ENA). * Sample: Escherichia coli B strain. * Accession Number: SRR2584863 (Illumina MiSeq paired-end sequencing). * Reference Genome: E. coli K-12 MG1655 (Accession: NC_000913.3).
graph TD;
A[Raw FASTQ] -->|FastQC & Trimmomatic| B[Clean Reads];
B -->|BWA MEM| C[Aligned BAM];
C -->|Bedtools Intersect| D[Targeted BAM];
D -->|Bcftools Call| E[Raw VCF];
E -->|Filter QUAL>20| F[High-Quality VCF];
F -->|snpEff| G[Annotated Clinical VCF];
G -->|Custom R Script| H[Clinical Report PNG];
By applying the targets.bed filter (simulating a resistance gene panel), the noise was drastically reduced, focusing only on clinically relevant regions.
| Whole Genome | Targeted Panel | |
|---|---|---|
| Total Variants | ~34,044 | 48 |
The genomic landscape plot below demonstrates robust coverage across the target regions.
- Mean Depth: ~75X (High confidence for variant detection).
- Mapping Quality: Consistently at 60 (Unique mapping).
Using snpEff and custom R visualization, we identified key functional changes:
-
Variant Impact: 4 variants identified as MODERATE impact (Missense), potentially affecting protein function.
-
Gen caiE mang đột biến, là ứng viên tiềm năng cho các thay đổi kiểu hình.
- Nucleotide 34944 trên nhiễm sắc thể
- Đột biến T thành C (vị trí 428 của gene)
- Thay đổi Axit Amin p.Asn143Ser, tại vị trí axit amin số 143, Asparagine thành Serine
- Conda
- Linux environment
git clone [https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git](https://github.com/YOUR_USERNAME/Bioinformatics-Learning-Journey.git)
cd Bioinformatics-Learning-Journeyconda create -n bio_pipeline -c bioconda -c conda-forge snakemake bwa samtools bcftools fastqc trimmomatic snpeff bedtools r-vcfr r-ggplot2 r-dplyr r-tidyr r-gridextra
conda activate bio_pipelinesnakemake -c1.
├── Snakefile # Main workflow definition
├── images/️ # Các biểu đồ sử dụng trong README
├── logs/ # Execution logs (Snakemake & Tools)
├── data/ # Dữ liệu giải trình tự thô (Input - Git ignored)
│ ├── reads_1.fastq.gz # Forward reads
│ └── reads_2.fastq.gz # Reverse reads
├── scripts/
│ ├── plot_vcf.R # QC visualization script
│ └── snpeff_summary_plot.R # Clinical summary visualization script
├── refs/
│ ├── ecoli_ref.fasta # Reference genome
| ├── adapters.fa # Trình tự adapter dùng để trimming
│ └── targets.bed # Target panel regions
├── results/ # Output files (Git ignored mapped/ and trimmed/)
│ ├── mapped/ # Alignment files
│ ├── qc/ # FastQC quality reports
│ ├── trimmed/ # Cleaned and processed reads
│ ├── variants/ # VCFs (Raw, Filtered, Annotated)
│ └── plots/ # Final PDF reports
├── CHANGELOG.md # Development Log (Step-by-step notes)
└── README.md
Tác giả: Nguyễn Trường Long

