Skip to content

surasree-c/chip-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChIP-seq Pipeline: p53 & RECQ1 Genome-Wide Binding in Cancer Cells

A reproducible, end-to-end ChIP-seq analysis pipeline for profiling genome-wide binding patterns of p53 and RECQ1 in mammalian cancer cell lines (e.g. MCF-7, U2OS, HeLa).

Built with Snakemake · Conda · hg38 · MACS2 · HOMER · deepTools


Pipeline Overview

Raw FASTQ (own) / GEO download
        │
        ▼
  [1] Quality Control        FastQC + MultiQC
        │
        ▼
  [2] Trimming               Trim Galore
        │
        ▼
  [3] Alignment              Bowtie2 → hg38
        │
        ▼
  [4] Post-alignment QC      SAMtools flagstat, bamCompare (deepTools)
        │
        ▼
  [5] Peak Calling           MACS2 (narrow for p53, broad optional)
        │
        ├──────────────────────────────────────┐
        ▼                                      ▼
  [6a] Motif Enrichment     [6b] Genomic Annotation
        HOMER findMotifsGenome                 HOMER annotatePeaks
        │                                      │
        ▼                                      ▼
  [7] Signal Tracks          deepTools bamCoverage → bigWig (IGV-ready)
        │
        ▼
  [8] Differential Binding   DiffBind (p53 vs RECQ1 co-occupancy)
        │
        ▼
  [9] Summary Report         MultiQC + custom Python plots

Directory Structure

chipseq_pipeline/
├── README.md
├── Snakefile                  # Master workflow
├── config/
│   ├── config.yaml            # All parameters (samples, paths, genome)
│   └── samples.tsv            # Sample sheet (name, SRR/path, antibody, input)
├── scripts/
│   ├── geo_download.sh        # Download SRA → FASTQ from GEO
│   ├── run_pipeline.sh        # One-command launcher
│   ├── diffbind_analysis.R    # DiffBind differential binding script
│   └── plot_summary.py        # Python visualisation of peak stats
├── envs/
│   ├── chipseq.yaml           # Conda environment (all tools)
│   └── r_diffbind.yaml        # Conda environment (R + DiffBind)
├── notebooks/
│   └── results_exploration.ipynb  # Jupyter notebook for results
└── .github/
    └── workflows/
        └── lint.yml           # GitHub Actions: validate config on push

Quick Start

1. Clone and set up environment

git clone https://github.com/YOUR_USERNAME/chipseq_pipeline.git
cd chipseq_pipeline

# Create conda environments
conda env create -f envs/chipseq.yaml
conda env create -f envs/r_diffbind.yaml
conda activate chipseq

2. Download reference genome (hg38)

# Reference genome + Bowtie2 index (run once)
mkdir -p reference/hg38
wget -P reference/hg38/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip reference/hg38/hg38.fa.gz

# Build Bowtie2 index
bowtie2-build reference/hg38/hg38.fa reference/hg38/hg38 --threads 8

# Download blacklist regions (ENCODE)
wget -P reference/ https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
gunzip reference/hg38-blacklist.v2.bed.gz

3. Configure your samples

Edit config/samples.tsv — either provide SRR accession IDs (GEO) or local FASTQ paths.

See Configuration section below.

4. Download GEO data (if needed)

bash scripts/geo_download.sh

5. Run the full pipeline

bash scripts/run_pipeline.sh

Or run Snakemake directly with custom threads:

snakemake --cores 16 --use-conda --conda-prefix .snakemake/conda

Dry run (check workflow without executing)

snakemake --dry-run --cores 1

Configuration

config/config.yaml

Key parameters to set before running:

Parameter Description
genome Path to hg38 Bowtie2 index prefix
blacklist Path to ENCODE hg38 blacklist BED
genome_size MACS2 effective genome size (hs for human)
macs2_qvalue Peak calling q-value cutoff (default: 0.05)
outdir Output directory (default: results/)

config/samples.tsv

sample_name    antibody    geo_accession    fastq_path    input_name
p53_MCF7_rep1  p53         SRR000001        -             input_MCF7_rep1
p53_MCF7_rep2  p53         SRR000002        -             input_MCF7_rep2
RECQ1_MCF7_rep1 RECQ1      SRR000003        -             input_MCF7_rep1
RECQ1_MCF7_rep2 RECQ1      SRR000004        -             input_MCF7_rep2
input_MCF7_rep1 input      SRR000005        -             -
input_MCF7_rep2 input      SRR000006        -             -
  • Use geo_accession for GEO/SRA data (leave fastq_path as -)
  • Use fastq_path for local FASTQ files (leave geo_accession as -)
  • input_name links each ChIP sample to its matched IgG/input control

Outputs

All outputs are written to results/:

results/
├── fastqc/              Raw + trimmed QC reports
├── trimmed/             Trimmed FASTQ files
├── aligned/             Sorted, deduplicated BAM files
├── peaks/               MACS2 narrowPeak files per sample
├── bigwig/              Normalised bigWig tracks (IGV-ready)
├── homer/
│   ├── motifs/          Motif enrichment results per factor
│   └── annotation/      Peak annotation (TSS, intron, intergenic...)
├── diffbind/            DiffBind co-occupancy and differential analysis
├── plots/               Summary figures (peak counts, FRiP, heatmaps)
└── multiqc_report.html  Aggregated QC report

Key outputs explained

Peak files (results/peaks/*.narrowPeak): BED-like format with peak coordinates, summit position, fold enrichment, and -log10(q-value). Load directly into IGV.

bigWig tracks (results/bigwig/*.bw): Input-normalised signal tracks (bamCompare, log2 ratio ChIP/Input). Load in IGV alongside peak files for visual inspection.

Motif results (results/homer/motifs/): Known and de novo motif enrichment relative to shuffled background. Check for p53 response element (RRRCWWGYYY) in p53 peaks as positive control.

Peak annotation (results/homer/annotation/): Each peak assigned to nearest gene feature (promoter ±1 kb, 5'UTR, exon, intron, intergenic). Enables Gene Ontology analysis.

DiffBind output (results/diffbind/): Identifies genomic regions co-occupied by both p53 and RECQ1, or differentially bound. Includes correlation heatmaps and MA plots.


Recommended GEO Datasets

These are publicly available ChIP-seq datasets suitable for this pipeline:

Factor Cell line GEO Accession Reference
p53 MCF-7 GSE86222 ENCODE
p53 U2OS GSE31462 Riley et al.
RECQ1 HEK293 Search GEO: "RECQ1 ChIP-seq"
Input control MCF-7 Paired with above

Tip: Use GEO DataSets and search "p53 ChIP-seq MCF-7" to find matched input controls. Always use the input/IgG from the same experiment.


Quality Control Checkpoints

The pipeline will warn if samples fail these thresholds:

Metric Acceptable range
Alignment rate > 80%
Duplication rate < 30%
FRiP score > 0.01 (ideally > 0.05)
Number of peaks > 500

FRiP (Fraction of Reads in Peaks) is the key ChIP quality metric. Results are summarised in the MultiQC report.


Dependencies

All managed via Conda (see envs/). Key tools:

  • FastQC / MultiQC — QC
  • Trim Galore — adapter trimming
  • Bowtie2 — alignment to hg38
  • SAMtools — BAM processing
  • Picard MarkDuplicates — deduplication
  • MACS2 — peak calling
  • HOMER — motif enrichment + annotation
  • deepTools — bigWig generation + QC plots
  • DiffBind (R) — differential binding
  • Python (Pandas, NumPy, Matplotlib, Seaborn) — summary plots

Citation

If you use this pipeline, please cite the underlying tools:

  • Bowtie2: Langmead & Salzberg, Nat Methods 2012
  • MACS2: Zhang et al., Genome Biol 2008
  • HOMER: Heinz et al., Mol Cell 2010
  • deepTools: Ramírez et al., Nucleic Acids Res 2016
  • DiffBind: Ross-Innes et al., Nature 2012

Author

Surasree Chakraborty (Pal) — PhD, IIT Kharagpur
linkedin.com/in/surasree-pal

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors