A reproducible, end-to-end ChIP-seq analysis pipeline for profiling genome-wide binding patterns of p53 and RECQ1 in mammalian cancer cell lines (e.g. MCF-7, U2OS, HeLa).
Built with Snakemake · Conda · hg38 · MACS2 · HOMER · deepTools
Raw FASTQ (own) / GEO download
│
▼
[1] Quality Control FastQC + MultiQC
│
▼
[2] Trimming Trim Galore
│
▼
[3] Alignment Bowtie2 → hg38
│
▼
[4] Post-alignment QC SAMtools flagstat, bamCompare (deepTools)
│
▼
[5] Peak Calling MACS2 (narrow for p53, broad optional)
│
├──────────────────────────────────────┐
▼ ▼
[6a] Motif Enrichment [6b] Genomic Annotation
HOMER findMotifsGenome HOMER annotatePeaks
│ │
▼ ▼
[7] Signal Tracks deepTools bamCoverage → bigWig (IGV-ready)
│
▼
[8] Differential Binding DiffBind (p53 vs RECQ1 co-occupancy)
│
▼
[9] Summary Report MultiQC + custom Python plots
chipseq_pipeline/
├── README.md
├── Snakefile # Master workflow
├── config/
│ ├── config.yaml # All parameters (samples, paths, genome)
│ └── samples.tsv # Sample sheet (name, SRR/path, antibody, input)
├── scripts/
│ ├── geo_download.sh # Download SRA → FASTQ from GEO
│ ├── run_pipeline.sh # One-command launcher
│ ├── diffbind_analysis.R # DiffBind differential binding script
│ └── plot_summary.py # Python visualisation of peak stats
├── envs/
│ ├── chipseq.yaml # Conda environment (all tools)
│ └── r_diffbind.yaml # Conda environment (R + DiffBind)
├── notebooks/
│ └── results_exploration.ipynb # Jupyter notebook for results
└── .github/
└── workflows/
└── lint.yml # GitHub Actions: validate config on push
git clone https://github.com/YOUR_USERNAME/chipseq_pipeline.git
cd chipseq_pipeline
# Create conda environments
conda env create -f envs/chipseq.yaml
conda env create -f envs/r_diffbind.yaml
conda activate chipseq# Reference genome + Bowtie2 index (run once)
mkdir -p reference/hg38
wget -P reference/hg38/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip reference/hg38/hg38.fa.gz
# Build Bowtie2 index
bowtie2-build reference/hg38/hg38.fa reference/hg38/hg38 --threads 8
# Download blacklist regions (ENCODE)
wget -P reference/ https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
gunzip reference/hg38-blacklist.v2.bed.gzEdit config/samples.tsv — either provide SRR accession IDs (GEO) or local FASTQ paths.
See Configuration section below.
bash scripts/geo_download.shbash scripts/run_pipeline.shOr run Snakemake directly with custom threads:
snakemake --cores 16 --use-conda --conda-prefix .snakemake/condasnakemake --dry-run --cores 1Key parameters to set before running:
| Parameter | Description |
|---|---|
genome |
Path to hg38 Bowtie2 index prefix |
blacklist |
Path to ENCODE hg38 blacklist BED |
genome_size |
MACS2 effective genome size (hs for human) |
macs2_qvalue |
Peak calling q-value cutoff (default: 0.05) |
outdir |
Output directory (default: results/) |
sample_name antibody geo_accession fastq_path input_name
p53_MCF7_rep1 p53 SRR000001 - input_MCF7_rep1
p53_MCF7_rep2 p53 SRR000002 - input_MCF7_rep2
RECQ1_MCF7_rep1 RECQ1 SRR000003 - input_MCF7_rep1
RECQ1_MCF7_rep2 RECQ1 SRR000004 - input_MCF7_rep2
input_MCF7_rep1 input SRR000005 - -
input_MCF7_rep2 input SRR000006 - -
- Use
geo_accessionfor GEO/SRA data (leavefastq_pathas-) - Use
fastq_pathfor local FASTQ files (leavegeo_accessionas-) input_namelinks each ChIP sample to its matched IgG/input control
All outputs are written to results/:
results/
├── fastqc/ Raw + trimmed QC reports
├── trimmed/ Trimmed FASTQ files
├── aligned/ Sorted, deduplicated BAM files
├── peaks/ MACS2 narrowPeak files per sample
├── bigwig/ Normalised bigWig tracks (IGV-ready)
├── homer/
│ ├── motifs/ Motif enrichment results per factor
│ └── annotation/ Peak annotation (TSS, intron, intergenic...)
├── diffbind/ DiffBind co-occupancy and differential analysis
├── plots/ Summary figures (peak counts, FRiP, heatmaps)
└── multiqc_report.html Aggregated QC report
Peak files (results/peaks/*.narrowPeak): BED-like format with peak coordinates, summit position, fold enrichment, and -log10(q-value). Load directly into IGV.
bigWig tracks (results/bigwig/*.bw): Input-normalised signal tracks (bamCompare, log2 ratio ChIP/Input). Load in IGV alongside peak files for visual inspection.
Motif results (results/homer/motifs/): Known and de novo motif enrichment relative to shuffled background. Check for p53 response element (RRRCWWGYYY) in p53 peaks as positive control.
Peak annotation (results/homer/annotation/): Each peak assigned to nearest gene feature (promoter ±1 kb, 5'UTR, exon, intron, intergenic). Enables Gene Ontology analysis.
DiffBind output (results/diffbind/): Identifies genomic regions co-occupied by both p53 and RECQ1, or differentially bound. Includes correlation heatmaps and MA plots.
These are publicly available ChIP-seq datasets suitable for this pipeline:
| Factor | Cell line | GEO Accession | Reference |
|---|---|---|---|
| p53 | MCF-7 | GSE86222 | ENCODE |
| p53 | U2OS | GSE31462 | Riley et al. |
| RECQ1 | HEK293 | Search GEO: "RECQ1 ChIP-seq" | — |
| Input control | MCF-7 | Paired with above | — |
Tip: Use GEO DataSets and search
"p53 ChIP-seq MCF-7"to find matched input controls. Always use the input/IgG from the same experiment.
The pipeline will warn if samples fail these thresholds:
| Metric | Acceptable range |
|---|---|
| Alignment rate | > 80% |
| Duplication rate | < 30% |
| FRiP score | > 0.01 (ideally > 0.05) |
| Number of peaks | > 500 |
FRiP (Fraction of Reads in Peaks) is the key ChIP quality metric. Results are summarised in the MultiQC report.
All managed via Conda (see envs/). Key tools:
- FastQC / MultiQC — QC
- Trim Galore — adapter trimming
- Bowtie2 — alignment to hg38
- SAMtools — BAM processing
- Picard MarkDuplicates — deduplication
- MACS2 — peak calling
- HOMER — motif enrichment + annotation
- deepTools — bigWig generation + QC plots
- DiffBind (R) — differential binding
- Python (Pandas, NumPy, Matplotlib, Seaborn) — summary plots
If you use this pipeline, please cite the underlying tools:
- Bowtie2: Langmead & Salzberg, Nat Methods 2012
- MACS2: Zhang et al., Genome Biol 2008
- HOMER: Heinz et al., Mol Cell 2010
- deepTools: Ramírez et al., Nucleic Acids Res 2016
- DiffBind: Ross-Innes et al., Nature 2012
Surasree Chakraborty (Pal) — PhD, IIT Kharagpur
linkedin.com/in/surasree-pal