ChIP-seq Pipeline: p53 & RECQ1 Genome-Wide Binding in Cancer Cells

A reproducible, end-to-end ChIP-seq analysis pipeline for profiling genome-wide binding patterns of p53 and RECQ1 in mammalian cancer cell lines (e.g. MCF-7, U2OS, HeLa).

Built with Snakemake · Conda · hg38 · MACS2 · HOMER · deepTools

Pipeline Overview

Raw FASTQ (own) / GEO download
        │
        ▼
  [1] Quality Control        FastQC + MultiQC
        │
        ▼
  [2] Trimming               Trim Galore
        │
        ▼
  [3] Alignment              Bowtie2 → hg38
        │
        ▼
  [4] Post-alignment QC      SAMtools flagstat, bamCompare (deepTools)
        │
        ▼
  [5] Peak Calling           MACS2 (narrow for p53, broad optional)
        │
        ├──────────────────────────────────────┐
        ▼                                      ▼
  [6a] Motif Enrichment     [6b] Genomic Annotation
        HOMER findMotifsGenome                 HOMER annotatePeaks
        │                                      │
        ▼                                      ▼
  [7] Signal Tracks          deepTools bamCoverage → bigWig (IGV-ready)
        │
        ▼
  [8] Differential Binding   DiffBind (p53 vs RECQ1 co-occupancy)
        │
        ▼
  [9] Summary Report         MultiQC + custom Python plots

Directory Structure

chipseq_pipeline/
├── README.md
├── Snakefile                  # Master workflow
├── config/
│   ├── config.yaml            # All parameters (samples, paths, genome)
│   └── samples.tsv            # Sample sheet (name, SRR/path, antibody, input)
├── scripts/
│   ├── geo_download.sh        # Download SRA → FASTQ from GEO
│   ├── run_pipeline.sh        # One-command launcher
│   ├── diffbind_analysis.R    # DiffBind differential binding script
│   └── plot_summary.py        # Python visualisation of peak stats
├── envs/
│   ├── chipseq.yaml           # Conda environment (all tools)
│   └── r_diffbind.yaml        # Conda environment (R + DiffBind)
├── notebooks/
│   └── results_exploration.ipynb  # Jupyter notebook for results
└── .github/
    └── workflows/
        └── lint.yml           # GitHub Actions: validate config on push

Quick Start

1. Clone and set up environment

git clone https://github.com/YOUR_USERNAME/chipseq_pipeline.git
cd chipseq_pipeline

# Create conda environments
conda env create -f envs/chipseq.yaml
conda env create -f envs/r_diffbind.yaml
conda activate chipseq

2. Download reference genome (hg38)

# Reference genome + Bowtie2 index (run once)
mkdir -p reference/hg38
wget -P reference/hg38/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip reference/hg38/hg38.fa.gz

# Build Bowtie2 index
bowtie2-build reference/hg38/hg38.fa reference/hg38/hg38 --threads 8

# Download blacklist regions (ENCODE)
wget -P reference/ https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
gunzip reference/hg38-blacklist.v2.bed.gz

3. Configure your samples

Edit config/samples.tsv — either provide SRR accession IDs (GEO) or local FASTQ paths.

See Configuration section below.

4. Download GEO data (if needed)

bash scripts/geo_download.sh

5. Run the full pipeline

bash scripts/run_pipeline.sh

Or run Snakemake directly with custom threads:

snakemake --cores 16 --use-conda --conda-prefix .snakemake/conda

Dry run (check workflow without executing)

snakemake --dry-run --cores 1

Configuration

`config/config.yaml`

Key parameters to set before running:

Parameter	Description
`genome`	Path to hg38 Bowtie2 index prefix
`blacklist`	Path to ENCODE hg38 blacklist BED
`genome_size`	MACS2 effective genome size (`hs` for human)
`macs2_qvalue`	Peak calling q-value cutoff (default: 0.05)
`outdir`	Output directory (default: `results/`)

`config/samples.tsv`

sample_name    antibody    geo_accession    fastq_path    input_name
p53_MCF7_rep1  p53         SRR000001        -             input_MCF7_rep1
p53_MCF7_rep2  p53         SRR000002        -             input_MCF7_rep2
RECQ1_MCF7_rep1 RECQ1      SRR000003        -             input_MCF7_rep1
RECQ1_MCF7_rep2 RECQ1      SRR000004        -             input_MCF7_rep2
input_MCF7_rep1 input      SRR000005        -             -
input_MCF7_rep2 input      SRR000006        -             -

Use geo_accession for GEO/SRA data (leave fastq_path as -)
Use fastq_path for local FASTQ files (leave geo_accession as -)
input_name links each ChIP sample to its matched IgG/input control

Outputs

All outputs are written to results/:

results/
├── fastqc/              Raw + trimmed QC reports
├── trimmed/             Trimmed FASTQ files
├── aligned/             Sorted, deduplicated BAM files
├── peaks/               MACS2 narrowPeak files per sample
├── bigwig/              Normalised bigWig tracks (IGV-ready)
├── homer/
│   ├── motifs/          Motif enrichment results per factor
│   └── annotation/      Peak annotation (TSS, intron, intergenic...)
├── diffbind/            DiffBind co-occupancy and differential analysis
├── plots/               Summary figures (peak counts, FRiP, heatmaps)
└── multiqc_report.html  Aggregated QC report

Key outputs explained

Peak files (results/peaks/*.narrowPeak): BED-like format with peak coordinates, summit position, fold enrichment, and -log10(q-value). Load directly into IGV.

bigWig tracks (results/bigwig/*.bw): Input-normalised signal tracks (bamCompare, log2 ratio ChIP/Input). Load in IGV alongside peak files for visual inspection.

Motif results (results/homer/motifs/): Known and de novo motif enrichment relative to shuffled background. Check for p53 response element (RRRCWWGYYY) in p53 peaks as positive control.

Peak annotation (results/homer/annotation/): Each peak assigned to nearest gene feature (promoter ±1 kb, 5'UTR, exon, intron, intergenic). Enables Gene Ontology analysis.

DiffBind output (results/diffbind/): Identifies genomic regions co-occupied by both p53 and RECQ1, or differentially bound. Includes correlation heatmaps and MA plots.

Recommended GEO Datasets

These are publicly available ChIP-seq datasets suitable for this pipeline:

Factor	Cell line	GEO Accession	Reference
p53	MCF-7	GSE86222	ENCODE
p53	U2OS	GSE31462	Riley et al.
RECQ1	HEK293	Search GEO: "RECQ1 ChIP-seq"	—
Input control	MCF-7	Paired with above	—

Tip: Use GEO DataSets and search "p53 ChIP-seq MCF-7" to find matched input controls. Always use the input/IgG from the same experiment.

Quality Control Checkpoints

The pipeline will warn if samples fail these thresholds:

Metric	Acceptable range
Alignment rate	> 80%
Duplication rate	< 30%
FRiP score	> 0.01 (ideally > 0.05)
Number of peaks	> 500

FRiP (Fraction of Reads in Peaks) is the key ChIP quality metric. Results are summarised in the MultiQC report.

Dependencies

All managed via Conda (see envs/). Key tools:

FastQC / MultiQC — QC
Trim Galore — adapter trimming
Bowtie2 — alignment to hg38
SAMtools — BAM processing
Picard MarkDuplicates — deduplication
MACS2 — peak calling
HOMER — motif enrichment + annotation
deepTools — bigWig generation + QC plots
DiffBind (R) — differential binding
Python (Pandas, NumPy, Matplotlib, Seaborn) — summary plots

Citation

If you use this pipeline, please cite the underlying tools:

Bowtie2: Langmead & Salzberg, Nat Methods 2012
MACS2: Zhang et al., Genome Biol 2008
HOMER: Heinz et al., Mol Cell 2010
deepTools: Ramírez et al., Nucleic Acids Res 2016
DiffBind: Ross-Innes et al., Nature 2012

Author

Surasree Chakraborty (Pal) — PhD, IIT Kharagpur
linkedin.com/in/surasree-pal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChIP-seq Pipeline: p53 & RECQ1 Genome-Wide Binding in Cancer Cells

Pipeline Overview

Directory Structure

Quick Start

1. Clone and set up environment

2. Download reference genome (hg38)

3. Configure your samples

4. Download GEO data (if needed)

5. Run the full pipeline

Dry run (check workflow without executing)

Configuration

`config/config.yaml`

`config/samples.tsv`

Outputs

Key outputs explained

Recommended GEO Datasets

Quality Control Checkpoints

Dependencies

Citation

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
config		config
envs		envs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile

Folders and files

Latest commit

History

Repository files navigation

ChIP-seq Pipeline: p53 & RECQ1 Genome-Wide Binding in Cancer Cells

Pipeline Overview

Directory Structure

Quick Start

1. Clone and set up environment

2. Download reference genome (hg38)

3. Configure your samples

4. Download GEO data (if needed)

5. Run the full pipeline

Dry run (check workflow without executing)

Configuration

config/config.yaml

config/samples.tsv

Outputs

Key outputs explained

Recommended GEO Datasets

Quality Control Checkpoints

Dependencies

Citation

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`config/config.yaml`

`config/samples.tsv`

Packages