# ChIP-seq Peak Calling Pipeline - Student Exercise

## Learning Objectives
By completing this notebook, you will:
- Understand how to call peaks from ChIP-seq data
- Learn to compare IP samples against control/input
- Calculate FRiP score (signal-to-noise metric)
- Filter artifacts using ENCODE blacklist
- Interpret peak calling results

## Pipeline Overview
```
IP BAM + Control BAM ‚Üí Peak Calling ‚Üí Calculate FRiP ‚Üí Filter Blacklist ‚Üí Final Peaks
     (Inputs)            (MACS2)       (Quality)      (Clean)         (Output)
```

## What is Peak Calling?
**Peak calling** identifies genomic regions where your protein of interest binds to DNA by comparing:
- **IP (Immunoprecipitation) sample**: Enriched for protein-bound DNA
- **Control/Input sample**: Background DNA (no enrichment)

Peaks = regions where IP signal significantly exceeds control signal.

## Why Use a Control Sample?
- Corrects for sequencing bias
- Accounts for open chromatin regions
- Reduces false positive peaks
- Improves specificity

## Key Metrics
- **FRiP (Fraction of Reads in Peaks)**: Measures signal-to-noise ratio
  - Good ChIP: FRiP > 0.01 (1%)
  - Excellent ChIP: FRiP > 0.05 (5%)
  
## Required Tools
- **MACS2** (2.2.7.1): Peak calling algorithm
- **Samtools** (1.7): BAM file operations
- **Bedtools** (2.29.2): Genomic interval operations

---

**Prerequisites**: You must have already processed both your IP and control samples using the ChIP-seq processing pipeline.

**Instructions**: Follow the cells below and complete sections marked with `# TODO`

## Step 1: Import Python Libraries

In [None]:
import os
import subprocess
from pathlib import Path
from datetime import datetime

print("‚úì Libraries imported successfully!")

## Step 2: Define Tool Containers

In [None]:
# Container images
macs2_container = "quay.io/biocontainers/macs2:2.2.7.1--py39hbf8eff0_5"
samtools_container = "quay.io/biocontainers/samtools:1.7--2"
bedtools_container = "quay.io/biocontainers/bedtools:2.29.2--hc088bd4_0"

print("‚úì Container images defined")

## Step 3: Set Pipeline Parameters

**üìù TODO: Update these paths to your processed BAM files!**

**Important**: Use the `*_final.bam` files from your processed IP and control samples.

In [None]:
# TODO: Update these paths to your processed BAM files
ip_bam = "/path/to/IP_sample_final.bam"           # IP/treatment BAM file
control_bam = "/path/to/control_sample_final.bam" # Control/input BAM file
basename = "my_chip_experiment"                    # Experiment name
output_dir = "/path/to/output_directory"           # Where to save results

# Reference data
# TODO: Update these paths to your blacklist file and accessible directory
BLACKLIST = "/path/to/GRCh38_unified_blacklist.bed"  # ENCODE blacklist regions
BIND_DIR = "/path/to/accessible_directory/"          # Directory accessible to Singularity

# Verify files exist
if os.path.exists(ip_bam):
    print(f"‚úì IP BAM found: {ip_bam}")
else:
    print(f"‚úó IP BAM not found: {ip_bam}")
    
if os.path.exists(control_bam):
    print(f"‚úì Control BAM found: {control_bam}")
else:
    print(f"‚úó Control BAM not found: {control_bam}")

print(f"\nExperiment: {basename}")
print("‚ö† Remember to update the file paths above before running!")

## Step 4: Create Output Directories

In [None]:
# Create output directory structure
OUTPUT_DIR = os.path.join(os.path.abspath(output_dir), f"{basename}_results")
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Create subdirectories
qc_dir = os.path.join(OUTPUT_DIR, f"{basename}_qc")
peaks_dir = os.path.join(OUTPUT_DIR, f"{basename}_peaks")

for d in [qc_dir, peaks_dir]:
    Path(d).mkdir(exist_ok=True)

# Initialize log file
log_file = os.path.join(OUTPUT_DIR, f"{basename}_pipeline.log")

print(f"‚úì Output directory: {OUTPUT_DIR}")
print(f"‚úì QC directory: {qc_dir}")
print(f"‚úì Peaks directory: {peaks_dir}")

## Step 5: Call Peaks with MACS2

**What does this do?** MACS2 compares IP vs control to identify enriched regions (peaks).

**MACS2 Algorithm:**
1. Scans the genome for regions with significant enrichment
2. Compares IP signal to control background
3. Calculates p-values and q-values (FDR)
4. Reports peak locations, summits, and confidence scores

**üìù TODO: Run MACS2 peak calling.**

In [None]:
print("=" * 70)
print("STEP 1: Peak Calling with MACS2")
print("=" * 70)

print("\nCalling peaks (this may take several minutes)...")
print(f"IP sample: {os.path.basename(ip_bam)}")
print(f"Control sample: {os.path.basename(control_bam)}")

# TODO: Complete the MACS2 command
# Hint: MACS2 needs both IP and control samples for peak calling
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for macs2?
    "macs2", "callpeak",
    # TODO: Add the following parameters:
    # - Treatment/IP file (hint: -t with ip_bam)
    # - Control/input file (hint: -c with control_bam)
    # - Keep duplicates setting (hint: --keep-dup all, already filtered)
    # - Genome size (hint: -g hs for human)
    # - Output name prefix (hint: -n with basename)
    # - Output directory (hint: --outdir with peaks_dir)
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("\n‚úì Peak calling completed!")
    
    # Parse MACS2 output
    peak_file = os.path.join(peaks_dir, f"{basename}_peaks.narrowPeak")
    summit_file = os.path.join(peaks_dir, f"{basename}_summits.bed")
    
    # Count peaks
    with open(peak_file, 'r') as f:
        num_peaks = sum(1 for line in f)
    
    print(f"\nüìä Peak Calling Results:")
    print(f"  Total peaks called: {num_peaks:,}")
    
    if num_peaks > 10000:
        print(f"  ‚úì Good number of peaks (>10,000)")
    elif num_peaks > 1000:
        print(f"  ‚ö† Moderate number of peaks (1,000-10,000)")
    else:
        print(f"  ‚úó Low number of peaks (<1,000) - check data quality")
    
    print(f"\n  Output files:")
    print(f"    - {basename}_peaks.narrowPeak    ‚Üê Peak coordinates")
    print(f"    - {basename}_summits.bed         ‚Üê Peak summits (highest point)")
    print(f"    - {basename}_peaks.xls           ‚Üê Detailed peak info")
    
    # Show preview of peaks
    print(f"\n  Preview of top 5 peaks:")
    print(f"  {'Chr':<8} {'Start':<12} {'End':<12} {'Name':<20} {'Score':<8} {'Enrichment':<10}")
    print(f"  {'-'*80}")
    
    with open(peak_file, 'r') as f:
        for i, line in enumerate(f):
            if i < 5:
                fields = line.strip().split('\t')
                chrom, start, end, name = fields[0], fields[1], fields[2], fields[3]
                score, enrichment = fields[4], fields[6]
                print(f"  {chrom:<8} {start:<12} {end:<12} {name:<20} {score:<8} {enrichment:<10}")
            else:
                break
    
    # Show MACS2 summary from stderr
    print(f"\n  MACS2 Summary:")
    for line in result.stderr.split('\n'):
        if 'tags after filtering' in line or 'total peaks' in line or 'Fragment length' in line:
            print(f"    {line.strip()}")
            
else:
    print(f"‚úó Error: {result.stderr}")

print("\n" + "=" * 70)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 70)
print("1. How many peaks were identified in your experiment?")
print("2. Is this peak count reasonable for your protein?")
print("   - Transcription factors: typically 5,000-50,000 peaks")
print("   - Histone marks: can have >100,000 peaks")
print("3. What do the different MACS2 output files contain?")
print("4. What does the fold enrichment value represent?")
print("5. Why is it important to use a control/input sample?")
print("6. What would happen if you didn't use a control sample?")
print("7. Look at the top peaks - are they on multiple chromosomes?")
print("=" * 70)

## Step 6: Calculate FRiP Score

**What is FRiP?** Fraction of Reads in Peaks - measures how much of your sequencing signal falls within identified peaks.

**Why is FRiP important?**
- High FRiP = Good enrichment, successful ChIP
- Low FRiP = Poor enrichment, may indicate technical issues

**Quality Thresholds:**
- **Excellent**: FRiP > 5% (0.05)
- **Good**: FRiP > 1% (0.01)
- **Poor**: FRiP < 1% (failed experiment)

**üìù TODO: Calculate FRiP score.**

In [None]:
print("=" * 70)
print("STEP 2: Calculate FRiP Score")
print("=" * 70)

peak_file = os.path.join(peaks_dir, f"{basename}_peaks.narrowPeak")

print("\nCalculating Fraction of Reads in Peaks (FRiP)...")

# Step 1: Count total reads in IP sample
print("[1/3] Counting total reads in IP sample...")
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",
    "samtools", "view", "-c", ip_bam
]
result = subprocess.run(cmd, capture_output=False, text=True)
total_reads = int(result.stdout.strip())
print(f"  Total reads: {total_reads:,}")

# Step 2: Count reads overlapping peaks
print("[2/3] Counting reads in peaks...")

# Sort and merge peaks
cmd1 = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{bedtools_container}",
    "bedtools", "sort", "-i", peak_file
]

cmd2 = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{bedtools_container}",
    "bedtools", "merge", "-i", "stdin"
]

cmd3 = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{bedtools_container}",
    "bedtools", "intersect",
    "-u", "-nonamecheck",
    "-a", ip_bam,
    "-b", "stdin",
    "-ubam"
]

cmd4 = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",
    "samtools", "view", "-c"
]

# Chain the commands together
p1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
p2 = subprocess.Popen(cmd2, stdin=p1.stdout, stdout=subprocess.PIPE)
p1.stdout.close()
p3 = subprocess.Popen(cmd3, stdin=p2.stdout, stdout=subprocess.PIPE)
p2.stdout.close()
p4 = subprocess.Popen(cmd4, stdin=p3.stdout, stdout=subprocess.PIPE, text=True)
p3.stdout.close()

output, _ = p4.communicate()
reads_in_peaks = int(output.strip())
print(f"  Reads in peaks: {reads_in_peaks:,}")

# Step 3: Calculate FRiP
print("[3/3] Calculating FRiP score...")
FRiP = reads_in_peaks / total_reads if total_reads > 0 else 0

print(f"\nüìä FRiP Score Results:")
print(f"  Total reads: {total_reads:,}")
print(f"  Reads in peaks: {reads_in_peaks:,}")
print(f"  FRiP score: {FRiP:.6f} ({FRiP*100:.2f}%)")

# Quality assessment
print(f"\n  Quality Assessment:")
if FRiP > 0.05:
    print(f"    ‚úì Excellent enrichment (>5%)")
    print(f"    Your ChIP worked very well!")
elif FRiP > 0.01:
    print(f"    ‚úì Good enrichment (1-5%)")
    print(f"    Acceptable ChIP-seq quality")
elif FRiP > 0.005:
    print(f"    ‚ö† Moderate enrichment (0.5-1%)")
    print(f"    Marginal quality - consider biological validation")
else:
    print(f"    ‚úó Poor enrichment (<0.5%)")
    print(f"    Failed experiment - check antibody, protocol, or sample")

# Save FRiP to file
qc_file = os.path.join(qc_dir, f"{basename}_qc.txt")
with open(qc_file, 'w') as f:
    f.write(f"FRiP:\n")
    f.write(f"{FRiP}\n")
    f.write(f"\nTotal_reads\tReads_in_peaks\tFRiP_percentage\n")
    f.write(f"{total_reads}\t{reads_in_peaks}\t{FRiP*100:.4f}\n")

print(f"\n‚úì FRiP score saved to: {qc_file}")

print("\n" + "=" * 70)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 70)
print("1. What is your FRiP score (as a percentage)?")
print("2. Is your FRiP score acceptable for ChIP-seq?")
print("   - >5%: Excellent")
print("   - 1-5%: Good")
print("   - <1%: Poor (failed experiment)")
print("3. What does FRiP measure?")
print("4. What does a low FRiP score indicate?")
print("   - Poor antibody quality?")
print("   - Low ChIP efficiency?")
print("   - Technical issues during library prep?")
print("5. How does ChIP-seq FRiP compare to ATAC-seq FRiP?")
print("6. If your FRiP is low, what could you do to improve it?")
print("7. What percentage of your sequencing reads fall in peaks?")
print("=" * 70)

## Step 7: Remove ENCODE Blacklist Regions

**What are blacklist regions?** Genomic regions that produce artifactual signals:
- Repetitive DNA
- Satellite regions
- Mitochondrial sequences
- Known problematic loci

**Why remove them?** These regions show false positive peaks regardless of the experiment.

**ENCODE Blacklist:** Curated list of artifact-prone regions for quality control.

**üìù TODO: Filter peaks using the blacklist.**

In [None]:
print("=" * 70)
print("STEP 3: Remove ENCODE Blacklist Regions")
print("=" * 70)

peak_file = os.path.join(peaks_dir, f"{basename}_peaks.narrowPeak")
summit_file = os.path.join(peaks_dir, f"{basename}_summits.bed")
filtered_peaks = os.path.join(peaks_dir, f"{basename}_peaks_blacklisted_filtered.narrowPeak")
filtered_summits = os.path.join(peaks_dir, f"{basename}_summits_blacklisted_filtered.bed")

print("\nFiltering peaks that overlap ENCODE blacklist regions...")

# Filter peaks
print("[1/2] Filtering peak file...")
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{bedtools_container}",
    "bedtools", "subtract",
    "-A",              # Remove entire feature if any overlap
    "-a", peak_file,   # Input peaks
    "-b", BLACKLIST    # Blacklist regions
]

result = subprocess.run(cmd, capture_output=False, text=True)

# Sort and save filtered peaks
sorted_peaks = subprocess.run(
    ["sort", "-k1,1", "-k2,2n"],
    input=result.stdout,
    capture_output=False,
    text=True
)

with open(filtered_peaks, 'w') as f:
    f.write(sorted_peaks.stdout)

print("‚úì Filtered peaks saved")

# Filter summits
print("[2/2] Filtering summit file...")
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{bedtools_container}",
    "bedtools", "subtract",
    "-A",
    "-a", summit_file,
    "-b", BLACKLIST
]

result = subprocess.run(cmd, capture_output=False, text=True)

# Sort and save filtered summits
sorted_summits = subprocess.run(
    ["sort", "-k1,1", "-k2,2n"],
    input=result.stdout,
    capture_output=False,
    text=True
)

with open(filtered_summits, 'w') as f:
    f.write(sorted_summits.stdout)

print("‚úì Filtered summits saved")

# Count filtered peaks
with open(peak_file, 'r') as f:
    original_peaks = sum(1 for line in f)

with open(filtered_peaks, 'r') as f:
    filtered_peak_count = sum(1 for line in f)

removed_peaks = original_peaks - filtered_peak_count

print(f"\nüìä Blacklist Filtering Results:")
print(f"  Original peaks: {original_peaks:,}")
print(f"  Filtered peaks: {filtered_peak_count:,}")
print(f"  Removed peaks: {removed_peaks:,} ({removed_peaks/original_peaks*100:.1f}%)")

if removed_peaks / original_peaks < 0.05:
    print(f"  ‚úì Good: <5% of peaks removed (clean data)")
elif removed_peaks / original_peaks < 0.10:
    print(f"  ‚ö† Moderate: 5-10% of peaks removed")
else:
    print(f"  ‚ö† Warning: >10% of peaks removed (check data quality)")

print(f"\n  Final output files:")
print(f"    {filtered_peaks}")
print(f"    {filtered_summits}")

print("\nüí° Use these filtered files for downstream analysis!")

print("\n" + "=" * 70)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 70)
print("1. How many peaks were removed by blacklist filtering?")
print("2. What percentage of your peaks overlapped blacklisted regions?")
print("3. What types of genomic regions are in the ENCODE blacklist?")
print("   - Repetitive DNA?")
print("   - Satellite sequences?")
print("   - Problematic assembly regions?")
print("4. Why is blacklist filtering important for ChIP-seq?")
print("5. Would you expect different proteins to have different blacklist overlap rates?")
print("6. What would happen if you didn't filter blacklisted regions?")
print("7. Should blacklist filtering be done before or after peak calling?")
print("=" * 70)

## Step 8: Analyze Peak Characteristics

Let's examine the properties of your called peaks to better understand the results.

In [None]:
print("=" * 70)
print("STEP 4: Analyze Peak Characteristics")
print("=" * 70)

filtered_peaks = os.path.join(peaks_dir, f"{basename}_peaks_blacklisted_filtered.narrowPeak")

print("\nAnalyzing peak properties...")

# Read peak data
peak_lengths = []
peak_scores = []
peak_enrichments = []
peak_chromosomes = {}

with open(filtered_peaks, 'r') as f:
    for line in f:
        fields = line.strip().split('\t')
        chrom = fields[0]
        start = int(fields[1])
        end = int(fields[2])
        score = float(fields[4])
        enrichment = float(fields[6])
        
        length = end - start
        peak_lengths.append(length)
        peak_scores.append(score)
        peak_enrichments.append(enrichment)
        
        if chrom not in peak_chromosomes:
            peak_chromosomes[chrom] = 0
        peak_chromosomes[chrom] += 1

# Calculate statistics
import statistics

print(f"\nüìä Peak Length Statistics:")
print(f"  Mean length: {statistics.mean(peak_lengths):.1f} bp")
print(f"  Median length: {statistics.median(peak_lengths):.1f} bp")
print(f"  Min length: {min(peak_lengths)} bp")
print(f"  Max length: {max(peak_lengths)} bp")

print(f"\nüìä Peak Score Statistics:")
print(f"  Mean score: {statistics.mean(peak_scores):.2f}")
print(f"  Median score: {statistics.median(peak_scores):.2f}")
print(f"  Max score: {max(peak_scores):.2f}")

print(f"\nüìä Peak Enrichment Statistics:")
print(f"  Mean fold enrichment: {statistics.mean(peak_enrichments):.2f}x")
print(f"  Median fold enrichment: {statistics.median(peak_enrichments):.2f}x")
print(f"  Max fold enrichment: {max(peak_enrichments):.2f}x")

print(f"\nüìä Peak Distribution by Chromosome:")
# Sort chromosomes
sorted_chroms = sorted(peak_chromosomes.items(), 
                       key=lambda x: (x[0].replace('chr', '').replace('X', '23').replace('Y', '24').replace('M', '25').zfill(2)))

print(f"  {'Chromosome':<12} {'Peak Count':<12} {'Percentage':<12}")
print(f"  {'-'*40}")
total_peaks = sum(peak_chromosomes.values())
for chrom, count in sorted_chroms[:10]:  # Show top 10
    percentage = (count / total_peaks) * 100
    print(f"  {chrom:<12} {count:<12,} {percentage:>6.2f}%")

if len(sorted_chroms) > 10:
    print(f"  ... and {len(sorted_chroms) - 10} more chromosomes")

print(f"\nüí° Interpretation Tips:")
print(f"  - Transcription factors typically have narrow peaks (100-300 bp)")
print(f"  - Histone marks may have broader peaks (500-5000 bp)")
print(f"  - High fold enrichment (>10x) indicates strong binding")
print(f"  - Peaks should be distributed across chromosomes")

print("\n" + "=" * 70)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 70)
print("1. What is the mean and median peak length?")
print("2. Based on peak length, is your protein a transcription factor or histone mark?")
print("3. What is the mean fold enrichment over control?")
print("4. Is the fold enrichment strong (>10x)?")
print("5. Are peaks distributed across all chromosomes or concentrated on a few?")
print("6. Which chromosome has the most peaks?")
print("7. Do you see any unexpected patterns in the chromosome distribution?")
print("8. What is the longest peak? What might this represent?")
print("9. How would you use these statistics to assess data quality?")
print("=" * 70)

## Step 9: Pipeline Summary and Results

**Congratulations!** üéâ You've successfully called peaks from your ChIP-seq data!

Let's review all results and discuss next steps.

In [None]:
print("=" * 70)
print("CHIP-SEQ PEAK CALLING COMPLETE!")
print("=" * 70)

print(f"\n‚úì Experiment: {basename}")
print(f"‚úì Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\n" + "=" * 70)
print("SUMMARY OF RESULTS")
print("=" * 70)

# Peak counts
filtered_peaks = os.path.join(peaks_dir, f"{basename}_peaks_blacklisted_filtered.narrowPeak")
with open(filtered_peaks, 'r') as f:
    final_peak_count = sum(1 for line in f)

print(f"\n1. Peak Calling:")
print(f"   Final peak count: {final_peak_count:,}")

# FRiP score
qc_file = os.path.join(qc_dir, f"{basename}_qc.txt")
if os.path.exists(qc_file):
    with open(qc_file, 'r') as f:
        lines = f.readlines()
        if len(lines) > 1:
            frip_value = float(lines[1].strip())
            print(f"\n2. FRiP Score: {frip_value:.6f} ({frip_value*100:.2f}%)")
            if frip_value > 0.05:
                print(f"   ‚úì Excellent quality")
            elif frip_value > 0.01:
                print(f"   ‚úì Good quality")
            else:
                print(f"   ‚ö† Low quality")

# Peak characteristics
with open(filtered_peaks, 'r') as f:
    peak_lengths = []
    peak_enrichments = []
    for line in f:
        fields = line.strip().split('\t')
        peak_lengths.append(int(fields[2]) - int(fields[1]))
        peak_enrichments.append(float(fields[6]))

if peak_lengths:
    print(f"\n3. Peak Characteristics:")
    print(f"   Mean peak length: {statistics.mean(peak_lengths):.1f} bp")
    print(f"   Mean fold enrichment: {statistics.mean(peak_enrichments):.2f}x")

print("\n" + "=" * 70)
print("OUTPUT FILES")
print("=" * 70)

print(f"\nüìÅ Results Directory: {OUTPUT_DIR}")
print(f"\n  Key Files:")
print(f"  ‚îú‚îÄ {basename}_peaks/")
print(f"  ‚îÇ  ‚îú‚îÄ {basename}_peaks_blacklisted_filtered.narrowPeak  ‚Üê **MAIN OUTPUT**")
print(f"  ‚îÇ  ‚îú‚îÄ {basename}_summits_blacklisted_filtered.bed")
print(f"  ‚îÇ  ‚îú‚îÄ {basename}_peaks.xls                              ‚Üê Detailed peak info")
print(f"  ‚îÇ  ‚îî‚îÄ {basename}_model.r                                ‚Üê Peak model")
print(f"  ‚îî‚îÄ {basename}_qc/")
print(f"     ‚îî‚îÄ {basename}_qc.txt                                 ‚Üê FRiP score")

print("\n" + "=" * 70)
print("FILE FORMAT GUIDE")
print("=" * 70)

print("\nüìÑ narrowPeak format (main output):")
print("  Columns:")
print("    1. Chromosome")
print("    2. Start position")
print("    3. End position")
print("    4. Peak name")
print("    5. Score (integer, 0-1000)")
print("    6. Strand")
print("    7. Fold enrichment (signal vs. control)")
print("    8. -log10(p-value)")
print("    9. -log10(q-value) - FDR corrected")
print("   10. Summit position relative to start")

print("\n" + "=" * 70)
print("NEXT STEPS - DOWNSTREAM ANALYSIS")
print("=" * 70)

print("\n1. üìç Peak Annotation")
print("   - Assign peaks to nearest genes")
print("   - Determine peak locations (promoter, intron, intergenic)")
print("   - Tools: ChIPseeker (R), HOMER annotatePeaks")

print("\n2. üß¨ Motif Analysis")
print("   - Find enriched DNA binding motifs in peaks")
print("   - Identify transcription factor binding sites")
print("   - Tools: HOMER findMotifsGenome, MEME-ChIP")

print("\n3. üìä Visualization")
print("   - Create heatmaps of signal around peaks")
print("   - Generate genome browser tracks")
print("   - Plot peak distribution")
print("   - Tools: deepTools, IGV, UCSC Genome Browser")

print("\n4. üîç Gene Ontology Enrichment")
print("   - Test if peak-associated genes are enriched for functions")
print("   - Tools: GREAT, Enrichr, DAVID")

print("\n5. üß™ Integration with Other Data")
print("   - Compare with RNA-seq (do TF peaks correlate with gene expression?)")
print("   - Overlap with ATAC-seq (are binding sites in open chromatin?)")
print("   - Compare across conditions (differential binding)")

print("\n6. ‚úÖ Validation")
print("   - Select top peaks for experimental validation")
print("   - ChIP-qPCR at specific loci")
print("   - Luciferase reporter assays")

print("\n" + "=" * 70)
print("QUALITY CHECKLIST")
print("=" * 70)

checklist = []
checklist.append(("‚úì" if final_peak_count > 1000 else "‚úó", f"Peak count > 1,000: {final_peak_count:,}"))

if os.path.exists(qc_file):
    with open(qc_file, 'r') as f:
        lines = f.readlines()
        if len(lines) > 1:
            frip_value = float(lines[1].strip())
            checklist.append(("‚úì" if frip_value > 0.01 else "‚úó", f"FRiP > 1%: {frip_value*100:.2f}%"))

if peak_lengths:
    avg_enrichment = statistics.mean(peak_enrichments)
    checklist.append(("‚úì" if avg_enrichment > 2 else "‚ö†", f"Mean enrichment > 2x: {avg_enrichment:.2f}x"))

print()
for status, check in checklist:
    print(f"  {status} {check}")

print("\n" + "=" * 70)
print("üìù REVIEW QUESTIONS")
print("=" * 70)

print("\n1. How many peaks were identified in your experiment?")
print("2. What is your FRiP score? Is it acceptable?")
print("3. What is the average fold enrichment of your peaks?")
print("4. What percentage of peaks were in blacklist regions?")
print("5. Are your peaks narrow (TF) or broad (histone)?")
print("6. Which chromosome has the most peaks? Why might this be?")
print("7. What would you do next to validate these results?")
print("8. How would you compare this to another condition?")

print("\n" + "=" * 70)
print("Excellent work! Your peaks are ready for biological interpretation! üéâ")
print("=" * 70)