# ChIP-seq Processing Pipeline - Student Exercise

## Learning Objectives
By completing this notebook, you will:
- Understand the ChIP-seq data analysis workflow
- Learn how to process single-end sequencing reads
- Identify protein-DNA binding sites
- Assess data quality using standard metrics

## Pipeline Overview
```
Raw FASTQ ‚Üí Trimming ‚Üí Alignment ‚Üí Mark Duplicates ‚Üí Filter ‚Üí Coverage Track
   (Input)  (Trim Galore) (Bowtie2)    (Picard)      (QC)    (BigWig)
```

## What is ChIP-seq?
**ChIP-seq** (Chromatin Immunoprecipitation sequencing) identifies genome-wide DNA binding sites for transcription factors and other proteins.

**Key Concepts:**
- **IP (Immunoprecipitation) sample**: DNA fragments bound to your protein of interest
- **Input/Control sample**: Background DNA (no immunoprecipitation)
- **Peak calling**: Identifies regions enriched in IP vs. control

## Pipeline Steps
This notebook covers **sample processing** (both IP and control samples):
1. Quality control and trimming
2. Read alignment
3. Duplicate marking
4. Quality filtering
5. Coverage track generation

**Note**: Peak calling is done separately using both IP and control samples together.

## Required Tools
- **Trim Galore** (0.6.10): Adapter trimming and QC
- **Bowtie2** (2.4.1): Read aligner
- **Samtools** (1.7): BAM file processing
- **Picard** (2.23.4): Duplicate marking
- **deeptools** (3.5.5): Coverage visualization

---

**Instructions**: Follow the cells below and complete sections marked with `# TODO`

## Step 1: Import Python Libraries

In [None]:
import os
import subprocess
from pathlib import Path
from datetime import datetime

print("‚úì Libraries imported successfully!")

## Step 2: Define Bioinformatics Tool Containers

In [None]:
# Container images for each tool
trimgalore_container = "quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_1"
fastqc_container = "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"
bowtie2_container = "quay.io/biocontainers/bowtie2:2.4.1--py38h1c8e9b9_3"
samtools_container = "quay.io/biocontainers/samtools:1.7--2"
picard_container = "quay.io/biocontainers/picard:2.23.4--0"
deeptools_container = "quay.io/biocontainers/deeptools:3.5.5--pyhdfd78af_0"

print("‚úì Container images defined")

## Step 3: Set Your Pipeline Parameters

**üìù TODO: Update these paths with your actual data!**

**Important**: Run this pipeline twice - once for your IP sample and once for your control/input sample.

In [None]:
# TODO: Update these paths to match your data location
fastq1 = "/path/to/your/sample_R1.fq.gz"      # Single-end reads
basename = "my_chip_sample"                    # Sample name (e.g., "IP_sample" or "control")
output_dir = "/path/to/output_directory"       # Where to save results

# Reference genome files (ask your instructor for these paths)
genome_index = "/path/to/bowtie2_index"        # Bowtie2 genome index

# Analysis settings
threads = 2                                     # Number of CPU threads
# TODO: Update this to your accessible directory for Singularity
BIND_DIR = "/path/to/accessible_directory/"    # Directory accessible to Singularity

print(f"Sample name: {basename}")
print(f"Sample type: {fastq1}")
print(f"Threads: {threads}")
print("‚ö† Remember to update the file paths above before running!")
print("\nüí° Tip: Process both IP and control samples using this notebook!")

## Step 4: Create Output Directories

In [None]:
# Create output directory structure
OUTPUT_DIR = os.path.join(os.path.abspath(output_dir), f"{basename}_results")
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Create subdirectories
qc_dir = os.path.join(OUTPUT_DIR, f"{basename}_qc")
picard_dir = os.path.join(OUTPUT_DIR, f"{basename}_picard")

for d in [qc_dir, picard_dir]:
    Path(d).mkdir(exist_ok=True)

# Initialize log file
log_file = os.path.join(OUTPUT_DIR, f"{basename}_pipeline.log")

print(f"‚úì Output directory: {OUTPUT_DIR}")
print(f"‚úì QC directory: {qc_dir}")
print(f"‚úì Picard directory: {picard_dir}")

## Step 5: Trim Adapters and Run Quality Control

**What does this do?** Removes adapter sequences and low-quality bases, then runs FastQC on trimmed reads.

**üìù TODO: Run Trim Galore with integrated FastQC.**

In [None]:
print("=" * 60)
print("STEP 1: Trimming and Quality Control")
print("=" * 60)

print("\nRunning Trim Galore (with FastQC)...")

# TODO: Complete the Trim Galore command
# Hint: You need to specify the correct container and complete the command parameters
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for trim_galore?
    "trim_galore",
    # TODO: Add the following parameters:
    # - Number of threads/cores to use
    # - Output format (gzip)
    # - Base name for output files
    # - FastQC arguments with output directory
    # - Output directory
    # - Input FASTQ file
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Trimming and QC completed!")
    fastq1_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_trimmed.fq.gz")
    print(f"  Trimmed reads: {fastq1_trimmed}")
    print(f"  QC reports: {qc_dir}/")
    
    # Show trimming stats from output
    if "Total reads processed:" in result.stdout:
        for line in result.stdout.split('\n'):
            if 'reads processed' in line or 'reads with adapters' in line or 'Quality-trimmed' in line:
                print(f"  {line.strip()}")
else:
    print(f"‚úó Error: {result.stderr}")

print("\n" + "=" * 60)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 60)
print("1. What percentage of your reads contained adapters?")
print("2. Why is it important to remove adapters before alignment?")
print("3. What is the average read length after trimming?")
print("4. Check the FastQC report - what is the per-base sequence quality?")
print("5. Are there any overrepresented sequences? What could they be?")
print("\nüí° Tip: Open the FastQC HTML report in a browser to visualize quality metrics")
print("=" * 60)

## Step 6: Align Reads with Bowtie2

**What does this do?** Maps single-end reads to the reference genome.

**üìù TODO: Complete the Bowtie2 alignment command.**

In [None]:
print("=" * 60)
print("STEP 2: Alignment with Bowtie2")
print("=" * 60)

fastq1_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_trimmed.fq.gz")
sam_file = os.path.join(OUTPUT_DIR, f"{basename}.sam")

print("\nAligning reads to genome...")

# TODO: Complete the Bowtie2 command for single-end reads
# Hint: Check which container has bowtie2 installed
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for bowtie2?
    "bowtie2",
    # TODO: Add the following parameters:
    # - Alignment sensitivity mode (hint: --very-sensitive)
    # - Number of threads
    # - Index basename (variable: genome_index)
    # - Input reads (variable: fastq1_trimmed, use -U for single-end)
]

print("Running Bowtie2 (this may take several minutes)...")
with open(sam_file, 'w') as outfile:
    result = subprocess.run(cmd, stdout=outfile, stderr=subprocess.PIPE, text=True)

if result.returncode == 0:
    print("‚úì Alignment completed!")
    print(f"  SAM file: {sam_file}")
    
    # Parse alignment stats
    print("\n  Alignment Statistics:")
    for line in result.stderr.split('\n'):
        if 'reads; of these:' in line or 'aligned concordantly' in line or 'aligned exactly' in line or 'overall alignment rate' in line:
            print(f"  {line.strip()}")
else:
    print(f"‚úó Error: {result.stderr}")

print("\n" + "=" * 60)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 60)
print("1. What is your overall alignment rate? Is it acceptable (typically >70%)?")
print("2. If the alignment rate is low (<70%), what could be the reasons?")
print("   - Wrong reference genome?")
print("   - High adapter contamination?")
print("   - Poor quality reads?")
print("   - Sample contamination?")
print("3. How many reads aligned exactly once vs. multiple times?")
print("4. Why might some reads align to multiple locations?")
print("5. Compare IP vs Control alignment rates - are they similar?")
print("=" * 60)

## Step 7: Convert SAM to BAM and Sort

**What does this do?** Converts to binary format (BAM) and sorts by genomic coordinates.

**üìù TODO: Complete the conversion and sorting steps.**

In [None]:
print("=" * 60)
print("STEP 3: Convert SAM to BAM and Sort")
print("=" * 60)

sam_file = os.path.join(OUTPUT_DIR, f"{basename}.sam")
sorted_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted.bam")

print("\n[1/2] Converting SAM to BAM and sorting...")

# TODO: Complete the samtools sort command
# Hint: Which container has samtools installed?
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for samtools?
    "samtools", "sort",
    # TODO: Add the following parameters:
    # - Number of threads (hint: use -@)
    # - Output file (hint: use -o with sorted_bam variable)
    # - Input file (variable: sam_file)
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Sorting completed!")
    print(f"  Sorted BAM: {sorted_bam}")
    
    # Get file size
    bam_size = os.path.getsize(sorted_bam) / (1024**3)  # Convert to GB
    print(f"  File size: {bam_size:.2f} GB")
else:
    print(f"‚úó Error: {result.stderr}")

print("\n[2/2] Fixing BAM header (removing Bowtie2 version info)...")
# This step is needed for compatibility with Picard
header_file = os.path.join(OUTPUT_DIR, f"{basename}_header.txt")
header_filtered = os.path.join(OUTPUT_DIR, f"{basename}_header_filtered.txt")

# Extract header
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",
    "samtools", "view", "-H", sorted_bam
]
result = subprocess.run(cmd, capture_output=False, text=True)
with open(header_file, 'w') as f:
    f.write(result.stdout)

# Filter @PG lines
with open(header_file, 'r') as f:
    with open(header_filtered, 'w') as out:
        for line in f:
            if not line.startswith('@PG'):
                out.write(line)

# Reheader
reheaded_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted_reheaded.bam")
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{trimgalore_container}",
    "samtools", "reheader",
    "-P", header_filtered,
    sorted_bam
]
with open(reheaded_bam, 'w') as outfile:
    subprocess.run(cmd, stdout=outfile, capture_output=False)

# Replace original
os.replace(reheaded_bam, sorted_bam)
os.remove(header_file)
os.remove(header_filtered)
os.remove(sam_file)

print("‚úì Header fixed and SAM file removed!")

print("\n" + "=" * 60)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 60)
print("1. What is the file size of your sorted BAM file?")
print("2. Why is BAM format better than SAM for storage?")
print("3. Why is coordinate sorting necessary for downstream analysis?")
print("4. What is the purpose of removing the SAM file after conversion?")
print("=" * 60)

## Step 8: Mark Duplicates with Picard

**What does this do?** Identifies PCR duplicate reads (same genomic position).

**Why mark duplicates in ChIP-seq?** PCR duplicates can artificially inflate signal and should be marked (and often removed) to avoid false positive peaks.

**üìù TODO: Run Picard MarkDuplicates.**

In [None]:
print("=" * 60)
print("STEP 4: Mark duplicate reads")
print("=" * 60)

sorted_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted.bam")
mkdup_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted_mkdup.bam")
dup_metrics = os.path.join(picard_dir, f"{basename}_dup.txt")

print("\nMarking duplicates with Picard...")

# TODO: Complete the Picard MarkDuplicates command
# Hint: Picard has its own container separate from samtools
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for picard?
    "picard", "MarkDuplicates",
    # TODO: Add the following parameters:
    # - Input BAM file (hint: -I with sorted_bam variable)
    # - Output BAM file (hint: -O with mkdup_bam variable)
    # - Metrics file (hint: -M with dup_metrics variable)
    # - REMOVE_DUPLICATES setting (should be "false" to mark but not remove)
    # - ASSUME_SORT_ORDER setting (should be "coordinate")
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Duplicate marking completed!")
    print(f"  Marked BAM: {mkdup_bam}")
    print(f"  Metrics file: {dup_metrics}")
    
    # Parse duplicate metrics
    print("\n  üìä Duplicate Statistics:")
    with open(dup_metrics, 'r') as f:
        for line in f:
            if line.startswith("LIBRARY"):
                header = line.strip().split('\t')
                data = next(f).strip().split('\t')
                # Find indices
                for i, col in enumerate(header):
                    if col == "UNPAIRED_READS_EXAMINED":
                        unpaired_reads = int(data[i])
                    elif col == "UNPAIRED_READ_DUPLICATES":
                        unpaired_dups = int(data[i])
                    elif col == "PERCENT_DUPLICATION":
                        dup_rate = float(data[i])
                
                print(f"    Total reads examined: {unpaired_reads:,}")
                print(f"    Duplicate reads: {unpaired_dups:,}")
                print(f"    Duplication rate: {dup_rate*100:.2f}%")
                
                if dup_rate < 0.2:
                    print(f"    ‚úì Good library complexity (<20% duplicates)")
                elif dup_rate < 0.5:
                    print(f"    ‚ö† Moderate duplication (20-50%)")
                else:
                    print(f"    ‚úó High duplication (>50%) - may indicate low complexity")
                break
else:
    print(f"‚úó Error: {result.stderr}")

print("\n" + "=" * 60)
print("üìù QUESTIONS TO ANSWER:")
print("=" * 60)
print("1. What is your duplication rate? Record the exact percentage.")
print("2. What does the duplication rate tell you about library quality?")
print("   - <20%: Excellent library complexity")
print("   - 20-50%: Acceptable, but could be improved")
print("   - >50%: Poor library complexity (potential issues)")
print("3. What causes PCR duplicates in ChIP-seq experiments?")
print("4. Why do we mark duplicates instead of removing them immediately?")
print("5. How might high duplication affect peak calling results?")
print("6. Compare duplication rates between IP and control samples - are they similar?")
print("7. What could cause unusually high duplication rates?")
print("   - Low input DNA amount?")
print("   - Too many PCR cycles?")
print("   - Poor library complexity?")
print("=" * 60)

## Step 9: Filter Low Quality Reads and Remove Duplicates

**What does this do?** Removes:
- Duplicate reads (FLAG 1024)
- Low mapping quality reads (MAPQ < 20)

**MAPQ threshold**: 20 means 99% probability the read is correctly mapped.

**üìù TODO: Apply quality filters.**

In [None]:
print("=" * 60)
print("STEP 5: Filter low quality reads and remove duplicates")
print("=" * 60)

mkdup_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted_mkdup.bam")
filtered_bam = os.path.join(OUTPUT_DIR, f"{basename}_sorted_mkdup_filtered.bam")
final_bam = os.path.join(OUTPUT_DIR, f"{basename}_final.bam")

print("\nFiltering reads with MAPQ < 20 and removing duplicates...")

# TODO: Complete the filtering command
# Hint: -F 1024 removes duplicates, -q 20 keeps only high-quality alignments
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",
    "samtools", "view",
    "-@", str(threads),
    # TODO: Add the following parameters:
    # - Filter flag to remove duplicates (hint: -F 1024)
    # - Minimum mapping quality (hint: -q 20)
    # - Output format as BAM (hint: -b)
    # - Output file (hint: -o with filtered_bam variable)
    # - Input file (variable: mkdup_bam)
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Filtering completed!")
    
    # Sort the filtered BAM
    print("Sorting filtered BAM...")
    cmd = [
        "singularity", "exec", "-e", "--no-home",
        "--bind", f"{BIND_DIR}:{BIND_DIR}",
        f"docker://{samtools_container}",
        "samtools", "sort",
        "-@", str(threads),
        "-o", final_bam,
        filtered_bam
    ]
    subprocess.run(cmd, capture_output=False)
    
    print("‚úì Final BAM created!")
    print(f"  Final BAM: {final_bam}")
    
    # Generate flagstats
    flagstats_file = os.path.join(qc_dir, f"{basename}_flagstats.txt")
    cmd = [
        "singularity", "exec", "-e", "--no-home",
        "--bind", f"{BIND_DIR}:{BIND_DIR}",
        f"docker://{samtools_container}",
        "samtools", "flagstat",
        "-@", str(threads),
        final_bam
    ]
    result = subprocess.run(cmd, capture_output=False, text=True)
    
    with open(flagstats_file, 'w') as f:
        f.write(result.stdout)
    
    print(f"\n  üìä Final Read Statistics:")
    for line in result.stdout.split('\n'):
        if 'mapped (' in line or 'properly paired' in line or 'singletons' in line:
            print(f"    {line.strip()}")
    
    print(f"\n  Full stats saved to: {flagstats_file}")
    
    # Clean up intermediate files
    os.remove(mkdup_bam)
    os.remove(filtered_bam)
    print("\n‚úì Intermediate files cleaned up")
    
    print("\n" + "=" * 60)
    print("üìù QUESTIONS TO ANSWER:")
    print("=" * 60)
    print("1. How many reads remained after filtering?")
    print("2. What percentage of reads were removed during filtering?")
    print("3. Why do we use MAPQ >= 20 as the quality threshold?")
    print("   (Hint: What does MAPQ=20 mean in terms of mapping confidence?)")
    print("4. Why is it important to remove low-quality alignments in ChIP-seq?")
    print("5. What is the difference between marked duplicates and removed duplicates?")
    print("6. Calculate: Starting reads ‚Üí After trimming ‚Üí After alignment ‚Üí After filtering")
    print("   What percentage of raw reads made it to the final BAM file?")
    print("7. Is your final read count sufficient for peak calling (typically need >5-10M)?")
    print("=" * 60)
    
else:
    print(f"‚úó Error: {result.stderr}")

## Step 10: Create BigWig Coverage Track

**What does this do?** Generates a normalized genome browser track for visualization.

**Normalization**: RPKM (Reads Per Kilobase per Million mapped reads) makes samples comparable.

**üìù TODO: Generate the coverage track.**

In [None]:
print("=" * 60)
print("STEP 6: Generate BigWig coverage track")
print("=" * 60)

final_bam = os.path.join(OUTPUT_DIR, f"{basename}_final.bam")
bigwig_file = os.path.join(OUTPUT_DIR, f"{basename}_final_RPKM_Norm_bs10.bw")

print("\n[1/2] Indexing BAM file...")
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",
    "samtools", "index",
    "-@", str(threads),
    final_bam
]
subprocess.run(cmd, capture_output=False)
print("‚úì Indexing done")

print("\n[2/2] Creating normalized BigWig track...")

# TODO: Complete the bamCoverage command
# Hint: bamCoverage is part of deeptools
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{???}",  # TODO: Which container should you use for bamCoverage?
    "bamCoverage",
    # TODO: Add the following parameters:
    # - Input BAM file (hint: -b with final_bam variable)
    # - Output BigWig file (hint: -o with bigwig_file variable)
    # - Bin size (hint: -bs 10 for 10bp bins)
    # - Normalization method (hint: --normalizeUsing RPKM)
    # - Number of processors (hint: -p with threads variable)
]

result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì BigWig file created!")
    print(f"  File: {bigwig_file}")
    
    # Get file size
    bw_size = os.path.getsize(bigwig_file) / (1024**2)  # Convert to MB
    print(f"  File size: {bw_size:.2f} MB")
    
    print(f"\n  üí° Visualization:")
    print(f"     - Load into IGV (Integrative Genomics Viewer)")
    print(f"     - Load into UCSC Genome Browser")
    print(f"     - Compare IP vs Control tracks to see enrichment")
    
    print("\n" + "=" * 60)
    print("üìù QUESTIONS TO ANSWER:")
    print("=" * 60)
    print("1. What is the file size of your BigWig file?")
    print("2. What does RPKM normalization do and why is it important?")
    print("3. What does the bin size (10bp) represent?")
    print("4. Why do we create BigWig files instead of using BAM for visualization?")
    print("5. After loading into IGV:")
    print("   - Do you see any regions with high signal?")
    print("   - Is the signal evenly distributed or localized?")
    print("   - How does the IP sample compare to the control?")
    print("6. What would you expect to see at true binding sites?")
    print("   (Hint: IP signal should be higher than control)")
    print("=" * 60)

else:
    print(f"‚úó Error: {result.stderr}")

## Step 11: Pipeline Summary

**Congratulations!** üéâ You've successfully processed your ChIP-seq sample!

Let's review the results and next steps.

In [None]:
print("=" * 70)
print("CHIP-SEQ SAMPLE PROCESSING COMPLETE!")
print("=" * 70)

print(f"\n‚úì Sample processed: {basename}")
print(f"‚úì Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\n" + "=" * 70)
print("QUALITY METRICS SUMMARY")
print("=" * 70)

# Read flagstats
flagstats_file = os.path.join(qc_dir, f"{basename}_flagstats.txt")
if os.path.exists(flagstats_file):
    print("\n1. Alignment Statistics:")
    with open(flagstats_file, 'r') as f:
        content = f.read()
        print(f"   {content}")

# Read duplicate metrics
dup_metrics = os.path.join(picard_dir, f"{basename}_dup.txt")
if os.path.exists(dup_metrics):
    print("\n2. Duplication Rate:")
    with open(dup_metrics, 'r') as f:
        for line in f:
            if line.startswith("LIBRARY"):
                header = line.strip().split('\t')
                data = next(f).strip().split('\t')
                for i, col in enumerate(header):
                    if col == "PERCENT_DUPLICATION":
                        dup_rate = float(data[i])
                        print(f"   {dup_rate*100:.2f}%")
                        if dup_rate < 0.2:
                            print(f"   ‚úì Good library quality")
                        elif dup_rate < 0.5:
                            print(f"   ‚ö† Moderate duplication")
                        else:
                            print(f"   ‚úó High duplication")
                break

print("\n" + "=" * 70)
print("OUTPUT FILES")
print("=" * 70)

print(f"\nüìÅ Results Directory: {OUTPUT_DIR}")
print(f"\n  Key Files:")
print(f"  ‚îú‚îÄ {basename}_final.bam              ‚Üê Final aligned reads (use for peak calling)")
print(f"  ‚îú‚îÄ {basename}_final_RPKM_Norm_bs10.bw ‚Üê Coverage track (load in IGV)")
print(f"  ‚îú‚îÄ {basename}_qc/")
print(f"  ‚îÇ  ‚îú‚îÄ {basename}_flagstats.txt        ‚Üê Alignment statistics")
print(f"  ‚îÇ  ‚îî‚îÄ FastQC reports                  ‚Üê Quality control reports")
print(f"  ‚îî‚îÄ {basename}_picard/")
print(f"     ‚îî‚îÄ {basename}_dup.txt              ‚Üê Duplicate metrics")

print("\n" + "=" * 70)
print("NEXT STEPS")
print("=" * 70)

print("\n‚ö†Ô∏è  IMPORTANT: Process BOTH samples!")
print("   1. Run this notebook for your IP sample")
print("   2. Run this notebook again for your control/input sample")

print("\nüìä After processing both samples:")
print("   Use a separate peak calling script/notebook with:")
print(f"   - IP BAM file: {basename}_final.bam")
print("   - Control BAM file: [control_name]_final.bam")

print("\nüîç Peak Calling Tools:")
print("   - MACS2 (most common, good for transcription factors)")
print("   - MACS3 (updated version)")
print("   - HOMER (good for histone marks)")
print("   - SICER (good for broad domains)")

print("\nüìà Downstream Analysis:")
print("   1. Peak annotation (assign peaks to genes)")
print("   2. Motif analysis (find DNA binding motifs)")
print("   3. Differential binding analysis (compare conditions)")
print("   4. Gene ontology enrichment")
print("   5. Integration with RNA-seq data")

print("\nüí° Visualization Tips:")
print("   - Load BigWig files for both IP and control into IGV")
print("   - Look for regions where IP signal > control signal")
print("   - These regions indicate protein binding sites")

print("\nüìù Review Questions:")
print("   1. What is your alignment rate? Is it acceptable?")
print("   2. What is your duplication rate? What does it mean?")
print("   3. How many reads remained after filtering?")
print("   4. What is the difference between IP and control samples?")
print("   5. Why do we need both IP and control for peak calling?")

print("\n" + "=" * 70)
print("Great job! Your ChIP-seq sample is ready for peak calling! üéâ")
print("=" * 70)