# RNA-seq Processing Pipeline - Student Exercise

## Learning Objectives
By completing this notebook, you will:
- Understand the basic steps of RNA-seq data analysis
- Learn how to run bioinformatics tools using Singularity containers
- Process paired-end RNA sequencing reads from raw data to gene counts

## Pipeline Overview
```
Raw FASTQ files ‚Üí Quality Control ‚Üí Trimming ‚Üí Alignment ‚Üí Sorting ‚Üí Count Matrix
     (Input)         (FastQC)     (Trim Galore)   (STAR)   (Samtools)  (HTSeq)
```

## Required Tools
All tools are provided as Docker/Singularity containers:
- **FastQC** (0.12.1): Quality control of sequencing data
- **Trim Galore** (0.6.10): Adapter and quality trimming
- **STAR** (2.5.2b): RNA-seq aligner
- **Samtools** (1.7): BAM file manipulation
- **HTSeq** (2.0.5): Read counting

---

**Instructions**: Follow the cells below and complete the sections marked with `# TODO`

## Step 1: Import Python Libraries

We need to import libraries to run commands and manage files.

In [None]:
import os
import subprocess
from pathlib import Path
from datetime import datetime

print("‚úì Libraries imported successfully!")

## Step 2: Define Bioinformatics Tool Containers

Each tool is packaged in a container for easy deployment.

In [None]:
# Container images - these are pre-configured environments with bioinformatics tools
fastqc_container = "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"
trimgalore_container = "quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_1"
star_container = "quay.io/biocontainers/star:2.5.2b--0"
samtools_container = "quay.io/biocontainers/samtools:1.7--2"
htseq_container = "quay.io/biocontainers/htseq:2.0.5--py39h91a4a08_2"

print("‚úì Container images defined")

Container images defined:
STAR: quay.io/biocontainers/star:2.5.2b--0
HTSeq: quay.io/biocontainers/htseq:2.0.5--py39h91a4a08_2
FastQC: quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0
Samtools: quay.io/biocontainers/samtools:1.7--2
Trim Galore: quay.io/biocontainers/trim-galore:0.6.10--hdfd78af_1


## Step 3: Set Your Pipeline Parameters

**üìù TODO: Update these paths with your actual data!**

In [None]:
# TODO: Update these paths to match your data location
fastq1 = "/path/to/your/sample_R1.fq.gz"      # Forward reads
fastq2 = "/path/to/your/sample_R2.fq.gz"      # Reverse reads
basename = "my_sample"                         # Sample name
output_dir = "/path/to/output_directory"       # Where to save results

# Reference genome files (ask your instructor for these paths)
genome_index = "/path/to/STAR_index"           # STAR genome index
genome_gtf = "/path/to/genes.gtf"              # Gene annotation file

# Analysis settings
threads = 2                                     # Number of CPU threads to use
# TODO: Update this to your accessible directory for Singularity
BIND_DIR = "/path/to/accessible_directory/"    # Directory accessible to Singularity

print(f"Sample name: {basename}")
print(f"Threads: {threads}")
print("‚ö† Remember to update the file paths above before running!")

## Step 4: Create Output Directories

This cell creates folders to organize your results.

In [None]:
# Create output directory structure
OUTPUT_DIR = os.path.join(os.path.abspath(output_dir), f"{basename}_results")
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Create QC subdirectory
qc_dir = os.path.join(OUTPUT_DIR, f"{basename}_qc")
Path(qc_dir).mkdir(exist_ok=True)

# Initialize log file
log_file = os.path.join(OUTPUT_DIR, f"{basename}_pipeline.log")

print(f"‚úì Output directory: {OUTPUT_DIR}")
print(f"‚úì QC directory: {qc_dir}")
print(f"‚úì Log file: {log_file}")

## Step 5: Quality Control with FastQC

**What does this do?** FastQC analyzes the quality of your sequencing reads and generates reports.

**üìù TODO: Complete the command below by filling in the missing parts.**

In [None]:
# Copy FASTQ files
fastq1_copy = os.path.join(OUTPUT_DIR, f"{basename}_R1.fq.gz")
fastq2_copy = os.path.join(OUTPUT_DIR, f"{basename}_R2.fq.gz")

print("Copying FASTQ files...")
shutil.copy(fastq1, fastq1_copy)
shutil.copy(fastq2, fastq2_copy)

# Update fastq paths to use the copies
fastq1 = fastq1_copy
fastq2 = fastq2_copy

print(f"Copied: {fastq1}")
print(f"Copied: {fastq2}")

## 6. Run FastQC for Quality Control

In [None]:
print("=" * 50)
print("STEP 1: Quality Control with FastQC")
print("=" * 50)

# TODO: Complete this command by replacing the ??? marks
# Hint: Look at the container name defined in Step 2
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{fastqc_container}",  # TODO: Replace ??? with the correct container variable
    "fastqc",
    "-t", str(threads),
    "-o", qc_dir,
    fastq1, fastq2
]

print("Running FastQC...")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì FastQC completed!")
    print(f"  Reports saved to: {qc_dir}")
else:
    print(f"‚úó Error: {result.stderr}")
    
# Question: What metrics does FastQC check? (Look at the HTML reports generated)

## Step 6: Trim Adapters and Low-Quality Bases

**What does this do?** Trim Galore removes adapter sequences and poor-quality bases from reads.

**üìù TODO: Run this cell and observe the trimming statistics.**

In [None]:
print("=" * 50)
print("STEP 2: Trimming with Trim Galore")
print("=" * 50)

cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{trimgalore_container}",
    "trim_galore",
    "--paired",                    # We have paired-end reads
    "-j", str(threads),
    "--basename", basename,
    "--gzip",                      # Keep output compressed
    "-o", OUTPUT_DIR,
    fastq1, fastq2
]

print("Running Trim Galore...")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Trimming completed!")
    # Update paths to trimmed files
    fastq1_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_val_1.fq.gz")
    fastq2_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_val_2.fq.gz")
    print(f"  Trimmed reads: {fastq1_trimmed}")
    print(f"                 {fastq2_trimmed}")
else:
    print(f"‚úó Error: {result.stderr}")
    
# Question: How many bases were trimmed on average?

## Step 7: Align Reads to Genome with STAR

**What does this do?** STAR aligns RNA-seq reads to a reference genome.

**üìù TODO: Complete the command and answer the questions below.**

In [None]:
print("=" * 50)
print("STEP 3: Alignment with STAR")
print("=" * 50)

# Use the trimmed reads
fastq1_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_val_1.fq.gz")
fastq2_trimmed = os.path.join(OUTPUT_DIR, f"{basename}_val_2.fq.gz")

# TODO: Fill in the missing container name
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{star_container}",  # TODO: Replace ??? with correct container
    "STAR",
    "--runMode", "alignReads",
    "--genomeDir", genome_index,
    "--runThreadN", str(threads),
    "--readFilesIn", fastq1_trimmed, fastq2_trimmed,
    "--readFilesCommand", "zcat",  # Because files are gzipped
    "--outFileNamePrefix", f"{OUTPUT_DIR}/{basename}.",
    "--outSAMtype", "BAM", "Unsorted"
]

print("Running STAR alignment (this may take several minutes)...")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Alignment completed!")
    bam_file = os.path.join(OUTPUT_DIR, f"{basename}.Aligned.out.bam")
    print(f"  BAM file: {bam_file}")
else:
    print(f"‚úó Error: {result.stderr}")
    
# Questions:
# 1. What percentage of reads mapped uniquely?
# 2. Check the {basename}.Log.final.out file for alignment statistics

## Step 8: Sort BAM File

**What does this do?** Sorts aligned reads by genomic coordinates (required for downstream analysis).

**üìù TODO: Complete the Samtools command.**

In [None]:
print("=" * 50)
print("STEP 4: Sorting BAM file")
print("=" * 50)

bam_file = os.path.join(OUTPUT_DIR, f"{basename}.Aligned.out.bam")
sorted_bam = os.path.join(OUTPUT_DIR, f"{basename}.Aligned.sortedByCoord.out.bam")

# TODO: Replace ??? with the correct container variable
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{samtools_container}",  # TODO: Which container should we use?
    "samtools", "sort",
    "-@", str(threads),
    "-o", sorted_bam,
    bam_file
]

print("Sorting BAM file...")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Sorting completed!")
    print(f"  Sorted BAM: {sorted_bam}")
else:
    print(f"‚úó Error: {result.stderr}")
    
# Question: Why is sorting necessary for the next step?

## Step 9: Count Reads per Gene with HTSeq

**What does this do?** Counts how many reads mapped to each gene (creates the count matrix).

**üìù TODO: Complete this final step to generate gene counts.**

In [None]:
print("=" * 50)
print("STEP 5: Extract gene counts with HTSeq")
print("=" * 50)

sorted_bam = os.path.join(OUTPUT_DIR, f"{basename}.Aligned.sortedByCoord.out.bam")
counts_file = os.path.join(OUTPUT_DIR, f"{basename}_counts.csv")

# TODO: Complete the command
cmd = [
    "singularity", "exec", "-e", "--no-home",
    "--bind", f"{BIND_DIR}:{BIND_DIR}",
    f"docker://{htseq_container}",  # TODO: Which container?
    "htseq-count",
    "-f", "bam",
    "-r", "pos",
    "-s", "reverse",              # Strandedness: check your library prep
    "-c", counts_file,
    sorted_bam,
    genome_gtf
]

print("Counting reads per gene...")
result = subprocess.run(cmd, capture_output=False, text=True)

if result.returncode == 0:
    print("‚úì Counting completed!")
    print(f"  Count matrix: {counts_file}")
    print("\nüìä Preview of count file:")
    # Show first few lines
    with open(counts_file, 'r') as f:
        for i, line in enumerate(f):
            if i < 10:
                print(f"  {line.strip()}")
            else:
                break
else:
    print(f"‚úó Error: {result.stderr}")
    
# Questions:
# 1. How many genes were detected in your sample?
# 2. What do the special counts (__no_feature, __ambiguous, etc.) mean?

## Step 10: Summary and Next Steps

**Congratulations!** üéâ You've completed the RNA-seq pipeline!

### What you've learned:
1. ‚úì Quality control of sequencing data
2. ‚úì Read trimming and filtering
3. ‚úì Genome alignment
4. ‚úì BAM file manipulation
5. ‚úì Gene expression quantification

### Your output files:
- **QC Reports**: `{basename}_qc/` folder
- **Aligned reads**: `{basename}.Aligned.sortedByCoord.out.bam`
- **Gene counts**: `{basename}_counts.csv` ‚Üê Use this for differential expression analysis!

### Next steps:
- Import the count matrix into R/Python for statistical analysis
- Perform differential expression analysis with DESeq2 or edgeR
- Visualize results with plots (volcano plots, heatmaps, etc.)

In [None]:
print("=" * 50)
print("PIPELINE COMPLETE!")
print("=" * 50)

print(f"\n‚úì Sample processed: {basename}")
print(f"‚úì Results directory: {OUTPUT_DIR}")
print(f"‚úì Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\nüìÅ Output files:")
print(f"  - QC reports: {qc_dir}/")
print(f"  - Sorted BAM: {basename}.Aligned.sortedByCoord.out.bam")
print(f"  - Gene counts: {basename}_counts.csv")

print("\nüîç Review Questions:")
print("  1. What was the quality of your raw sequencing data?")
print("  2. What percentage of reads aligned to the genome?")
print("  3. How many genes have non-zero counts?")
print("  4. What would you do differently if alignment rate was low?")

print("\nüí° Tip: You can now use the count matrix for differential expression analysis!")