# RNA-Seq Pipeline: Step 3 - Alignment (HISAT2)

This notebook performs the alignment step. We map our cleaned, paired-end reads (from `02_Trimmed_Data`) to the reference genome (from `06_Genome_Index`).

**Workflow:**
1.  **Build Index:** First, create a `HISAT2` index from our reference genome (`genome.fna.gz`). This is a one-time step that makes alignment fast.
2.  **Run Alignment:** Loop through all 6 samples and run `hisat2` to align the paired-end reads.
3.  **Convert & Sort:** Convert the output (`.sam`) to the compressed `.bam` format (which saves space) and sort it.
4.  **Index BAM:** Create an index (`.bai`) for the sorted BAM files. This is required for visualization and counting.

**Tools:** `hisat2`, `samtools` (which is part of the `subread` package we installed)

In [None]:
import os

# --- Define Core Paths ---

# 1. Input: Cleaned FASTQ files
trimmed_dir = "02_Trimmed_Data"

# 2. Input: Reference Genome files
genome_dir = "06_Genome_Index"
genome_fasta = os.path.join(genome_dir, "genome.fna.gz")

# 3. Output: HISAT2 Index
# We will store the index in its own folder
index_dir = "07_HISAT2_Index"
os.makedirs(index_dir, exist_ok=True)
# Define a prefix name for the index files
index_prefix = os.path.join(index_dir, "klebsiella_hs11286_index")

# 4. Output: Aligned BAM files
alignment_dir = "03_Aligned_BAMs"
os.makedirs(alignment_dir, exist_ok=True)

# --- Get Sample Names (like we did in Notebook 02) ---
input_files = sorted([f for f in os.listdir(trimmed_dir) if f.endswith(".trimmed_R1.fastq.gz")])
sample_names = [f.split(".trimmed_R1.fastq.gz")[0] for f in input_files]

print(f"Genome FASTA: {genome_fasta}")
print(f"Index Prefix: {index_prefix}")
print(f"Alignment Output Dir: {alignment_dir}")
print(f"Found {len(sample_names)} samples to align: {sample_names}")

In [None]:
%%bash -s "$genome_fasta" "$index_prefix"
# $1 = genome_fasta (from Cell 2)
# $2 = index_prefix (from Cell 2)

GENOME_FASTA=$1
INDEX_PREFIX=$2

echo "--- 1. Building HISAT2 Index ---"
echo "Input FASTA: $GENOME_FASTA"
echo "Output Prefix: $INDEX_PREFIX"

# -p 8 : Use 8 threads
hisat2-build -p 8 $GENOME_FASTA $INDEX_PREFIX

echo "--- HISAT2 Index build complete. ---"
echo "Index files are in: $(dirname $INDEX_PREFIX)"

### 2. Run Alignment & BAM Processing Loop

Now that the genome index is built (step 1), we will loop through all 6 samples.

This code cell performs the main alignment workflow:
1.  **Alignment (`hisat2`):** Align the trimmed `R1` and `R2` files to the `klebsiella_hs11286_index`. This outputs a very large text file (`.sam`).
2.  **Conversion (`samtools view`):** Convert the text `.sam` file into its binary (compressed) equivalent (`.bam`).
3.  **Sorting (`samtools sort`):** Sort the `.bam` file by genomic coordinates. This is **required** for the next step (Counting).
4.  **Indexing (`samtools index`):** Create an index file (`.bai`) for the sorted BAM, allowing tools to access it quickly.
5.  **Cleanup (`rm`):** Delete the intermediate `.sam` file to save disk space.

In [None]:
# Cell 3: Run the Alignment Loop 

print("\n--- Starting HISAT2 Alignment Loop ---")

for sample in sample_names:
    print(f"Processing sample: {sample} ...")
    
    # 1. Define inputs
    in_r1 = f"{trimmed_dir}/{sample}.trimmed_R1.fastq.gz"
    in_r2 = f"{trimmed_dir}/{sample}.trimmed_R2.fastq.gz"
    
    # 2. Define outputs
    sam_output = f"{alignment_dir}/{sample}.sam"
    bam_output_sorted = f"{alignment_dir}/{sample}.sorted.bam"
    stats_output = f"{alignment_dir}/{sample}.stats.txt"

    # 4. Run hisat2 alignment
    !hisat2 -p 8 \
        -x $index_prefix \
        -1 $in_r1 \
        -2 $in_r2 \
        --summary-file $stats_output \
        -S $sam_output
    
    print(f"  ... Finished alignment (SAM) for {sample}.")

    # 5. --- THIS IS THE FIX ---
    # Convert, Sort, and Index in one simple command
    # samtools sort can read SAM, sort it, and output BAM directly.
    !samtools sort -@ 8 -o $bam_output_sorted $sam_output
    
    print(f"  ... Finished sorting and converting to BAM for {sample}.")

    # 6. Index the new sorted BAM file
    !samtools index $bam_output_sorted
    
    print(f"  ... Finished indexing BAM for {sample}.")

    # 7. (Important) Clean up the large SAM file
    !rm $sam_output
    
    print(f"  ... Cleaned up intermediate SAM file.")
    print(f"  ... Final output: {bam_output_sorted}")
    print("--------------------------------------")

print("--- HISAT2 Alignment complete for all samples. ---")

### 3. Verify Alignment Statistics (MultiQC)

The alignment loop is complete. `hisat2` generated a summary statistics file (`.stats.txt`) for each sample.

We will now run `MultiQC` on the `03_Aligned_BAMs` directory to parse these stats files and create a single report. This report will show us the "Overall Alignment Rate" for all samples, which is the most critical metric to verify success.

In [None]:
# Define the output directory for this new QC report
# (We need 'os' again if the kernel was restarted, but it should be loaded)
import os 

alignment_qc_dir = "00_Data_QC/03_alignment_qc"
os.makedirs(alignment_qc_dir, exist_ok=True)

print(f"Alignment stats (input): {alignment_dir}")
print(f"QC Report (output): {alignment_qc_dir}")

In [None]:
%%bash -s "$alignment_dir" "$alignment_qc_dir"
# $1 = alignment_dir (where the .stats.txt files are)
# $2 = alignment_qc_dir (where the report will go)

ALIGN_DIR=$1
QC_OUT_DIR=$2

echo "--- 6. Running MultiQC on Alignment Stats ---"
echo "Scanning directory: $ALIGN_DIR"
echo "Outputting report to: $QC_OUT_DIR"

# -o : output directory
# ALIGN_DIR : target directory to scan
multiqc -o $QC_OUT_DIR $ALIGN_DIR

echo "--- Alignment MultiQC complete. ---"
echo "Check the 'multiqc_report.html' file in $QC_OUT_DIR"