# RNA-Seq Pipeline: Step 4 - Read Counting (featureCounts)

This notebook performs "read counting" or "quantification".

**Goal:**
We need to count how many of our aligned reads (from the `.bam` files) actually overlap with known genes (from the `.gff` annotation file).

The output will be a single file: a **"counts matrix"**. This is a simple table where:
* Rows are **Genes** (e.g., `gene-KPN_00001`)
* Columns are our **Samples** (e.g., `SRR34134104.sorted.bam`)
* Values are the **number of reads** that mapped to that gene in that sample.

This counts matrix is the final input for the statistical analysis (Step 5).

**Tool:** `featureCounts` (from the `subread` package)

In [None]:
import os

# --- Define Core Paths ---

# 1. Input: Aligned/Sorted BAM files (from Notebook 03)
alignment_dir = "03_Aligned_BAMs"

# 2. Input: Reference Genome Annotation (from Notebook 01)
genome_dir = "06_Genome_Index"
# We need the GFF (General Feature Format) file, which contains the gene coordinates
annotation_file = os.path.join(genome_dir, "genes.gff")

# 3. Output: The counts data
counts_dir = "04_Counts"
os.makedirs(counts_dir, exist_ok=True)
# This is the final matrix file we will create
counts_matrix_file = os.path.join(counts_dir, "gene_counts_matrix.txt")

print(f"Annotation File (Input): {annotation_file}")
print(f"BAM Directory (Input): {alignment_dir}")
print(f"Counts Matrix (Output): {counts_matrix_file}")

In [None]:
import os
import glob # This library is good at finding files

print("--- 1. Running featureCounts ---")
print(f"Annotation: {annotation_file}")
print(f"Output File: {counts_matrix_file}")

# --- FIX 1: Use glob to find files ---
bam_pattern = os.path.join(alignment_dir, "*.sorted.bam")
bam_files_list = sorted(glob.glob(bam_pattern))

if not bam_files_list:
    print(f"Error: No '.sorted.bam' files found in {alignment_dir}")
else:
    # Convert the Python list into a space-separated string
    bam_files_string = " ".join(bam_files_list)
    
    print("\nFound BAM Files to process:")
    for f in bam_files_list:
        print(f" - {f}")
    
    # --- FIX 2 (Your dUTP note): Add '-s 2' for reverse stranded ---
    command = f"""
    featureCounts \
        -T 8 \
        -p \
        -t gene \
        -g ID \
        -s 2 \
        -a {annotation_file} \
        -o {counts_matrix_file} \
        {bam_files_string}
    """
    
    print("\nExecuting command (with -s 2 for dUTP protocol)...")
    !{command}
    
    print("\n--- featureCounts complete. ---")
    print(f"A summary was printed above, and the full matrix is in {counts_matrix_file}")