# RNA-Seq Pipeline: Step 4 - Read Counting (featureCounts)

This notebook performs "read counting" or "quantification".

**Goal:**
We need to count how many of our aligned reads (from the `.bam` files) actually overlap with known genes (from the `.gff` annotation file).

The output will be a single file: a **"counts matrix"**. This is a simple table where:
* Rows are **Genes** (e.g., `gene-KPN_00001`)
* Columns are our **Samples** (e.g., `SRR34134104.sorted.bam`)
* Values are the **number of reads** that mapped to that gene in that sample.

This counts matrix is the final input for the statistical analysis (Step 5).

**Tool:** `featureCounts` (from the `subread` package)

In [1]:
import os

# --- Define Core Paths ---

# 1. Input: Aligned/Sorted BAM files (from Notebook 03)
alignment_dir = "03_Aligned_BAMs"

# 2. Input: Reference Genome Annotation (from Notebook 01/03)
genome_dir = "06_Genome_Index"
# --- [THE FIX (Rule 4)] ---
# (The real file we downloaded in Nb 03 is "genome.gff.gz")
annotation_file = os.path.join(genome_dir, "genome.gff.gz") # <-- ADDED .gz
# --- [END FIX] ---

# 3. Output: The counts data
counts_dir = "04_Counts"
os.makedirs(counts_dir, exist_ok=True)
# This is the final matrix file we will create
counts_matrix_file = os.path.join(counts_dir, "gene_counts_matrix.txt")

print(f"Annotation File (Input): {annotation_file}") # <-- This will now print the correct .gff.gz path
print(f"BAM Directory (Input): {alignment_dir}")
print(f"Counts Matrix (Output): {counts_matrix_file}")

Annotation File (Input): 06_Genome_Index/genome.gff.gz
BAM Directory (Input): 03_Aligned_BAMs
Counts Matrix (Output): 04_Counts/gene_counts_matrix.txt


In [2]:
import os
import glob # This library is good at finding files

print("--- 1. Running featureCounts ---")
print(f"Annotation: {annotation_file}")
print(f"Output File: {counts_matrix_file}")

# --- FIX 1: Use glob to find files ---
bam_pattern = os.path.join(alignment_dir, "*.sorted.bam")
bam_files_list = sorted(glob.glob(bam_pattern))

if not bam_files_list:
    print(f"Error: No '.sorted.bam' files found in {alignment_dir}")
else:
    # Convert the Python list into a space-separated string
    bam_files_string = " ".join(bam_files_list)
    
    print("\nFound BAM Files to process:")
    for f in bam_files_list:
        print(f" - {f}")
    
    # --- FIX 2 (Your dUTP note): Add '-s 2' for reverse stranded ---
    command = f"""
    featureCounts \
        -T 8 \
        -p \
        -t gene \
        -g ID \
        -s 2 \
        -a {annotation_file} \
        -o {counts_matrix_file} \
        {bam_files_string}
    """
    
    print("\nExecuting command (with -s 2 for dUTP protocol)...")
    !{command}
    
    print("\n--- featureCounts complete. ---")
    print(f"A summary was printed above, and the full matrix is in {counts_matrix_file}")

--- 1. Running featureCounts ---
Annotation: 06_Genome_Index/genome.gff.gz
Output File: 04_Counts/gene_counts_matrix.txt

Found BAM Files to process:
 - 03_Aligned_BAMs/SRR34134104.sorted.bam
 - 03_Aligned_BAMs/SRR34134105.sorted.bam
 - 03_Aligned_BAMs/SRR34134106.sorted.bam
 - 03_Aligned_BAMs/SRR34134107.sorted.bam
 - 03_Aligned_BAMs/SRR34134108.sorted.bam
 - 03_Aligned_BAMs/SRR34134109.sorted.bam

Executing command (with -s 2 for dUTP protocol)...

       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v2.1.1

||  [0m                                                                          ||
||             Input files : [36m6 BAM files  [0m [0m                                 

###  Verify the Output "Counts Matrix"

**Goal:** Inspect the final `gene_counts_matrix.txt` file.

**Why:** We must verify that the file is in the correct "matrix" format that `DESeq2` (in R) expects. We will use `!head` to look at the first 10 rows.

**Expected Format:**
* The first line(s) will be comments (starting with `#`).
* There will be a "Header" line containing our 6 sample names.
* The rows will be `GeneID` followed by 6 "count" numbers.

In [3]:
# --- 4. Verification (The Proof) ---
# We use the 'counts_matrix_file' variable from Cell 2
    
print(f"--- Verification: Checking the first 10 lines of the final matrix: ---")
print(f"File: {counts_matrix_file}\n")
    
# !head -n 10: Show the first 10 lines
!head -n 10 {counts_matrix_file}

--- Verification: Checking the first 10 lines of the final matrix: ---
File: 04_Counts/gene_counts_matrix.txt

# Program:featureCounts v2.1.1; Command:"featureCounts" "-T" "8" "-p" "-t" "gene" "-g" "ID" "-s" "2" "-a" "06_Genome_Index/genome.gff.gz" "-o" "04_Counts/gene_counts_matrix.txt" "03_Aligned_BAMs/SRR34134104.sorted.bam" "03_Aligned_BAMs/SRR34134105.sorted.bam" "03_Aligned_BAMs/SRR34134106.sorted.bam" "03_Aligned_BAMs/SRR34134107.sorted.bam" "03_Aligned_BAMs/SRR34134108.sorted.bam" "03_Aligned_BAMs/SRR34134109.sorted.bam" 
Geneid	Chr	Start	End	Strand	Length	03_Aligned_BAMs/SRR34134104.sorted.bam	03_Aligned_BAMs/SRR34134105.sorted.bam	03_Aligned_BAMs/SRR34134106.sorted.bam	03_Aligned_BAMs/SRR34134107.sorted.bam	03_Aligned_BAMs/SRR34134108.sorted.bam	03_Aligned_BAMs/SRR34134109.sorted.bam
gene-KPHS_00010	CP003200.1	382	822	-	441	4481	2603	3144	2181	1933	2101
gene-KPHS_00020	CP003200.1	922	1380	-	459	2926	1908	1856	906	1021	1021
gene-KPHS_00030	CP003200.1	1532	2524	+	993	28409	7842

##  Conclusion & Handoff

**Status:** 100% Success.

**Analysis:**
1.  **Recipe Success:** The `featureCounts` command (Cell 3) ran successfully on all 6 BAM files using the correct annotation file (`genome.gff.gz`).
2.  **Verification (Cell 5):** The verification step confirms that our final output, `gene_counts_matrix.txt`, is a perfectly formatted "Counts Matrix" (Rows=Genes, Columns=Samples).

**Final Product:**
We have successfully produced the "Counts Matrix" (`gene_counts_matrix.txt`) in the `04_Counts/` directory. This is the final data input required for the statistical analysis.

**Next Step (The Handoff):**
We are now 100% ready to proceed to the final and most important phase as defined in the `README.md`:
**`05_DGE_Analysis.ipynb`** (where we will use R/DESeq2 to find the differentially expressed genes).