# Project 2: M. tuberculosis Genome Assembly
## 05 - Comparative Analysis: Hypothesis-Driven Mutation Discovery

* **Author:** Youssef mimoune
* **Date:** 26-Oct-2025

### Objective
This notebook investigates the genetic cause of PAS resistance in sample `DRR749572`.

Our initial "unbiased" approach using `diff` on the `.faa` files failed because the gene order was different between the two assemblies (due to different contig assemblies by SPAdes).

We will now pivot to a **hypothesis-driven approach**, testing the most common genes known to cause PAS resistance, as identified from scientific literature.

### Candidate Genes (Hypotheses)
1.  **`folC`**: Dihydrofolate synthase
2.  **`thyA`**: Thymidylate synthase
3.  **`ribD`**: Riboflavin biosynthesis protein

In [None]:
print("--- 1. Creating directory for differential analysis ---")
!mkdir -p ../analysis/06_diff_results

print("Directory created.")
!ls -l ../analysis/

---
### Hypothesis 1: `folC` Mutation Search

We will first search the GFF files to find the specific Locus Tag for `folC`.

In [None]:
print("--- 1. Testing Hypothesis 1: Checking gene 'folC' ---")
print("--- 'folC' name not found in .faa. Searching in GFF for 'folate' instead... ---")

print("\n--- Control Sample (DRR749571) ---")
# Search the .gff file for the function "folate"
!grep "folate" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.gff

print("\n--- Resistant Sample (DRR749572) ---")
!grep "folate" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.gff

In [None]:
print("--- 1. Testing Hypothesis 1: Extracting 'folC' using its Locus Tag ---")
print("\n--- Control Sample (DRR749571) 'folC' Protein Sequence: ---")

# We are now searching for the specific Locus Tag we found in the GFF
!grep -A 10 "IBGOOOGP_02999" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.faa

In [None]:
print("\n--- Resistant Sample (DRR749572) 'folC' Protein Sequence: ---")

# We use the specific Locus Tag for *this* sample's annotation
!grep -A 10 "JCLEOGND_02856" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.faa

---
### Hypothesis 2: `thyA` Mutation Search

`folC` sequences were identical. We now proceed to the second candidate, `thyA`. We will search the GFF files to find its Locus Tag.

In [None]:
print("--- 2. Hypothesis 1 FAILED. Testing Hypothesis 2: Checking gene 'thyA' ---")
print("--- Searching GFF for 'thyA' Locus Tags... ---")

print("\n--- Control Sample (DRR749571) ---")
!grep "thyA" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.gff

print("\n--- Resistant Sample (DRR749572) ---")
!grep "thyA" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.gff

In [None]:
print("--- 3. Extracting 'thyA' (Hypothesis 2) using its Locus Tag ---")
print("\n--- Control Sample (DRR749571) 'thyA' Protein Sequence: ---")

# We are searching for the specific Locus Tag for thyA
!grep -A 10 "IBGOOOGP_02304" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.faa

In [None]:
print("\n--- Resistant Sample (DRR749572) 'thyA' Protein Sequence: ---")

!grep -A 10 "JCLEOGND_02130" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.faa

---
### Hypothesis 3: `ribD` Mutation Search

`thyA` sequences were also identical. We now proceed to the third candidate, `ribD`. We will search the GFF files to find its Locus Tag.

In [None]:
print("--- 3. H1 & H2 FAILED. Testing Hypothesis 3: Checking gene 'ribD' ---")
print("--- Searching GFF for 'ribD' Locus Tags... ---")

print("\n--- Control Sample (DRR749571) ---")
!grep "ribD" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.gff

print("\n--- Resistant Sample (DRR749572) ---")
!grep "ribD" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.gff

In [None]:
print("--- 4. Extracting 'ribD' (Hypothesis 3) using its Locus Tag ---")
print("\n--- Control Sample (DRR749571) 'ribD' Protein Sequence: ---")

# We are searching for the specific Locus Tag for ribD
!grep -A 10 "IBGOOOGP_00101" ../analysis/05_prokka_annotation/DRR749571_annotation/DRR749571_control.faa

In [None]:
print("\n--- Resistant Sample (DRR749572) 'ribD' Protein Sequence: ---")

!grep -A 10 "JCLEOGND_00159" ../analysis/05_prokka_annotation/DRR749572_annotation/DRR749572_resistant.faa

---
### Investigation Conclusion

All three major candidate genes (`folC`, `thyA`, and `ribD`) were found to be **identical** at the protein level between the control and resistant strains.

This is a significant finding: it means our strain (`DRR749572`) has a **non-canonical (novel or less common)** resistance mechanism.

This rules out the simple hypothesis-driven approach. We must now proceed to a full, unbiased, pangenome analysis using a smarter tool (`Roary`) that can compare all ~4000 genes regardless of their order in the assembly.