# Project 2: M. tuberculosis Genome Assembly
## 02 - Genome Assembly with SPAdes

* **Author:** Youssef (Your Name)
* **Date:** 25-Oct-2025
* **Sample IDs:** `DRR749571` (Control), `DRR749572` (Resistant)

### Objective
This notebook performs the *de novo* genome assembly using the trimmed reads from the previous step (01_QC). We will use `SPAdes`, a powerful assembler that uses De Bruijn graphs with multiple k-mer sizes.

We will assemble each sample *separately* to compare them later.

### Tool
* `SPAdes`: (v4.0.0) A genome assembler for small genomes.
* **Key Parameters:**
    * `--careful`: Mismatch correction mode for high-quality data.
    * `-k 21,33,55,77,99,127`: A large range of k-mers for high-quality contigs.
    * `-t 4 -m 20`: Using 4 threads and 20GB of RAM.

In [None]:
print("--- 1. Creating directories for SPAdes assembly ---")
# We create one output directory for each sample's assembly
!mkdir -p ../analysis/03_spades_assembly/DRR749571_assembly
!mkdir -p ../analysis/03_spades_assembly/DRR749572_assembly

print("Directories created:")
!ls -lR ../analysis/

In [None]:
print("--- 2. Starting SPAdes Assembly for DRR749571 (Control) ---")
# This will take a LONG time (potentially 30-60 minutes or more).
# We are running this inside 'screen' so it's safe if we disconnect.

!spades.py \
  -1 ../analysis/02_fastp_trimmed/DRR749571.trimmed_1.fastq.gz \
  -2 ../analysis/02_fastp_trimmed/DRR749571.trimmed_2.fastq.gz \
  -o ../analysis/03_spades_assembly/DRR749571_assembly \
  -k 21,33,55,77,99,127 \
  --careful \
  -t 4 \
  -m 20

print("--- SPAdes Assembly for DRR749571 COMPLETE ---")

In [1]:
print("--- 3. Starting SPAdes Assembly for DRR749572 (Resistant) ---")
# This will also take a LONG time.

!spades.py \
  -1 ../analysis/02_fastp_trimmed/DRR749572.trimmed_1.fastq.gz \
  -2 ../analysis/02_fastp_trimmed/DRR749572.trimmed_2.fastq.gz \
  -o ../analysis/03_spades_assembly/DRR749572_assembly \
  -k 21,33,55,77,99,127 \
  --careful \
  -t 4 \
  -m 20

print("--- SPAdes Assembly for DRR749572 COMPLETE ---")

--- 3. Starting SPAdes Assembly for DRR749572 (Resistant) ---








Command line: /home/refm_youssef/mambaforge-pypy3/envs/assembly_env/bin/spades.py	-1	/home/refm_youssef/tb_genome_assembly/analysis/02_fastp_trimmed/DRR749572.trimmed_1.fastq.gz	-2	/home/refm_youssef/tb_genome_assembly/analysis/02_fastp_trimmed/DRR749572.trimmed_2.fastq.gz	-o	/home/refm_youssef/tb_genome_assembly/analysis/03_spades_assembly/DRR749572_assembly	-k	21,33,55,77,99,127	--careful	-t	4	-m	20	

System information:
  SPAdes version: 4.0.0
  Python version: 3.11.9
  OS: Linux-6.14.0-1017-gcp-x86_64-with-glibc2.39

Output dir: /home/refm_youssef/tb_genome_assembly/analysis/03_spades_assembly/DRR749572_assembly
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Standard mode
  For multi-cell/isolate data we recommend to use '--isolate' option; for single-cell MDA data use '--sc'; for metagenomic data use '--meta'; for RNA-Seq use '--rna'.
  Reads:
    Library number: 1, li

In [None]:
print("\n--- 4. Final Verification of Assembly Results ---")
print("Checking contents of the assembly directory:")
!ls -lR ../analysis/03_spades_assembly/

print("\n--- Looking for the final FASTA files ---")
# The '-L 2' makes 'ls' look inside the sub-directories
!ls -lh ../analysis/03_spades_assembly/*/contigs.fasta