
# Girardia dorotocephala HMW Nanopore Genome Assembly

# Pipeline - Complete Workflow

## Phase 1: Data Preparation & Quality Control

### Step 1: Combine Flow Cell Data

Concatenate FASTQ files from both flow cells into a single file. Check basic statistics like total read count
and file size to confirm successful combination.

### Step 2: Initial Quality Assessment

Run NanoPlot on the combined dataset to assess read length distribution, quality scores, and overall data
characteristics. Optionally check for adapter contamination using Porechop.

### Step 3: Read Filtering and Subsampling

Use Filtlong to filter reads based on quality and length, targeting 100-150 GB of high-quality data
(approximately 50-100x coverage). Filter parameters: minimum length 1000 bp, minimum mean quality 8,
keep best 20-25% of reads.

### Step 4: Post-filtering Quality Control

Run NanoPlot again on filtered reads to confirm improved quality metrics and appropriate data volume
for assembly.

## Phase 2: Primary Genome Assembly

### Step 5: Flye Assembly (Primary Method)

Run Flye genome assembler using filtered nanopore reads. Use 48 CPU cores, 200GB RAM, allow 3 days
runtime. Flye handles raw nanopore reads well and excels at repeat resolution.

### Step 6: Alternative Assembly - Canu (Backup)

If Flye results are suboptimal, run Canu assembly in parallel. Canu uses error correction followed by

assembly and often performs well on complex genomes. Requires higher memory (300GB) and longer
runtime.

### Step 7: Initial Assembly Assessment

Run BUSCO analysis on both assemblies using eukaryota_odb10 lineage to assess gene completeness.
Calculate basic assembly statistics with QUAST. Compare assemblies and select the best performer.


## Phase 3: Assembly Polishing

### Step 8: First Round Medaka Polishing

Use Medaka for nanopore-specific error correction on the selected assembly. This corrects systematic
nanopore sequencing errors and improves consensus accuracy.

### Step 9: Second Round Racon Polishing

Run Racon polishing using original filtered reads mapped to the Medaka-polished assembly. This
provides additional consensus improvement through multiple alignment iterations.

### Step 10: Post-polishing Quality Assessment

Run BUSCO again on polished assembly to ensure polishing improved rather than degraded gene
completeness. Compare metrics with initial assembly.

## Phase 4: Assembly Refinement

### Step 11: Haplotig Purging

Use purge_dups to identify and remove duplicate sequences/haplotigs that may cause inflated assembly
size. Planarian genomes can have complex ploidy requiring careful duplicate removal.

### Step 12: Gap Filling (Optional)

If significant gaps remain, consider using TGS-GapCloser or similar tools to close gaps using long reads.

### Step 13: Scaffolding (Optional)

If assembly is highly fragmented, use LINKS or SSPACE-LongRead for scaffolding with long reads.

## Phase 5: Final Quality Control & Validation

### Step 14: Comprehensive Quality Assessment

Run final BUSCO analysis with multiple lineages (eukaryota, metazoa if available). Generate detailed
assembly statistics with QUAST including N50, L50, total length, and contig count.

### Step 15: Completeness Validation

Assess assembly completeness using additional metrics like LAI (LTR Assembly Index) if applicable, and
compare against expected genome size for planarians.

### Step 16: Contamination Screening

Screen for potential contamination using tools like BlobTools or Kraken2 to ensure assembly represents


target organism.

## Phase 6: Functional Annotation (Optional)

### Step 17: Repeat Annotation

Mask repetitive elements using RepeatMasker or RepeatModeler to prepare for gene annotation.

### Step 18: Gene Prediction

Perform gene prediction using AUGUSTUS, GeneMark, or BRAKER2 for functional genome annotation.

## Phase 7: Data Management & Documentation

### Step 19: Final Assembly Preparation

Generate final assembly FASTA with standardized contig naming. Create assembly report with statistics,
methods, and quality metrics.

### Step 20: Data Archival

Organize all intermediate files, scripts, and results. Prepare data for potential public deposition (NCBI,
ENA) following community standards.

## Timeline Estimates

```
Phase 1: 1-2 days
Phase 2: 3-5 days
Phase 3: 1-2 days
Phase 4: 1-2 days
Phase 5: 1 day
Total: 1-2 weeks
```
## Critical Decision Points

1. **Assembly selection:** Choose between Flye and Canu based on BUSCO scores and assembly metrics
2. **Polishing iterations:** Continue polishing rounds until no further improvement
3. **Purging stringency:** Balance between removing true duplicates and losing legitimate repetitive
    elements
4. **Quality thresholds:** Aim for >90% complete BUSCOs and appropriate assembly size for planarian
    genomes


## Success Metrics

```
BUSCO completeness >90%
Assembly size appropriate for planarian genomes (0.5-2 Gb)
N50 >100 kb (ideally >500 kb)
Minimal contamination
Low duplication after purging
```


