### Step 1: Combine Flow Cell Data

Concatenate FASTQ files from both flow cells into a single file. Check basic statistics like total read count
and file size to confirm successful combination.

In [None]:
!cd reads && sh sample_reads.sh

Writes 1 million subsampled reads to `/data/groups/wheelenj/sequencing/20250916_M009242/planaria_test_subset.fastq.gz`

### Step 2: Initial Quality Assessment

Run NanoPlot on the combined dataset to assess read length distribution, quality scores, and overall data
characteristics. Optionally check for adapter contamination using Porechop.

In [None]:
!mkdir fastq_planaria
!mkdir fastq_planaria/qc
!NanoPlot --fastq /data/groups/wheelenj/sequencing/20250916_M009242/planaria_test_subset.fastq.gz -o fastq_planaria/qc

In [None]:
!multiqc fastq_planaria/.

### Step 3: Read Filtering and Subsampling

Use Filtlong to filter reads based on quality and length, targeting 100-150 GB of high-quality data
(approximately 50-100x coverage). Filter parameters: minimum length 1000 bp, minimum mean quality 8,
keep best 20-25% of reads.

In [1]:
!mkdir trimmed_fastq
!filtlong --min_length 1000 --keep_percent 80 --min_mean_q 8 /data/groups/wheelenj/sequencing/20250916_M009242/planaria_test_subset.fastq.gz | gzip > trimmed_fastq/trimmed_reads.fastq.gz

mkdir: cannot create directory ‚Äòtrimmed_fastq‚Äô: File exists

Scoring long reads
  1,000,000 reads (6,122,479,215 bp) reads (62,569,487 bp)11,495 reads (69,480,054 bp)51,114 reads (308,517,951 bp)52,252 reads (315,391,128 bp)322,258,455 bp)75,879 reads (459,163,051 bp)521,235,318 bp)94,178 reads (570,513,172 bp)624,922,251 bp)108,094 reads (656,056,755 bp)666,416,747 bp)110,796 reads (673,296,661 bp)116,101 reads (706,869,259 bp)117,717 reads (717,200,446 bp)118,928 reads (724,082,766 bp)121,171 reads (737,864,220 bp)908,929,618 bp)168,259 reads (1,030,810,015 bp)173,154 reads (1,059,848,455 bp)1,073,679,473 bp)177,616 reads (1,087,521,831 bp)180,592 reads (1,104,816,013 bp)1,150,592,600 bp)1,254,918,420 bp)224,669 reads (1,377,294,317 bp)231,896 reads (1,421,244,554 bp)245,852 reads (1,508,258,156 bp)1,777,781,027 bp)320,527 reads (1,966,858,119 bp)2,023,909,420 bp)330,538 reads (2,027,377,019 bp)338,932 reads (2,079,153,747 bp)2,141,098,718 bp)349,629 reads (2,144,544,373 bp)358,4

### Step 4: Post-Filtering Quality Control

Run NanoPlot on the combined dataset to assess read length distribution, quality scores, and overall data
characteristics. Optionally check for adapter contamination using Porechop.

In [2]:
!NanoPlot --fastq trimmed_fastq/trimmed_reads.fastq.gz -o trimmed_fastq/qc --plots kde hex

This command requires Kaleido v1.0.0 or greater.
Install it using `pip install 'kaleido>=1.0.0'` or `pip install 'plotly[kaleido]'`."



In [3]:
!multiqc fastq_planaria/. trimmed_fastq/.


[91m///[0m ]8;id=815645;https://multiqc.info\[1mMultiQC[0m]8;;\ üéÉ [2mv1.31[0m

[34m     version_check[0m | [33mMultiQC Version v1.32 now available![0m
[34m       file_search[0m | Search path: /data/users/willetse0745/Planaria-Genome-Project/fastq_planaria
[34m       file_search[0m | Search path: /data/users/willetse0745/Planaria-Genome-Project/trimmed_fastq
[2K         [34msearching[0m | [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [35m100%[0m [32m34/34[0m  astq_planaria/qc/NanoPlot-report.html[0m
[?25h[34m          nanostat[0m | Found 1 reports
[34m     write_results[0m | Existing reports found, adding suffix to filenames. Use '--force' to overwrite.
[34m     write_results[0m | Data        : multiqc_data_1_2
[34m     write_results[0m | Report      : multiqc_report_2.html
[34m           multiqc[0m | MultiQC complete


### Step 5: Flye Assembly

Run Flye genome assembler using filtered nanopore reads. Use 48 CPU cores, 200GB RAM, allow 3 days
runtime. Flye handles raw nanopore reads well and excels at repeat resolution.

In [7]:
!mkdir test_assembly
!mkdir test_assembly/flye_output

mkdir: cannot create directory ‚Äòtest_assembly‚Äô: File exists


In [9]:
!flye --nano-raw trimmed_fastq/trimmed_reads.fastq.gz \
     --out-dir /data/groups/wheelenj/sequencing/20250916_M009242/test_assembly/flye_output/ \
     --threads 64 \
     --genome-size 1g \
     --iterations 2

[2025-10-30 21:53:54] INFO: Starting Flye 2.9.6-b1802
[2025-10-30 21:53:54] INFO: >>>STAGE: configure
[2025-10-30 21:53:54] INFO: Configuring run
[2025-10-30 21:55:05] INFO: Total read length: 4897987557
[2025-10-30 21:55:05] INFO: Input genome size: 1000000000
[2025-10-30 21:55:05] INFO: Estimated coverage: 4
[2025-10-30 21:55:05] INFO: Reads N50/N90: 16392 / 6390
[2025-10-30 21:55:05] INFO: Minimum overlap set to 6000
[2025-10-30 21:55:05] INFO: >>>STAGE: assembly
[2025-10-30 21:55:05] INFO: Assembling disjointigs
[2025-10-30 21:55:05] INFO: Reading sequences
[2025-10-30 21:56:08] INFO: Counting k-mers:
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2025-10-30 21:58:08] INFO: Filling index table (1/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2025-10-30 21:58:56] INFO: Filling index table (2/2)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2025-10-30 22:00:27] INFO: Extending reads
[2025-10-30 22:15:11] INFO: Overlap-based coverage: 2
[2025-10-30 22:15:11] INFO: Median overlap div

### Step 7: Initial Assembly Assessment

Run BUSCO analysis on both assemblies using metazoa_odb10 lineage to assess gene completeness.
Calculate basic assembly statistics with QUAST.

In [1]:
!busco -i /data/groups/wheelenj/sequencing/20250916_M009242/test_assembly/flye_output/assembly.fasta \
      -l metazoa_odb10 \
      -o busco_test \
      -m genome \
      --cpu 64 \
      --out_path /data/groups/wheelenj/sequencing/20250916_M009242/

2025-11-12 11:29:08 INFO:	***** Start a BUSCO v6.0.0 analysis, current time: 11/12/2025 11:29:08 *****
2025-11-12 11:29:08 INFO:	Configuring BUSCO with local environment
2025-11-12 11:29:08 INFO:	Running genome mode
2025-11-12 11:29:08 INFO:	Downloading information on latest versions of BUSCO data...
2025-11-12 11:29:10 INFO:	Input file is /data/groups/wheelenj/sequencing/20250916_M009242/test_assembly/flye_output/assembly.fasta
2025-11-12 11:29:10 INFO:	Downloading file 'https://busco-data.ezlab.org/v5/data/lineages/metazoa_odb10.2024-01-08.tar.gz'
2025-11-12 11:29:15 INFO:	Decompressing file '/data/users/willetse0745/Planaria-Genome-Project/busco_downloads/lineages/metazoa_odb10.tar.gz'
2025-11-12 11:29:22 INFO:	Running BUSCO using lineage dataset metazoa_odb10 (eukaryota, 2024-01-08)
2025-11-12 11:29:22 INFO:	Running 1 job(s) on bbtools, starting at 11/12/2025 11:29:22
2025-11-12 11:29:24 INFO:	Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limi

## Subsample BUSCO Results

- Complete genes: 28.3% -- 270 total -> 91 contain internal stop codons
- Single copy: 25.7% -- 245
- Duplicated: 2.6% -- 25
- Fragmented genes: 8.0% -- 76
- Missing genes: 63.7% -- 608

In [1]:
!multiqc fastq_planaria/. trimmed_fastq/qc/. /data/groups/wheelenj/sequencing/20250916_M009242/busco_test/short_summary.specific.metazoa_odb10.busco_test.txt --force --verbose   --dirs \
  --dirs-depth 1 \
  --force

[32m[2025-11-12 12:18:24][0m [34mmultiqc.core.tmp_dir                              [0m [1;30m[DEBUG  ][0m  [2mUsing new temporary directory: /local/scratch/job_334086/tmpkd48xyif[0m
[32m[2025-11-12 12:18:24][0m [34mroot                                              [0m [1;30m[DEBUG  ][0m  [2mLogging to file: /local/scratch/job_334086/tmpkd48xyif/multiqc.log[0m

[91m///[0m ]8;id=619933;https://multiqc.info\[1mMultiQC[0m]8;;\ üîç [2mv1.31[0m

[32m[2025-11-12 12:18:25][0m [34mmultiqc.core.update_config                        [0m [1;30m[DEBUG  ][0m  [2mThis is MultiQC v1.31[0m
[32m[2025-11-12 12:18:25][0m [34mmultiqc.core.update_config                        [0m [1;30m[DEBUG  ][0m  [2mRunning Python 3.12.12 | packaged by conda-forge | (main, Oct 13 2025, 14:34:15) [GCC 14.3.0][0m
[32m[2025-11-12 12:18:25][0m [34mmultiqc.core.update_config                        [0m [1;30m[INFO   ][0m  Prepending directory to sample names
[32m[2025-11-12 12:1