# Generate Bash and SLURM Job Scripts for all datasets

This notebook generates the .sh scripts (bash and SLURM submission scripts), directories, data (commented out to avoid accidental over-writing), for resource monitored and performence evaluted runs. Unlike the preprint version, this version explictly splits which tools (exhustive vs expected to finish wihtin < 1M CPU seconds) and distance value (<=5 vs <=3).  
Specifically, the runs are organised by these dataset type:
1. **Real data subsamples** (fractions) - smaller fractions use max_distance=5, larger use max_distance=3
2. **Full real data** (fraction_1) - max_distance=3, excluding compute-intensive tools
3. **Semi-synthetic/baseline** (real spacers in simulated contigs) - max_distance=3, excluding sassy and indelfree_bruteforce
4. **Simulated data** -Mostly smaller runs with max_distance=5, some larger runs with max_distance=3. Also one simulated dataset with high insertion rate.

**IMPORTANT**: Data generation commands are commented out to avoid overwriting existing data. We only regenerate scripts, and use `exist=true` when creating the directory structure (so skip data generation if data already exists).

In [1]:
# %load_ext autoreload
# %autoreload 2
import os # noqa: F401
import sys # noqa: F401
from pathlib import Path
import polars as pl
import logging

# Setup logging
logger = logging.Logger(__name__, logging.INFO)
handler = logging.StreamHandler()
formatter = logging.Formatter('%(filename)s:%(lineno)s - %(asctime)s - %(levelname)s - %(message)s - %(name)s ')
handler.setFormatter(formatter)
logger.addHandler(handler)

pl.Config(tbl_rows=50)

# Import bench utilities
from bench.utils.pyseff import generate_scripts_for_dataset
from bench.utils.pyseff import print_dataset_slurm_summary as print_summary

# Base paths
BASE_DIR = Path('/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench')
RESULTS_DIR = BASE_DIR / 'results'
IMGVR4_DIR = BASE_DIR / 'imgvr4_data'

# Common file paths
SPACERS_FILE = str(IMGVR4_DIR / 'spacers' / 'iphop_filtered_spacers.fna')
IMGVR4_CONTIGS_FILE = str(IMGVR4_DIR / 'contigs' / 'filtered_contigs.fasta')
contigs_stats = pl.read_parquet(IMGVR4_DIR / 'contigs' / 'filtered_contig_stats.parquet')
spacers_stats = pl.read_parquet(IMGVR4_DIR / 'spacers' / 'spacers_stats.parquet')
print(f"Base directory: {BASE_DIR}")
print(f"Spacers file: {SPACERS_FILE}")
print(f"IMG/VR4 contigs: {IMGVR4_CONTIGS_FILE}")

Base directory: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench
Spacers file: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/spacers/iphop_filtered_spacers.fna
IMG/VR4 contigs: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/contigs/filtered_contigs.fasta


## 1. Real Data Subsamples (Fractions)

Generate scripts for real data subsamples. Smaller fractions (0.001-0.01) use max_distance=5 for the hamming vs edit distance analysis.  
  Larger fractions (0.05-0.1) use max_distance=3 to reduce compute time and/or storage usage.

**Data generation is commented out - data already exists.**

In [None]:
# COMMENT OUT if DATA ALREADY EXISTS

from bench.commands.subsample import subsample_dataset
smaller_fractions = [0.0005, 0.001, 0.005, 0.01]
larger_fractions = [0.05, 0.1]
full_set_fraction = [1]
all_fractions = smaller_fractions + larger_fractions + full_set_fraction

# first create all needed directories
for frac in all_fractions:
    frac_dir = f'/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_{frac}'
    # Create all needed subdirectories
    for subdir in ['slurm_logs', 'results', 'bash_scripts','job_scripts', "subsampled_data",'raw_outputs']:
        Path.mkdir(Path(frac_dir + '/' + subdir),exist_ok=True,parents=True)

for frac in smaller_fractions+larger_fractions:
    results = subsample_dataset(
        contigs_file=IMGVR4_CONTIGS_FILE,
        metadata_file=str((IMGVR4_DIR / 'contigs' / 'filtered_contig_stats.parquet')), 
        output_dir=str(RESULTS_DIR / 'real_data' / 'subsamples' / f'fraction_{frac}'/ 'subsampled_data'),
        reduce_factor=frac, 
        hq_fraction=1,
        taxonomic_rank='class',
        logger=logger,
        extract_method="iter",

    )

subsample.py:297 - 2026-01-22 16:42:38,305 - INFO - Starting intelligent subsampling... - __main__ 
subsample.py:302 - 2026-01-22 16:42:38,306 - INFO - Using fraction reduction: 0.0005 - __main__ 
subsample.py:310 - 2026-01-22 16:42:38,307 - INFO - Reading metadata from /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/contigs/filtered_contig_stats.parquet - __main__ 
subsample.py:321 - 2026-01-22 16:42:38,362 - INFO - Loaded metadata with 5115894 entries - __main__ 
subsample.py:336 - 2026-01-22 16:42:38,588 - INFO - Using metadata for 5115894 contigs - __main__ 
subsample.py:342 - 2026-01-22 16:42:38,602 - INFO - Selected 421431 high-quality contigs - __main__ 
subsample.py:350 - 2026-01-22 16:42:38,603 - INFO - Target number of contigs: 210 - __main__ 
subsample.py:356 - 2026-01-22 16:42:38,794 - INFO - Length range: 1001 - 711471 bp - __main__ 
subsample.py:357 - 2026-01-22 16:42:38,795 - INFO - GC content range: 17.740 - 78.410 - __main__ 
subsample.p

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:42:39,091 - INFO - Selected 279 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:42:39,092 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/subsampled_data/subsampled_contigs.fa - __main__ 


Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Tectiliviricetes   ┆ 18    │
│ Tokiviricetes      ┆ 14    │
│ Leviviricetes      ┆ 16    │
│ Faserviricetes     ┆ 41    │
│ …                  ┆ …     │
│ Vidaverviricetes   ┆ 21    │
│ Caudoviricetes     ┆ 99    │
│ Malgrandaviricetes ┆ 22    │
│ Huolimaviricetes   ┆ 13    │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:43:18,806 - INFO - Written 279 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:43:18,810 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:43:18,898 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:43:18,898 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:43:18,899 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:43:18,899 - INFO - Subsampled contigs: 279 - __main__ 
subsample.py:388 - 2026-01-22 16:43:18,899 - INFO - Reduction factor: 18336.538 - __main__ 
subsample.py:392 - 2026-01-22 16:43:18,900 - INFO - Results saved to: /clust

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:43:19,848 - INFO - Selected 421 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:43:19,849 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa - __main__ 


Added 45 additional contigs to reach target
Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Vidaverviricetes   ┆ 40    │
│ Huolimaviricetes   ┆ 18    │
│ Leviviricetes      ┆ 37    │
│ Malgrandaviricetes ┆ 24    │
│ …                  ┆ …     │
│ Caudoviricetes     ┆ 136   │
│ Duplopiviricetes   ┆ 34    │
│ Faserviricetes     ┆ 41    │
│ Arfiviricetes      ┆ 32    │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:43:59,212 - INFO - Written 421 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:43:59,216 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.001/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:43:59,257 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:43:59,258 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.001/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:43:59,258 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:43:59,258 - INFO - Subsampled contigs: 421 - __main__ 
subsample.py:388 - 2026-01-22 16:43:59,259 - INFO - Reduction factor: 12151.767 - __main__ 
subsample.py:392 - 2026-01-22 16:43:59,259 - INFO - Results saved to: /cluster

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:44:00,243 - INFO - Selected 2107 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:44:00,243 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.005/subsampled_data/subsampled_contigs.fa - __main__ 


Added 862 additional contigs to reach target
Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Malgrandaviricetes ┆ 211   │
│ Tokiviricetes      ┆ 41    │
│ Leviviricetes      ┆ 269   │
│ Arfiviricetes      ┆ 149   │
│ …                  ┆ …     │
│ Huolimaviricetes   ┆ 18    │
│ Faserviricetes     ┆ 168   │
│ Tectiliviricetes   ┆ 115   │
│ Caudoviricetes     ┆ 921   │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:44:39,840 - INFO - Written 2107 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:44:39,846 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.005/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:44:39,897 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:44:39,897 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.005/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:44:39,898 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:44:39,898 - INFO - Subsampled contigs: 2107 - __main__ 
subsample.py:388 - 2026-01-22 16:44:39,898 - INFO - Reduction factor: 2428.047 - __main__ 
subsample.py:392 - 2026-01-22 16:44:39,899 - INFO - Results saved to: /cluste

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:44:40,866 - INFO - Selected 4214 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:44:40,867 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.01/subsampled_data/subsampled_contigs.fa - __main__ 


Added 2019 additional contigs to reach target
Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Leviviricetes      ┆ 497   │
│ Duplopiviricetes   ┆ 263   │
│ Malgrandaviricetes ┆ 459   │
│ Huolimaviricetes   ┆ 18    │
│ …                  ┆ …     │
│ Caudoviricetes     ┆ 2141  │
│ Tokiviricetes      ┆ 41    │
│ Vidaverviricetes   ┆ 73    │
│ Tectiliviricetes   ┆ 190   │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:45:20,609 - INFO - Written 4214 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:45:20,615 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.01/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:45:20,649 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:45:20,649 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.01/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:45:20,650 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:45:20,651 - INFO - Subsampled contigs: 4214 - __main__ 
subsample.py:388 - 2026-01-22 16:45:20,651 - INFO - Reduction factor: 1214.023 - __main__ 
subsample.py:392 - 2026-01-22 16:45:20,651 - INFO - Results saved to: /clusterf

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:45:21,644 - INFO - Selected 21071 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:45:21,645 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.05/subsampled_data/subsampled_contigs.fa - __main__ 


Added 12831 additional contigs to reach target
Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Leviviricetes      ┆ 2571  │
│ Caudoviricetes     ┆ 13193 │
│ Faserviricetes     ┆ 958   │
│ Malgrandaviricetes ┆ 2210  │
│ …                  ┆ …     │
│ Arfiviricetes      ┆ 1070  │
│ Tokiviricetes      ┆ 41    │
│ Huolimaviricetes   ┆ 18    │
│ Duplopiviricetes   ┆ 683   │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:46:02,266 - INFO - Written 21071 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:46:02,280 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.05/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:46:02,343 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:46:02,344 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.05/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:46:02,344 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:46:02,344 - INFO - Subsampled contigs: 21071 - __main__ 
subsample.py:388 - 2026-01-22 16:46:02,345 - INFO - Reduction factor: 242.793 - __main__ 
subsample.py:392 - 2026-01-22 16:46:02,345 - INFO - Results saved to: /cluster

Found 10 unique taxa at rank class


subsample.py:364 - 2026-01-22 16:46:03,376 - INFO - Selected 42143 contigs total - __main__ 
subsample.py:368 - 2026-01-22 16:46:03,377 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.1/subsampled_data/subsampled_contigs.fa - __main__ 


Added 27633 additional contigs to reach target
Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬───────┐
│ class              ┆ count │
│ ---                ┆ ---   │
│ str                ┆ u64   │
╞════════════════════╪═══════╡
│ Caudoviricetes     ┆ 28183 │
│ Faserviricetes     ┆ 1750  │
│ Huolimaviricetes   ┆ 18    │
│ Tectiliviricetes   ┆ 254   │
│ …                  ┆ …     │
│ Malgrandaviricetes ┆ 4330  │
│ Tokiviricetes      ┆ 41    │
│ Arfiviricetes      ┆ 1623  │
│ Leviviricetes      ┆ 5140  │
└────────────────────┴───────┘


subsample.py:375 - 2026-01-22 16:46:45,362 - INFO - Written 42143 subsampled contigs - __main__ 
subsample.py:381 - 2026-01-22 16:46:45,386 - INFO - Writing subsampled metadata to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.1/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:384 - 2026-01-22 16:46:45,425 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:385 - 2026-01-22 16:46:45,425 - INFO - Results saved to: /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.1/subsampled_data - __main__ 
subsample.py:386 - 2026-01-22 16:46:45,426 - INFO - Original contigs: 5115894 - __main__ 
subsample.py:387 - 2026-01-22 16:46:45,426 - INFO - Subsampled contigs: 42143 - __main__ 
subsample.py:388 - 2026-01-22 16:46:45,426 - INFO - Reduction factor: 121.394 - __main__ 
subsample.py:392 - 2026-01-22 16:46:45,427 - INFO - Results saved to: /clusterfs

In [13]:
%load_ext autoreload
%autoreload 2
# Generate scripts for real data subsamples
subsample_results = {}

# Smaller fractions: max_distance=5 (for hamming vs edit distance comparison)
for frac in smaller_fractions:
    dataset_dir = RESULTS_DIR / 'real_data' / 'subsamples' / f'fraction_{frac}'
    contigs_file = dataset_dir / 'subsampled_data' / 'subsampled_contigs.fa'
    
    if not contigs_file.exists():
        logger.warning(f"Contigs file not found for fraction {frac}: {contigs_file}")
        continue
    
    result = generate_scripts_for_dataset(
        dataset_dir=dataset_dir,
        contigs_file=str(contigs_file),
        spacers_file=SPACERS_FILE,
        max_distance=5,  # For edit vs hamming distance analysis
        threads=38,
        slurm_threads=38,
        hyperfine=False
    )
    subsample_results[f'fraction_{frac}'] = result

# Larger fractions: max_distance=3 (to reduce compute time and/or storage usage)
larger_fractions = [0.05, 0.1]
for frac in larger_fractions:
    dataset_dir = RESULTS_DIR / 'real_data' / 'subsamples' / f'fraction_{frac}'
    contigs_file = dataset_dir / 'subsampled_data' / 'subsampled_contigs.fa'
    
    if not contigs_file.exists():
        logger.warning(f"Contigs file not found for fraction {frac}: {contigs_file}")
        continue
    
    result = generate_scripts_for_dataset(
        dataset_dir=dataset_dir,
        contigs_file=str(contigs_file),
        spacers_file=SPACERS_FILE,
        max_distance=3,  # Reduced for compute efficiency
        threads=38,
        slurm_threads=38,
        hyperfine=False,
        skip_tools =['indelfree_bruteforce',"sassy"]  
    )
    subsample_results[f'fraction_{frac}'] = result

print_summary(subsample_results, "REAL DATA SUBSAMPLES - Script Generation Summary")

REAL DATA SUBSAMPLES - Script Generation Summary
fraction_0.0005:
  max_distance: 5
  Bash scripts: 12
  SLURM jobs: 12
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
fraction_0.001:
  max_distance: 5
  Bash scripts: 12
  SLURM jobs: 12
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
fraction_0.005:
  max_distance: 5
  Bash scripts: 12
  SLURM jobs: 12
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
fraction_0.01:
  max_distance: 5
  Bash scripts: 12
  SLURM jobs: 12
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
fraction_0.05:
  max_distance: 3
  Bash scripts: 10
  SLURM jobs: 10
  Tools: blastn, bowtie1, bowtie2, indelfree_indexed, lexicmap...
fraction_0.1:
  max_distance: 3
  Bash scripts: 10
  SLURM jobs: 10
  Tools: blastn, bowtie1, bowtie2, indelfree_indexed, lexicmap...


## 2. Full Real Data (a.k.a "fraction_1")

Generate scripts for the full IMG/VR4 dataset with max_distance=3. Exclude compute-intensive tools (sassy, indelfree_bruteforce) due to the large dataset size.  
Note the "1" in fraction_1 indicates the 100% of the HQ contigs (as much as possible with still representing most prok taxa).


**Data generation is commented out - data already exists.**

In [None]:
# COMMENT OUT if DATA ALREADY EXISTS)
from bench.commands.subsample import subsample_dataset

results = subsample_dataset(
    contigs_file=str(IMGVR4_DIR / 'contigs' / 'filtered_contigs.fasta'),
    metadata_file=str(IMGVR4_DIR / 'contigs' / 'filtered_contig_stats.parquet'), 
    output_dir=str(RESULTS_DIR / 'real_data' / 'subsamples' / 'fraction_1' / "subsampled_data" ),
    reduce_factor=1.0,  # Full dataset including only (or mostly) high-quality viral contigs
    hq_fraction=1,
    taxonomic_rank='class',
    logger=logger,
    extract_method="paraseq_filt"
)

subsample.py:307 - 2026-01-22 16:54:00,336 - INFO - Starting intelligent subsampling... - __main__ 
subsample.py:312 - 2026-01-22 16:54:00,337 - INFO - Using fraction reduction: 1.0 - __main__ 
subsample.py:320 - 2026-01-22 16:54:00,338 - INFO - Reading metadata from /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/contigs/filtered_contig_stats.parquet - __main__ 
subsample.py:331 - 2026-01-22 16:54:00,386 - INFO - Loaded metadata with 5115894 entries - __main__ 
subsample.py:346 - 2026-01-22 16:54:00,600 - INFO - Using metadata for 5115894 contigs - __main__ 
subsample.py:352 - 2026-01-22 16:54:00,617 - INFO - Selected 421431 high-quality contigs - __main__ 
subsample.py:360 - 2026-01-22 16:54:00,617 - INFO - Target number of contigs: 421431 - __main__ 
subsample.py:366 - 2026-01-22 16:54:00,808 - INFO - Length range: 1001 - 711471 bp - __main__ 
subsample.py:367 - 2026-01-22 16:54:00,809 - INFO - GC content range: 17.740 - 78.410 - __main__ 
subsample.p

Found 10 unique taxa at rank class
Added 342895 additional contigs to reach target


subsample.py:374 - 2026-01-22 16:54:01,690 - INFO - Selected 421431 contigs total - __main__ 
subsample.py:378 - 2026-01-22 16:54:01,690 - INFO - Writing selected contigs to /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/subsampled_contigs.fa - __main__ 


Sampled sequences per taxa:
shape: (10, 2)
┌────────────────────┬────────┐
│ class              ┆ count  │
│ ---                ┆ ---    │
│ str                ┆ u64    │
╞════════════════════╪════════╡
│ Duplopiviricetes   ┆ 731    │
│ Faserviricetes     ┆ 8476   │
│ Malgrandaviricetes ┆ 14890  │
│ Tokiviricetes      ┆ 41     │
│ …                  ┆ …      │
│ Arfiviricetes      ┆ 1705   │
│ Caudoviricetes     ┆ 359791 │
│ Huolimaviricetes   ┆ 18     │
│ Tectiliviricetes   ┆ 254    │
└────────────────────┴────────┘
Using paraseq_filt for sequence extraction... should probably upload the code for paraset_filt to somewhere


Using 10 threads
Loaded 421431 headers
Processed 100000 records
Processed 200000 records
Processed 300000 records
Processed 400000 records
Processed 500000 records
Processed 600000 records
Processed 700000 records
Processed 800000 records
Processed 900000 records
Processed 1000000 records
Processed 1100000 records
Processed 1200000 records
Processed 1300000 records
Processed 1400000 records
Processed 1500000 records
Processed 1600000 records
Processed 1700000 records
Processed 1800000 records
Processed 1900000 records
Processed 2000000 records
Processed 2100000 records
Processed 2200000 records
Processed 2300000 records
Processed 2400000 records
Processed 2500000 records
Processed 2600000 records
Processed 2700000 records
Processed 2800000 records
Processed 2900000 records
Processed 3000000 records
Processed 3100000 records
Processed 3200000 records
Processed 3300000 records
Processed 3400000 records
Processed 3500000 records
Processed 3600000 records
Processed 3700000 records
Processe

In [14]:
# Generate scripts for full real data
full_data_results = {}

dataset_dir = RESULTS_DIR / 'real_data' / 'subsamples' / 'fraction_1'
contigs_file = dataset_dir / 'subsampled_data' / 'subsampled_contigs.fa'

# Check if using the original IMG/VR4 file instead
if not contigs_file.exists():
    logger.info("Using original IMG/VR4 contigs file")
    contigs_file = IMGVR4_CONTIGS_FILE
else:
    contigs_file = str(contigs_file)

result = generate_scripts_for_dataset(
    dataset_dir=dataset_dir,
    contigs_file=contigs_file,
    spacers_file=SPACERS_FILE,
    max_distance=3,  # Lower to make computation feasible
    threads=38,
    skip_tools=['sassy', 'indelfree_bruteforce'],  # Too compute-intensive for full dataset
    slurm_threads=38,  # More threads for large dataset
    slurm_opts={'mem': '250G', 't': '72:00:00'},  # More resources
    hyperfine=False
)
full_data_results['fraction_1'] = result

print_summary(full_data_results, "FULL REAL DATA - Script Generation Summary")

FULL REAL DATA - Script Generation Summary
fraction_1:
  max_distance: 3
  Bash scripts: 10
  SLURM jobs: 10
  Tools: blastn, bowtie1, bowtie2, indelfree_indexed, lexicmap...


#### Job submissions
Do not execute more than once.... hopefully

In [15]:
import subprocess as sp
import glob
fraction=[0.0005, 0.001, 0.005, 0.01, 0.05, 0.1] #, 1]
for frac in fraction:
    frac_dirs = f'/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_{frac}'
    for script in glob.glob(f"{frac_dirs}/job_scripts/*.sh"):
        print(f"Submitting {script}")
        sp.run(f"sbatch {script}",shell=True)

Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/job_scripts/blastn.sh
Submitted batch job 20668596
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/job_scripts/bowtie1.sh
Submitted batch job 20668597
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/job_scripts/bowtie2.sh
Submitted batch job 20668598
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/job_scripts/indelfree_bruteforce.sh
Submitted batch job 20668599
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_0.0005/job_scripts/indelfree_indexed.sh
Submitted batch job 20668600
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/s

In [16]:
import subprocess as sp
import glob
fraction=[1] #, 1]
for frac in fraction:
    frac_dirs = f'/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_{frac}'
    for script in glob.glob(f"{frac_dirs}/job_scripts/*.sh"):
        print(f"Submitting {script}")
        sp.run(f"sbatch {script}",shell=True)

Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/blastn.sh
Submitted batch job 20668664
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/bowtie1.sh
Submitted batch job 20668665
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/bowtie2.sh
Submitted batch job 20668666
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/indelfree_indexed.sh
Submitted batch job 20668667
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/lexicmap.sh
Submitted batch job 20668668
Submitting /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/results/real_data/subsamples/fraction_1/job_scripts/mini

## 3. Semi-Synthetic (Baseline)  -  Real Spacers in Simulated Contigs

Generate scripts for the baseline dataset with real spacers inserted into simulated contigs (with sequence features matching the real contigs, and as many contigs as in the `fraction_1` set i.e. 421431).  
Uses max_distance=3 and excludes compute-intensive tools as this is used for estimating "spurious" matches (arising from chance).

**Data generation is commented out - data already exists.**

In [12]:
%%bash
# DATA GENERATION - COMMENT OUT if DATA ALREADY EXISTS
# from bench.commands.simulate import run_simulate
#
# real spacers, 20k simulated contigs, 20000 contigs, 1-5 insertions, 25-40 spacer length, 10000-250000 contig length, 0-5 mismatches
# pixi run "spacer_bencher simulate --contig-distribution normal --spacers  imgvr4_data/spacers/iphop_filtered_spacers.fna  --contig-gc-content 46 --spacer-insertions 1 5 --num-contigs 20000  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ./results/simulated/ns_1000000_nc_20000_real/ --threads 8"

spacer_bencher simulate \
  --num-contigs 421431 \
  --num-spacers 3826979 \
  --contig-length 2000 100000 \
  --contig-distribution normal \
  --contig-gc-content 46 \
  --spacer-length 25 50 \
  --mismatch-range 0 0 \
  --spacer-insertions 0 0 \
  --reverse-complement 0.5 \
  --threads 16 \
  --output-dir ../results/simulated/ns_3826979_nc_421431_real_baseline \
  --spacers ../imgvr4_data/spacers/iphop_filtered_spacers.fna 
#   --verify

[2;36m[01/22/26 18:26:28][0m[2;36m [0m[34mINFO    [0m Starting simulation with [1;36m3826979[0m spacers and [1;36m421431[0m
[2;36m                    [0m         contigs                                            
[2;36m[01/22/26 18:26:28][0m[2;36m [0m[34mINFO    [0m Starting simulation: [1;36m421431[0m contigs, [1;36m3826979[0m       
[2;36m                    [0m         spacers                                            
Using generated simulation ID: ed580215
Using all 3826979 spacers from file
Writing contigs to FASTA file...
Writing spacers to FASTA file...
Writing planned ground truth to TSV file...
[2;36m[01/22/26 18:27:29][0m[2;36m [0m[34mINFO    [0m Generated [1;36m421431[0m contigs and [1;36m3826979[0m spacers       
[2;36m[01/22/26 18:27:29][0m[2;36m [0m[34mINFO    [0m Ground truth contains [1;36m0[0m spacer insertions          
[2;36m[01/22/26 18:27:30][0m[2;36m [0m[34mINFO    [0m Simulation completed successfully            

In [None]:
# Generate scripts for semi-synthetic/baseline datasets
baseline_results = {}

# The main baseline dataset: real spacers, simulated contigs
baseline_datasets = [
    'ns_3826979_nc_421431_real_baseline',
]

for dataset_name in baseline_datasets:
    dataset_dir = RESULTS_DIR / 'simulated' / dataset_name
    
    # Check if dataset exists
    contigs_file = dataset_dir / 'simulated_data' / 'simulated_contigs.fa'
    spacers_file = dataset_dir / 'simulated_data' / 'simulated_spacers.fa'
    
    if not contigs_file.exists():
        logger.warning(f"Dataset not found: {dataset_dir}")
        continue
    
    result = generate_scripts_for_dataset(
        dataset_dir=dataset_dir,
        contigs_file=str(contigs_file),
        spacers_file=str(spacers_file),
        max_distance=3,  # For spurious match estimation
        threads=38,
        skip_tools=['sassy', 'indelfree_bruteforce'],  # Too compute-intensive
        slurm_threads=38,
        slurm_opts={'mem': '250G', 't': '72:00:00'},
        hyperfine=False
    )
    baseline_results[dataset_name] = result

print_summary(baseline_results, "SEMI-SYNTHETIC/BASELINE - Script Generation Summary")

SEMI-SYNTHETIC/BASELINE - Script Generation Summary
ns_3826979_nc_421431_real_baseline:
  max_distance: 3
  Bash scripts: 10
  SLURM jobs: 10
  Tools: blastn, bowtie1, bowtie2, indelfree_indexed, lexicmap...


## 4. Simulated Data

Generate scripts for fully simulated datasets. Smaller datasets use max_distance=5 for comprehensive analysis, while larger datasets may use max_distance=3.

**Data generation is commented out - data already exists.**

In [None]:
%%bash
# 100 spacers, 50000 contigs, 1 insertions, 25-40 spacer length, 10000-750000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate \
  --num-contigs 50000 \
  --num-spacers 100 \
  --contig-length 2500 850000 \
  --contig-distribution normal \
  --contig-gc-content 46 \
  --spacer-gc-content 49 \
  --spacer-length 25 45 \
  --mismatch-range 0 5 \
  --spacer-insertions 1 3 \
  --reverse-complement 0.5 \
  --threads 16 \
  --output-dir ../results/simulated/ns_100_nc_50000 
"

In [None]:
%%bash
# 50_000 spacers, 5_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 50000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 5000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ../results/simulated/ns_50000_nc_5000/ --threads 8"

# 75_000 spacers, 5_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 75000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 5000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ../results/simulated/ns_75000_nc_5000/ --threads 8"

# 500 spacers, 5_000 contigs, HIGH INSERTIONS! 100-2500 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 500 --gc-content 49 --spacer-insertions 100 2500  --num-contigs 5000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ../results/simulated/ns_500_nc_5000_HIGH_INSERTION_RATE/ --threads 8"

# 75_000 spacers, 10_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 75000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 10000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ../results/simulated/ns_75000_nc_10000/ --threads 8"

# 100_000 spacers, 10_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 100000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 10000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 5 --output-dir ../results/simulated/ns_100000_nc_10000/ --threads 8"

# 100_000 spacers, 20_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-150000 contig length, 0-5 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 100000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 20000 --spacer-length 25 40  --contig-length 10000 150000 --mismatch-range 0 3 --output-dir ../results/simulated/ns_100000_nc_20000/ --threads 8"

# # 500_000 spacers, 100_000 contigs, 1-5 insertions, 25-40 spacer length, 10000-550000 contig length, 0-3 mismatches
pixi run "spacer_bencher simulate --contig-distribution normal  --spacer-distribution normal --num-spacers 500000 --gc-content 49 --spacer-insertions 1 5 --num-contigs 100000 --spacer-length 25 40  --contig-length 10000 550000 --mismatch-range 0 3 --output-dir ../results/simulated/ns_500000_nc_100000/ --threads 8"

[2;36m[01/22/26 18:31:37][0m[2;36m [0m[34mINFO    [0m Starting simulation with [1;36m50000[0m spacers and [1;36m5000[0m    
[2;36m                    [0m         contigs                                            
[2;36m[01/22/26 18:31:37][0m[2;36m [0m[34mINFO    [0m Starting simulation: [1;36m5000[0m contigs, [1;36m50000[0m spacers   
Using generated simulation ID: d1184075
Generating contigs...
Generating spacers...
Predetermining spacer insertion parameters...
Total planned insertion length: 4795343 bp
Average insertions per spacer: 3.0
Total contig size: 402606570 bp
Expected contig utilization: 1.2%
Thread 0: 6250 spacers (599419 bp planned) / 625 contigs (50385036 bp)
Thread 1: 6250 spacers (599418 bp planned) / 625 contigs (50370212 bp)
Thread 2: 6250 spacers (599418 bp planned) / 625 contigs (50352937 bp)
Thread 3: 6250 spacers (599418 bp planned) / 625 contigs (50335864 bp)
Thread 4: 6250 spacers (599418 bp planned) / 625 contigs (50315897 bp)
Thread 5: 6

[?25hSuccessfully inserted 300084 spacer instances
Writing contigs to FASTA file...
Writing spacers to FASTA file...
Writing planned ground truth to TSV file...



Aborted!


[2;36m[01/22/26 18:36:24][0m[2;36m [0m[34mINFO    [0m Starting simulation with [1;36m1000000[0m spacers and [1;36m100000[0m
[2;36m                    [0m         contigs                                            
[2;36m[01/22/26 18:36:24][0m[2;36m [0m[34mINFO    [0m Starting simulation: [1;36m100000[0m contigs, [1;36m1000000[0m       
[2;36m                    [0m         spacers                                            
Using generated simulation ID: c1a7860d
Generating contigs...


In [None]:
# Generate scripts for simulated datasets
simulated_results = {}
all_simulated_datasets = ["ns_100000_nc_10000", "ns_50000_nc_5000", "ns_75000_nc_10000",
"ns_100000_nc_20000", "ns_500000_nc_100000", "ns_500_nc_5000_HIGH_INSERTION_RATE", "ns_75000_nc_5000"]

# Smaller simulated datasets: max_distance=5 (comprehensive analysis)
smaller_simulated = [
    'ns_50000_nc_5000',
    'ns_75000_nc_5000',
    'ns_100_nc_50000',
    'ns_500_nc_5000_HIGH_INSERTION_RATE',
]

medium_simulated = [
        'ns_75000_nc_10000',
        ]
large_simulated = [
    'ns_100000_nc_10000',
    "ns_500000_nc_100000",
    "ns_100000_nc_20000"
]

for dataset_name in smaller_simulated:
    dataset_dir = RESULTS_DIR / 'simulated' / dataset_name
    
    contigs_file = dataset_dir / 'simulated_data' / 'simulated_contigs.fa'
    spacers_file = dataset_dir / 'simulated_data' / 'simulated_spacers.fa'
    
    if not contigs_file.exists():
        logger.warning(f"Dataset not found: {dataset_dir}")
        continue
    
    result = generate_scripts_for_dataset(
        dataset_dir=dataset_dir,
        contigs_file=str(contigs_file),
        spacers_file=str(spacers_file),
        max_distance=5,  # Full range for comprehensive analysis
        threads=38,
        slurm_threads=38,
        hyperfine=False
    )
    simulated_results[dataset_name] = result

# Larger simulated datasets: max_distance=3 or 5 depending on size
larger_simulated = {
    'ns_100000_nc_10000': 5,  # Still manageable with max_distance=5
    'ns_100000_nc_20000': 5,  # Still manageable with max_distance=5
    'ns_500000_nc_100000': 3  # Larger dataset, use max_distance=3
}

for dataset_name, max_dist in larger_simulated.items():
    dataset_dir = RESULTS_DIR / 'simulated' / dataset_name
    contigs_file = dataset_dir / 'simulated_data' / 'simulated_contigs.fa'
    spacers_file = dataset_dir / 'simulated_data' / 'simulated_spacers.fa'
    
    if not contigs_file.exists():
        logger.warning(f"Dataset not found: {dataset_dir}")
        continue
    
    # For very large datasets, might want to skip compute-intensive tools
    skip_tools = ['sassy', 'indelfree_bruteforce'] if max_dist == 3 else None
    
    result = generate_scripts_for_dataset(
        dataset_dir=dataset_dir,
        contigs_file=str(contigs_file),
        spacers_file=str(spacers_file),
        max_distance=max_dist,
        threads=38,
        skip_tools=skip_tools,
        slurm_threads=38,
        slurm_opts={'mem': '250G', 't': '72:00:00'},
        hyperfine=False
    )
    simulated_results[dataset_name] = result

print_summary(simulated_results, "SIMULATED DATA - Script Generation Summary")

SIMULATED DATA - Script Generation Summary
ns_50000_nc_5000:
  max_distance: 5
  Bash scripts: 11
  SLURM jobs: 11
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
ns_75000_nc_5000:
  max_distance: 5
  Bash scripts: 11
  SLURM jobs: 11
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
ns_500_nc_5000_HIGH_INSERTION_RATE:
  max_distance: 5
  Bash scripts: 11
  SLURM jobs: 11
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
ns_100000_nc_10000:
  max_distance: 5
  Bash scripts: 11
  SLURM jobs: 11
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
ns_100000_nc_20000:
  max_distance: 5
  Bash scripts: 11
  SLURM jobs: 11
  Tools: blastn, bowtie1, bowtie2, indelfree_bruteforce, indelfree_indexed...
ns_500000_nc_100000:
  max_distance: 3
  Bash scripts: 9
  SLURM jobs: 9
  Tools: blastn, bowtie1, bowtie2, indelfree_indexed, minimap2...


## Job Submissions
Proceed with caution - do not submit multiple times (these are not the hyperfine runs where we clean up after/before each run).

In [None]:
%%bash
# Submit all jobs for a specific dataset
# for script in results/real_data/subsamples/fraction_0.*/job_scripts/blastn.sh; do
#     sbatch $script
# done

# Submit all jobs for simulated datasets
# for script in results/simulated/ns_*0/job_scripts/*.sh; do
#     sbatch $script
# done

# baseline (semi-synthetic) dataset
# for script in results/simulated/*_real_baseline/job_scripts/*.sh; do
#     sbatch $script
# done
# high insertion rate dataset
# for script in results/simulated/ns_500_nc_5000_HIGH_INSERTION_RATE/job_scripts/*.sh; do
#     sbatch $script
# done