## Create multiple subsamples of the contigs for statistical analysis
Using the `spacer_bencher subsample` command, we can create meta-data informed subsamples of the contigs. The goal is to keep the subsamples balanced for the different taxonomic groups and genetic features (namely lenth and GC content), while reducing the number of contigs to a manageable number for which all or most of the tools should finish in a reasonable time.
The subsample command requires a contigs file and the metadata in the form of table with at least the columns for the contig names, "Lenght", taxonomy string and GC content. The current table does not have the GC content, so we'll add that column first from the output of the `seqkit fx2tab` command.

In [2]:
import os
os.chdir('/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/')
import polars as pl
import logging
logger = logging.Logger(__name__, logging.INFO)
handler = logging.StreamHandler()
# Set a format that makes log messages clickable in JupyterLab (file:line) # TODO: add this to the functions file main logging logic.
formatter = logging.Formatter('%(filename)s:%(lineno)s - %(asctime)s - %(levelname)s - %(message)s - %(name)s ')
handler.setFormatter(formatter)
logger.addHandler(handler)

pl.Config(tbl_rows=50)

from bench import *
from bench.utils.functions import *
# Import the functionalized subsampling functions
from bench.commands.subsample import subsample_dataset

# Set up parameters for subsampling
contigs_file = 'contigs/IMGVR4_SEQUENCES.fna'
spacers_file = 'spacers/All_CRISPR_spacers_nr_clean.fna'
metadata_file = 'contigs/contig_stats.parquet'
output_dir = 'subsampled_data'
contig_stats  = pl.read_parquet("contigs/contig_stats.parquet")
contig_stats.head(1)
# contig_stats["seqid"].count() #5457198

seqid,length,gc,topology,uvig,hq,…,phylum,class,order,family,genus,species
str,i64,f64,str,str,bool,…,str,str,str,str,str,str
"""IMGVR_UViG_2504643025_000001|2…",5594,38.36,"""Provirus""","""IMGVR_UViG_2504643025_000001""",True,…,"""Hofneiviricota""","""Faserviricetes""","""Tubulavirales""",,,


Next - creating subsamples, of varying sizes 

In [None]:
%%time
fractions = [0.001, 0.005, 0.01, 0.03, 0.05, 0.07, 0.1] #, 0.2]    
for i in fractions:     
    results = subsample_dataset(
    contigs_file='contigs/IMGVR4_SEQUENCES.fna',
    metadata_file='contigs/contig_stats.parquet', 
    output_dir=f'subsamples/fraction_{i}',
    reduce_factor= i, 
    hq_fraction=0.8,    # 80% of high-quality
    taxonomic_rank='class', # trying to keep the taxonomic diversity at this level
    # gc_bins=10,
    # length_bins=10,
    logger=logger,
    extract_method="pyfastx"       
)

subsample.py:296 - 2025-10-07 18:09:05,686 - INFO - Starting intelligent subsampling... - __main__ 
subsample.py:301 - 2025-10-07 18:09:05,687 - INFO - Using fraction reduction: 0.001 - __main__ 
subsample.py:310 - 2025-10-07 18:09:05,689 - INFO - Reading metadata from contigs/contig_stats.parquet - __main__ 
subsample.py:321 - 2025-10-07 18:09:05,785 - INFO - Loaded metadata with 5457198 entries - __main__ 
subsample.py:334 - 2025-10-07 18:09:06,130 - INFO - Using metadata for 5457198 contigs - __main__ 
subsample.py:340 - 2025-10-07 18:09:06,217 - INFO - Selected 381580 high-quality contigs - __main__ 
subsample.py:348 - 2025-10-07 18:09:06,218 - INFO - Target number of contigs: 381 - __main__ 
subsample.py:354 - 2025-10-07 18:09:06,423 - INFO - Length range: 165 - 2473870 bp - __main__ 
subsample.py:355 - 2025-10-07 18:09:06,424 - INFO - GC content range: 11.820 - 78.770 - __main__ 
subsample.py:358 - 2025-10-07 18:09:06,424 - INFO - Performing stratified sampling... - __main__ 


Found 39 unique taxa at rank class
Selected 3 contigs from Mouviricetes
Selected 10 contigs from Pokkesviricetes
Selected 99 contigs from Caudoviricetes
Selected 23 contigs from Miaviricetes
Selected 23 contigs from Maveriviricetes
Selected 27 contigs from Arfiviricetes
Selected 27 contigs from Duplopiviricetes
Selected 25 contigs from Stelpaviricetes
Selected 22 contigs from Magsaviricetes
Selected 12 contigs from Tokiviricetes
Selected 20 contigs from Leviviricetes
Selected 29 contigs from Alsuviricetes
Selected 19 contigs from Repensiviricetes
Selected 14 contigs from Flasuviricetes
Selected 40 contigs from Faserviricetes
Selected 26 contigs from Megaviricetes
Selected 25 contigs from Tolucaviricetes
Selected 30 contigs from Chrymotiviricetes
Selected 18 contigs from Monjiviricetes
Selected 38 contigs from Pisoniviricetes
Selected 5 contigs from Laserviricetes
Selected 5 contigs from Huolimaviricetes


subsample.py:360 - 2025-10-07 18:09:07,056 - INFO - Selected 808 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:09:07,057 - INFO - Writing selected contigs to subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 27 contigs from Malgrandaviricetes
Selected 6 contigs from Insthoviricetes
Selected 16 contigs from Papovaviricetes
Selected 24 contigs from Herviviricetes
Selected 13 contigs from Naldaviricetes
Selected 16 contigs from Resentoviricetes
Selected 35 contigs from Tectiliviricetes
Selected 7 contigs from Polintoviricetes
Selected 22 contigs from Revtraviricetes
Selected 18 contigs from Amabiliviricetes
Selected 9 contigs from Chunqiuviricetes
Selected 21 contigs from Ellioviricetes
Selected 12 contigs from Quintoviricetes
Selected 17 contigs from Howeltoviricetes
Selected 3 contigs from Yunchangviricetes
Selected 9 contigs from Vidaverviricetes
Selected 13 contigs from Milneviricetes


subsample.py:366 - 2025-10-07 18:09:23,707 - INFO - Written subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:09:23,715 - INFO - Writing subsampled metadata to subsamples/fraction_0.001/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:09:23,722 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:09:23,723 - INFO - Results saved to: subsamples/fraction_0.001 - __main__ 
subsample.py:377 - 2025-10-07 18:09:23,723 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:09:23,723 - INFO - Subsampled contigs: 808 - __main__ 
subsample.py:379 - 2025-10-07 18:09:23,724 - INFO - Reduction factor: 6753.958 - __main__ 
subsample.py:381 - 2025-10-07 18:09:23,724 - INFO - Results saved to: subsamples/fraction_0.001 - __main__ 
subsample.py:382 - 2025-10-07 18:09:23,724 - INFO - Original contigs: 5457198 - __main__ 
subsample

Found 39 unique taxa at rank class
Selected 42 contigs from Revtraviricetes
Selected 29 contigs from Alsuviricetes
Selected 7 contigs from Polintoviricetes
Selected 39 contigs from Herviviricetes
Selected 38 contigs from Repensiviricetes
Selected 39 contigs from Leviviricetes
Selected 39 contigs from Faserviricetes
Selected 1 contigs from Mouviricetes
Selected 34 contigs from Monjiviricetes
Selected 19 contigs from Laserviricetes
Selected 98 contigs from Caudoviricetes
Selected 29 contigs from Naldaviricetes
Selected 34 contigs from Howeltoviricetes
Selected 24 contigs from Pokkesviricetes
Selected 30 contigs from Chrymotiviricetes
Selected 42 contigs from Papovaviricetes
Selected 4 contigs from Yunchangviricetes
Selected 40 contigs from Insthoviricetes
Selected 24 contigs from Tokiviricetes
Selected 27 contigs from Malgrandaviricetes
Selected 44 contigs from Miaviricetes
Selected 12 contigs from Huolimaviricetes
Selected 46 contigs from Resentoviricetes
Selected 28 contigs from Duplop

subsample.py:360 - 2025-10-07 18:09:25,350 - INFO - Selected 1907 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:09:25,350 - INFO - Writing selected contigs to subsamples/fraction_0.005/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 28 contigs from Arfiviricetes
Selected 26 contigs from Tolucaviricetes
Selected 33 contigs from Tectiliviricetes
Selected 35 contigs from Vidaverviricetes
Selected 42 contigs from Milneviricetes
Selected 33 contigs from Flasuviricetes
Selected 37 contigs from Pisoniviricetes
Added 643 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:09:46,276 - INFO - Written subsamples/fraction_0.005/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:09:46,284 - INFO - Writing subsampled metadata to subsamples/fraction_0.005/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:09:46,292 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:09:46,292 - INFO - Results saved to: subsamples/fraction_0.005 - __main__ 
subsample.py:377 - 2025-10-07 18:09:46,292 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:09:46,293 - INFO - Subsampled contigs: 1907 - __main__ 
subsample.py:379 - 2025-10-07 18:09:46,293 - INFO - Reduction factor: 2861.666 - __main__ 
subsample.py:381 - 2025-10-07 18:09:46,293 - INFO - Results saved to: subsamples/fraction_0.005 - __main__ 
subsample.py:382 - 2025-10-07 18:09:46,294 - INFO - Original contigs: 5457198 - __main__ 
subsampl

Found 39 unique taxa at rank class
Selected 55 contigs from Naldaviricetes
Selected 85 contigs from Howeltoviricetes
Selected 73 contigs from Papovaviricetes
Selected 72 contigs from Faserviricetes
Selected 76 contigs from Revtraviricetes
Selected 76 contigs from Maveriviricetes
Selected 16 contigs from Chunqiuviricetes
Selected 74 contigs from Stelpaviricetes
Selected 77 contigs from Repensiviricetes
Selected 64 contigs from Insthoviricetes
Selected 71 contigs from Pisoniviricetes
Selected 29 contigs from Tokiviricetes
Selected 4 contigs from Mouviricetes
Selected 52 contigs from Milneviricetes
Selected 99 contigs from Caudoviricetes
Selected 87 contigs from Miaviricetes
Selected 60 contigs from Tectiliviricetes
Selected 10 contigs from Polintoviricetes
Selected 85 contigs from Chrymotiviricetes
Selected 2 contigs from Yunchangviricetes
Selected 51 contigs from Flasuviricetes
Selected 76 contigs from Duplopiviricetes
Selected 13 contigs from Huolimaviricetes
Selected 77 contigs from M

subsample.py:360 - 2025-10-07 18:09:47,959 - INFO - Selected 3815 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:09:47,960 - INFO - Writing selected contigs to subsamples/fraction_0.01/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 66 contigs from Tolucaviricetes
Selected 77 contigs from Arfiviricetes
Selected 23 contigs from Laserviricetes
Selected 85 contigs from Alsuviricetes
Selected 76 contigs from Quintoviricetes
Added 1437 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:10:16,891 - INFO - Written subsamples/fraction_0.01/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:10:16,899 - INFO - Writing subsampled metadata to subsamples/fraction_0.01/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:10:16,909 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:10:16,910 - INFO - Results saved to: subsamples/fraction_0.01 - __main__ 
subsample.py:377 - 2025-10-07 18:10:16,910 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:10:16,910 - INFO - Subsampled contigs: 3815 - __main__ 
subsample.py:379 - 2025-10-07 18:10:16,910 - INFO - Reduction factor: 1430.458 - __main__ 
subsample.py:381 - 2025-10-07 18:10:16,911 - INFO - Results saved to: subsamples/fraction_0.01 - __main__ 
subsample.py:382 - 2025-10-07 18:10:16,911 - INFO - Original contigs: 5457198 - __main__ 
subsample.py

Found 39 unique taxa at rank class
Selected 224 contigs from Chrymotiviricetes
Selected 197 contigs from Resentoviricetes
Selected 7 contigs from Polintoviricetes
Selected 231 contigs from Ellioviricetes
Selected 173 contigs from Amabiliviricetes
Selected 4 contigs from Mouviricetes
Selected 173 contigs from Megaviricetes
Selected 206 contigs from Tolucaviricetes
Selected 172 contigs from Tectiliviricetes
Selected 34 contigs from Tokiviricetes
Selected 119 contigs from Insthoviricetes
Selected 202 contigs from Maveriviricetes
Selected 232 contigs from Stelpaviricetes
Selected 58 contigs from Vidaverviricetes
Selected 93 contigs from Herviviricetes
Selected 128 contigs from Quintoviricetes
Selected 219 contigs from Arfiviricetes
Selected 229 contigs from Magsaviricetes
Selected 9 contigs from Chunqiuviricetes
Selected 4 contigs from Yunchangviricetes
Selected 197 contigs from Repensiviricetes
Selected 289 contigs from Howeltoviricetes
Selected 168 contigs from Faserviricetes
Selected 19

subsample.py:360 - 2025-10-07 18:10:18,566 - INFO - Selected 11447 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:10:18,567 - INFO - Writing selected contigs to subsamples/fraction_0.03/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 197 contigs from Caudoviricetes
Selected 17 contigs from Huolimaviricetes
Selected 51 contigs from Pokkesviricetes
Selected 119 contigs from Flasuviricetes
Added 5600 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:11:19,776 - INFO - Written subsamples/fraction_0.03/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:11:19,790 - INFO - Writing subsampled metadata to subsamples/fraction_0.03/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:11:19,804 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:11:19,804 - INFO - Results saved to: subsamples/fraction_0.03 - __main__ 
subsample.py:377 - 2025-10-07 18:11:19,805 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:11:19,805 - INFO - Subsampled contigs: 11447 - __main__ 
subsample.py:379 - 2025-10-07 18:11:19,805 - INFO - Reduction factor: 476.736 - __main__ 
subsample.py:381 - 2025-10-07 18:11:19,806 - INFO - Results saved to: subsamples/fraction_0.03 - __main__ 
subsample.py:382 - 2025-10-07 18:11:19,806 - INFO - Original contigs: 5457198 - __main__ 
subsample.py

Found 39 unique taxa at rank class
Selected 353 contigs from Tolucaviricetes
Selected 22 contigs from Laserviricetes
Selected 15 contigs from Huolimaviricetes
Selected 11 contigs from Chunqiuviricetes
Selected 364 contigs from Malgrandaviricetes
Selected 222 contigs from Megaviricetes
Selected 380 contigs from Magsaviricetes
Selected 258 contigs from Monjiviricetes
Selected 471 contigs from Howeltoviricetes
Selected 143 contigs from Flasuviricetes
Selected 275 contigs from Faserviricetes
Selected 346 contigs from Chrymotiviricetes
Selected 397 contigs from Pisoniviricetes
Selected 50 contigs from Pokkesviricetes
Selected 4 contigs from Yunchangviricetes
Selected 295 contigs from Repensiviricetes
Selected 106 contigs from Herviviricetes
Selected 305 contigs from Resentoviricetes
Selected 393 contigs from Caudoviricetes
Selected 288 contigs from Duplopiviricetes
Selected 348 contigs from Arfiviricetes
Selected 437 contigs from Alsuviricetes
Selected 29 contigs from Tokiviricetes
Selected

subsample.py:360 - 2025-10-07 18:11:21,464 - INFO - Selected 19079 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:11:21,478 - INFO - Writing selected contigs to subsamples/fraction_0.05/subsampled_data/subsampled_contigs.fa - __main__ 


Added 10362 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:12:54,044 - INFO - Written subsamples/fraction_0.05/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:12:54,060 - INFO - Writing subsampled metadata to subsamples/fraction_0.05/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:12:54,100 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:12:54,101 - INFO - Results saved to: subsamples/fraction_0.05 - __main__ 
subsample.py:377 - 2025-10-07 18:12:54,101 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:12:54,101 - INFO - Subsampled contigs: 19079 - __main__ 
subsample.py:379 - 2025-10-07 18:12:54,102 - INFO - Reduction factor: 286.032 - __main__ 
subsample.py:381 - 2025-10-07 18:12:54,102 - INFO - Results saved to: subsamples/fraction_0.05 - __main__ 
subsample.py:382 - 2025-10-07 18:12:54,102 - INFO - Original contigs: 5457198 - __main__ 
subsample.py

Found 39 unique taxa at rank class
Selected 139 contigs from Quintoviricetes
Selected 4 contigs from Yunchangviricetes
Selected 460 contigs from Ellioviricetes
Selected 78 contigs from Naldaviricetes
Selected 4 contigs from Mouviricetes
Selected 12 contigs from Chunqiuviricetes
Selected 582 contigs from Alsuviricetes
Selected 535 contigs from Miaviricetes
Selected 588 contigs from Caudoviricetes
Selected 352 contigs from Duplopiviricetes
Selected 554 contigs from Pisoniviricetes
Selected 17 contigs from Huolimaviricetes
Selected 248 contigs from Tectiliviricetes
Selected 349 contigs from Amabiliviricetes
Selected 25 contigs from Laserviricetes
Selected 664 contigs from Howeltoviricetes
Selected 402 contigs from Resentoviricetes
Selected 377 contigs from Maveriviricetes
Selected 179 contigs from Revtraviricetes
Selected 525 contigs from Leviviricetes
Selected 11 contigs from Polintoviricetes
Selected 318 contigs from Monjiviricetes
Selected 493 contigs from Magsaviricetes
Selected 60 co

subsample.py:360 - 2025-10-07 18:12:55,752 - INFO - Selected 26710 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:12:55,753 - INFO - Writing selected contigs to subsamples/fraction_0.07/subsampled_data/subsampled_contigs.fa - __main__ 


Added 15524 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:15:04,892 - INFO - Written subsamples/fraction_0.07/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:15:04,927 - INFO - Writing subsampled metadata to subsamples/fraction_0.07/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:15:04,948 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:15:04,949 - INFO - Results saved to: subsamples/fraction_0.07 - __main__ 
subsample.py:377 - 2025-10-07 18:15:04,949 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:15:04,950 - INFO - Subsampled contigs: 26710 - __main__ 
subsample.py:379 - 2025-10-07 18:15:04,950 - INFO - Reduction factor: 204.313 - __main__ 
subsample.py:381 - 2025-10-07 18:15:04,950 - INFO - Results saved to: subsamples/fraction_0.07 - __main__ 
subsample.py:382 - 2025-10-07 18:15:04,951 - INFO - Original contigs: 5457198 - __main__ 
subsample.py

Found 39 unique taxa at rank class
Selected 22 contigs from Laserviricetes
Selected 686 contigs from Pisoniviricetes
Selected 871 contigs from Caudoviricetes
Selected 479 contigs from Repensiviricetes
Selected 401 contigs from Papovaviricetes
Selected 57 contigs from Pokkesviricetes
Selected 639 contigs from Stelpaviricetes
Selected 589 contigs from Tolucaviricetes
Selected 372 contigs from Monjiviricetes
Selected 692 contigs from Leviviricetes
Selected 684 contigs from Magsaviricetes
Selected 499 contigs from Faserviricetes
Selected 81 contigs from Milneviricetes
Selected 3 contigs from Mouviricetes
Selected 590 contigs from Ellioviricetes
Selected 15 contigs from Huolimaviricetes
Selected 745 contigs from Miaviricetes
Selected 460 contigs from Maveriviricetes
Selected 112 contigs from Herviviricetes
Selected 517 contigs from Resentoviricetes
Selected 922 contigs from Howeltoviricetes
Selected 3 contigs from Yunchangviricetes
Selected 285 contigs from Tectiliviricetes
Selected 262 con

subsample.py:360 - 2025-10-07 18:15:06,669 - INFO - Selected 38158 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:15:06,670 - INFO - Writing selected contigs to subsamples/fraction_0.1/subsampled_data/subsampled_contigs.fa - __main__ 


Added 23966 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:17:57,821 - INFO - Written subsamples/fraction_0.1/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:17:57,861 - INFO - Writing subsampled metadata to subsamples/fraction_0.1/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:17:57,900 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:17:57,901 - INFO - Results saved to: subsamples/fraction_0.1 - __main__ 
subsample.py:377 - 2025-10-07 18:17:57,901 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:17:57,901 - INFO - Subsampled contigs: 38158 - __main__ 
subsample.py:379 - 2025-10-07 18:17:57,902 - INFO - Reduction factor: 143.016 - __main__ 
subsample.py:381 - 2025-10-07 18:17:57,902 - INFO - Results saved to: subsamples/fraction_0.1 - __main__ 
subsample.py:382 - 2025-10-07 18:17:57,902 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:383

Found 39 unique taxa at rank class
Selected 92 contigs from Naldaviricetes
Selected 306 contigs from Tectiliviricetes
Selected 190 contigs from Revtraviricetes
Selected 618 contigs from Maveriviricetes
Selected 1656 contigs from Howeltoviricetes
Selected 23 contigs from Laserviricetes
Selected 1344 contigs from Alsuviricetes
Selected 12 contigs from Chunqiuviricetes
Selected 108 contigs from Herviviricetes
Selected 3 contigs from Mouviricetes
Selected 751 contigs from Resentoviricetes
Selected 1073 contigs from Malgrandaviricetes
Selected 1120 contigs from Tolucaviricetes
Selected 1028 contigs from Stelpaviricetes
Selected 36 contigs from Tokiviricetes
Selected 4 contigs from Yunchangviricetes
Selected 55 contigs from Pokkesviricetes
Selected 420 contigs from Papovaviricetes
Selected 925 contigs from Ellioviricetes
Selected 133 contigs from Quintoviricetes
Selected 615 contigs from Duplopiviricetes
Selected 58 contigs from Vidaverviricetes
Selected 976 contigs from Arfiviricetes
Select

subsample.py:360 - 2025-10-07 18:17:59,710 - INFO - Selected 76316 contigs total - __main__ 
subsample.py:364 - 2025-10-07 18:17:59,711 - INFO - Writing selected contigs to subsamples/fraction_0.2/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 1328 contigs from Leviviricetes
Selected 1083 contigs from Magsaviricetes
Added 53471 additional contigs to reach target


subsample.py:366 - 2025-10-07 18:23:51,650 - INFO - Written subsamples/fraction_0.2/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:372 - 2025-10-07 18:23:51,749 - INFO - Writing subsampled metadata to subsamples/fraction_0.2/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:375 - 2025-10-07 18:23:51,825 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:376 - 2025-10-07 18:23:51,826 - INFO - Results saved to: subsamples/fraction_0.2 - __main__ 
subsample.py:377 - 2025-10-07 18:23:51,826 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:378 - 2025-10-07 18:23:51,826 - INFO - Subsampled contigs: 76316 - __main__ 
subsample.py:379 - 2025-10-07 18:23:51,827 - INFO - Reduction factor: 71.508 - __main__ 
subsample.py:381 - 2025-10-07 18:23:51,827 - INFO - Results saved to: subsamples/fraction_0.2 - __main__ 
subsample.py:382 - 2025-10-07 18:23:51,827 - INFO - Original contigs: 5457198 - __main__ 
subsample.py:383 

CPU times: user 2min 1s, sys: 52 s, total: 2min 53s
Wall time: 14min 46s


Next, calling the spacer_bencher commands on the smallest subsample to get a sense of the runtime.


generating the scripts

In [1]:
%%time
!spacer_bencher generate_scripts \
    --input_dir /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001 \
    --contigs /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa \
    --spacers /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/spacers/All_CRISPR_spacers_nr_clean.fna \
    --warmups 1 \
    --max_runs 1 \
    --threads 8 
# !pwd

Generating tool execution scripts...
Initializing tools...
Initialized 12 tools
Disabling user-specified tools: vsearch
Generating script for blastn...
Generating script for bowtie1...
Generating script for bowtie2...
Generating script for indelfree...
Generating script for lexicmap...
Generating script for minimap2...
Generating script for minimap2_mod...
Generating script for mmseqs2...
Generating script for mummer4...
Generating script for sassy...
Generating script for strobealign...
Generating script for x_mapper...
Tool scripts generated successfully in /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/bash_scripts/
CPU times: user 9.98 ms, sys: 4 ms, total: 14 ms
Wall time: 812 ms


running the scripts locally

In [2]:
%%time
!pixi run spacer_bencher run_tools \
  -i /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001 \
  --spacers /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/spacers/All_CRISPR_spacers_nr_clean.fna \
  --contigs /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa \
  --threads 8 \
  --max_runs 1 \
  --warmups 0 \
  -rt "sassy"
  # !pwd

[2K[32m⠁[0m activating environment                                                                 Executing tools...
Initializing tools...
Initialized 12 tools
Disabling user-specified tools: vsearch
Running user-specified tools: sassy
Running tools...
Running sassy.sh in /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/bash_scripts/sassy.sh
[2K[1mBenchmark [0m[1m1[0m: sassy search --alphabet iupac --pattern-fasta /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/spacers/All_CRISPR_spacers_nr_clean.fna -k 5 --output /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/raw_outputs//sassy.tsv --threads 8  /clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/subsamples/fraction_0.001/subsampled_data/subsampled_contigs.fa
[2K ⠼ Performing warmup runs         ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:00:00 [?25hTraceb

In [None]:
# %%time
# ! spacer_bencher run imgvr4_data/subsamples/fraction_0.001/contigs.fna -o imgvr4_data/subsamples/fraction_0.001/spacer_bencher_output --threads 16 --log-level INFO