## Create multiple subsamples of the contigs for statistical analysis
Using the `spacer_bencher subsample` command, we can create meta-data informed subsamples of the contigs. The goal is to keep the subsamples balanced for the different taxonomic groups and genetic features (namely lenth and GC content), while reducing the number of contigs to a manageable number for which all or most of the tools should finish in a reasonable time.
The subsample command requires a contigs file and the metadata in the form of table with at least the columns for the contig names, "Lenght", taxonomy string and GC content. The current table does not have the GC content, so we'll add that column first from the output of the `seqkit fx2tab` command.

In [1]:
import os
os.chdir('/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/')
import polars as pl
import logging
logger = logging.Logger(__name__, logging.INFO)
handler = logging.StreamHandler()
# Set a format that makes log messages clickable in JupyterLab (file:line)
formatter = logging.Formatter('%(filename)s:%(lineno)s - %(asctime)s - %(levelname)s - %(message)s - %(name)s ')
handler.setFormatter(formatter)
logger.addHandler(handler)

pl.Config(tbl_rows=50)

from bench import *
from bench.utils.functions import *
# Import the functionalized subsampling functions
from bench.commands.subsample import subsample_dataset

# Set up parameters for subsampling
contigs_file = 'contigs/IMGVR4_SEQUENCES.fna'
spacers_file = 'spacers/All_CRISPR_spacers_nr_clean.fna'
metadata_file = 'contigs/contig_stats.parquet'
output_dir = 'subsampled_data'


In [2]:
contig_stats  = pl.read_parquet("contigs/contig_stats.parquet")
contig_stats.head(1)

seqid,length,gc,topology,uvig,hq,…,phylum,class,order,family,genus,species
str,i64,f64,str,str,bool,…,str,str,str,str,str,str
"""IMGVR_UViG_2504643025_000001|2…",5594,38.36,"""Provirus""","""IMGVR_UViG_2504643025_000001""",True,…,"""Hofneiviricota""","""Faserviricetes""","""Tubulavirales""",,,


In [3]:
%%time
results = subsample_dataset(
    contigs_file='contigs/IMGVR4_SEQUENCES.fna',
    metadata_file='contigs/contig_stats.parquet', 
    output_dir='subsampled_data',
    reduce_factor=0.01,  # Absolute number
    hq_fraction=0.8,    # 80% of high-quality
    taxonomic_rank='class', # Class level diversity
    gc_bins=15,
    length_bins=15,
    logger=logger,
    extract_method="iter"    
)

subsample.py:296 - 2025-10-07 15:53:15,928 - INFO - Starting intelligent subsampling... - __main__ 
subsample.py:301 - 2025-10-07 15:53:15,929 - INFO - Using fraction reduction: 0.01 - __main__ 
subsample.py:310 - 2025-10-07 15:53:15,931 - INFO - Reading metadata from contigs/contig_stats.parquet - __main__ 
subsample.py:321 - 2025-10-07 15:53:16,197 - INFO - Loaded metadata with 5457198 entries - __main__ 
subsample.py:334 - 2025-10-07 15:53:16,510 - INFO - Using metadata for 5457198 contigs - __main__ 
subsample.py:340 - 2025-10-07 15:53:16,527 - INFO - Selected 381580 high-quality contigs - __main__ 
subsample.py:348 - 2025-10-07 15:53:16,528 - INFO - Target number of contigs: 3815 - __main__ 
subsample.py:354 - 2025-10-07 15:53:16,735 - INFO - Length range: 165 - 2473870 bp - __main__ 
subsample.py:355 - 2025-10-07 15:53:16,736 - INFO - GC content range: 11.820 - 78.770 - __main__ 
subsample.py:358 - 2025-10-07 15:53:16,736 - INFO - Performing stratified sampling... - __main__ 


Found 34 unique taxa at rank class
Selected 75 contigs from Repensiviricetes
Selected 81 contigs from Amabiliviricetes
Selected 64 contigs from Arfiviricetes
Selected 69 contigs from Pisoniviricetes
Selected 59 contigs from Tolucaviricetes
Selected 16 contigs from Flasuviricetes
Selected 22 contigs from Revtraviricetes
Selected 4 contigs from Huolimaviricetes
Selected 63 contigs from Papovaviricetes
Selected 72 contigs from Tectiliviricetes
Selected 66 contigs from Faserviricetes
Selected 92 contigs from Duplopiviricetes
Selected 59 contigs from Malgrandaviricetes
Selected 89 contigs from Leviviricetes
Selected 13 contigs from Tokiviricetes
Selected 85 contigs from Megaviricetes
Selected 96 contigs from Monjiviricetes
Selected 66 contigs from Ellioviricetes
Selected 212 contigs from Caudoviricetes
Selected 59 contigs from Miaviricetes
Selected 81 contigs from Chrymotiviricetes
Selected 98 contigs from Alsuviricetes
Selected 107 contigs from Magsaviricetes
Selected 42 contigs from Vidav

subsample.py:360 - 2025-10-07 15:53:17,920 - INFO - Selected 3815 contigs total - __main__ 
subsample.py:364 - 2025-10-07 15:53:17,921 - INFO - Writing selected contigs to subsampled_data/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 81 contigs from Stelpaviricetes
Selected 3 contigs from Pokkesviricetes
Added 1753 additional contigs to reach target
Written 100 contigs...
Written 200 contigs...
Written 300 contigs...
Written 400 contigs...
Written 500 contigs...
Written 600 contigs...
Written 700 contigs...
Written 800 contigs...
Written 900 contigs...
Written 1000 contigs...
Written 1100 contigs...
Written 1200 contigs...
Written 1300 contigs...
Written 1400 contigs...
Written 1500 contigs...
Written 1600 contigs...
Written 1700 contigs...
Written 1800 contigs...
Written 1900 contigs...
Written 2000 contigs...
Written 2100 contigs...
Written 2200 contigs...
Written 2300 contigs...
Written 2400 contigs...
Written 2500 contigs...
Written 2600 contigs...
Written 2700 contigs...
Written 2800 contigs...
Written 2900 contigs...
Written 3000 contigs...
Written 3100 contigs...
Written 3200 contigs...
Written 3300 contigs...
Written 3400 contigs...
Written 3500 contigs...
Written 3600 contigs...
Written 3700 conti

subsample.py:366 - 2025-10-07 15:58:37,364 - INFO - Written 3815 subsampled contigs - __main__ 
subsample.py:378 - 2025-10-07 15:58:37,376 - INFO - Writing subsampled metadata to subsampled_data/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:390 - 2025-10-07 15:58:37,410 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:391 - 2025-10-07 15:58:37,410 - INFO - Results saved to: subsampled_data - __main__ 
subsample.py:392 - 2025-10-07 15:58:37,411 - INFO - Summary report: subsampled_data/subsampling_report.txt - __main__ 


CPU times: user 1min 54s, sys: 55.9 s, total: 2min 50s
Wall time: 5min 21s


In [4]:
%%time
results = subsample_dataset(
    contigs_file='contigs/IMGVR4_SEQUENCES.fna',
    metadata_file='contigs/contig_stats.parquet', 
    output_dir='subsampled_data',
    reduce_factor=0.01,  # Absolute number
    hq_fraction=0.8,    # 80% of high-quality
    taxonomic_rank='class', # Class level diversity
    gc_bins=15,
    length_bins=15,
    logger=logger,
    extract_method="pyfastx"    
)

subsample.py:296 - 2025-10-07 15:58:37,571 - INFO - Starting intelligent subsampling... - __main__ 
subsample.py:301 - 2025-10-07 15:58:37,571 - INFO - Using fraction reduction: 0.01 - __main__ 
subsample.py:310 - 2025-10-07 15:58:37,573 - INFO - Reading metadata from contigs/contig_stats.parquet - __main__ 
subsample.py:321 - 2025-10-07 15:58:37,665 - INFO - Loaded metadata with 5457198 entries - __main__ 
subsample.py:334 - 2025-10-07 15:58:37,998 - INFO - Using metadata for 5457198 contigs - __main__ 
subsample.py:340 - 2025-10-07 15:58:38,007 - INFO - Selected 381580 high-quality contigs - __main__ 
subsample.py:348 - 2025-10-07 15:58:38,008 - INFO - Target number of contigs: 3815 - __main__ 
subsample.py:354 - 2025-10-07 15:58:38,214 - INFO - Length range: 165 - 2473870 bp - __main__ 
subsample.py:355 - 2025-10-07 15:58:38,214 - INFO - GC content range: 11.820 - 78.770 - __main__ 
subsample.py:358 - 2025-10-07 15:58:38,215 - INFO - Performing stratified sampling... - __main__ 


Found 34 unique taxa at rank class
Selected 81 contigs from Chrymotiviricetes
Selected 58 contigs from Milneviricetes
Selected 3 contigs from Naldaviricetes
Selected 92 contigs from Duplopiviricetes
Selected 75 contigs from Repensiviricetes
Selected 4 contigs from Huolimaviricetes
Selected 69 contigs from Pisoniviricetes
Selected 5 contigs from Herviviricetes
Selected 96 contigs from Monjiviricetes
Selected 16 contigs from Flasuviricetes
Selected 20 contigs from Quintoviricetes
Selected 98 contigs from Alsuviricetes
Selected 89 contigs from Leviviricetes
Selected 66 contigs from Ellioviricetes
Selected 59 contigs from Malgrandaviricetes
Selected 59 contigs from Tolucaviricetes
Selected 85 contigs from Megaviricetes
Selected 81 contigs from Amabiliviricetes
Selected 72 contigs from Tectiliviricetes
Selected 11 contigs from Polintoviricetes
Selected 11 contigs from Laserviricetes
Selected 81 contigs from Stelpaviricetes
Selected 13 contigs from Tokiviricetes
Selected 64 contigs from Arfi

subsample.py:360 - 2025-10-07 15:58:39,276 - INFO - Selected 3815 contigs total - __main__ 
subsample.py:364 - 2025-10-07 15:58:39,277 - INFO - Writing selected contigs to subsampled_data/subsampled_data/subsampled_contigs.fa - __main__ 


Selected 95 contigs from Howeltoviricetes
Selected 22 contigs from Revtraviricetes
Selected 107 contigs from Magsaviricetes
Added 1753 additional contigs to reach target


subsample.py:366 - 2025-10-07 15:59:21,708 - INFO - Written subsampled_data/subsampled_data/subsampled_contigs.fa subsampled contigs - __main__ 
subsample.py:378 - 2025-10-07 15:59:21,716 - INFO - Writing subsampled metadata to subsampled_data/subsampled_data/subsampled_metadata.tsv - __main__ 
subsample.py:390 - 2025-10-07 15:59:21,757 - INFO - Subsampling completed successfully! - __main__ 
subsample.py:391 - 2025-10-07 15:59:21,758 - INFO - Results saved to: subsampled_data - __main__ 
subsample.py:392 - 2025-10-07 15:59:21,758 - INFO - Summary report: subsampled_data/subsampling_report.txt - __main__ 


CPU times: user 15.3 s, sys: 4.92 s, total: 20.2 s
Wall time: 44.3 s
