## Downloading and filtering real data  

For a "real" data test case, we'll test the different aligners/search software on the virus sequences from IMG/VR4.  
In IMG/VR4  [DOI: 10.1093/nar/gkac1037](https://doi.org/10.1093/nar/gkac1037), the host assignment [workflow](https://github.com/jgi-microbiome-data-science/crispr-host-prediction) identifies spacer-protospacer pairs via `blastn -w 8 --dust no ...  --max_target_seqs 1000` which is roughly equivalent to using `blastn -task short` parameter, followed by filtering the results and using the `assign_host.py` script which utilizes the spacers' lineage information.  
Note, we do not compare any filtereing done after the pairing was done, or the host assignment itself - for that reason, we also do not benchmark dedicated host assignment tools like spacerphaser, wish or iphop (even though they may be useful or utilize spacer-protospacer pairs).  

In our benchmark, we'll use a more updated and curated version of the spacer set compared to the one used in IMG/VR4 - we'll use the CRISPR sapcer set from iphop [https://bitbucket.org/srouxjgi/iphop/src/main/#markdown-header-host-databases-and-versions](https://bitbucket.org/srouxjgi/iphop/src/main/#markdown-header-host-databases-and-versions). Specifically, we'll use the June 2025 upload.



### Fetch IMG/VR4
The full IMG/VR4 data can be downloaded from the [JGI portal](https://genome.jgi.doe.gov/portal/pages/dynamicOrganismDownload.jsf?organism=IMG_VR). This requires logining in and selecting the files we want:  
- IMGVR_all_nucleotides-high_confidence.fna
- IMGVR_all_Sequence_information-high_confidence.tsv
<!-- 
For convenience, we'll download the ffasta file from an unstable FTP hosting of them on nersc:
- [IMGVR4_SEQUENCES.fna.zst](https://portal.nersc.gov/genomad/__data__/IMGVR_DATA/IMGVR4_SEQUENCES.fna.zst) -->


In [86]:
# %%bash
# mkdir imgvr4_data
# cd imgvr4_data
# mkdir spacers contigs

# cd contigs
# # download the contigs (this may take a while)
# wget https://portal.nersc.gov/genomad/__data__/IMGVR_DATA/IMGVR4_SEQUENCES.fna.zst --quiet -O IMGVR4_SEQUENCES.fna.zst
# # decompress the file
# zstd -d IMGVR4_SEQUENCES.fna.zst
# # get the name, length and GC% of the contigs (this may also take a while)
# seqkit fx2tab -nl --gc IMGVR4_SEQUENCES.fna  > IMGVR4_SEQUENCES_name_length.tab

Note, we'll use the IMG_VR_2022-09-20_7.1 version of IMG/VR (v4.1, with bugfix in UViG table and protein fasta), and specifically the high-confidence genomes only ( `IMGVR_all_nucleotides-high_confidence.fna`).  
To get the info table, you'll need to login to the JGI portal and download the `IMGVR_all_Sequence_information-high_confidence.tsv` file. The data is free and unrestricted, but still requires registerition or confirming the data use agreement. 


## fetch the spacer set
We are only interested in one (or two) files from the iphop database:
- `All_CRISPR_spacers_nr_clean.fna`
- `Host_Genomes.tsv`  

**Note:** see the spacer_inspection.ipynb notebook for further details and analyses of the spacer set.


In [87]:
# %%bash
# cd imgvr4_data/spacers
# wget https://portal.nersc.gov/cfs/m342/iphop/db/extra/All_CRISPR_spacers_nr_clean.fna
# wget https://portal.nersc.gov/cfs/m342/iphop/db/extra/Host_Genomes.tsv
# # create a table with the name and length of the spacers
# seqkit fx2tab -nl All_CRISPR_spacers_nr_clean.fna  > All_CRISPR_spacers_nr_clean_name_length.tab

Next some data wrangling to get the contig info and the spacer info in the same format.

In [88]:
import os
os.chdir('/clusterfs/jgi/scratch/science/metagen/neri/code/blits/spacer_bench/imgvr4_data/')
import polars as pl
pl.Config(tbl_rows=50)

from bench import *
from bench.utils.functions import *

contigs_file = 'contigs/IMGVR4_SEQUENCES.fna'

contig_stats = pl.read_csv('contigs/IMGVR4_SEQUENCES_name_length.tab', separator='\t',has_header=False,new_columns=['seqid','length',"GC"])
contig_info = pl.read_csv('contigs/IMGVR_all_Sequence_information-high_confidence.tsv', separator='\t',infer_schema_length=10000, null_values=['N/A'])
contig_info= contig_info.rename({"Coordinates ('whole' if the UViG is the entire contig)":"coordinates"})
contig_info= contig_info.with_columns(
    pl.when(pl.col('coordinates').eq('whole')).then(None).otherwise(pl.col('coordinates')).alias('coordinates')
)
contig_info = contig_info.with_columns(
    pl.concat_str([pl.col('UVIG'),pl.col('Taxon_oid'),pl.col('Scaffold_oid'),pl.col('coordinates')],ignore_nulls=True,separator='|').alias('seqid')
)
print(f"contig_stats.schema: {contig_stats.schema}")
print(f"contig_stats.shape: {contig_stats.shape}")

print(f"contig_info.schema: {contig_info.schema}")
print(f"contig_info.shape: {contig_info.shape}")
contig_stats = contig_stats.join(contig_info, on='seqid', how='inner')

contig_stats = contig_stats.filter(pl.col("length") == pl.col("Length")).drop("Length")
contig_stats = contig_stats.filter(pl.col("Topology") != "GVMAG")

contig_stats = contig_stats.rename(str.lower)
contig_stats = contig_stats.rename(lambda x: x.lower().replace(" ","_"))

contig_stats = contig_stats.filter(~pl.col("taxonomic_classification").str.contains(";;;;;"))
contig_stats = contig_stats.with_columns(
    pl.when(pl.col("miuvig_quality").str.contains("High-quality|Reference")).then(True).otherwise(False).alias("hq")
)
contig_stats = contig_stats.with_columns([
    pl.col("taxonomic_classification")
    .str.extract(r"r__([^;]*)", 1).alias("realm"),
    pl.col("taxonomic_classification")
    .str.extract(r"k__([^;]*)", 1).alias("kingdom"),
    pl.col("taxonomic_classification")
    .str.extract(r"p__([^;]*)", 1).alias("phylum"),
    pl.col("taxonomic_classification")
    .str.extract(r"c__([^;]*)", 1).alias("class"),
    pl.col("taxonomic_classification")
    .str.extract(r"o__([^;]*)", 1).alias("order"),
    pl.col("taxonomic_classification")
    .str.extract(r"f__([^;]*)", 1).alias("family"),
    pl.col("taxonomic_classification")
    .str.extract(r"g__([^;]*)", 1).alias("genus"),
    pl.col("taxonomic_classification")
    .str.extract(r"s__([^;]*)", 1).alias("species"),
])
contig_stats = contig_stats.drop(["taxonomic_classification","miuvig_quality"])

contig_stats.schema: Schema([('seqid', String), ('length', Int64), ('GC', Float64)])
contig_stats.shape: (15881302, 3)
contig_info.schema: Schema([('UVIG', String), ('Taxon_oid', Int64), ('Scaffold_oid', String), ('coordinates', String), ('Ecosystem classification', String), ('vOTU', String), ('Length', Int64), ('Topology', String), ('geNomad score', Float64), ('Confidence', String), ('Estimated completeness', String), ('Estimated contamination', Float64), ('MIUViG quality', String), ('Gene content (total genes;cds;tRNA;geNomad marker)', String), ('Taxonomic classification', String), ('Taxonomic classification method', String), ('Host taxonomy prediction', String), ('Host prediction method', String), ('Sequence origin (doi)', String), ('seqid', String)])
contig_info.shape: (5576197, 20)


## Removing most eukaryotic virus families

In [None]:
ICTV_Euk_virus_order = ['Reovirales', 'Sobelivirales', 'Patatavirales', 'Picornavirales', 'Chitovirales', 'Tolivirales', 'Ortervirales', 'Rowavirales', 'Asfuvirales', 'Ourlivirales', 'Goujianvirales', 'Nodamuvirales', 'Tymovirales', 'Herpesvirales', 'Piccovirales', 'Amarillovirales', 'Bunyavirales', 'Rohanvirales', 'Baphyvirales', 'Stellavirales', 'Algavirales', 'Cirlivirales', 'Martellivirales', 'Recrevirales', 'Pimascovirales', 'Mononegavirales', 'Zurhausenvirales', 'Mulpavirales', 'Yadokarivirales', 'Blubervirales', 'Polivirales', 'Serpentovirales', 'Ghabrivirales', 'Lefavirales', 'Cryppavirales', 'Nidovirales', 'Articulavirales', 'Wolframvirales', 'Sepolyvirales', 'Imitervirales', 'Hepelivirales', 'Muvirales', 'Jingchuvirales']
ICTV_Euk_virus_family = ["Abyssoviridae", "Adamaviridae", "Adenoviridae", "Aliusviridae", "Alloherpesviridae", "Allomimiviridae", "Alphaflexiviridae", "Alphaormycoviridae", "Alphasatellitidae", "Alphatetraviridae", "Alternaviridae", "Alvernaviridae", "Amalgaviridae", "Amesuviridae", "Amnoonviridae", "Anelloviridae", "Anicreviridae", "Arenaviridae", "Arteriviridae", "Artiviridae", "Artoviridae", "Ascoviridae", "Asfarviridae", "Aspiviridae", "Astroviridae", "Avsunviroidae", "Bacilladnaviridae", "Baculoviridae", "Barnaviridae", "Belpaoviridae", "Benyviridae", "Betaflexiviridae", "Betaormycoviridae", "Bidnaviridae", "Birnaviridae", "Bornaviridae", "Botourmiaviridae", "Botybirnaviridae", "Bromoviridae", "Burtonviroviridae", "Carmotetraviridae", "Caulimoviridae", "Chrysoviridae", "Chuviridae", "Closteroviridae", "Circoviridae","Coronaviridae", "Cremegaviridae", "Crepuscuviridae", "Cruliviridae", "Curvulaviridae", "Deltaflexiviridae", "Deltanormycoviridae", "Dicistroviridae", "Discoviridae", "Dishuiviroviridae", "Draupnirviridae", "Dumbiviridae", "Endolinaviridae", "Endornaviridae", "Eupolintoviridae", "Euroniviridae", "Filamentoviridae", "Filoviridae", "Fimoviridae", "Flaviviridae", "Fusagraviridae", "Fusariviridae", "Gammaflexiviridae", "Gammaormycoviridae", "Gandrviridae",  "Geplanaviridae", "Giardiaviridae", "Gresnaviridae", "Gulliviroviridae", "Hadakaviridae", "Hantaviridae", "Hepadnaviridae", "Hepeviridae", "Hydriviridae", "Hypoviridae", "Hytrosaviridae", "Iflaviridae", "Inseviridae", "Iridoviridae", "Kanorauviridae", "Kirkoviridae", "Kitaviridae", "Kolmioviridae", "Konkoviridae", "Lebotiviridae", "Leishbuviridae", "Lispiviridae", "Mahapunaviridae", "Malacoherpesviridae", "Mamonoviridae", "Marnaviridae", "Marseilleviridae", "Matonaviridae", "Maviroviridae", "Mayoviridae", "Medioniviridae", "Megabirnaviridae", "Megatotiviridae", "Mesomimiviridae", "Mesoniviridae", "Metaviridae", "Metaxyviridae", "Mimiviridae", "Mitoviridae", "Monocitiviridae", "Mononiviridae", "Mycoalphaviridae", "Mymonaviridae", "Mypoviridae", "Myriaviridae", "Nairoviridae", "Nanghoshaviridae", "Nanhypoviridae", "Nanoviridae", "Narnaviridae", "Naryaviridae", "Natareviridae", "Nenyaviridae", "Nimaviridae", "Nodaviridae", "Noraviridae", "Nudiviridae", "Nyamiviridae", "Olifoviridae", "Omnilimnoviroviridae", "Oomyviridae", "Ootiviridae", "Orpheoviridae", "Orthoherpesviridae", "Orthomyxoviridae", "Orthototiviridae", "Ouroboviridae", "Pamosaviridae", "Papillomaviridae", "Paramyxoviridae", "Parvoviridae", "Pecoviridae", "Peribunyaviridae", "Permutotetraviridae", "Phasmaviridae", "Phenuiviridae", "Phlegiviridae", "Phycodnaviridae", "Phypoliviridae", "Picornaviridae", "Pistolviridae", "Pithoviridae", "Pneumoviridae", "Polycipiviridae", "Polydnaviriformidae", "Polymycoviridae", "Polyomaviridae", "Pospiviroidae", "Potyviridae", "Poxviridae", "Pseudototiviridae", "Pseudoviridae", "Qinviridae", "Quadriviridae", "Quambiviridae", "Redondoviridae", "Retroviridae", "Rhabdoviridae", "Roniviridae", "Ruviroviridae", "Sarthroviridae", "Schizomimiviridae", "Secoviridae", "Sedoreoviridae", "Sinhaliviridae", "Solemoviridae", "Solinviviridae", "Spiciviridae", "Spinareoviridae", "Splipalmiviridae", "Sputniviroviridae", "Sunviridae", "Tobaniviridae", "Togaviridae", "Tolecusatellitidae", "Tombusviridae", "Tomosaviridae", "Tonesaviridae", "Tosoviridae", "Tospoviridae", "Trimbiviridae", "Tulasviridae", "Tymoviridae", "Unambiviridae", "Vilyaviridae", "Virgaviridae", "Wupedeviridae", "Xinmoviridae", "Yadokariviridae", "Yadonushiviridae", "Yaraviridae", "Yueviridae"]
ICTV_Euk_virus_classes = ['Alsuviricetes', 'Amabiliviricetes', 'Chrymotiviricetes', 'Chunqiuviricetes','Ellioviricetes', 'Flasuviricetes', 'Herviviricetes', 'Howeltoviricetes', 'Insthoviricetes', 'Laserviricetes', 'Magsaviricetes', 'Maveriviricetes', 'Megaviricetes', 'Miaviricetes', 'Milneviricetes', 'Monjiviricetes', 'Mouviricetes', 'Naldaviricetes', 'Papovaviricetes', 'Pisoniviricetes', 'Pokkesviricetes', 'Polintoviricetes', 'Quintoviricetes', 'Repensiviricetes', 'Resentoviricetes', 'Stelpaviricetes', 'Tolucaviricetes', 'Yunchangviricetes']
ICTV_Euk_virus_classes = set(ICTV_Euk_virus_classes).difference(["Tectiliviricetes","Arfiviricetes","Revtraviricetes"])

print(contig_stats.height)
contig_stats = contig_stats.filter(~pl.col("family").is_in(set(ICTV_Euk_virus_family),nulls_equal=True))
contig_stats = contig_stats.filter(~pl.col("order").is_in(set(ICTV_Euk_virus_order),nulls_equal=True))
contig_stats = contig_stats.filter(~pl.col("class").is_in(set(ICTV_Euk_virus_classes),nulls_equal=True))

print(contig_stats.height)

5457198
5115930


In [91]:
contig_stats = contig_stats.select(["seqid","realm","kingdom","order","class","family","genus","species","length","gc","hq"])

for i in ["realm","kingdom","order","class","family","genus","species","hq"]:
    print(f"{i}: {contig_stats[i].value_counts(sort=True)}")
for i in ["length","gc"]:
    print(f"{i}: {contig_stats[i].describe()}")

print(f"new schema: {contig_stats.schema}")
contig_stats.sample(3)

realm: shape: (5, 2)
┌───────────────┬─────────┐
│ realm         ┆ count   │
│ ---           ┆ ---     │
│ str           ┆ u64     │
╞═══════════════╪═════════╡
│ Duplodnaviria ┆ 4986008 │
│ Riboviria     ┆ 93153   │
│ Monodnaviria  ┆ 35852   │
│ Varidnaviria  ┆ 672     │
│ Adnaviria     ┆ 245     │
└───────────────┴─────────┘
kingdom: shape: (8, 2)
┌────────────────┬─────────┐
│ kingdom        ┆ count   │
│ ---            ┆ ---     │
│ str            ┆ u64     │
╞════════════════╪═════════╡
│ Heunggongvirae ┆ 4986008 │
│ Orthornavirae  ┆ 93153   │
│ Sangervirae    ┆ 19063   │
│ Loebvirae      ┆ 14198   │
│ Shotokuvirae   ┆ 2564    │
│ Bamfordvirae   ┆ 672     │
│ Zilligvirae    ┆ 245     │
│ Trapavirae     ┆ 27      │
└────────────────┴─────────┘
order: shape: (18, 2)
┌───────────────────┬─────────┐
│ order             ┆ count   │
│ ---               ┆ ---     │
│ str               ┆ u64     │
╞═══════════════════╪═════════╡
│ null              ┆ 4923798 │
│ Crassvirales      ┆ 80669 

seqid,realm,kingdom,order,class,family,genus,species,length,gc,hq
str,str,str,str,str,str,str,str,i64,f64,bool
"""IMGVR_UViG_3300017956_002023|3300017956|Ga0181580_10017582""","""Duplodnaviria""","""Heunggongvirae""",,"""Caudoviricetes""",,,,5634,42.76,False
"""IMGVR_UViG_3300027719_000271|3300027719|Ga0209467_1000224|56315-89830""","""Duplodnaviria""","""Heunggongvirae""",,"""Caudoviricetes""",,,,33516,54.51,False
"""IMGVR_UViG_3300047388_000595|3300047388|Ga0497965_0013996""","""Duplodnaviria""","""Heunggongvirae""",,"""Caudoviricetes""",,,,5592,57.92,False


## Removing very short contigs

In [92]:
print(contig_stats.height)
contig_stats = contig_stats.filter(pl.col("length") > 1000)
print(contig_stats.height)

5115930
5115894


## Export out (stats and fasta)

In [93]:
contig_stats.write_parquet("./contigs/filtered_contig_stats.parquet")
contig_stats.select(pl.col("seqid")).write_csv("./contigs/filtered_contig.lst",include_header=False)

!paraseq_filt --headers ./contigs/filtered_contig.lst --input contigs/IMGVR4_SEQUENCES.fna --output contigs/filtered_contigs.fasta --threads 10

Using 10 threads
Loaded 5115894 headers
Processed 100000 records
Processed 200000 records
Processed 300000 records
Processed 400000 records
Processed 500000 records
Processed 600000 records
Processed 700000 records
Processed 800000 records
Processed 900000 records
Processed 1000000 records
Processed 1100000 records
Processed 1200000 records
Processed 1300000 records
Processed 1400000 records
Processed 1500000 records
Processed 1600000 records
Processed 1700000 records
Processed 1800000 records
Processed 1900000 records
Processed 2000000 records
Processed 2100000 records
Processed 2200000 records
Processed 2300000 records
Processed 2400000 records
Processed 2500000 records
Processed 2600000 records
Processed 2700000 records
Processed 2800000 records
Processed 2900000 records
Processed 3000000 records
Processed 3100000 records
Processed 3200000 records
Processed 3300000 records
Processed 3400000 records
Processed 3500000 records
Processed 3600000 records
Processed 3700000 records
Process

Finally, we'll index the contigs file for fast random access later.
This may take a while, but only done once and will save time (later).

In [94]:
%%bash
pyfastx index contigs/filtered_contigs.fasta
ls -lsh ../imgvr4_data/contigs/filtered_contigs.fasta*

 74G -rw-rw-r-- 1 uneri grp-org-sc-metagen  74G Jan 22 16:22 ../imgvr4_data/contigs/filtered_contigs.fasta
767M -rw-r--r-- 1 uneri grp-org-sc-metagen 767M Jan 22 16:15 ../imgvr4_data/contigs/filtered_contigs.fasta.fxi


(Next steps are in the spacer inspection notebook)