### Storage details for spruce dataset

This is a summary of the spruce notebook that produced data from ICF and Zarr "inspect" commands stored in corresponding CSVs.

#### VCFs

The VCF data for the spruce dataset is split across 165 chunks, using a total of 7.33 TiB of storage. 

#### Intermediate columnar format

We converted the full VCF to ICF using vcf2zarr's `explode` command.
The job was run on one 128-core node using 120 threads, with a run time of 18h36min.

The storage used by each field is shown below. Total size is 6.77 TiB.

In [4]:
import os
import pandas as pd
import humanfriendly
import numpy as np
from pathlib import Path

#WORKDIR=Path(os.environ["WORKDIR"])
WORKDIR = Path("../")

def parse_size(values):
    size = np.zeros(values.shape, dtype=int)
    for j, val in enumerate(values):
        size[j] = humanfriendly.parse_size(val)
    return size


fn = WORKDIR / "spruce_example" / "spruce_icf_inspect.csv"
df_icf = pd.read_csv(fn)
df_icf["compressed_bytes"] = parse_size(df_icf.compressed.values)
df_icf.sort_values("compressed_bytes", ascending=False)

Unnamed: 0,name,type,chunks,size,compressed,max_n,min_val,max_val,compressed_bytes
21,FORMAT/PL,Integer,836546,51.06 TiB,6.02 TiB,10,0.0,400.0,6619059999211
22,FORMAT/GT,Integer,363245,22.16 TiB,590.94 GiB,3,-1.0,3.0,634516993474
25,INFO/DP4,Integer,8202,502.27 GiB,27.36 GiB,4,0.0,320000.0,29377576304
13,INFO/RPBZ,Float,6410,390.59 GiB,16.19 GiB,1,-16000.0,18000.0,17383880130
17,INFO/SCBZ,Float,6410,390.59 GiB,15.85 GiB,1,-14000.0,14000.0,17018807910
14,INFO/MQBZ,Float,6410,390.59 GiB,15.61 GiB,1,-4300.0,22000.0,16761109872
15,INFO/BQBZ,Float,6410,390.56 GiB,15.57 GiB,1,-8500.0,4200.0,16718160199
19,INFO/SGB,Float,6411,390.65 GiB,14.55 GiB,1,-280000.0,98000.0,15622943539
12,INFO/VDB,Float,6409,390.47 GiB,14.5 GiB,1,0.0,1.0,15569256448
16,INFO/MQSBZ,Float,6408,390.29 GiB,13.72 GiB,1,-44000.0,30000.0,14731737825


In [4]:
humanfriendly.format_size(df_icf.compressed_bytes.sum(), binary=True)

'6.77 TiB'

How many chunk files?

In [5]:
df_icf.chunks.sum()

1340804

#### VCF Zarr

We converted the ICF data to Zarr using vcf2zarr's `encode` command. The job was run on a 128-core node using 120 threads with run time 16h30min.

It uses a total of 6.68 TiB of storage over 14980847 files and directories.

This is a compression of 1.1X over the VCF.

In [36]:
7.33 / 6.68

1.097305389221557

In [5]:
fn = WORKDIR / "spruce_example" / "spruce_zarr_inspect.csv"
df_zarr = pd.read_csv(fn)
df_zarr["stored_bytes"] = parse_size(df_zarr.stored.values)
df_zarr.sort_values("stored_bytes", ascending=False, inplace=True)
df_zarr

Unnamed: 0,name,dtype,stored,size,ratio,nchunks,chunk_size,avg_chunk_stored,shape,chunk_shape,compressor,filters,stored_bytes
0,/call_PL,int16,6.04 TiB,72.42 TiB,12.0,374518,202.75 MiB,16.92 MiB,"(3745170452, 1063, 10)","(10000, 1063, 10)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,6641050231767
1,/call_genotype,int8,381.28 GiB,7.24 TiB,19.0,374518,20.28 MiB,1.04 MiB,"(3745170452, 1063, 2)","(10000, 1063, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,409396282654
2,/call_genotype_mask,bool,103.84 GiB,7.24 TiB,71.0,374518,20.28 MiB,290.72 KiB,"(3745170452, 1063, 2)","(10000, 1063, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,111497351004
3,/variant_DP4,int32,26.24 GiB,55.81 GiB,2.1,374518,156.25 KiB,73.47 KiB,"(3745170452, 4)","(10000, 4)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,28174985461
4,/variant_MQSBZ,float32,13.08 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.63 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14044543057
5,/variant_RPBZ,float32,13.04 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.51 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14001593384
6,/variant_MQBZ,float32,12.97 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.32 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13926431457
7,/variant_SCBZ,float32,12.94 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.22 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13894219202
9,/variant_quality,float32,12.84 GiB,13.95 GiB,1.1,374518,39.06 KiB,35.95 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13786845020
8,/variant_VDB,float32,12.84 GiB,13.95 GiB,1.1,374518,39.06 KiB,35.95 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13786845020


In [15]:
total = df_zarr.stored_bytes.sum()

humanfriendly.format_size(total, binary=True)

'6.68 TiB'

In [16]:
df_zarr.nchunks.sum()

10861038

How much of the overall storage is consumed by the top 4 fields?

In [17]:
df_zarr.head(4).stored_bytes.sum() / total

0.9796690169358103

In [18]:
df_display_table = pd.DataFrame({
    "Field":df_zarr.name,
    "type": df_zarr.dtype,
    "storage":df_zarr.stored,
    "compress": df_zarr.ratio,
    "percentage": df_zarr.stored_bytes / total})
df_display_table.sort_values("percentage", ascending=False, inplace=True)
df_display_table["percentage"] = df_display_table["percentage"].map('{:.2%}'.format)
df_display_table["compress"] = df_display_table["compress"].map('{:.1f}'.format)
df_display_table

Unnamed: 0,Field,type,storage,compress,percentage
0,/call_PL,int16,6.04 TiB,12.0,90.49%
1,/call_genotype,int8,381.28 GiB,19.0,5.58%
2,/call_genotype_mask,bool,103.84 GiB,71.0,1.52%
3,/variant_DP4,int32,26.24 GiB,2.1,0.38%
4,/variant_MQSBZ,float32,13.08 GiB,1.1,0.19%
5,/variant_RPBZ,float32,13.04 GiB,1.1,0.19%
6,/variant_MQBZ,float32,12.97 GiB,1.1,0.19%
7,/variant_SCBZ,float32,12.94 GiB,1.1,0.19%
8,/variant_VDB,float32,12.84 GiB,1.1,0.19%
9,/variant_quality,float32,12.84 GiB,1.1,0.19%


Output the (rough) table data for the manuscript:


In [19]:
print(df_display_table.to_latex(index=False))

\begin{tabular}{lllll}
\toprule
Field & type & storage & compress & percentage \\
\midrule
/call_PL & int16 & 6.04 TiB & 12.0 & 90.49% \\
/call_genotype & int8 & 381.28 GiB & 19.0 & 5.58% \\
/call_genotype_mask & bool & 103.84 GiB & 71.0 & 1.52% \\
/variant_DP4 & int32 & 26.24 GiB & 2.1 & 0.38% \\
/variant_MQSBZ & float32 & 13.08 GiB & 1.1 & 0.19% \\
/variant_RPBZ & float32 & 13.04 GiB & 1.1 & 0.19% \\
/variant_MQBZ & float32 & 12.97 GiB & 1.1 & 0.19% \\
/variant_SCBZ & float32 & 12.94 GiB & 1.1 & 0.19% \\
/variant_VDB & float32 & 12.84 GiB & 1.1 & 0.19% \\
/variant_quality & float32 & 12.84 GiB & 1.1 & 0.19% \\
/variant_BQBZ & float32 & 12.64 GiB & 1.1 & 0.18% \\
/variant_SGB & float32 & 12.12 GiB & 1.2 & 0.18% \\
/variant_position & int32 & 9.65 GiB & 1.4 & 0.14% \\
/variant_AC & int16 & 7.42 GiB & 2.8 & 0.11% \\
/variant_DP & int32 & 6.71 GiB & 2.1 & 0.10% \\
/variant_allele & object & 5.37 GiB & 21.0 & 0.08% \\
/variant_AN & int16 & 3.05 GiB & 2.3 & 0.04% \\
/call_genotype_phased &

#### VCF Zarr with schema and local alleles


We used the ICF data to generate a schema with the command vcf2zarr mkschema and option `--local-alleles` to reduce storage requirements. We then converted the ICF data to Zarr using vcf2zarr's encode command using the schema as input (option `--schema`). The job was run on a 128-core node using 120 threads with run time 14h10min.

It uses a total of 2.3 TiB of storage over 16104404 files and directories.

This is a compression of 3.2X over the VCF.

In [37]:
7.33 / 2.3

3.186956521739131

In [6]:
fn = WORKDIR / "spruce_example" / "spruce_zarr_json_inspect.csv"
df_zarr = pd.read_csv(fn)
df_zarr["stored_bytes"] = parse_size(df_zarr.stored.values)
df_zarr.sort_values("stored_bytes", ascending=False, inplace=True)
df_zarr

Unnamed: 0,name,dtype,stored,size,ratio,nchunks,chunk_size,avg_chunk_stored,shape,chunk_shape,compressor,filters,stored_bytes
0,/call_LPL,int16,1.26 TiB,21.72 TiB,17.0,374518,60.83 MiB,3.53 MiB,"(3745170452, 1063, 3)","(10000, 1063, 3)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,1385384650997
1,/call_LA,int8,415.39 GiB,7.24 TiB,18.0,374518,20.28 MiB,1.14 MiB,"(3745170452, 1063, 2)","(10000, 1063, 2)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,446021616271
2,/call_genotype,int8,381.28 GiB,7.24 TiB,19.0,374518,20.28 MiB,1.04 MiB,"(3745170452, 1063, 2)","(10000, 1063, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,409396282654
3,/call_genotype_mask,bool,103.84 GiB,7.24 TiB,71.0,374518,20.28 MiB,290.72 KiB,"(3745170452, 1063, 2)","(10000, 1063, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,111497351004
4,/variant_DP4,int32,26.24 GiB,55.81 GiB,2.1,374518,156.25 KiB,73.47 KiB,"(3745170452, 4)","(10000, 4)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,28174985461
5,/variant_MQSBZ,float32,13.08 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.63 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14044543057
6,/variant_RPBZ,float32,13.04 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.51 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14001593384
7,/variant_MQBZ,float32,12.97 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.32 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13926431457
8,/variant_SCBZ,float32,12.94 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.22 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13894219202
9,/variant_VDB,float32,12.84 GiB,13.95 GiB,1.1,374518,39.06 KiB,35.95 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13786845020


In [7]:
total = df_zarr.stored_bytes.sum()

humanfriendly.format_size(total, binary=True)

'2.3 TiB'

In [8]:
df_zarr.nchunks.sum()

11235556

How much of the overall storage is consumed by the top 4 fields?

In [33]:
df_zarr.head(4).stored_bytes.sum() / total

0.9298764586486353

In [10]:

df_display_table = pd.DataFrame({
    "Field":df_zarr.name,
    "type": df_zarr.dtype,
    "storage":df_zarr.stored,
    "compress": df_zarr.ratio,
    "percentage": df_zarr.stored_bytes / total})
threshold = 0.01 / 100 # 0.01% 
df_display_table = df_display_table[df_display_table.percentage >= threshold].copy()
df_display_table.sort_values("percentage", ascending=False, inplace=True)
df_display_table["percentage"] = df_display_table["percentage"].map('{:.2%}'.format)
df_display_table["compress"] = df_display_table["compress"].map('{:.1f}'.format)
df_display_table

Unnamed: 0,Field,type,storage,compress,percentage
0,/call_LPL,int16,1.26 TiB,17.0,54.76%
1,/call_LA,int8,415.39 GiB,18.0,17.63%
2,/call_genotype,int8,381.28 GiB,19.0,16.18%
3,/call_genotype_mask,bool,103.84 GiB,71.0,4.41%
4,/variant_DP4,int32,26.24 GiB,2.1,1.11%
5,/variant_MQSBZ,float32,13.08 GiB,1.1,0.56%
6,/variant_RPBZ,float32,13.04 GiB,1.1,0.55%
7,/variant_MQBZ,float32,12.97 GiB,1.1,0.55%
8,/variant_SCBZ,float32,12.94 GiB,1.1,0.55%
10,/variant_quality,float32,12.84 GiB,1.1,0.55%


Output the (rough) table data for the manuscript:


In [11]:
print(df_display_table.to_latex(index=False, escape=True).replace("object", "str"))

\begin{tabular}{lllll}
\toprule
Field & type & storage & compress & percentage \\
\midrule
/call\_LPL & int16 & 1.26 TiB & 17.0 & 54.76\% \\
/call\_LA & int8 & 415.39 GiB & 18.0 & 17.63\% \\
/call\_genotype & int8 & 381.28 GiB & 19.0 & 16.18\% \\
/call\_genotype\_mask & bool & 103.84 GiB & 71.0 & 4.41\% \\
/variant\_DP4 & int32 & 26.24 GiB & 2.1 & 1.11\% \\
/variant\_MQSBZ & float32 & 13.08 GiB & 1.1 & 0.56\% \\
/variant\_RPBZ & float32 & 13.04 GiB & 1.1 & 0.55\% \\
/variant\_MQBZ & float32 & 12.97 GiB & 1.1 & 0.55\% \\
/variant\_SCBZ & float32 & 12.94 GiB & 1.1 & 0.55\% \\
/variant\_quality & float32 & 12.84 GiB & 1.1 & 0.55\% \\
/variant\_VDB & float32 & 12.84 GiB & 1.1 & 0.55\% \\
/variant\_BQBZ & float32 & 12.64 GiB & 1.1 & 0.54\% \\
/variant\_SGB & float32 & 12.12 GiB & 1.2 & 0.51\% \\
/variant\_position & int32 & 9.65 GiB & 1.4 & 0.41\% \\
/variant\_AC & int16 & 7.42 GiB & 2.8 & 0.31\% \\
/variant\_DP & int32 & 6.71 GiB & 2.1 & 0.28\% \\
/variant\_allele & str & 5.37 GiB & 21.0 &