### Storage details for spruce dataset

This is a summary of the spruce notebook,and using the data from ICF and Zarr "inspect" commands stored in corresponding CSVs.

#### VCFs

The VCF data for the spruce dataset is split across 165 chunks, using a total of 7.33 TiB of storage. 

#### Intermediate columnar format

We converted the full VCF to ICF using vcf2zarr's explode command.
The job was run on one 128-core node, with a run time of 13h45min.

The storage used by each field is shown below. Total size is 6.77 TiB.

In [2]:
import os
import pandas as pd
import humanfriendly
import numpy as np
from pathlib import Path

WORKDIR=Path(os.environ["WORKDIR"])

def parse_size(values):
    size = np.zeros(values.shape, dtype=int)
    for j, val in enumerate(values):
        size[j] = humanfriendly.parse_size(val)
    return size

fn = WORKDIR / "spruce_example" / "spruce_icf_inspect.csv"
df_icf = pd.read_csv(fn)
df_icf["compressed_bytes"] = parse_size(df_icf.compressed.values)
df_icf.sort_values("compressed_bytes", ascending=False)

Unnamed: 0,name,type,chunks,size,compressed,max_n,min_val,max_val,compressed_bytes
20,FORMAT/PL,Integer,836866,51.06 TiB,6.02 TiB,10,0.0,400.0,6619059999211
21,FORMAT/GT,Integer,363571,22.16 TiB,590.94 GiB,3,-1.0,3.0,634516993474
24,INFO/DP4,Integer,8543,502.27 GiB,27.36 GiB,4,0.0,320000.0,29377576304
12,INFO/RPBZ,Float,6753,390.59 GiB,16.19 GiB,1,-16000.0,18000.0,17383880130
16,INFO/SCBZ,Float,6753,390.59 GiB,15.85 GiB,1,-14000.0,14000.0,17018807910
13,INFO/MQBZ,Float,6753,390.59 GiB,15.61 GiB,1,-4300.0,22000.0,16761109872
14,INFO/BQBZ,Float,6753,390.56 GiB,15.57 GiB,1,-8500.0,4200.0,16718160199
18,INFO/SGB,Float,6753,390.65 GiB,14.55 GiB,1,-280000.0,98000.0,15622943539
11,INFO/VDB,Float,6751,390.47 GiB,14.5 GiB,1,0.0,1.0,15569256448
15,INFO/MQSBZ,Float,6751,390.29 GiB,13.72 GiB,1,-44000.0,30000.0,14731737825


In [3]:
humanfriendly.format_size(df_icf.compressed_bytes.sum(), binary=True)

'6.77 TiB'

How many chunk files?

In [4]:
df_icf.chunks.sum()

1342569

#### VCF Zarr

We converted the ICF data to Zarr using vcf2zarr's encode command. The job was run on a 128-core node with run time 18h51min.

It uses a total of 6.6 TiB of storage over 17227933 files and directories.

This is a compression of 1.1X over the VCF.

In [11]:
7.33 / 6.6

1.1106060606060606

In [5]:
fn = WORKDIR / "spruce_example" / "spruce_zarr_inspect.csv"
df_zarr = pd.read_csv(fn)
df_zarr["stored_bytes"] = parse_size(df_zarr.stored.values)
df_zarr.sort_values("stored_bytes", ascending=False, inplace=True)
df_zarr

Unnamed: 0,name,dtype,stored,size,ratio,nchunks,chunk_size,avg_chunk_stored,shape,chunk_shape,compressor,filters,stored_bytes
0,/call_PL,int16,6.12 TiB,72.42 TiB,12.0,749036,101.38 MiB,8.57 MiB,"(3745170452, 1063, 10)","(10000, 1000, 10)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,6729011161989
1,/call_genotype,int8,282.45 GiB,7.24 TiB,26.0,749036,10.14 MiB,395.4 KiB,"(3745170452, 1063, 2)","(10000, 1000, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,303278378188
2,/call_genotype_mask,bool,42.15 GiB,7.24 TiB,180.0,749036,10.14 MiB,59.01 KiB,"(3745170452, 1063, 2)","(10000, 1000, 2)","Blosc(cname='zstd', clevel=7, shuffle=BITSHUFF...",,45258217881
3,/variant_DP4,int32,26.24 GiB,55.81 GiB,2.1,374518,156.25 KiB,73.47 KiB,"(3745170452, 4)","(10000, 4)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,28174985461
4,/variant_MQSBZ,float32,13.08 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.63 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14044543057
5,/variant_RPBZ,float32,13.04 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.51 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,14001593384
6,/variant_MQBZ,float32,12.97 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.32 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13926431457
7,/variant_SCBZ,float32,12.94 GiB,13.95 GiB,1.1,374518,39.06 KiB,36.22 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13894219202
8,/variant_VDB,float32,12.84 GiB,13.95 GiB,1.1,374518,39.06 KiB,35.95 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13786845020
9,/variant_quality,float32,12.84 GiB,13.95 GiB,1.1,374518,39.06 KiB,35.95 KiB,"(3745170452,)","(10000,)","Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFL...",,13786845020


In [6]:
total = df_zarr.stored_bytes.sum()

humanfriendly.format_size(total, binary=True)

'6.6 TiB'

In [7]:
df_zarr.nchunks.sum()

11984581

How much of the overall storage is consumed by the top 4 fields?

In [8]:
df_zarr.head(4).stored_bytes.sum() / total

0.979406720944372

In [9]:
df_display_table = pd.DataFrame({
    "Field":df_zarr.name,
    "type": df_zarr.dtype,
    "storage":df_zarr.stored,
    "compress": df_zarr.ratio,
    "percentage": df_zarr.stored_bytes / total})
df_display_table.sort_values("percentage", ascending=False, inplace=True)
df_display_table["percentage"] = df_display_table["percentage"].map('{:.2%}'.format)
df_display_table["compress"] = df_display_table["compress"].map('{:.1f}'.format)
df_display_table

Unnamed: 0,Field,type,storage,compress,percentage
0,/call_PL,int16,6.12 TiB,12.0,92.75%
1,/call_genotype,int8,282.45 GiB,26.0,4.18%
2,/call_genotype_mask,bool,42.15 GiB,180.0,0.62%
3,/variant_DP4,int32,26.24 GiB,2.1,0.39%
4,/variant_MQSBZ,float32,13.08 GiB,1.1,0.19%
5,/variant_RPBZ,float32,13.04 GiB,1.1,0.19%
6,/variant_MQBZ,float32,12.97 GiB,1.1,0.19%
7,/variant_SCBZ,float32,12.94 GiB,1.1,0.19%
8,/variant_VDB,float32,12.84 GiB,1.1,0.19%
9,/variant_quality,float32,12.84 GiB,1.1,0.19%


Output the (rough) table data for the manuscript:


In [10]:
print(df_display_table.to_latex(index=False))

\begin{tabular}{lllll}
\toprule
Field & type & storage & compress & percentage \\
\midrule
/call_PL & int16 & 6.12 TiB & 12.0 & 92.75% \\
/call_genotype & int8 & 282.45 GiB & 26.0 & 4.18% \\
/call_genotype_mask & bool & 42.15 GiB & 180.0 & 0.62% \\
/variant_DP4 & int32 & 26.24 GiB & 2.1 & 0.39% \\
/variant_MQSBZ & float32 & 13.08 GiB & 1.1 & 0.19% \\
/variant_RPBZ & float32 & 13.04 GiB & 1.1 & 0.19% \\
/variant_MQBZ & float32 & 12.97 GiB & 1.1 & 0.19% \\
/variant_SCBZ & float32 & 12.94 GiB & 1.1 & 0.19% \\
/variant_VDB & float32 & 12.84 GiB & 1.1 & 0.19% \\
/variant_quality & float32 & 12.84 GiB & 1.1 & 0.19% \\
/variant_BQBZ & float32 & 12.64 GiB & 1.1 & 0.19% \\
/variant_SGB & float32 & 12.12 GiB & 1.2 & 0.18% \\
/variant_position & int32 & 9.65 GiB & 1.4 & 0.14% \\
/variant_AC & int16 & 7.42 GiB & 2.8 & 0.11% \\
/variant_DP & int32 & 6.71 GiB & 2.1 & 0.10% \\
/variant_allele & object & 5.37 GiB & 21.0 & 0.08% \\
/variant_AN & int16 & 3.05 GiB & 2.3 & 0.05% \\
/call_genotype_phased &