# SAIGE Prototype

This notebook demonstrates a prototype of [SAIGE](https://saigegit.github.io/SAIGE-doc/) that supports reading from VCF Zarr stores.

The code for the protoype is available at https://github.com/Will-Tyler/SAIGE

To see the differences between the prototype and the upstream code, see the full diff [here](https://github.com/saigegit/SAIGE/compare/main...Will-Tyler:SAIGE:main). The main non-boilerplate changes needed is the new [VCZ.cpp](https://github.com/Will-Tyler/SAIGE/blob/2f0ad0d5cf612136487c22ea60bf53271b5bfe0a/src/VCZ.cpp) file which implements the variant file access interface in SAIGE.

Run the `setup.sh` script before using this notebook to install SAIGE and create a Conda environment for SAIGE.

## IMPORTANT!

TensorStore does not currently support string data, and so we could not exactly replicate the output of the other backends. Numerical output was exactly reproduced.

In [3]:
!pip install tstrait



### Simulate phenotypes

In [4]:
import tskit
import tstrait

ts = tskit.load('../scaling/data/chr21_10_5.ts')
model = tstrait.trait_model(distribution='normal', mean=0, var=1)
sim_result = tstrait.sim_phenotype(ts=ts, model=model, h2=0.3)

In [5]:
sim_result.trait

Unnamed: 0,position,site_id,effect_size,causal_allele,allele_freq,trait_id
0,24145141,691929,-0.315704,C,1.5e-05,0


In [6]:
sim_result.phenotype

Unnamed: 0,trait_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,0.0,-0.000566,-0.000566
1,0,1,0.0,-0.003547,-0.003547
2,0,2,0.0,0.001797,0.001797
3,0,3,0.0,-0.003055,-0.003055
4,0,4,0.0,0.001222,0.001222
...,...,...,...,...,...
286713,0,286713,0.0,0.000440,0.000440
286714,0,286714,0.0,-0.000471,-0.000471
286715,0,286715,0.0,0.000386,0.000386
286716,0,286716,0.0,-0.004003,-0.004003


In [7]:
phenotype = sim_result.phenotype
phenotype['sample_id'] = 'tsk_' + phenotype['individual_id'].astype(str)
phenotype

Unnamed: 0,trait_id,individual_id,genetic_value,environmental_noise,phenotype,sample_id
0,0,0,0.0,-0.000566,-0.000566,tsk_0
1,0,1,0.0,-0.003547,-0.003547,tsk_1
2,0,2,0.0,0.001797,0.001797,tsk_2
3,0,3,0.0,-0.003055,-0.003055,tsk_3
4,0,4,0.0,0.001222,0.001222,tsk_4
...,...,...,...,...,...,...
286713,0,286713,0.0,0.000440,0.000440,tsk_286713
286714,0,286714,0.0,-0.000471,-0.000471,tsk_286714
286715,0,286715,0.0,0.000386,0.000386,tsk_286715
286716,0,286716,0.0,-0.004003,-0.004003,tsk_286716


In [8]:
# Save phenotype data to disk in the format that SAIGE expects.
phenotype[['sample_id', 'phenotype']].to_csv('chr21_10_5.phenotypes.txt', sep='\t', index=False)

### SAIGE workflow - initial setup steps

In [9]:
!plink2 --vcf ../scaling/data/chr21_10_4.vcf.gz --make-bed --out ./chr21_10_4 --max-alleles 2

PLINK v2.00a3 SSE4.2 (18 Feb 2022)             www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./chr21_10_4.log.
Options in effect:
  --make-bed
  --max-alleles 2
  --out ./chr21_10_4
  --vcf ../scaling/data/chr21_10_4.vcf.gz

Start time: Wed Jan 22 12:20:57 2025
31922 MiB RAM detected; reserving 15961 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 863998 variants scanned.
--vcf: ./chr21_10_4-temporary.pgen + ./chr21_10_4-temporary.pvar.zst +
./chr21_10_4-temporary.psam written.
10000 samples (0 females, 0 males, 10000 ambiguous; 10000 founders) loaded from
./chr21_10_4-temporary.psam.
856315 out of 863998 variants loaded from ./chr21_10_4-temporary.pvar.zst.
Note: No phenotype data present.
856315 variants remaining after main filters.
Writing ./chr21_10_4.fam ... done.
Writing ./chr21_10_4.bim ... done.
Writing ./chr21_10_4.bed ... 152230374553606875839198done.
End time: Wed Jan 22 12:21:26 2025


In [12]:
%%bash
#export PATH="/opt/miniconda3/bin:$PATH"
conda run -n saige Rscript SAIGE/extdata/step1_fitNULLGLMM.R     \
        --plinkFile=./chr21_10_4  \
        --useSparseGRMtoFitNULL=FALSE    \
        --phenoFile=./chr21_10_5.phenotypes.txt \
        --phenoCol=phenotype \
        --sampleIDColinphenoFile=sample_id \
        --invNormalize=TRUE     \
        --traitType=quantitative        \
        --outputPrefix=./chr21_10_4.model \
        --nThreads=24   \
        --IsOverwriteVarianceRatioFile=TRUE

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Linux Mint 21.3

Matrix products: default
BLAS/LAPACK: /home/benj/miniconda3/envs/saige/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] optparse_1.7.3 SAIGE_1.3.6   

loaded via a namespace (and not attached):
[1] compiler_4.3.1     Matrix_1.6-1.1     Rcpp_1.0.11        getopt_1.20.4     
[5] grid_4.3.1         data.table_1.14.8  RcppParallel_5.1.7 lattice_0.21

206 th marker in geno  1 
MAC:  1756 
G0 1 0 0 0 1 0 1 0 1 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
169 th marker in geno  1 
MAC:  106 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
152 th marker in geno  1 
MAC:  232 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
134 th marker in geno  1 
MAC:  8014 
G0 1 1 1 0 1 1 1 0 2 1 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
174 th marker in geno  1 
MAC:  8712 
G0 1 1 0 0 2 1 1 2 2 1 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
92 th marker in geno  1 
MAC:  95 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
42 th marker in geno  1 
MAC:  48 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
228 th marker in geno  1 
MAC:  80 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
238 th marker in geno  1 
MAC:  24 
G0 0 0 0 0 0 0 0 0 0 0 
CHR  1 
iter from getPCG1ofSigmaAndVector 1
6 th marker in geno  1 
MAC:  843 
G0 0 0 0 0 1 1 0 0 1 0 

Loading required package: optparse






# Benchmark

Here we time and run [step 2](https://saigegit.github.io/SAIGE-doc/docs/single_step2.html) of the single-variant association test using the VCF/BCF/Savvy and zarr data.

## BCF

In [10]:
%%bash

#export PATH="/opt/miniconda3/bin:$PATH"
time conda run -n saige Rscript SAIGE/extdata/step2_SPAtests.R        \
        --vcfFile=../scaling/data/chr21_10_4.bcf \
        --vcfFileIndex=../scaling/data/chr21_10_4.bcf.csi \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.bcf_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Linux Mint 21.3

Matrix products: default
BLAS/LAPACK: /home/benj/miniconda3/envs/saige/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8   optparse_1.7.3      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.1     Matrix_1.6-1.1     Rcpp_1.0.11        getopt_1.20.4     
[5] grid_4.3.1     

Completed 10000/10000 markers in the chunk.
3284 markers were tested.
write to output
   user  system elapsed 
 44.258   0.715  44.969 
isVcfEnd  FALSE 
(2025-01-22 12:23:00.320184) ---- Analyzing Chunk 22 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were tested.
write to output
   user  system elapsed 
 46.222   0.731  46.949 
isVcfEnd  FALSE 
(2025-01-22 12:23:02.299075) ---- Analyzing Chunk 23 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2956 markers were tested.
write to output
   user  system elapsed 
 48.252   0.739  48.987 
isVcfEnd  FALSE 
(2025-01-22 12:23:04.336925) ---- Analyzing Chunk 24 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3108 markers were tested.
write to output
   user  system elapsed 
 50.282   0.747  51.025 
isVcfEnd  FALSE 
(2025-01-22 12:23:06.374873) ---- Analyzing Chunk 25 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were te

3137 markers were tested.
write to output
   user  system elapsed 
113.944   1.019 115.092 
isVcfEnd  FALSE 
(2025-01-22 12:24:10.45559) ---- Analyzing Chunk 57 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3069 markers were tested.
write to output
   user  system elapsed 
115.936   1.027 117.092 
isVcfEnd  FALSE 
(2025-01-22 12:24:12.454882) ---- Analyzing Chunk 58 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2937 markers were tested.
write to output
   user  system elapsed 
117.946   1.027 119.111 
isVcfEnd  FALSE 
(2025-01-22 12:24:14.460236) ---- Analyzing Chunk 59 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3055 markers were tested.
write to output
   user  system elapsed 
119.888   1.035 121.060 
isVcfEnd  FALSE 
(2025-01-22 12:24:16.418801) ---- Analyzing Chunk 60 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2871 markers were tested.
write to output
   user  system elapsed

Loading required package: RhpcBLASctl




















IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

























real	3m1.068s
user	2m56.944s
sys	0m2.941s





## VCF

In [11]:
%%bash

#export PATH="/opt/miniconda3/bin:$PATH"
time conda run -n saige Rscript SAIGE/extdata/step2_SPAtests.R        \
        --vcfFile=../scaling/data/chr21_10_4.vcf.gz \
        --vcfFileIndex=../scaling/data/chr21_10_4.vcf.gz.csi \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.vcf_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Linux Mint 21.3

Matrix products: default
BLAS/LAPACK: /home/benj/miniconda3/envs/saige/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8   optparse_1.7.3      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.1     Matrix_1.6-1.1     Rcpp_1.0.11        getopt_1.20.4     
[5] grid_4.3.1     

Completed 10000/10000 markers in the chunk.
3284 markers were tested.
write to output
   user  system elapsed 
 93.576   0.834  94.147 
isVcfEnd  FALSE 
(2025-01-22 12:26:49.022225) ---- Analyzing Chunk 22 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were tested.
write to output
   user  system elapsed 
 97.931   0.841  98.571 
isVcfEnd  FALSE 
(2025-01-22 12:26:53.448421) ---- Analyzing Chunk 23 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2956 markers were tested.
write to output
   user  system elapsed 
102.278   0.845 102.923 
isVcfEnd  FALSE 
(2025-01-22 12:26:57.794807) ---- Analyzing Chunk 24 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3108 markers were tested.
write to output
   user  system elapsed 
106.689   0.853 107.343 
isVcfEnd  FALSE 
(2025-01-22 12:27:02.216908) ---- Analyzing Chunk 25 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were te

3137 markers were tested.
write to output
   user  system elapsed 
249.098   1.353 251.530 
isVcfEnd  FALSE 
(2025-01-22 12:29:26.400491) ---- Analyzing Chunk 57 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3069 markers were tested.
write to output
   user  system elapsed 
253.512   1.357 255.962 
isVcfEnd  FALSE 
(2025-01-22 12:29:30.832493) ---- Analyzing Chunk 58 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2937 markers were tested.
write to output
   user  system elapsed 
257.793   1.365 260.256 
isVcfEnd  FALSE 
(2025-01-22 12:29:35.137292) ---- Analyzing Chunk 59 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3055 markers were tested.
write to output
   user  system elapsed 
262.320   1.365 264.789 
isVcfEnd  FALSE 
(2025-01-22 12:29:39.668761) ---- Analyzing Chunk 60 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2871 markers were tested.
write to output
   user  system elapse

Loading required package: RhpcBLASctl




















IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)




















real	6m30.335s
user	6m25.193s
sys	0m2.842s


## Savvy

In [12]:
%%bash

#export PATH="/opt/miniconda3/bin:$PATH"
time conda run -n saige Rscript SAIGE/extdata/step2_SPAtests.R        \
        --savFile=../scaling/data/chr21_10_4.sav \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.sav_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Linux Mint 21.3

Matrix products: default
BLAS/LAPACK: /home/benj/miniconda3/envs/saige/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8   optparse_1.7.3      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.1     Matrix_1.6-1.1     Rcpp_1.0.11        getopt_1.20.4     
[5] grid_4.3.1     

write to output
   user  system elapsed 
 10.448   0.649  10.923 
isVcfEnd  FALSE 
(2025-01-22 12:31:57.005929) ---- Analyzing Chunk 22 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were tested.
write to output
   user  system elapsed 
 10.815   0.652  11.298 
isVcfEnd  FALSE 
(2025-01-22 12:31:57.377812) ---- Analyzing Chunk 23 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2956 markers were tested.
write to output
   user  system elapsed 
 11.188   0.653  11.673 
isVcfEnd  FALSE 
(2025-01-22 12:31:57.754324) ---- Analyzing Chunk 24 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3108 markers were tested.
write to output
   user  system elapsed 
 11.603   0.653  12.092 
isVcfEnd  FALSE 
(2025-01-22 12:31:58.172643) ---- Analyzing Chunk 25 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were tested.
write to output
   user  system elapsed 
 12.006   0.657  12.505

   user  system elapsed 
 23.673   0.819  24.459 
isVcfEnd  FALSE 
(2025-01-22 12:32:10.540637) ---- Analyzing Chunk 57 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3069 markers were tested.
write to output
   user  system elapsed 
 24.052   0.819  24.838 
isVcfEnd  FALSE 
(2025-01-22 12:32:10.917721) ---- Analyzing Chunk 58 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2937 markers were tested.
write to output
   user  system elapsed 
 24.414   0.824  25.204 
isVcfEnd  FALSE 
(2025-01-22 12:32:11.285261) ---- Analyzing Chunk 59 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3055 markers were tested.
write to output
   user  system elapsed 
 24.776   0.824  25.567 
isVcfEnd  FALSE 
(2025-01-22 12:32:11.64692) ---- Analyzing Chunk 60 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2871 markers were tested.
write to output
   user  system elapsed 
 25.122   0.828  25.918 
isVcfEnd  FALSE

Loading required package: RhpcBLASctl




















IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)













real	0m39.660s
user	0m37.498s
sys	0m2.183s





## Zarr

In [20]:
%%bash

#export PATH="/opt/miniconda3/bin:$PATH"
time conda run -n saige Rscript SAIGE/extdata/step2_SPAtests.R        \
        --vczFile=/home/benj/projects/vcf-zarr-publication/scaling/data/chr21_10_4.zarr \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.vcz_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Linux Mint 21.3

Matrix products: default
BLAS/LAPACK: /home/benj/miniconda3/envs/saige/lib/libopenblasp-r0.3.21.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8   optparse_1.7.3      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.1     Matrix_1.6-1.1     Rcpp_1.0.11        getopt_1.20.4     
[5] grid_4.3.1     

write to output
   user  system elapsed 
 18.268   3.410  14.544 
isVczEnd  FALSE 
(2025-01-22 12:45:26.743736) ---- Analyzing Chunk 23 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2956 markers were tested.
write to output
   user  system elapsed 
 19.029   3.559  15.138 
isVczEnd  FALSE 
(2025-01-22 12:45:27.337872) ---- Analyzing Chunk 24 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3108 markers were tested.
write to output
   user  system elapsed 
 19.790   3.670  15.754 
isVczEnd  FALSE 
(2025-01-22 12:45:27.94713) ---- Analyzing Chunk 25 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3073 markers were tested.
write to output
   user  system elapsed 
 20.505   3.780  16.321 
isVczEnd  FALSE 
(2025-01-22 12:45:28.519953) ---- Analyzing Chunk 26 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2642 markers were tested.
write to output
   user  system elapsed 
 21.291   3.900  16.904 

   user  system elapsed 
 45.297   7.867  35.525 
isVczEnd  FALSE 
(2025-01-22 12:45:47.7248) ---- Analyzing Chunk 58 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2937 markers were tested.
write to output
   user  system elapsed 
 46.114   7.982  36.115 
isVczEnd  FALSE 
(2025-01-22 12:45:48.318459) ---- Analyzing Chunk 59 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
3055 markers were tested.
write to output
   user  system elapsed 
 46.947   8.073  36.721 
isVczEnd  FALSE 
(2025-01-22 12:45:48.921075) ---- Analyzing Chunk 60 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2871 markers were tested.
write to output
   user  system elapsed 
 47.714   8.163  37.301 
isVczEnd  FALSE 
(2025-01-22 12:45:49.501416) ---- Analyzing Chunk 61 :  chrom InitialChunk ---- 
Completed 10000/10000 markers in the chunk.
2979 markers were tested.
write to output
   user  system elapsed 
 48.477   8.295  37.900 
isVczEnd  FALSE 

Loading required package: RhpcBLASctl







real	0m55.402s
user	1m10.391s
sys	0m13.081s


## Final timings
### BCF:
real	3m1.068s
user	2m56.944s
sys	0m2.941s

### VCF:
real	6m30.335s
user	6m25.193s
sys	0m2.842s

### Savvy:
real	0m39.660s
user	0m37.498s
sys	0m2.183s

### Zarr:
real	0m55.957s
user	1m10.375s
sys	0m12.111s

### Compare results

In [15]:
import pandas as pd

vcz_results = pd.read_csv('chr21_10_4.vcz_results.txt', sep='\t')
bcf_results = pd.read_csv('chr21_10_4.bcf_results.txt', sep='\t')
sav_results = pd.read_csv('chr21_10_4.sav_results.txt', sep='\t')
vcf_results = pd.read_csv('chr21_10_4.vcf_results.txt', sep='\t')

In [16]:
vcz_results.shape == bcf_results.shape == sav_results.shape == vcz_results.shape

True

In [17]:
all(vcz_results.columns == bcf_results.columns)
all(vcz_results.columns == vcf_results.columns)
all(vcz_results.columns == sav_results.columns)

True

In [18]:
for column in vcz_results.columns:
    print(column, all(vcz_results[column] == bcf_results[column]))
    print(column, all(vcz_results[column] == vcf_results[column]))
    print(column, all(vcz_results[column] == sav_results[column]))

CHR True
CHR True
CHR True
POS True
POS True
POS True
MarkerID False
MarkerID False
MarkerID False
Allele1 False
Allele1 False
Allele1 False
Allele2 False
Allele2 False
Allele2 False
AC_Allele2 True
AC_Allele2 True
AC_Allele2 True
AF_Allele2 True
AF_Allele2 True
AF_Allele2 True
MissingRate True
MissingRate True
MissingRate True
BETA True
BETA True
BETA True
SE True
SE True
SE True
Tstat True
Tstat True
Tstat True
var True
var True
var True
p.value True
p.value True
p.value True
N True
N True
N True


### Cleanup

A script, `cleanup.sh`, is added in the same folder as this notebook for convenience.