# SAIGE Prototype

This notebook demonstrates a prototype of [SAIGE](https://saigegit.github.io/SAIGE-doc/) that supports reading from VCF Zarr stores.

Run the `setup.sh` script before using this notebook to install SAIGE and create a Conda environment for SAIGE.

In [1]:
!pip install tstrait

Collecting tstrait
  Using cached tstrait-0.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting numpy>=1.20.3 (from tstrait)
  Downloading numpy-2.2.1-cp310-cp310-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting numba>=0.57.0 (from tstrait)
  Using cached numba-0.60.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (2.7 kB)
Collecting pandas>=1.0 (from tstrait)
  Using cached pandas-2.2.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (89 kB)
Collecting tskit>=0.5.5 (from tstrait)
  Downloading tskit-0.6.0-cp310-cp310-macosx_10_9_universal2.whl.metadata (2.0 kB)
Collecting llvmlite<0.44,>=0.43.0dev0 (from numba>=0.57.0->tstrait)
  Using cached llvmlite-0.43.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (4.8 kB)
Collecting numpy>=1.20.3 (from tstrait)
  Downloading numpy-2.0.2-cp310-cp310-macosx_14_0_arm64.whl.metadata (60 kB)
Collecting tzdata>=2022.7 (from pandas>=1.0->tstrait)
  Using cached tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting svgwrite>=1.1.10 (from tskit>=0.5.5->tst

### Simulate phenotypes

In [4]:
import tskit
import tstrait

ts = tskit.load('../scaling/data/chr21_10_5.ts')
model = tstrait.trait_model(distribution='normal', mean=0, var=1)
sim_result = tstrait.sim_phenotype(ts=ts, model=model, h2=0.3)

In [5]:
sim_result.trait

Unnamed: 0,position,site_id,effect_size,causal_allele,allele_freq,trait_id
0,23605820,654107,0.372912,A,3.5e-05,0


In [6]:
sim_result.phenotype

Unnamed: 0,trait_id,individual_id,genetic_value,environmental_noise,phenotype
0,0,0,0.0,-0.006036,-0.006036
1,0,1,0.0,0.000652,0.000652
2,0,2,0.0,0.001182,0.001182
3,0,3,0.0,-0.001402,-0.001402
4,0,4,0.0,-0.001009,-0.001009
...,...,...,...,...,...
286713,0,286713,0.0,0.000769,0.000769
286714,0,286714,0.0,-0.000580,-0.000580
286715,0,286715,0.0,0.005226,0.005226
286716,0,286716,0.0,-0.003341,-0.003341


In [7]:
phenotype = sim_result.phenotype
phenotype['sample_id'] = 'tsk_' + phenotype['individual_id'].astype(str)
phenotype

Unnamed: 0,trait_id,individual_id,genetic_value,environmental_noise,phenotype,sample_id
0,0,0,0.0,-0.006036,-0.006036,tsk_0
1,0,1,0.0,0.000652,0.000652,tsk_1
2,0,2,0.0,0.001182,0.001182,tsk_2
3,0,3,0.0,-0.001402,-0.001402,tsk_3
4,0,4,0.0,-0.001009,-0.001009,tsk_4
...,...,...,...,...,...,...
286713,0,286713,0.0,0.000769,0.000769,tsk_286713
286714,0,286714,0.0,-0.000580,-0.000580,tsk_286714
286715,0,286715,0.0,0.005226,0.005226,tsk_286715
286716,0,286716,0.0,-0.003341,-0.003341,tsk_286716


In [8]:
# Save phenotype data to disk in the format that SAIGE expects.
phenotype[['sample_id', 'phenotype']].to_csv('chr21_10_5.phenotypes.txt', sep='\t', index=False)

### SAIGE workflow

In [10]:
!plink2 --vcf ../scaling/data/chr21_10_4.vcf.gz --make-bed --out ./chr21_10_4 --max-alleles 2

PLINK v2.00a5.12 M1 (25 Jun 2024)              www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./chr21_10_4.log.
Options in effect:
  --make-bed
  --max-alleles 2
  --out ./chr21_10_4
  --vcf ../scaling/data/chr21_10_4.vcf.gz

Start time: Sat Jan 11 20:12:48 2025
8192 MiB RAM detected; reserving 4096 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 863998 variants scanned.
--vcf: ./chr21_10_4-temporary.pgen + ./chr21_10_4-temporary.pvar.zst +
./chr21_10_4-temporary.psam written.
10000 samples (0 females, 0 males, 10000 ambiguous; 10000 founders) loaded from
./chr21_10_4-temporary.psam.
856315 out of 863998 variants loaded from ./chr21_10_4-temporary.pvar.zst.
Note: No phenotype data present.
856315 variants remaining after main filters.
Writing ./chr21_10_4.fam ... done.
Writing ./chr21_10_4.bim ... done.
done.hr21_10_4.bed ... 0%
End time: Sat Jan 11 20:13:26 2025


In [18]:
%%bash

export PATH="/opt/miniconda3/bin:$PATH"
conda run -n RSAIGE Rscript SAIGE/extdata/step1_fitNULLGLMM.R     \
        --plinkFile=./chr21_10_4  \
        --useSparseGRMtoFitNULL=FALSE    \
        --phenoFile=./chr21_10_5.phenotypes.txt \
        --phenoCol=phenotype \
        --sampleIDColinphenoFile=sample_id \
        --invNormalize=TRUE     \
        --traitType=quantitative        \
        --outputPrefix=./chr21_10_4.model \
        --nThreads=24   \
        --IsOverwriteVarianceRatioFile=TRUE

Loading required package: optparse
package ‘optparse’ was built under R version 4.3.3 



R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /opt/miniconda3/envs/RSAIGE/lib/libblis.4.0.0.dylib 
LAPACK: /opt/miniconda3/envs/RSAIGE/lib/liblapack.3.9.0.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: UTC
tzcode source: system (macOS)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] optparse_1.7.5 SAIGE_1.3.6   

loaded via a namespace (and not attached):
[1] compiler_4.3.0     Matrix_1.6-5       Rcpp_1.0.13        getopt_1.20.4     
[5] grid_4.3.0         data.table_1.15.2  RcppParallel_5.1.9 lattice_0.22-6    
$plinkFile
[1] "./chr21_10_4"

$bedFile
[1] ""

$bimFile
[1] ""

$famFile
[1] ""

$phenoFile
[1] "./chr21_10_5.phenotypes.txt"

$phenoCol
[1] "phenotype"

$traitType
[1] "quantitative"

$invNormalize
[1] TRUE

$covarColList
[1] ""

$qCovarColList
[1] ""


Here we run [step 2](https://saigegit.github.io/SAIGE-doc/docs/single_step2.html) of the single-variant association test using the VCF data.

In [34]:
%%bash

export PATH="/opt/miniconda3/bin:$PATH"
conda run -n RSAIGE Rscript SAIGE/extdata/step2_SPAtests.R        \
        --vcfFile=../scaling/data/chr21_10_4.bcf \
        --vcfFileIndex=../scaling/data/chr21_10_4.bcf.csi \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.bcf_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /opt/miniconda3/envs/RSAIGE/lib/libblis.4.0.0.dylib 
LAPACK: /opt/miniconda3/envs/RSAIGE/lib/liblapack.3.9.0.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: UTC
tzcode source: system (macOS)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.15.2   optparse_1.7.5      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.0     Matrix_1.6-5       Rcpp_1.0.13        getopt_1.20.4     
[5] grid_4.3.0         RcppParallel_5.1.9 lattice_0.22-6    
$vcfFile
[1] "../scaling/data/chr21_10_4.bcf"

$vcfFileIndex
[1] "../scaling/data/chr21_10_4.bcf.csi"

$vcfField
[1] "GT"

$vczFile
[1] ""

$savFile
[1] ""

$savFileIndex
[1] ""

$bgenFile
[1] ""

$bgenFileIndex
[1] ""


Loading required package: RhpcBLASctl
package ‘RhpcBLASctl’ was built under R version 4.3.3 
package ‘optparse’ was built under R version 4.3.3 
package ‘data.table’ was built under R version 4.3.1 
IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



Now we run [step 2](https://saigegit.github.io/SAIGE-doc/docs/single_step2.html) of the single-variant association test using the VCF Zarr data.

In [27]:
%%bash

export PATH="/opt/miniconda3/bin:$PATH"
conda run -n RSAIGE Rscript SAIGE/extdata/step2_SPAtests.R        \
        --vczFile=/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.zarr \
        --vcfField=GT   \
        --SAIGEOutputFile=./chr21_10_4.vcz_results.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./chr21_10_4.model.rda \
        --varianceRatioFile=./chr21_10_4.model.varianceRatio.txt  \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=FALSE

R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /opt/miniconda3/envs/RSAIGE/lib/libblis.4.0.0.dylib 
LAPACK: /opt/miniconda3/envs/RSAIGE/lib/liblapack.3.9.0.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: UTC
tzcode source: system (macOS)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.15.2   optparse_1.7.5      RhpcBLASctl_0.23-42
[4] SAIGE_1.3.6        

loaded via a namespace (and not attached):
[1] compiler_4.3.0     Matrix_1.6-5       Rcpp_1.0.13        getopt_1.20.4     
[5] grid_4.3.0         RcppParallel_5.1.9 lattice_0.22-6    
$vcfFile
[1] ""

$vcfFileIndex
[1] ""

$vcfField
[1] "GT"

$vczFile
[1] "/Users/willtyler/Desktop/vcf-zarr-publication/scaling/data/chr21_10_4.zarr"

$savFile
[1] ""

$savFileIndex
[1] ""

$bgenFile
[1] ""

$bgenFileInd

Loading required package: RhpcBLASctl
package ‘RhpcBLASctl’ was built under R version 4.3.3 
package ‘optparse’ was built under R version 4.3.3 
package ‘data.table’ was built under R version 4.3.1 






### Compare results

In [28]:
import pandas as pd

vcz_results = pd.read_csv('chr21_10_4.vcz_results.txt', sep='\t')
bcf_results = pd.read_csv('chr21_10_4.bcf_results.txt', sep='\t')

In [31]:
vcz_results.shape == bcf_results.shape

True

In [30]:
all(vcz_results.columns == bcf_results.columns)

True

In [33]:
for column in vcz_results.columns:
    print(column, all(vcz_results[column] == bcf_results[column]))

CHR True
POS True
MarkerID False
Allele1 False
Allele2 False
AC_Allele2 True
AF_Allele2 True
MissingRate True
BETA True
SE True
Tstat True
var True
p.value True
N True


### Cleanup

A script, `cleanup.sh`, is added in the same folder as this notebook for convenience.