# SAIGE analysis for UK Biobank data: binary phenotypes

## Aim

To perfom genetic association analysis for binary traits (asthma and diabetes) using SAIGE software and UK Biobank imputed data of ~500K invidivuals

## Method and workflow overview

1. Download data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Create a conda enviroment for the installation of SAIGE in Yale's HRC cluster. Instructions in https://github.com/weizhouUMICH/SAIGE
3. Run SAIGE analysis to obtain summary statistics for association analysis

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed, .bim, .fam)`
2. Phenotype file (contains non-genetic covariates). Format is space or tab delimited with a header (one column for sample IDs and one column for each phenotype)

## Ouput files

**From step 1**

1. Model file: `${_output}.rda`

2. Association result file for the subset of randomly selected markers: `${_output}.results.txt`

3. Variance ratio file: `${_output}.varianceRatio.txt`

**From step 2**

1. A file with association results for each chromosome (Note: this are given in regard to Allele 2)

## Global parameter setting

In [None]:
[global]
import os
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed')
# Phenotype file for binary trait 1 (asthma)
parameter: phenoFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720')
# Phenotype file for binary trait 2 (diabetes)
#parameter: phenoFile2 = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720')
# Path to bgen files
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(22)])
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Minimum MAF to be used
parameter: minMAF = 0.001
# Mimimum info score to be used
parameter: minInfo = 0.8
# Mimimum allele count to be sued
parameter: minMAC = 4
# the output directory for generated files
parameter: cwd = path('~/results/pleiotropy/2020-04_saige/asthma/')
# Specific number of threads to use
parameter: numThreads = 20
# For cluster jobs, number commands to run per job
parameter: job_size = 1

for item in bgenFile:
    fail_if(not os.path.isfile(f"{item}.bgi"), msg = f"Index file doesn't exist: ``{item}.bgi``")

## Running SAIGE

On Yale's Farnam cluster,

```
sos run pleiotropy_UKB/SAIGE_binary_UKB.ipynb -c pleiotropy_UKB/farnam.yml -q farnam -J 40

```

In your local machine, for a test run

```
sos dryrun pleiotropy_UKB/SAIGE_binary_UKB.ipynb
```

To try the steps 

```
sos run pleiotropy_UKB/SAIGE_binary_UKB.ipynb saige:1 -q none
```

If using the sos-submit-template.sh `sbatch saige_asthma.sh`

```
sos run pleiotropy_UKB/SAIGE_binary_UKB.ipynb -c pleiotropy_UKB/farnam.yml -q farnam -J 40 \
-s build & > sos-submission-saige-asthma.log
```

### Step 1: fitting the null

In [None]:
[saige_1]
input: genoFile, phenoFile
output: f'{cwd}/{genoFile:bn}.{phenoFile:bn}.SAIGE.rda', f'{cwd}/{genoFile:bn}.{phenoFile:b}.SAIGE.varianceRatio.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ~/software/bin/step1_fitNULLGLMM.R \
        --plinkFile=${_input[0]:n} \
        --phenoFile=${_input[1]} \
        --phenoCol=ASTHMA \
        --covarColList=AGE,SEX \
        --sampleIDColinphenoFile=IID \
        --traitType=binary \
        --outputPrefix=${_output[0]:n} \
        --nThreads=${numThreads} \
        --LOCO=TRUE

### Step 2: perform single variant association test

In [None]:
[saige_2]
input: for_each='bgenFile'
output: f'{cwd}/{_input[2]:bn}.{phenoFile:b}.SAIGE.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', template_name='conda', env_name='RSAIGE'
    Rscript ~/software/bin/step2_SPAtests.R \
        --bgenFile=${_bgenFile} \
        --bgenFileIndex=${_bgenFile}.bgi \
        --minMAF=${minMAF} \
        --minMAC=${minMAC} \
        --minInfo=${minInfo} \
        --sampleFile=${sampleFile} \
        --GMMATmodelFile=${_input[0]} \
        --varianceRatioFile=${_input[1]} \
        --SAIGEOutputFile=${_output} \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE \
        --IsOutputNinCaseCtrl=TRUE \
        --IsDropMissingDosages=TRUE \
        --IsOutputHetHomCountsinCaseCtrl=TRUE

### Results

Results for the single variant association analyses can be found in file with the suffix `.SAIGE.bgen.txt`