# SAIGE analysis for UK Biobank data

For binary and quantitative phenotypes

## Aim

To perfom genetic association analysis for binary traits (asthma and diabetes) using SAIGE software and UK Biobank imputed data of ~500K invidivuals

## Method and workflow overview

1. Download data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
3. Run SAIGE analysis to obtain summary statistics for association analysis

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed, .bim, .fam)`
2. Phenotype file (contains non-genetic covariates). Format is space or tab delimited with a header (one column for sample IDs and one column for each phenotype)

## Ouput files

## Global parameter setting

In [None]:
[global]
import os
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed')
# Phenotype file for binary trait 1 (asthma)
parameter: phenoFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols_forbolt030720')
# Phenotype file for binary trait 2 (diabetes)
#parameter: phenoFile2 = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/diabetes_casesbyICD10andselfreport_controlswithoutautoiummune_030720')
# Path to bgen files
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(22)])
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Minimum MAF to be used
parameter: minMAF = 0.001
# Mimimum info score to be used
parameter: minInfo = 0.8
# Mimimum allele count to be sued
parameter: minMAC = 4
# the output directory for generated files
parameter: cwd = path('~/scratch60/2020-04_saige/asthma')
# Specific number of threads to use
parameter: numThreads = 20
# For cluster jobs, number commands to run per job
parameter: job_size = 1

for item in bgenFile:
    fail_if(not os.path.isfile(f"{item}.bgi"), msg = f"Index file doesn't exist: ``{item}.bgi``")

## Running SAIGE

On Yale's Farnam cluster,

```
sos run pleiotropy_UKB/SAIGE_binary_UKB.ipynb -c pleiotropy_UKB/farnam.yml -q farnam -J 40

```

In your local machine, for a test run

```
sos dryrun pleiotropy_UKB/SAIGE_binary_UKB.ipynb
```

To try the steps 

```
sos run pleiotropy_UKB/SAIGE_binary_UKB.ipynb saige_1 -q none
```

If using the sos-submit-template.sh `sbatch saige_asthma.sh`

```
sos run ~/project/pleiotropy_UKB/workflow/SAIGE.ipynb -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 \
-s build & > sos-submission-saige-asthma-051920.log
```

### Results

Results for the single variant association analyses can be found in file with the suffix `.SAIGE.bgen.txt`