# BOLT-LMM analyses for UK Biobank data

For binary and quantitative traits.

## Aim

To perfom genetic association analysis for quantitative traits (waist-to-hip ratio, waist circumference, BMI, blood lipids) using BOLT-LMM software and UK Biobank imputed data of ~500K invidivuals

## Method and workflow overview

1. Download data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Install BOLT-LMM in Yale's Farnam cluster. To install for source code follow instructions here: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-70002.2
3. Run BOLT-LMM analysis to obtain summary statistics for association analysis

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed/.bim/.fam)` `--bfile=prefix`
2. Reference genetic maps `--geneticMapFile=tables/genetic_map_hg##.txt.gz`
3. Imputed genotype dosages in `.bgen` format specify the options with  `--bgenFile` and ` --sampleFile`
4. Phenotype file (white space delimited file with column headers, first two columns should be FID and IID) specify files by options `--phenoFile` and the phenotype to be analized by `--phenoCol`
5. Covariates file (same format as phenoFile) specify them by `--covarFile` for qualitative covariates use `--covarCol` and for quantitative `--qCovarCol`. Use `--covarMaxLevels` to specify the number of categories of a qualitative covariate. To specify an array of quantitative covariates use `--qCovarCol=PC{1:20}`

Note: reference genome used **GRCh37/hg19**

## Output data

For a complete description on bolt commands go to: http://manpages.ubuntu.com/manpages/eoan/en/man1/bolt.1.html
```
--statsFile arg
  output file for assoc stats at PLINK genotypes
  
--statsFileBgenSnps arg
  output file for assoc stats at BGEN-format genotypes
```

## Global parameter setting

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to sample file
parameter: sampleFile = path
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Path to bgen files 
parameter: bgenFile = paths
# Phenotype file for quantitative trait (BMI)
parameter: phenoFile = path
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = str
# Covariate file path
parameter: covarFile = path
# Qualitative covariates to be used in the analysis
parameter: covarCol = list
# Maximum categories of covariates allowed 
parameter: covarMaxLevels = int
# Quantitative covariates to be used in the analysis
parameter: qCovarCol = list
# Path to LDscore file for european population
parameter: LDscoresFile = path
# Path to genetic map file used to interpolate genetic map coordinates from SNP physical (base pair) positions
parameter: geneticMapFile = path
# Specific number of threads to use
parameter: numThreads = int
# Minimum MAF to be used
parameter: bgenMinMAF = float
# Mimimum info score to be used
parameter: bgenMinINFO = float
# For cluster jobs, number commands to run per job
parameter: job_size = 1

## Running BOLT-LMM

On Yale Farnam cluster,

* INT-BMI with age and sex as covariates

```
PHEN_DIR=~/project/phenotypes_UKB/
sos run ~/project/pleiotropy_UKB/worfkflow/BoltLMM.ipynb bolt -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720 -s build &> sos-submission-INT-BMI.log
```
* INT-WAIST with age, sex and BMI as covariates

```
PHEN_DIR=~/project/phenotypes_UKB/
sos run ~/project/pleiotropy_UKB/worfkflow/BoltLMM.ipynb bolt -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-WAIST_withagesex_042020 -s build &> sos-submission-INT-WAIST.log
```
* INT-WHR with age, sex and BMI as covariates

```
PHEN_DIR=~/project/phenotypes_UKB/
sos run ~/project/pleiotropy_UKB/worfkflow/BoltLMM.ipynb bolt -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-WHR_withagesex_042020 -s build &> sos-submission-INT-WHR.log
```
On a local computer with for example 8 threads,

```
sos run BoltLMM.ipynb bolt -q none -j 8
```

### Notes about how the analyses where done

1. Run INT-BMI with covariates AGE and SEX
2. Run INT-WHR with covariates AGE and SEX, repeated on 05-07-2020 with BMI as covariate
3. Run INT-WAIST with covariates AGE, SEX and BMI

## Workflow implementation details

**A note for developers**: it is important to have input and output for each step. Input files and output files are best derived from one another.


BOLT-LMM software computes statistics for testing association between phenotypes and genotypes using a linear mixed model


```
--bfile = accepts genotype files in PLINK binary format (.fam, .bed, .bim)
--geneticMapFile = Oxford-format file for interpolating genetic distances: tables/genetic_map_hg##.txt.gz
--phenoFile = phenotype file (header required; FII and IID must be first two columns)
--phenoCol = phenotype columns header
--covarFile = covariate file (header required; FII and IID must be first two columns)
--covarCol = categorical covariate column(s); for >1, use multiple --covarCol and/or {i:j} expansion
--qcovarCol = quantitative covariate column(s); for  >1, use multiple --qCovarCol and/or {i:j} expansion
--lmm = compute assoc stats under the inf model and with Bayesian non-inf prior (VB approx), if power gain expected
--modelSnps = file(s) listing SNPs to use in model (i.e., GRM) (default: use all non-excluded SNPs)
--LDscoresFile = LD Scores for calibration of Bayesian assoc stats: tables/LDSCORE.1000G_EUR.tab.g
--numThreads = number of computational threads
--statsFile = output file for assoc stats at PLINK genotypes
--bgenFile = file(s) containing Oxford BGEN-format genotypes to test for association
--sampleFile = file containing Oxford sample file corresponding to BGEN file(s)
--bgenMinMAF = MAF threshold on Oxford BGEN-format genotypes; lower-MAF SNPs will be ignored
--bgenMinINFO = INFO threshold on Oxford BGEN-format genotypes; lower-INFO SNPs will be ignored
--statsFileBgenSNPs = output file for assoc stats at BGEN-format genotypes
```

In [2]:
[bolt]
depends: executable("bolt")
input: bgenFile, group_by = 1
output: f'{cwd}/{_input:bn}.{phenoFile:b}.boltlmm.snp_stats.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bolt \
    --bfile=${bfile} \
    --phenoFile=${phenoFile} \
    --phenoCol=${phenoCol} \
    --covarFile=${covarFile} \
    ${' '.join(['--covarCol=%s ' % x for x in covarCol if x is not None])} \
    --covarMaxLevels=${covarMaxLevels} \
    ${' '.join(['--qcovarCol=%s ' % x for x in covarCol if x is not None])} \
    --LDscoresFile=${LDscoresFile} \
    --geneticMapFile=${geneticMapFile} \
    --lmm \
    --statsFile=${_output:nn}.stats.gz \
    --numThreads=${numThreads} \
    --bgenFile=${_input} \
    --bgenMinMAF=${bgenMinMAF} \
    --bgenMinINFO=${bgenMinINFO} \
    --sampleFile=${sampleFile} \
    --statsFileBgenSnps=${_output} \
    --verboseStats

## Results

In [None]:
%cd ~/scratch60/2020-04_bolt