# BOLT-LMM analyses for UK Biobank data: quantitative traits

## Aim

To perfom genetic association analysis for quantitative traits (waist-to-hip ratio, waist circumference, BMI, blood lipids) using BOLT-LMM software and UK Biobank imputed data of ~500K invidivuals

## Method and workflow overview

1. Download data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Install BOLT-LMM in Yale's Farnam cluster. To install for source code follow instructions here: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-70002.2
3. Run BOLT-LMM analysis to obtain summary statistics for association analysis

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `--bfile=prefix`
2. Reference genetic maps `--geneticMapFile=tables/genetic_map_hg##.txt.gz`
3. Imputed genotype dosages in `.bgen` format specify the options with  `--bgenFile` and ` --sampleFile`
4. Phenotype file (white space delimited file with column headers, first two columns should be FID and IID) specify files by options `--phenoFile` and the phenotype to be analized by `--phenoCol`
5. Covariates file (same format as phenoFile) specify them by `--covarFile` for qualitative covariates use `--covarCol` and for quantitative `--qCovarCol`. Use `--covarMaxLevels` to specify the number of categories of a qualitative covariate. To specify an array of quantitative covariates use `--qCovarCol=PC{1:20}`

Note: reference genome used **GRCh37/hg19**

## Global parameter setting

In [None]:
[global]
# the output directory for generated files
parameter: cwd = path('~/results/pleiotropy/2020-04_bolt/INT-BMI/')
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv')
# Path to bgen files 
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(22)])
# Phenotype file for quantitative trait (BMI)
parameter: phenoFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/') 
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = 'rankNorm_BMI'
# Covariate file path
parameter: covarFile = path('/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_withagesex_033120')
# Qualitative covariates to be used in the analysis
parameter: covarCol1 = 'SEX'
# Maximum categories of covariates allowed 
parameter: covarMaxLevels = 10
# Quantitative covariates to be used in the analysis
parameter: qCovarCol2 = 'AGE'
#parameter: qCovarCol3 = 'PC{1:10}' # if we are going to use PC as covariates uncomment
# Path to LDscore file for european population
parameter: LDscoresFile = path('~/software/BOLT-LMM_v2.3.4/tables/LDSCORE.1000G_EUR.tab.gz')
# Path to genetic map file used to interpolate genetic map coordinates from SNP physical (base pair) positions
parameter: geneticMapFile = path('~/software/BOLT-LMM_v2.3.4/tables/genetic_map_hg19_withX.txt.gz')
# Specific number of threads to use
parameter: numThreads = 20
# Minimum MAF to be used
parameter: bgenMinMAF = 0.001
# Mimimum info score to be used
parameter: bgenMinINFO = 0.8
# For cluster jobs, number commands to run per job
parameter: job_size = 1

## Running BOLT-LMM

On Yale Farnam cluster,

```
PHEN_DIR=/home/dc2325/phenotypes_UKB/
sos run pleiotropy_UKB/BOLT-LMM_quantitative_UKB.ipynb bolt -c pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720
```

```
PHEN_DIR=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/
sos run pleiotropy_UKB/BOLT-LMM_quantitative_UKB.ipynb bolt -c pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMI_nopreg_adjagesex_residuals_andstandardized_022720
```

```
PHEN_DIR=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/
sos run pleiotropy_UKB/BOLT-LMM_quantitative_UKB.ipynb bolt -c pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_waistcircumference_adjbmiagesex_nopreg_residuals_022720
```

```
PHEN_DIR=/home/dc2325/phenotypes_UKB/
sos run pleiotropy_UKB/BOLT-LMM_quantitative_UKB.ipynb bolt -c pleiotropy_UKB/farnam.yml -q farnam -J 40 \
   --phenoFile $PHEN_DIR/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-WHR_withagesex_042020
```

On a local computer with for example 8 threads,

```
sos run BOLT-LMM_quantitative_UKB.ipynb bolt -q none -j 8
```

## Workflow implementation details

**A note for developers**: it is important to have input and output for each step. Input files and output files are best derived from one another.

In [1]:
[bolt]
depends: executable("bolt")
input: bgenFile, group_by = 1
output: f'{cwd}/{_input:bn}.{phenoFile:b}.BoltLMM.stats.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
  bolt \
    --bfile=${bfile} \
    --phenoFile=${phenoFile} \
    --phenoCol=${phenoCol} \
    --covarFile=${covarFile} \
    --covarCol=${covarCol1} \
    --covarMaxLevels=${covarMaxLevels} \
    --qCovarCol=${qCovarCol2} \
    --LDscoresFile=${LDscoresFile} \
    --geneticMapFile=${geneticMapFile} \
    --lmm \
    --numThreads=${numThreads} \
    --statsFile=${_output} \
    --bgenFile=${_input} \
    --bgenMinMAF=${bgenMinMAF} \
    --bgenMinINFO=${bgenMinINFO} \
    --sampleFile=${sampleFile} \
    --statsFileBgenSnps=${_output:nn}.snp_stats.gz \
    --verboseStats

## Results

In [None]:
%cd ~/results/pleiotropy/2020-04_bolt