# BOLT-LMM analyses for UK Biobank data

For binary and quantitative traits, works with BOLT-LMM version 2.3.4.

## Aim

This pipeline was initially developed to perfom genetic association analysis using BOLT-LMM software and UK Biobank imputed data of ~500K invidivuals, although it can be used to analyze other phenotypes.

## Method and workflow overview

1. Download data from UKB: phenotype, genotypes files (`.fam`,`.bed`, `.bim`) and imputed genotypes (`.bgen`, `.bgi`, `.sample`)
2. Install BOLT-LMM. To install from source code follow instructions here: https://data.broadinstitute.org/alkesgroup/BOLT-LMM/#x1-70002.2
    - Supporting files such as LD score file and genetic map file can be found [in the installation bundle](https://data.broadinstitute.org/alkesgroup/BOLT-LMM/downloads/BOLT-LMM_v2.3.4.tar.gz).
3. Run BOLT-LMM analysis to obtain summary statistics for association analysis.

## Input data

1. Genotype file for constructing the GRM (genetic relationship matrix) formated as a plink binary file `(.bed/.bim/.fam)` 
    - `--bfile=prefix`
2. Reference genetic maps 
    - `--geneticMapFile=tables/genetic_map_hg##.txt.gz`
3. Imputed genotype dosages in `.bgen` format 
    - `--bgenFile` and ` --sampleFile`
4. Phenotype file (white space delimited file with column headers, first two columns should be FID and IID) specify files by options `--phenoFile` and the phenotype to be analized by `--phenoCol`
5. Covariates file (same format as phenoFile) specify them by `--covarFile` for qualitative covariates use `--covarCol` and for quantitative `--qCovarCol`. If `--covarFile` is not specified then phenotype file will be used as covariate file. Use `--covarMaxLevels` to specify the number of categories of a qualitative covariate. To specify an array of quantitative covariates use `--qCovarCol PC{1:20}`

Note: reference genome used **GRCh37/hg19**

## Output data

For a complete description on bolt commands go to: http://manpages.ubuntu.com/manpages/eoan/en/man1/bolt.1.html
```
--statsFile arg
  output file for assoc stats at PLINK genotypes
  
--statsFileBgenSnps arg
  output file for assoc stats at BGEN-format genotypes
```

## Command interface

In [17]:
sos run BoltLMM.ipynb -h

usage: sos run BoltLMM.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bolt

Global Workflow Options:
  --cwd VAL (as path, required)
                        the output directory for generated files
  --sampleFile VAL (as path, required)
                        Path to sample file
  --bfile VAL (as path, required)
                        Genotype files in plink binary this is used for
                        computing the GRM
  --bgenFile  paths

                        Path to bgen files
  --phenoFile VAL (as path, required)
                        Phenotype file for quantitative trait (BMI)
  --phenoCol VAL (as str, required)
                        Phenotype to be analyzed (s

## Global parameter setting

In [1]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to sample file
parameter: sampleFile = path
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Path to bgen files 
parameter: bgenFile = paths
# Phenotype file for quantitative trait (BMI)
parameter: phenoFile = path
# Phenotype to be analyzed (specify the column)
parameter: phenoCol = str
# Covariate file path. Will use phenoFile if empty
parameter: covarFile = path('.')
# Qualitative covariates to be used in the analysis
parameter: covarCol = []
# Maximum categories of covariates allowed 
parameter: covarMaxLevels = int
# Quantitative covariates to be used in the analysis
parameter: qCovarCol = list
# Path to LDscore file for reference population
parameter: LDscoresFile = path
# Path to genetic map file used to interpolate genetic map coordinates from SNP physical (base pair) positions
parameter: geneticMapFile = path
# Specific number of threads to use
parameter: numThreads = int
# Minimum MAF to be used
parameter: bgenMinMAF = float
# Mimimum info score to be used
parameter: bgenMinINFO = float
# LMM option: lmm, lmmInfOnly, and lmmForceNonInf
parameter: lmm_option = 'lmm'
# For cluster jobs, number commands to run per job
parameter: job_size = 1

if not covarFile.is_file():
    covarFile = phenoFile

## Running BOLT-LMM

```
JOB_OPT='-j 2'
```

On a minimal working example (MWE) dataset (about 1min to complete the analysis),

```
sos run BoltLMM.ipynb bolt \
    --cwd output \
    --bfile data/genotypes \
    --sampleFile data/imputed_genotypes.sample \
    --bgenFile data/imputed_genotypes_chr*.bgen \
    --phenoFile data/phenotypes.txt \
    --LDscoresFile BOLT-LMM_v2.3.4/tables/LDSCORE.1000G_EUR.tab.gz \
    --geneticMapFile BOLT-LMM_v2.3.4/tables/genetic_map_hg19_withX.txt.gz \
    --phenoCol BMI \
    --covarCol SEX \
    --covarMaxLevels 10 \
    --qCovarCol AGE \
    --numThreads 5 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --lmm-option none \
    $JOB_OPT
```

Please note that the command above is only meant to demonstrate the usage of the pipeline. Data will be generated to a folder called `output`. We set `--lmm-option` to `none` to not run LMM on this minimal data-set. In practice we will definitely want to use one of the LMM options in BoltLMM. Default is `lmm` switch in `bolt` if you don't specify `--lmm-option`.

### Run workflow on a cluster

The shell variable `JOB_OPT` was set to `-j 2`. That is, run 2 jobs in parallel on a local computer (each using 5 threads due to `--numThreads 5`).

On cluster we use a job template, and configure `JOB_OPT` as follows: 

```
JOB_OPT="-c farnam.yml -q farnam -J 40"
```

Here we use task queue `farnam` configured in file `farnam.yml`. We allow for at most 40 jobs in the cluster job queue.

## Workflow implementation details

**A note for developers**: it is important to have input and output for each step. Input files and output files are best derived from one another.


BOLT-LMM software computes statistics for testing association between phenotypes and genotypes using a linear mixed model


```
--bfile = accepts genotype files in PLINK binary format (.fam, .bed, .bim)
--geneticMapFile = Oxford-format file for interpolating genetic distances: tables/genetic_map_hg##.txt.gz
--phenoFile = phenotype file (header required; FII and IID must be first two columns)
--phenoCol = phenotype columns header
--covarFile = covariate file (header required; FII and IID must be first two columns)
--covarCol = categorical covariate column(s); for >1, use multiple --covarCol and/or {i:j} expansion
--qcovarCol = quantitative covariate column(s); for  >1, use multiple --qCovarCol and/or {i:j} expansion
--lmm = compute assoc stats under the inf model and with Bayesian non-inf prior (VB approx), if power gain expected
--modelSnps = file(s) listing SNPs to use in model (i.e., GRM) (default: use all non-excluded SNPs)
--LDscoresFile = LD Scores for calibration of Bayesian assoc stats: tables/LDSCORE.1000G_EUR.tab.g
--numThreads = number of computational threads
--statsFile = output file for assoc stats at PLINK genotypes
--bgenFile = file(s) containing Oxford BGEN-format genotypes to test for association
--sampleFile = file containing Oxford sample file corresponding to BGEN file(s)
--bgenMinMAF = MAF threshold on Oxford BGEN-format genotypes; lower-MAF SNPs will be ignored
--bgenMinINFO = INFO threshold on Oxford BGEN-format genotypes; lower-INFO SNPs will be ignored
--statsFileBgenSNPs = output file for assoc stats at BGEN-format genotypes
```

It is important to know that BOLT-LMMv2.3.4 accepts bgen files only in 8bit formatting as stated below:

*WARNING: The BGEN format comprises a few sub-formats; we have only implemented support for the versions (and specific data layouts) used in the UK Biobank N=150K and N=500K releases. In particular, for BGEN v1.2, BOLT-LMM currently only supports the 8-bit encoding used for the UK Biobank N=500K data. (Starting with BOLT-LMM v2.3.3, missing values in BGEN v1.2 data are now allowed.)*

In [2]:
# Run BOLT analysis
[bolt_1]
depends: executable("bolt")
input: bgenFile, group_by = 1
output: f'{cwd}/{_input:bn}.{phenoFile:bn}_{phenoCol}.boltlmm.snp_stats.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '60G', cores = numThreads, tags = f'{step_name}'
bash: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    bolt \
    --bfile=${bfile} \
    --phenoFile=${phenoFile} \
    --phenoCol=${phenoCol} \
    --covarFile=${covarFile} \
    ${' '.join(['--covarCol=%s ' % x for x in covarCol if x is not None])} \
    --covarMaxLevels=${covarMaxLevels} \
    ${' '.join(['--qCovarCol=%s ' % x for x in qCovarCol if x is not None])} \
    --LDscoresFile=${LDscoresFile} \
    --geneticMapFile=${geneticMapFile} \
    ${('--' + lmm_option) if lmm_option in ['lmm', 'lmmInfOnly', 'lmmForceNonInf'] else ''} \
    --statsFile=${_output:nn}.ref_stats.gz \
    --numThreads=${numThreads} \
    --bgenFile=${_input} \
    --bgenMinMAF=${bgenMinMAF} \
    --bgenMinINFO=${bgenMinINFO} \
    --sampleFile=${sampleFile} \
    --statsFileBgenSnps=${_output} \
    --verboseStats

bash: expand = "${ }", active = (_index != 0)
    # remove redundant reference summary stats file
    rm -f ${_output:nn}.ref_stats.gz

bash: expand = "${ }", active = (_index == 0)
    # rename reference summary stats file
    mv ${_output:nn}.ref_stats.gz ${cwd}/${phenoFile:bn}_${phenoCol}.boltlmm.ref_stats.gz

In [None]:
# Merge results and log files
[bolt_2]
input: group_by = 'all'
output: f'{cwd}/{phenoFile:bn}_{phenoCol}.boltlmm.snp_stats.gz', 
        f'{cwd}/{phenoFile:bn}_{phenoCol}.boltlmm.snp_counts.txt'
task: trunk_workers = 1, trunk_size = 1, walltime = '30m', mem = '6G', cores = 1, tags = f'{step_name}'
python: expand=True
    import gzip
    n_lines = -1
    with gzip.open('{_output[0]}', 'wt') as outfile:
        with gzip.open({_input[0]:r}) as f:
            for line in f:
                outfile.write(line.decode('utf-8'))
            for files in [{_input:r,}][1:]:
                with gzip.open(files) as f:
                    for line in f:
                        if not line.startswith(b'SNP'):
                            outfile.write(line.decode('utf-8'))
    sum(1 for i in gzip.open('{_output[0]}', 'r'))

bash: expand="$( )"
    # count result SNPs
    for f in $(_input); do echo "$f: `zcat $f | wc -l`"; done > $(_output[1])
    # merge stderr and stdout files
    for f in $(_input); do 
        for ext in stderr stdout; do
            echo "$f $ext:"
            cat ${f%.gz}.$ext
            rm -f ${f%.gz}.$ext 
        done
    done > $(_output[0]:nn).log
    # remove files before merger
    rm -f $(_input)

## Result

In [16]:
ls output/

phenotypes_BMI.boltlmm.log           phenotypes_BMI.boltlmm.snp_counts.txt
[0m[01;31mphenotypes_BMI.boltlmm.ref_stats.gz[0m  [01;31mphenotypes_BMI.boltlmm.snp_stats.gz[0m


In [12]:
%preview output/phenotypes_BMI.boltlmm.snp_stats.gz

SNP	CHR	BP	GENPOS	ALLELE1	ALLELE0	A1FREQ	INFO	CHISQ_LINREG	P_LINREG
rs79945276	21	48096251	0.646473	T	G	0.0640784	0.96222	4.23679	4.0E-02
rs12481825	21	48096617	0.646484	A	C	0.0153529	0.977965	0.712511	4.0E-01
rs61504104	21	48096920	0.646493	C	T	0.0887647	0.975897	0.0796273	7.8E-01
rs55777714	21	48097101	0.646499	T	C	0.169882	0.959507	0.128473	7.2E-01


In [13]:
%preview output/phenotypes_BMI.boltlmm.ref_stats.gz

SNP	CHR	BP	GENPOS	ALLELE1	ALLELE0	A1FREQ	F_MISS	CHISQ_LINREG	P_LINREG
rs3131962	1	756604	0.00490722	A	G	0.165	0	0.0284453	8.7E-01
rs12562034	1	768448	0.00495714	A	G	0.07	0	1.03484	3.1E-01
rs4040617	1	779322	0.00500708	G	A	0.155	0	0.133342	7.1E-01
rs79373928	1	801536	0.0058722	G	T	0.02	0	0.0409388	8.4E-01


In [14]:
%preview output/phenotypes_BMI.boltlmm.snp_counts.txt

output/imputed_genotypes_chr21.phenotypes_BMI.boltlmm.snp_stats.gz: 25
output/imputed_genotypes_chr22.phenotypes_BMI.boltlmm.snp_stats.gz: 22

In [15]:
%preview output/phenotypes_BMI.boltlmm.log

output/imputed_genotypes_chr21.phenotypes_BMI.boltlmm.snp_stats.gz stderr:
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates
NOTE: Using all-1s vector (constant term) in addition to specified covariates