# GCTA-COJO: multi-SNP-based conditional & joint association analysis using GWAS summary data

## Aim

The purpose of this pipeline is to analyze the GWAS results for the UKBB data to identify secondary association signals to different traits, using both conditional and joint association analyses.

Normally, the top SNP in the region is reported to represent association to the region. However, some assumptions need to be made:
1. The top SNP captures the maximum amount of variation in the region because of its LD with an unknow causal variant
2. Neighbor SNPs show association because they are correlated to the top SNP

These assumptions are not met if:
1. Even if there is a single casual variant, it may not capture the overall variation at the locus.
2. There can be multiple causal variants in a locus, therefore a single SNPs won't capture all the LD between the unkown causal variants and the genotyped/imputed SNPs at the locus

Conditional analysis: perform association analysis conditioning on the primary associated SNP at the locus to test whether there are any other SNPs significantly associated

## Method

Genome-wide stepwise selection procedure to select SNPs based on their conditional P values, then estimate the joint effects of all selected SNPs after the model is optimized

## Input files

Summary statistics format
```
SNP A1 A2 freq b se p N
rs1001 A G 0.8493 0.0024 0.0055 0.6653 129850 
rs1002 C G 0.0306 0.0034 0.0115 0.7659 129799 
rs1003 A C 0.5128 0.0045 0.0038 0.2319 129830
```

Columns are:
* SNP
* A1: the effect allele
* A2: the other allele
* freq: frequency of the effect allele A1
* b: effect size
* se: standard error
* p: p-value
* N: sample size.

## Software options: 

For more info refer to the [documentation](https://cnsgenomics.com/software/gcta/#COJO)

--cojo-slct : stepwise model selection procedure to select independently associated SNPs

--cojo-top-SNPs 10 : Perform a stepwise model selection procedure to select a fixed number of independently associated SNPs without a p-value threshold

--cojo-joint: Fit all the included SNPs to estimate their joint effects without model selection

--cojo-cond SNP_file : Perform association analysis of the included SNPs conditional on the given list of SNPs

--cojo-p 5e-8: Threshold p-value to declare a genome-wide significant hit

--cojo-wind 10000: Specify a distance d (in Kb unit). It is assumed that SNPs more than d Kb away from each other are in complete linkage equilibrium

--cojo-collinear 0.9: During the model selection procedure, the program will check the collinearity between the SNPs that have already been selected and a SNP to be tested. The testing SNP will not be selected if its multiple regression R2 on the selected SNPs is greater than the cutoff value

--diff-freq 0.2: To check the difference in allele frequency of each SNP between the GWAS summary datasets and the LD reference sample. SNPs with allele frequency differences greater than the specified threshold value will be excluded from the analysis

--cojo-gc : If this option is specified, p-values will be adjusted by the genomic control method. By default, the genomic inflation factor will be calculated from the summary-level statistics of all the SNPs unless you specify a value



## Illustration with minimal working examples

```
sos run ~/project/UKBB_GWAS_dev/workflow/GCTA-COJO gcta_cojo \
    --cwd output \
    --bfile MWE/genotypes.bed \
    --sampleFile MWE/imputed_genotypes.sample \
    --bgenFile MWE/imputed_genotypes_chr*.bgen \
    --formatFile MWE/gcat-cojo_template \
    --numThreads 5 \
    --bgenMinMAF 0.001 \
    --bgenMinINFO 0.1 \
    --cojo-cond MWE/phenotypes_BMI.fastGWA.snp_stats.gz \
    --job-size 1 \
    --container_lmm lmm.sif
```

## Global parameter setting

In [12]:
[global]
# the output directory for generated files
parameter: cwd = path
# Path to summary stats file
parameter: sumstatsFile = path
# Genotype files in plink binary this is used for computing the GRM
parameter: bfile = path
# Summary statistics format file path used for unifying input column names. Will not unify names if empty
parameter: formatFile = path('.')
# Specific number of threads to use
parameter: numThreads = 2
# Minimum MAF to be used
parameter: maf = 0.001
# Chromosome to be analyzed
parameter: chrom = 0
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# If the sumstatsfile has logP instead of P-val
parameter: reverse_log_p = True
# The container with the lmm software. Can be either a dockerhub image or a singularity `sif` file.
parameter: container_lmm = 'statisticalgenetics/lmm:3.0'

In [13]:
[top_SNPs]
input: sumstatsFile
output: topSNPs = f'{cwd}/{sumstatsFile:bnn}.top_SNPs.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
  library(dplyr)
  sumstat = read.table(${_input:r}, sep = '\t', header=T)
  sumstat_top <- sumstat %>%
      top_n(-10) %>%
      select("SNP")
  write.table(sumstat_top, ${_output:br}, append = T, quote = FALSE, row.names = FALSE, col.names = FALSE )

In [None]:
# Create sumstats file
[gcta_slct_1, gcta_cond_1, gcta_joint_1]
input: sumstatsFile
output: f'{cwd}/{sumstatsFile:bnn}.gcta_cojo.snp_stats'
depends: formatFile
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: container=container_lmm, expand = "${ }",  stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    import gzip
    import pandas as pd
  
    # unify output format
    if ${formatFile.is_file()} or ${reverse_log_p}:
        sumstats = pd.read_csv(${_input:r}, compression='gzip', header=0, delim_whitespace=True, quotechar='"')  
        if ${formatFile.is_file()}:
            import yaml
            config = yaml.safe_load(open(${formatFile:r}, 'r'))
        try:
            sumstats = sumstats.loc[:,list(config.values())]
        except:
            raise ValueError(f'According to ${formatFile}, input summary statistics should have the following columns: {list(config.values())}.')
        sumstats.columns = list(config.keys())
        if ${reverse_log_p}:
            sumstats['p'] = sumstats['p'].apply(lambda row: 10**-row)
        sumstats.to_csv(${_output:r}, sep='\t', header = True, index = False)

In [14]:
# Perform a stepwise model selection procedure to select independently associated SNPs. Results will be saved in a *.jma file with additional file *.jma.ldr showing the LD correlations between the SNPs.
# Perform a stepwise model selection procedure to select a fixed number of independently associated SNPs without a p-value threshold. If option cojo_top_SNPs is active
[gcta_slct_2]
# If you want a fixed number of SNPs to be selected
parameter: cojo_top_SNPs = 0
input: bfile, group_by=1
output: f'{cwd}/{sumstatsFile:bnn}.cojo_slct.jma.cojo'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    gcta64 \
    --bfile ${_input:n} \
    --autosome \
    --maf ${maf} \
    --cojo-file ${cwd}/${sumstatsFile:bnn}.gcta_cojo.snp_stats \
    --cojo-slct \
    --cojo-p 5e-8 \
    --cojo-wind 10000 \
    --cojo-collinear 0.9 \
    --cojo-gc \
    ${('--chr %s' % chrom) if chrom > 0 else ''} \
    ${('--cojo-top-SNPs %s' % cojo_top_SNPs) if cojo_top_SNPs > 0 else ''} \
    --out ${_output:nn}

In [10]:
# Perform association analysis of the included SNPs conditional on the given list of SNPs without model selection. 
# Results will be saved in a *.cma. The conditional SNP effects (i.e. bC) will be labelled as "NA" if the multivariate correlation between the SNP in question and all the covariate SNPs is > 0.9
[gcta_cond_2]
# Give a list of SNPs on which to condition analysis
parameter: snp_list = path
input: bfile, group_by = 1
output: f'{cwd}/{sumstatsFile:bnn}.cojo_cond.cma.cojo'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    gcta64 \
    --bfile ${_input:n} \
    --autosome \
    --cojo-file ${cwd}/${sumstatsFile:bnn}.gcta_cojo.snp_stats \
    --cojo-cond ${snp_list} \
    --maf ${maf} \
    --thread-num ${numThreads} \
    --out ${_output:nn}

In [11]:
# Estimate the joint effects of a subset of SNPs (given in the file test.snplist) without model selection
[gcta_joint_2]
# To estimate the joint effects of  SNPs given in this list
parameter: snp_list = path
input: bfile, group_by = 1
output:  f'{cwd}/{sumstatsFile:bnn}.cojo_joint.jma.cojo'
task: trunk_workers = 1, trunk_size = job_size, walltime = '48h', mem = '48G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container_lmm, expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    gcta64 \
    --bfile ${_input:n} \
    --autosome \
    --extract ${snp_list} \
    --cojo-file ${cwd}/${sumstatsFile:bnn}.gcta_cojo.snp_stats\
    --cojo-joint \
    --out ${_output:nn}