# Perform LD-clumping in PLINK v1.9

In this procedure only the most significant SNP (i.e. that with the lowest p-value) in a LD block is identified and used for further analysis. Therefore, reducing the correlation between the remaining SNPs while retaining SNPs with the strongest statistical evidence. The output is a subset of independent SNPs in the dataset.

**Settings for clumping analysis:**

1. Which reference dataset to use? Options are 1000G_CEU, hapmap_CEU_r23a_filtered, UK10K, HRC reference panel
    
    * FIXME: if the SNPs are not in the reference panel they won't be outputed as index SNPs
    * FIX: use the bgen files with the genotype info for which we are going to extract 1000 individuals
    
    
2. What is the significance threshold for the index variant (p1) we should use for the analyses? 
    
    p=5e-08
    
3. What significance threshold to use for the SNPs to be clumped? 
   
   p=1 (this will include all the SNPs)
   
4. What LD r2 to use? 
   
   r2=0.3 or even lower to capture bigger LD blocks
   
5. What window size in kb to use (research about the average LD in the human genome for CEU population)? 
   
   I decided to use 1Mb (1000Kb), however conversations with the team decided that 2Mb will be better

Below are the default options used by PLINK

```
--clump-p1 0.0001: significance threshold for Index SNPs
--clump-p2 0.01: Secondary significance threshold for clumped SNPs
--clump-r2 0.50: LD threshold for clumping
--clump-kb 250: Physical distance threshold for clumping
--clump-field P_BOLT_LMM: To specify the name of the field for P-value
--clump-verbose: to add a more detailed report of SNPs in each clump
--clump-best: to select the single best proxy
```

## Running this workflow

On Yale Farnam cluster,

```
sos run ~/project/pleiotropy_UKB/workflow/LDclumping.ipynb awk -c ~/project/pleiotropy_UKB/farnam.yml -q farnam -J 40 -s build &> sos-submission-LDclumping-date.log
```

On a local computer with for example 8 threads and for the workflow `filter_samples`

```
sos run ~/project/pleiotropy_UKB/workflow/LDclumping.ipynb filter_samples -q none -j 8
```

In [None]:
[global]
# Working directory: change accordingly
parameter: cwd = path('~/scratch60/plink-clumping')
# Genotype file in plink binary format
parameter: genoFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed')
# Path to bgen files
parameter: bgenFile = paths([f'/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{x+1}_v3.bgen' for x in range(22)])
# Path to sample file
parameter: sampleFile = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample')
# Path to summary stats file
parameter: sum_stats= path('/home/dc2325/project/results/pleiotropy/2020-04_bolt/INT-BMI/ukb_imp_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.BoltLMM.snp_stats.all_chr.gz')
# Path to samples of unrelated individuals
parameter: unrelated_samples = path('/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620')
# Clumping parameteres
parameter: clump_field = 'P_BOLT_LMM'
parameter: clump_p1 = 5e-08
parameter: clump_p2 = 1
parameter: clump_r2 = 0.3
parameter: clump_kb = 2000
parameter: clump_annotate = 'BP'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Specific number of threads to use
parameter: numThreads = 20

## Select 1000 random and unrelated samples from UKB

The sample_file.txt should be a text file containing a white space separated list of identifiers

In [None]:
# Create a white-space delimited file with a list of 1000 unrelated samples
[filter_samples]
input: unrelated_samples
output: f'{cwd}/{_input:bn}.1000unrelatedindiv' , f'{cwd}/{_input:bn}.1000unrelatedindiv.sample'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '8G', tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    cat ${_input} | awk -F " " 'NR>1 {print $1}' | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .004 || NR==1) print} {ORS=" "}' > ${_output[0]} \
    cat ${_output[0]} | awk -F " " 'BEGIN {OFS="\n"; print "ID"} {print $1}' > ${_output[1]}
    

## Filter the BGEN files 

This step is based on the samples selected in the previously

In [None]:
# Select the 1000 samples from the BGEN files
# Remember to load QCTOOL/2.0-foss-2016b-rc7-CentOS6.8
# Script called qctool-ldclump.sh
[clumping_1]
input: bgenFile, group_by=1, group_with = dict(info=[(sampleFile, output_from('filter_samples')[0])] * len(bgenFile))
output: f'{cwd}/{_input:bn}.filtered.bgen'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h',  mem = '60G', cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    qctool \
    -g ${_input} \
    -s ${_input.info[0]} \
    -og ${_output} \
    -incl-samples ${_input.info[1]}

## Make the binary files with the selected samples

In [None]:
# This step is done with PLINKv2.0
# Remember to load PLINK/2_x86_64_20180428
[clumping_2]
output: f'{cwd}/{_input:bn}.1000samples.bed'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink2 \
    --bgen ${_input} \
    --sample ${_input} \
    --make-bed \
    --out ${_output:bn} 
    

In [None]:
# Perform LD-clumping in PLINKv1.9
# Remember to load PLINK/1.90-beta5.3
[clumping_3]
input: bfile, sum_stats
output: f'{cwd}/{_input[1]:bn}.clumped'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '20G', tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", workdir = cwd, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink \
    --bfile {_input[0]} \
    --clump {_input[1]} \
    --clump-field {clump_field} \
    --clump-p1 {clump_p1} \
    --clump-p2 {clump_p2} \
    --clump-r2 {clump_r2} \
    --clump-kb {clump_kb} \
    --clump-verbose \
    --clump-annotate {clump_annotate} \
    --out {_output:bn}