## Data analysis
https://www.r-graph-gallery.com/101_Manhattan_plot.html

## Running sos notebook

On the Yale Farnam cluster,


On a local computer with for example 8 threads,

```
sos run BOLT-LMM_R_analysis_UKB.ipynb -q none -j 8
```


In [91]:
[global]
# Working directory
parameter: cwd = path('~/results/pleiotropy/2020-04_bolt/INT-BMI/')
#Input individuals summary stats file
parameter: input_pattern = '*.snp_stats.gz'
import glob
input_file = sorted(glob.glob(input_pattern))
fail_if(len(input_file) == 0, msg = f"Input pattern ``{input_pattern}`` failed to match any files.")
#Output merged summary stats file
parameter: summary_stats_file = 'INT-BMI.txt'

## Merging summary stats for all chromosomes

In [92]:
[step_1]
input: input_file
output: f'{cwd}/{summary_stats_file}.gz'
python: expand=True
    import gzip
    with gzip.open ('{_output}', 'wt') as outfile:
        with gzip.open({_input[0]:r}) as f:
            for line in f:
                outfile.write(line.decode('utf-8'))
            for files in [{_input:r,}][1:]:
                with gzip.open(files) as f:
                    for line in f:
                        if not line.startswith(b'SNP'):
                            outfile.write(line.decode('utf-8'))
    sum(1 for i in gzip.open('{_output}', 'r'))

## Creating the Manhattan and Q-Q plots

In [89]:
[step_2]
input: output_from('step_1')
output: f'{cwd}/{_input:bnn}.manhattan.png', f'{cwd}/{_input:bnn}.qqplot.png'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '60G', tags = f'{step_name}_{_output:bn}'
R: expand=True
    if (!require('qqman')) install.packages('qqman'); library('qqman')
    INTBMI_data <- read.table(gzfile('{_input}'), header=T)
    png('{_output[0]}', width = 6, height = 4, unit='in', res=300)
    bmi_manhattan <- manhattan(INTBMI_data, chr='CHR', bp='BP', snp='SNP', p='P_BOLT_LMM', main = "INT-BMI Manhattan Plot", ylim = c(0, 250), cex = 0.6, 
    cex.axis = 0.9, col = c("blue4", "orange3"), suggestiveline = T, genomewideline = T, chrlabs = c(1:22))
    dev.off()
    png('{_output[1]}', width = 6, height = 4, unit='in', res=300)
    qq_plot <- qq(INTBMI_data$P_BOLT_LMM, main = "INT-BMI Q-Q plot of Bolt-LMM p-values", xlim = c(0, 8), ylim = c(0, 300), pch = 18, col = "blue4", cex = 1.5, las = 1)
    dev.off()

Awk to find the minimum and maximum values in a specific column

In [None]:
#awk '(NR==1){Min=$3;Max=$3};(NR>=3){if(Min>$3) Min=$3;if(Max<$3) Max=$3} END {printf "The Min is %d ,Max is %d",Min,Max}' ukb_mfi_chr1_v3.txt

Show the lowest p-values

In [None]:
#gzcat Test_INT-BMI.txt.gz | awk -F "\t" '{ if ($16 <= 5e-8) { print } }'| sort -g | head -n 5

To find the overlapping variants between dataset 1 and dataset 2

In [None]:
awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' dataset2 dataset1 > dataset1_subset_overlapping_dataset2

## PLINK v1.9 LD-clumping
To perform p-value informed LD clumping in PLINK. 
In this procedure only the most significant SNP (i.e. lowest p-value) in a LD block is identified and used for further analysis. This reduces the correlation between the remaining SNPs while retaining SNPs with the strongest statistical evidence. 
This is done to identify a subset of independent SNPs in the dataset

**Questions:**

1. Which reference dataset to use? Options are 1000G_CEU, hapmap_CEU_r23a_filtered, UK10K, HRC reference panel
FIXME: if the SNPs are not in the reference panel they won't be outputed as index SNPs
2. What is the significance threshold for the index variant (p1) we should use for the analyses? p=5e-08
3. What significance threshold to use for the SNPs to be clumped? p=1 (this will include all the SNPs)
4. What LD r2 to use? r2=0.5
5. What window size in kb to use (research about the average LD in the human genome for CEU population)? I decided to use 1Mb (1000Kb)

Below the default options used by PLINK

```
--clump-p1 0.0001: significance threshold for Index SNPs
--clump-p2 0.01: Secondary significance threshold for clumped SNPs
--clump-r2 0.50: LD threshold for clumping
--clump-kb 250: Physical distance threshold for clumping
--clump-field P_BOLT_LMM: To specify the name of the field for P-value
--clump-verbose: to add a more detailed report of SNPs in each clump
--clump-best: to select the single best proxy
```

In [None]:
plink \
--noweb \
--bfile ~/pleiotropy_UKB/data/1000G_20101123_v3_GIANT_chr1_23_minimacnamesifnotRS_CEU_MAF0.01 \
--clump ~/pleiotropy_UKB/data/Test_INT-BMI.txt.gz \
--clump-field P_BOLT_LMM \
--clump-p1 5e-08 \
--clump-p2 1 \
--clump-r2 0.5 \
--clump-kb 1000 \
--clump-allow-overlap \
--clump-best \
--out Test_INT-BMI.txt

In [None]:
plink \
--noweb \
--bfile /SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv \
--clump /home/dc2325/results/pleiotropy/2020-04_bolt/INT-BMI/Test_INT-BMI.txt.gz \
--clump-field P_BOLT_LMM \
--clump-p1 5e-08 \
--clump-p2 1 \
--clump-r2 0.5 \
--clump-kb 1000 \
--clump-allow-overlap \
--clump-best \
--out Test_INT-BMI.txt

The output file .clumped will contain the index SNPs for each LD block in column 3. To extract this column and create a vector to use for highlighting in the Manhattan Plot

In [None]:
awk '{ print $3 }' \
Test_INT-BMI.txt.clumped \
> dataset1_subset_overlapping_dataset2_clump1.SNPs

Filtering `.bgen` files using qctool2

```
qctool -g ukb_imp_chr1_v3.bgen -og subset_ukb_imp_chr1_v3.bgen -s ukb32285_imputedindiv.sample -incl-range 1:10177-20000 -incl-samples samples.txt
```

Then create a list of files to be combined using R. In this case the summary statistics for the association analysis and then create the dataset by binding files from chr{1:22}. The drawback is that R runs out of memory when doing this

```
#install.packages("reader")
#intall.packages("plyr")
#install.packages("ggplot2")
#library(readr)
#library(plyr)
#library(ggplot2)
#file_list = Sys.glob("*.snp_stats.bgen.gz") #one way of creating lists
#file_list = list.files(path=mydir, pattern="*.snp_stats.bgen.gz", full.names=TRUE) #another way of creating lists
#data.list = lapply(file_list, function(x){read.table(file = x,header = TRUE, sep = "\t")})
#data.merged = do.call("rbind", data.list
```