## Data analysis


Pipeline to process the output files from the association results and analyze summary statistics

## Running sos notebook

On the Yale Farnam cluster,


On a local computer with for example 8 threads,

```
sos run ~/project/pleiotropy_UKB/postprocessing.ipynb -q none -j 8
```

## Input files

### BOLT-LMM

The input files for this process are the ones created after running the `BoltLMM.ipynb` notebook that should be the following (example for INT-BMI):

* ukb_imp_chr{1:22}_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.boltlmm.snp_stats.gz (these are the association results for the bgen genotypes)
* ukb_imp_chr{1:22}_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.boltlmm.stats.gz (take into account that the 22 files generated in this step will have the same information, that is the association results for the hard called genotypes - PLINK genotypes)
* ukb_imp_chr{1:22}_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.boltlmm.snp_stats.stderr
* ukb_imp_chr{1:22}_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.boltlmm.snp_stats.stdout

### FastGWA

The input files for this process are the ones created after running the `fastGWA.ipynb` notebook and should be the following:

* bgenFile_chr{1:22}.TraitName.fastGWA (these are the association results for the bgen genotypes)
* bgenFile_chr{1:22}.TraitName.log


In [91]:
[global]
# Working directory: change accordingly
parameter: cwd = path('~/scratch60/')
# Association files patterns
parameter: pattern1 = ['fastGWA', 'boltlmm.snp_stats.gz','boltlmm.stats.gz']
# Patterns for the log, stderr and stdout files
parameter: pattern2 = ['fastGWA.log', 'boltlmm.snp_stats.stderr', 'boltlmm.snp_stats.stdout']
# Association files for imputed data
import glob
snp_stats_file = sorted(glob.glob({_pattern1[0]}))
fail_if(len(snp_stats_file) == 0, msg = f"Input pattern ``{_pattern1[0]}`` failed to match any files.")
# Output name merged summary stats file for bgen files
parameter: bgen_sumstats = 'ukb_imp_v3.Asthma_casesbyICD10codesANDselfreport_controlsbyselfreportandicd10_noautoimmuneincontrols.fastGWA.snp_stats.all_chr'
# Input plink genotypes summary stats - this only applies to Bolt-LMM
parameter: stats_file = path('~/scratch60/')
# Output plink genotypes summary stats - this only applies for Bolt-LMM
parameter: plink_sumstats = path('~/scratch60/')
# Title of the manhattan plot
parameter: manplot= 'Asthma Manhattan Plot fastGWA'
# Title of the q-q plot
parameter: qq_plot= 'Asthma Q-Q plot fastGWA'
# P-value from summary stats
parameter: pval='P'

### Step 1: Merge the summary statistics generated during the association analysis of the imputed variants

In [92]:
# Merge all the bgen summary statistics
[merge_imputed]
input: snp_stats_file
output: f'{cwd}/{bgen_sumstats}.gz'
python: expand=True
    import gzip
    with gzip.open ('{_output}', 'wt') as outfile:
        with gzip.open({_input[0]:r}) as f:
            for line in f:
                outfile.write(line.decode('utf-8'))
            for files in [{_input:r,}][1:]:
                with gzip.open(files) as f:
                    for line in f:
                        if not line.startswith(b'SNP'):
                            outfile.write(line.decode('utf-8'))
    sum(1 for i in gzip.open('{_output}', 'r'))

### Step 4: Creating the Manhattan and Q-Q plots

The simplest way to create a manhattan plot is to use the qqman package from R: https://www.r-graph-gallery.com/101_Manhattan_plot.html

In [89]:
[plot_1]
input: output_from('merge_imputed')
output: f'{cwd}/{_input:bnn}.manhattan.png', f'{cwd}/{_input:bnn}.qqplot.png'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '60G', tags = f'{step_name}_{_output[0]:bn}'
R: expand=True
    if (!require('qqman')) install.packages('qqman'); library('qqman')
    data <- read.table(gzfile('{_input}'), header=T)
    # Creating manhattan plot
    png('{_output[0]}', width = 6, height = 4, unit='in', res=300)
    manhattan_plot <- manhattan(data, chr='CHR', bp='BP', snp='SNP', p={pval}, main = '{manplot}', ylim = c(0, 250), cex = 0.6, 
    cex.axis = 0.9, col = c("blue4", "orange3"), suggestiveline = T, genomewideline = T, chrlabs = c(1:22))
    dev.off()
    # Creating qqplot
    png('{_output[1]}', width = 6, height = 4, unit='in', res=300)
    qq_plot <- qq(data$P_BOLT_LMM, main = '{qq_plot}', xlim = c(0, 8), ylim = c(0, 300), pch = 18, col = "blue4", cex = 1.5, las = 1)
    dev.off()

### Step 5: Creating flexibility for Manhattan plots

In [None]:
# For more flexibility when creating manhattan plots and adding highlighted SNPs
[plot_2]
input: output_from('step_1')
output: f'{cwd}/{_input:bnn}.manhattan.png'
task: trunk_workers = 1, trunk_size = job_size, walltime = '12h', mem = '60G', tags = f'{step_name}_{_output:bn}'
R: expand="${ }"
    if (!require('tidyverse')) install.packages('tidyverse'); library('tidyverse')
    if (!require('ggrepel')) install.packages('ggrepel'); library('ggrepel')
  #Load your data
    data <- read.table(gzfile('${_input}'), header=T)
  # Create a subset of the data with variants with P< 0.05 and arrange by chromosome number
  # https://danielroelfs.com/blog/how-i-create-manhattan-plots-using-ggplot/

    sig.dat <- data %>% 
      subset('${pval}' < 0.05) %>%
      arrange (CHR, .by_group=TRUE)

  # Add highlight and annotation information
    #mutate( is_highlight=ifelse(SNP %in% index_snps, "yes", "no")) %>%
    #mutate( is_annotate=ifelse(-log10(P_BOLT_LMM)>6, "yes", "no")) 

  # Check the list of chromosomes (make sure the sex chr are at the end of the list)

    unique(sig.dat$CHR)

    # Get the cumulative base pair position for each variant

    nCHR <- length(unique(sig.dat$CHR))
    sig.dat$BPcum <- NA
    s <- 0
    nbp <- c()
    for (i in unique(sig.dat$CHR)){
      nbp[i] <- max(sig.dat[sig.dat$CHR == i,]$BP)
      sig.dat[sig.dat$CHR == i,"BPcum"] <- sig.dat[sig.dat$CHR == i,"BP"] + s
      s <- s + nbp[i]
    }

    # Calculate the mid point for each chromosome for plotting the x-axis
    # Calculate the y-lim 

    axis.set <- sig.dat %>% 
      group_by(CHR) %>% 
      summarize(center = (max(BPcum) + min(BPcum)) / 2)
    ylim <- abs(floor(log10(min(sig.dat$'${pval}')))) + 2 
    sig <- 5e-8

    # Now time to draw the manhattan plot without filtering the most significant signals

    manhplot <- ggplot(sig.dat, aes(x = BPcum, y = -log10('${pval}'), 
                                 color = as.factor(CHR), size = -log10('${pval}'))) +
      geom_point(alpha = 0.75) +
      geom_hline(yintercept = -log10(sig), color = "red1", linetype = "dashed") + 
      scale_x_continuous(label = axis.set$CHR, breaks = axis.set$center) +
      scale_y_continuous(expand = c(0,0), limits = c(0, 50)) +
      scale_color_manual(values = rep(c("#276FBF", "#183059"), nCHR)) +
      scale_size_continuous(range = c(0.5,3)) +
      # Add highlighted points
      # geom_point(data=subset(sig.dat, is_highlight=="yes"), color="orange", alpha=0.75) +
      labs(x = "Chromosome", 
           y = "-log10(p)",
           title ="{manplot}") + 
      theme_classic() +
      theme( 
        legend.position = "none",
        panel.border = element_blank(),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        axis.text.x = element_text(angle = 90, size = 8, vjust = 0.5)
      )

    # To save a plot created with ggplot2 you have to use to print() function

    png(filename = '${_output}', width = 6, height = 4, unit='in', res=300)
    print(manhplot)
    dev.off()


Awk to find the minimum and maximum values in a specific column

In [None]:
#awk '(NR==1){Min=$3;Max=$3};(NR>=3){if(Min>$3) Min=$3;if(Max<$3) Max=$3} END {printf "The Min is %d ,Max is %d",Min,Max}' ukb_mfi_chr1_v3.txt

Show the lowest p-values

In [None]:
#gzcat Test_INT-BMI.txt.gz | awk -F "\t" '{ if ($16 <= 5e-8) { print } }'| sort -g | head -n 5

To find the overlapping variants between dataset 1 and dataset 2

In [None]:
#awk 'NR==FNR {FILE1[$1]=$0; next} ($1 in FILE1) {print $0}' dataset2 dataset1 > dataset1_subset_overlapping_dataset2

The output file .clumped will contain the index SNPs for each LD block in column 3. To extract this column and create a vector to use for highlighting in the Manhattan Plot

```
awk '{ print $3 }' \
Test_INT-BMI.txt.clumped \
> dataset1_subset_overlapping_dataset2_clump1.SNPs
```

### Output some summary statistics from the phenotypic data
* Number of cases
* Number of controls

In [None]:
[phenotype_summary]
input: phenoFile
output: None
bash: expand=True
    # Number of controls
    awk '{ if ($5 == 0) { print }' {_input} | wc -l 
    # Number of cases
    awk '{ if ($5 == 1) { print }' {_input} | wc -l