# Explanation of the main script

Just as with the previous tutorials, this tutorial can be completed simply by copy-and-pasting all commands from this 'main script' into the Unix terminal.

For a theoretical background on these method we refer to the accompanying article entitled "A tutorial on conducting Genome-Wide-Association Studies: Quality control and statistical analysis" (https://www.ncbi.nlm.nih.gov/pubmed/29484742).

In order to run this script you need the following files from the previous tutorial:
- `covar_mds.txt` and
- `HapMap_3_r3_13` (bfile, i.e., `HapMap_3_r3_13.bed`, `HapMap_3_r3_13.bim`, `HapMap_3_r3_13.fam`).

## Setup

> **Author's note:** below is some setup code I need to make the notebook shine a bit more. Some imports and styling for the plots later on.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from qmplot import manhattanplot, qqplot

plt.style.use('ggplot')

## Association analyses

For the association analyses we use the files generated in the previous tutorial (population stratification), named: `HapMap_3_r3_13` (with `.bed`, `.bim`, and `.fam`. extensions) and `covar_mds.txt`

Copy the bfile HapMap_3_r3_13 from the previous tutorial to the current directory.

In [None]:
!cp ../2_Population_stratification/HapMap_3_r3_13.* .

Copy the `covar_mds.txt` from the previous tutorial in the current directory.

In [None]:
!cp ../2_Population_stratification/covar_mds.txt .

For binary traits.

### assoc

> **Author's note:** Info about the columns of the `.assoc` file can be found here https://www.cog-genomics.org/plink/1.9/formats#assoc

In [None]:
!plink --bfile HapMap_3_r3_13 --assoc --out assoc_results

Note, the `--assoc` option does not allow to correct covariates such as principal components (PC's)/ MDS components, which makes it less suited for association analyses.

### logistic

We will be using 10 principal components as covariates in this logistic analysis. We use the MDS components calculated from the previous tutorial: `covar_mds.txt`.

> **Author's note:** Info about the columns of the `.assoc.logistic` file can be found here https://www.cog-genomics.org/plink/1.9/formats#assoc_linear

In [None]:
!plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic hide-covar --out logistic_results

Note, we use the option `--hide-covar` to only show the additive results of the SNPs in the output file.

Remove NA values, those might give problems generating plots in later steps.

In [None]:
!awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic

The results obtained from these GWAS analyses will be visualized in the last step. This will also show if the data set contains any genome-wide significant SNPs.

Note, in case of a quantitative outcome measure the option `--logistic` should be replaced by `--linear`. The use of the `--assoc` option is also possible for quantitative outcome measures (as metioned previously, this option does not allow the use of covariates).

## Multiple testing

There are various way to deal with multiple testing outside of the conventional genome-wide significance threshold of 5.0E-8, below we present a couple. 

### adjust

In [None]:
!plink --bfile HapMap_3_r3_13 --assoc --adjust --out adjusted_assoc_results

This file gives a Bonferroni corrected _p_-value, along with FDR and others.

### Permutation

This is a computational intensive step. Further pros and cons of this method, which can be used for association and dealing with multiple testing, are described in our article corresponding to this tutorial (https://www.ncbi.nlm.nih.gov/pubmed/29484742).

_To_ reduce computational time we only perform this test on a subset of the SNPs from chromosome 22.

The EMP2 collumn provides the for multiple testing corrected p-value.

Generate subset of SNPs.

In [None]:
!awk '{ if ($4 >= 21595000 && $4 <= 21605000) print $2 }' HapMap_3_r3_13.bim > subset_snp_chr_22.txt

Filter your bfile based on the subset of SNPs generated in the step above.

In [None]:
!plink --bfile HapMap_3_r3_13 --extract subset_snp_chr_22.txt --make-bed --out HapMap_subset_for_perm

Perform 1,000,000 _permutations_.

In [None]:
!plink --bfile HapMap_subset_for_perm --assoc --mperm 1000000 --out subset_1M_perm_result

Order your data, from lowest to highest _p_-value.

In [None]:
!sort -gk 4 subset_1M_perm_result.assoc.mperm > sorted_subset.txt

Check ordered permutation results.

In [None]:
!head sorted_subset.txt

## Generate Manhattan and QQ plots

In [None]:
results_log = pd.read_csv('logistic_results.assoc_2.logistic', sep='\s+')

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
manhattanplot(results_log, chrom='CHR', pos='BP', snp='SNP', ax=ax)
ax.set_title('Manhattan plot: logistic')

In [None]:
results_as = pd.read_csv('assoc_results.assoc', sep='\s+')

fig, ax = plt.subplots(1, 1, figsize=(15, 5))
qqplot(results_as['P'], ax=ax)

---

## Congratulations

You have succesfully conducted a GWAS _analysis_!

If you are also interested in learning how to conduct a polygenic risk score (PRS) analysis please see our fourth tutorial.

The tutorial explaining PRS is independent from the previous tutorials.