# Processing of GWAS summary statistics
## Previous attempt: GWAS Curation
Analyzing just three traits have demonstrated differences in either onset or incidence between males and females has not revealed any obvious differences in heritability explained by marginal and sex-specific mQTLs. As such, I've decided to grab traits from high level categories in the [EFO ontology](https://www.ebi.ac.uk/gwas/docs/ontologyhttps://www.ebi.ac.uk/gwas/docs/ontology) over GWAS in the EBI GWAS catalogue. I am focusing primarily on neurobehavioral traits, citing the following study which found that the majority of traits do not show sex-dependent genetic effects ([Stringer et al 2017](https://doi-org.ezproxy.library.ubc.ca/10.1038/s41598-017-09249-3)).

In [9]:
import pandas as pd
%load_ext rpy2.ipython
%R library(tidyverse)

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


0,1,2,3,4,5,6
'forcats','stringr','dplyr',...,'datasets','methods','base'


In [10]:
# Loading data from the GWAS catalogue
gwas_meta = pd.read_csv("../data/gwas_catalog_all.csv")
gwas_meta.head()

Unnamed: 0,DISEASE/TRAIT,STUDY ACCESSION,DATE ADDED TO CATALOG,PUBMED ID,FIRST AUTHOR.x,DATE.x,JOURNAL,LINK,STUDY,INITIAL SAMPLE SIZE,...,STAGE,NUMBER OF INDIVDUALS,BROAD ANCESTRAL CATEGORY,COUNTRY OF ORIGIN,COUNTRY OF RECRUITMENT,ADDITONAL ANCESTRY DESCRIPTION,EFO term,EFO URI,Parent term,Parent URI
0,&beta;2-Glycoprotein I (&beta;2-GPI) plasma le...,GCST001800,2013-03-19,23279374,Athanasiadis G,2013-01-02,J Thromb Haemost,www.ncbi.nlm.nih.gov/pubmed/23279374,Genetic determinants of plasma β₂-glycoprotein...,306 European ancestry individuals,...,initial,306.0,European,NR,Spain,,glycoprotein measurement,http://www.ebi.ac.uk/efo/EFO_0004555,Other measurement,http://www.ebi.ac.uk/efo/EFO_0001444
1,"1,5-anhydroglucitol levels",GCST004643,2017-09-05,28588231,Li M,2017-06-06,Sci Rep,www.ncbi.nlm.nih.gov/pubmed/28588231,"Genome-wide association study of 1,5-anhydrogl...","7,550 European ancestry individuals",...,initial,7550.0,European,NR,U.S.,European American,"1,5 anhydroglucitol measurement",http://www.ebi.ac.uk/efo/EFO_0008009,Other measurement,http://www.ebi.ac.uk/efo/EFO_0001444
2,"1,5-anhydroglucitol levels",GCST004643,2017-09-05,28588231,Li M,2017-06-06,Sci Rep,www.ncbi.nlm.nih.gov/pubmed/28588231,"Genome-wide association study of 1,5-anhydrogl...","7,550 European ancestry individuals",...,replication,2030.0,African American or Afro-Caribbean,NR,U.S.,African American,"1,5 anhydroglucitol measurement",http://www.ebi.ac.uk/efo/EFO_0008009,Other measurement,http://www.ebi.ac.uk/efo/EFO_0001444
3,"1,5-anhydroglucitol levels",GCST004643,2017-09-05,28588231,Li M,2017-06-06,Sci Rep,www.ncbi.nlm.nih.gov/pubmed/28588231,"Genome-wide association study of 1,5-anhydrogl...","7,550 European ancestry individuals",...,replication,8790.0,European,NR,"Germany, U.K.",,"1,5 anhydroglucitol measurement",http://www.ebi.ac.uk/efo/EFO_0008009,Other measurement,http://www.ebi.ac.uk/efo/EFO_0001444
4,17-hydroxyprogesterone (17-OHP) levels,GCST008879,2019-10-21,31169883,Pott J,2019-06-06,J Clin Endocrinol Metab,www.ncbi.nlm.nih.gov/pubmed/31169883,Genetic association study of eight steroid hor...,"1,358 European ancestry men, 712 European ance...",...,initial,2070.0,European,NR,Germany,,17-hydroxyprogesterone measurement,http://www.ebi.ac.uk/efo/EFO_0010220,Other measurement,http://www.ebi.ac.uk/efo/EFO_0001444


## Uniform formatting with `munge_sumstats.py`

In [21]:
%%bash
for f in ../../tmp_GWAS/pgc_sumstats/*; do
    echo $f
    zcat $f | head -n 2
done
    

../../tmp_GWAS/pgc_sumstats/adhd_jul2017.gz
CHR	SNP	BP	A1	A2	INFO	OR	SE	P
1	rs202152658	751343	A	T	0.884	1.03118	0.0221	0.1654
../../tmp_GWAS/pgc_sumstats/anxiety.meta.full.cc.tbl.gz
SNPID	CHR	BP	Allele1	Allele2	Freq1	Effect	StdErr	P.value	TotalN
rs1000033	1	226580387	t	g	0.8266	-0.0574	0.0348	0.09867	17310
../../tmp_GWAS/pgc_sumstats/anxiety.meta.full.fs.tbl.gz
SNPID	CHR	BP	Allele1	Allele2	Freq1	Effect	StdErr	P.value	TotalN
rs1000033	1	226580387	t	g	0.824	-0.0057	0.0058	0.3288	18186
../../tmp_GWAS/pgc_sumstats/AUDIT_UKB_2018_AJP.txt.gz
chr rsid a_0 a_1 info beta_T se_T p_T beta_C se_C p_C beta_P se_P p_P N
1 1:10000179_AAAAAAAC_A AAAAAAAC A 0.985768 -0.0023263 0.0087656 0.790715040934701 0.0024046 0.0080078 0.763958909282397 -0.013933 0.0076296 0.0678265926015363 121568
../../tmp_GWAS/pgc_sumstats/Cannabis_ICC_23andmetop_UKB_het.txt.gz
CHR	SNP	BP	A1	A2	FRQ	BETA	SE	Z	P	Direction	HetISq	HetDf	HetPVa	Nca	Nco	Neff
3	rs2875907	85518580	a	g	0.3524	0.0712	0.0086	8.27907	9.381e-17	+++	0	2	0.5

With the exception of the vcf.tsv formatted files, it appears that the formatting of these summary statistics is fairly uniform. Let's see if we can just run munge_sumstats through all of these files:

In [38]:
%%bash 
source /home/wcasazza/miniconda3/bin/activate
conda activate ldsc
cd /scratch/st-dennisjk-1/wcasazza/tmp_GWAS/
echo "" > missed_files.txt
for f in pgc_sumstats/*.gz; do
    out_f=${f%.*}
    if [[ ! -f "pgc_formatted_sumstats/${out_f##*/}.sumstats.gz" ]]; then
        cmd="/arc/project/st-dennisjk-1/software/ldsc/munge_sumstats.py"
        cmd+=" --sumstats ${f}"
        cmd+=" --out pgc_formatted_sumstats/${out_f##*/}"
        for arg in $(zgrep -v "^#" ${f} | head -n 1 | tr -d '\r\n'); do
            case ${arg} in
                "TotalN")
                    cmd+=" --N-col TotalN"
                ;;
                "Neff")
                    cmd+=" --N-col Neff"
                ;;
                "Nca")
                    cmd+=" --N-cas-col Nca"
                ;;
                "NCAS")
                    cmd+=" --N-cas-col NCAS"
                ;;
                "Nco")
                    cmd+=" --N-con-col Nco"
                ;;
                "Z")
                    cmd+=" --signed-sumstats Z,0"
                    break
                ;;
                "BETA")
                    cmd+=" --signed-sumstats BETA,0"
                ;;
                "LogOR")
                    cmd+=" --signed-sumstats LogOR,0"
                ;;
                *)
                    echo "${arg} ignored."
                ;;
            esac
        done
        eval $cmd
    fi
done

Process is interrupted.
