## Postgap GWAS Catalog Input

This notebook will generate postgap compatible input summary stats from a GWAS catalog export.

In [13]:
import pandas as pd
import numpy as np

## Load GWAS Catalog Download

From https://www.ebi.ac.uk/gwas/studies/GCST007400 in this case.

In [6]:
# Note that "OR or BETA" is OR for this study
df = pd.read_csv("/Users/eczech/Downloads/gwas-association-downloaded_2020-01-06-accessionId_GCST007400.tsv", sep="\t")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 38 columns):
DATE ADDED TO CATALOG         71 non-null object
PUBMEDID                      71 non-null int64
FIRST AUTHOR                  71 non-null object
DATE                          71 non-null object
JOURNAL                       71 non-null object
LINK                          71 non-null object
STUDY                         71 non-null object
DISEASE/TRAIT                 71 non-null object
INITIAL SAMPLE SIZE           71 non-null object
REPLICATION SAMPLE SIZE       0 non-null float64
REGION                        64 non-null object
CHR_ID                        64 non-null float64
CHR_POS                       64 non-null float64
REPORTED GENE(S)              71 non-null object
MAPPED_GENE                   64 non-null object
UPSTREAM_GENE_ID              19 non-null object
DOWNSTREAM_GENE_ID            19 non-null object
SNP_GENE_IDS                  45 non-null object
UPSTREAM_GENE_

In [15]:
# The postgap code for loading csvs is poorly written/documented and requires the following fields 
# for the bayesian finemapping calculations based on position:
# chromosome, position, rsID, effect_allele, non_effect_allele, beta, se, pvalue
# see: https://github.com/Ensembl/postgap/blob/master/lib/postgap/FinemapIntegration.py#L126
# In the main project docs, only the following fields are mentioned as needed per
# https://github.com/Ensembl/postgap/blob/master/lib/postgap/GWAS.py#L844:
# variant_id, p-value, beta
# This transformation will try to satisfy both of those:
dfe = df.assign(
    effect_allele='', 
    non_effect_allele='', 
    se=np.nan,
    beta=lambda df: np.log(df['OR or BETA']) # presumably, this is the correct OR -> beta transform
    ).rename(columns={
        'CHR_ID': 'chromosome', 
        'CHR_POS': 'position', 
        'SNPS': 'variant_id',
        'OR or BETA': 'odds_ratio',
        'P-VALUE': 'p-value'
    })[[
        'chromosome', 'position', 'variant_id', 'effect_allele', 'non_effect_allele',
        'beta', 'se', 'p-value'
    ]]
dfe

Unnamed: 0,chromosome,position,variant_id,effect_allele,non_effect_allele,beta,se,p-value
0,3.0,119503609.0,rs1131265,,,0.210721,,1.000000e-09
1,,,rs115268109,,,0.270027,,2.000000e-07
2,16.0,68573583.0,rs1170436,,,0.157004,,9.000000e-07
3,9.0,99637981.0,rs10819689,,,0.139262,,9.000000e-07
4,10.0,6072893.0,rs10905718,,,0.113329,,4.000000e-06
...,...,...,...,...,...,...,...,...
66,8.0,11484361.0,rs2736336,,,0.292670,,6.000000e-32
67,19.0,1112944.0,rs7249065,,,0.262364,,9.000000e-06
68,3.0,58332737.0,rs180977001,,,0.239017,,2.000000e-08
69,11.0,128454974.0,rs12575600,,,0.215111,,6.000000e-10


In [17]:
dfe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71 entries, 0 to 70
Data columns (total 8 columns):
chromosome           64 non-null float64
position             64 non-null float64
variant_id           71 non-null object
effect_allele        71 non-null object
non_effect_allele    71 non-null object
beta                 71 non-null float64
se                   0 non-null float64
p-value              71 non-null float64
dtypes: float64(5), object(3)
memory usage: 4.6+ KB


In [16]:
# Export --summary_stats arg for postgap
dfe.to_csv('/tmp/postgap_GCST007400.tsv', sep='\t', index=False)