# Pleiotropy project for Asthma, Adiposity and type 2 diabetes in the jazf1 region

# Aims
The aim of this project is to examine pleiotropic relationships between three phenotypes: asthma, type 2 diabetes (T2D), and circumference using the sumarystats(regenie files from UKBB).

1. First we use GSMR [Zhu et al. 2018 Nat. Commun](https://www.nature.com/articles/s41467-017-02317-2) to test for putative causal association between asthma_t2d, waist_t2d and waist_asthma on ***jazf1*** region. We also check the results with a more recent methods MR-Corr2 and CAUSE as described by [Qiao et al. 2021 bioinformatics](https://pubmed.ncbi.nlm.nih.gov/34499127/)and [Morrison et al. 2020 Nat. Genet](https://pubmed.ncbi.nlm.nih.gov/32451458/)
2. Compare with ivariate fine mapping results from Jiayi Zhou


## Data files and documents
* The location of phenotype and genotype data described [here](https://github.com/statgenetics/UKBB_GWAS_dev/blob/master/analysis/pleiotropy/data_description.ipynb)
* Phenotype and regenie summstat files also copied to my cluster account

    **Pheno**
    
    > **asthma**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/pheno_asthma_ind_PC.txt
    > **t2d**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/pheno_asthma_ind_PC.txt
    > **waist**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/pheno_WC_ind_PC.txt
    
    **Sumstats**
    > **asthma**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_PC10_step2_imp.regenie_ASTHMA.regenie
    > **t2d**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/T2D_PC10_step2_imp.regenie_T2D.regenie
    > **waist**:/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/WC_PC10_step2_imp.regenie_WAISTcirc_invranknorm.regenie

## step 1: Input file preparation for GSMR
### 1.1 Import sumstats

In [2]:
import pandas as pd

In [3]:
# reading in the regenie on the imputed data and subsetting regenie data to only keep information within jazf1 region - 7 27868573 28273990
# within jazf1 region - 7 27868573 28273990 of the sumstats files there are 2067 variants for asthma and t2d, and 2068 variants for waist

# asthma data
asthma_regenie = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_PC10_step2_imp.regenie_ASTHMA.regenie", sep=" ")
asthma_regenie = asthma_regenie[(asthma_regenie["CHROM"] == 7) & (asthma_regenie["GENPOS"] >= 27868573) & (asthma_regenie["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

# t2d data
t2d_regenie = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/T2D_PC10_step2_imp.regenie_T2D.regenie", sep=" ")
t2d_regenie = t2d_regenie[(t2d_regenie["CHROM"] == 7) & (t2d_regenie["GENPOS"] >= 27868573) & (t2d_regenie["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

# waist circumference data
waist_regenie = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/WC_PC10_step2_imp.regenie_WAISTcirc_invranknorm.regenie", sep=" ")
waist_regenie = waist_regenie[(waist_regenie["CHROM"] == 7) & (waist_regenie["GENPOS"] >= 27868573) & (waist_regenie["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

In [4]:
asthma_regenie

Unnamed: 0,CHROM,GENPOS,ID,ALLELE0,ALLELE1,A1FREQ,N,BETA,SE,LOG10P
6069486,7,27869098,rs545409685,C,T,0.997340,339345,0.116773,0.071129,0.997184
6069487,7,27869261,7:27869261_CAGTA_C,C,CAGTA,0.998498,339345,0.036830,0.098431,0.149797
6069488,7,27869377,rs73075348,G,A,0.943372,339345,-0.000204,0.015226,0.004665
6069489,7,27869782,rs6948467,A,G,0.607234,339345,-0.002206,0.007223,0.119179
6069490,7,27869794,rs73075354,G,C,0.883803,339345,0.012311,0.011001,0.579898
...,...,...,...,...,...,...,...,...,...,...
6071548,7,28273623,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.842749,339345,0.000601,0.009892,0.021555
6071549,7,28273697,rs188426589,A,T,0.987408,339345,-0.004086,0.031982,0.046563
6071550,7,28273719,rs6944995,G,T,0.146693,339345,0.006399,0.009963,0.283409
6071551,7,28273829,rs192297723,C,A,0.988365,339345,0.013026,0.035082,0.148496


In [5]:
t2d_regenie

Unnamed: 0,CHROM,GENPOS,ID,ALLELE0,ALLELE1,A1FREQ,N,BETA,SE,LOG10P
6070158,7,27869098,rs545409685,C,T,0.997337,336074,0.038626,0.100036,0.155268
6070159,7,27869261,7:27869261_CAGTA_C,C,CAGTA,0.998492,336074,0.108386,0.140165,0.357178
6070160,7,27869377,rs73075348,G,A,0.943307,336074,0.016225,0.021522,0.345893
6070161,7,27869782,rs6948467,A,G,0.607261,336074,-0.020544,0.010241,1.348310
6070162,7,27869794,rs73075354,G,C,0.883898,336074,0.038589,0.015628,1.868420
...,...,...,...,...,...,...,...,...,...,...
6072220,7,28273623,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.842757,336074,-0.007129,0.014019,0.213893
6072221,7,28273697,rs188426589,A,T,0.987404,336074,-0.013464,0.045338,0.115494
6072222,7,28273719,rs6944995,G,T,0.146537,336074,0.013798,0.014138,0.482690
6072223,7,28273829,rs192297723,C,A,0.988329,336074,-0.051344,0.049307,0.526178


In [6]:
waist_regenie

Unnamed: 0,CHROM,GENPOS,ID,ALLELE0,ALLELE1,A1FREQ,N,BETA,SE,LOG10P
6069968,7,27869098,rs545409685,C,T,0.997340,365499,0.000830,0.009980,0.029799
6069969,7,27869261,7:27869261_CAGTA_C,C,CAGTA,0.998506,365499,-0.018772,0.013917,0.751086
6069970,7,27869377,rs73075348,G,A,0.943390,365499,-0.002467,0.002139,0.604290
6069971,7,27869782,rs6948467,A,G,0.607228,365499,-0.002577,0.001015,1.953540
6069972,7,27869794,rs73075354,G,C,0.883827,365499,0.006933,0.001547,5.128300
...,...,...,...,...,...,...,...,...,...,...
6072031,7,28273623,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.842736,365499,-0.001481,0.001391,0.541901
6072032,7,28273697,rs188426589,A,T,0.987452,365499,-0.008732,0.004520,1.272780
6072033,7,28273719,rs6944995,G,T,0.146811,365499,0.003185,0.001400,1.640000
6072034,7,28273829,rs192297723,C,A,0.988319,365499,-0.002544,0.004923,0.218021


In [7]:
# gsmr sumdata uses SNP, a1, a2, a1_freq, bzx, bzx_se, bzx_pval, bzx_n, bzy, bzy_se, bzy_pval, bzy_n as columns

# from current regenie data need to calculate A0FREQ and PVAL

# A0FREQ
get_a0freq = lambda row: 1 - row["A1FREQ"]

asthma_regenie["A0FREQ"] = asthma_regenie.apply(get_a0freq, axis=1)
t2d_regenie["A0FREQ"] = t2d_regenie.apply(get_a0freq, axis=1)
waist_regenie["A0FREQ"] = waist_regenie.apply(get_a0freq, axis=1)

# PVAL
get_pval = lambda row: 10 ** (-row["LOG10P"])

asthma_regenie["PVAL"] = asthma_regenie.apply(get_pval, axis=1)
t2d_regenie["PVAL"] = t2d_regenie.apply(get_pval, axis=1)
waist_regenie["PVAL"] = waist_regenie.apply(get_pval, axis=1)

# also renaming all the columns for merging later on
asthma_regenie = asthma_regenie.rename(columns={"ID":"SNP", "ALLELE0":"a1", "ALLELE1":"a2", "A0FREQ":"asthma_a1_freq", "N":"asthma_n", "BETA":"asthma_beta", "SE":"asthma_se", "PVAL":"asthma_pval"})
t2d_regenie = t2d_regenie.rename(columns={"ID":"SNP", "ALLELE0":"a1", "ALLELE1":"a2", "A0FREQ":"t2d_a1_freq", "N":"t2d_n", "BETA":"t2d_beta", "SE":"t2d_se", "PVAL":"t2d_pval"})
waist_regenie = waist_regenie.rename(columns={"ID":"SNP", "ALLELE0":"a1", "ALLELE1":"a2", "A0FREQ":"waist_a1_freq", "N":"waist_n", "BETA":"waist_beta", "SE":"waist_se", "PVAL":"waist_pval"})

# keeping only relevant columns
asthma_regenie = asthma_regenie[["SNP", "a1", "a2", "asthma_a1_freq", "asthma_beta", "asthma_se", "asthma_pval", "asthma_n"]]
t2d_regenie = t2d_regenie[["SNP", "a1", "a2", "t2d_a1_freq", "t2d_beta", "t2d_se", "t2d_pval", "t2d_n"]]
waist_regenie = waist_regenie[["SNP", "a1", "a2", "waist_a1_freq", "waist_beta", "waist_se", "waist_pval", "waist_n"]]

In [8]:
asthma_regenie

Unnamed: 0,SNP,a1,a2,asthma_a1_freq,asthma_beta,asthma_se,asthma_pval,asthma_n
6069486,rs545409685,C,T,0.002660,0.116773,0.071129,0.100651,339345
6069487,7:27869261_CAGTA_C,C,CAGTA,0.001502,0.036830,0.098431,0.708277,339345
6069488,rs73075348,G,A,0.056628,-0.000204,0.015226,0.989316,339345
6069489,rs6948467,A,G,0.392766,-0.002206,0.007223,0.760013,339345
6069490,rs73075354,G,C,0.116197,0.012311,0.011001,0.263089,339345
...,...,...,...,...,...,...,...,...
6071548,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157251,0.000601,0.009892,0.951580,339345
6071549,rs188426589,A,T,0.012592,-0.004086,0.031982,0.898333,339345
6071550,rs6944995,G,T,0.853307,0.006399,0.009963,0.520704,339345
6071551,rs192297723,C,A,0.011635,0.013026,0.035082,0.710402,339345


In [9]:
t2d_regenie

Unnamed: 0,SNP,a1,a2,t2d_a1_freq,t2d_beta,t2d_se,t2d_pval,t2d_n
6070158,rs545409685,C,T,0.002663,0.038626,0.100036,0.699410,336074
6070159,7:27869261_CAGTA_C,C,CAGTA,0.001508,0.108386,0.140165,0.439362,336074
6070160,rs73075348,G,A,0.056693,0.016225,0.021522,0.450928,336074
6070161,rs6948467,A,G,0.392739,-0.020544,0.010241,0.044843,336074
6070162,rs73075354,G,C,0.116102,0.038589,0.015628,0.013539,336074
...,...,...,...,...,...,...,...,...
6072220,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157243,-0.007129,0.014019,0.611093,336074
6072221,rs188426589,A,T,0.012596,-0.013464,0.045338,0.766489,336074
6072222,rs6944995,G,T,0.853463,0.013798,0.014138,0.329086,336074
6072223,rs192297723,C,A,0.011671,-0.051344,0.049307,0.297730,336074


In [10]:
waist_regenie

Unnamed: 0,SNP,a1,a2,waist_a1_freq,waist_beta,waist_se,waist_pval,waist_n
6069968,rs545409685,C,T,0.002660,0.000830,0.009980,0.933687,365499
6069969,7:27869261_CAGTA_C,C,CAGTA,0.001494,-0.018772,0.013917,0.177384,365499
6069970,rs73075348,G,A,0.056610,-0.002467,0.002139,0.248720,365499
6069971,rs6948467,A,G,0.392772,-0.002577,0.001015,0.011129,365499
6069972,rs73075354,G,C,0.116173,0.006933,0.001547,0.000007,365499
...,...,...,...,...,...,...,...,...
6072031,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157264,-0.001481,0.001391,0.287144,365499
6072032,rs188426589,A,T,0.012548,-0.008732,0.004520,0.053361,365499
6072033,rs6944995,G,T,0.853189,0.003185,0.001400,0.022909,365499
6072034,rs192297723,C,A,0.011681,-0.002544,0.004923,0.605312,365499


## The goal is assess pleiotropic relationships between asthma-T2D, waist-T2D and waist_asthma_t2d 


In [11]:
## Prepare asthma_t2d sumdata for GSMR
asthma_v_t2d = pd.merge(asthma_regenie, t2d_regenie,  how='inner', left_on=['SNP','a1','a2'], right_on = ['SNP','a1','a2']).drop(["t2d_a1_freq"], axis=1)
asthma_v_t2d = asthma_v_t2d.rename(columns={"asthma_a1_freq":"a1_freq", "asthma_beta":"bzx", "asthma_se":"bzx_se", "asthma_pval":"bzx_pval", "asthma_n":"bzx_n", "t2d_beta":"bzy", "t2d_se":"bzy_se", "t2d_pval":"bzy_pval", "t2d_n":"bzy_n"})
asthma_v_t2d



Unnamed: 0,SNP,a1,a2,a1_freq,bzx,bzx_se,bzx_pval,bzx_n,bzy,bzy_se,bzy_pval,bzy_n
0,rs545409685,C,T,0.002660,0.116773,0.071129,0.100651,339345,0.038626,0.100036,0.699410,336074
1,7:27869261_CAGTA_C,C,CAGTA,0.001502,0.036830,0.098431,0.708277,339345,0.108386,0.140165,0.439362,336074
2,rs73075348,G,A,0.056628,-0.000204,0.015226,0.989316,339345,0.016225,0.021522,0.450928,336074
3,rs6948467,A,G,0.392766,-0.002206,0.007223,0.760013,339345,-0.020544,0.010241,0.044843,336074
4,rs73075354,G,C,0.116197,0.012311,0.011001,0.263089,339345,0.038589,0.015628,0.013539,336074
...,...,...,...,...,...,...,...,...,...,...,...,...
2060,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157251,0.000601,0.009892,0.951580,339345,-0.007129,0.014019,0.611093,336074
2061,rs188426589,A,T,0.012592,-0.004086,0.031982,0.898333,339345,-0.013464,0.045338,0.766489,336074
2062,rs6944995,G,T,0.853307,0.006399,0.009963,0.520704,339345,0.013798,0.014138,0.329086,336074
2063,rs192297723,C,A,0.011635,0.013026,0.035082,0.710402,339345,-0.051344,0.049307,0.297730,336074


In [12]:
## Prepare waist_t2d sumdata for GSMR
waist_v_t2d = pd.merge(waist_regenie, t2d_regenie,  how='inner', left_on=['SNP','a1','a2'], right_on = ['SNP','a1','a2']).drop(["t2d_a1_freq"], axis=1)
waist_v_t2d = waist_v_t2d.rename(columns={"waist_a1_freq":"a1_freq", "waist_beta":"bzx", "waist_se":"bzx_se", "waist_pval":"bzx_pval", "waist_n":"bzx_n", "t2d_beta":"bzy", "t2d_se":"bzy_se", "t2d_pval":"bzy_pval", "t2d_n":"bzy_n"})
waist_v_t2d


Unnamed: 0,SNP,a1,a2,a1_freq,bzx,bzx_se,bzx_pval,bzx_n,bzy,bzy_se,bzy_pval,bzy_n
0,rs545409685,C,T,0.002660,0.000830,0.009980,0.933687,365499,0.038626,0.100036,0.699410,336074
1,7:27869261_CAGTA_C,C,CAGTA,0.001494,-0.018772,0.013917,0.177384,365499,0.108386,0.140165,0.439362,336074
2,rs73075348,G,A,0.056610,-0.002467,0.002139,0.248720,365499,0.016225,0.021522,0.450928,336074
3,rs6948467,A,G,0.392772,-0.002577,0.001015,0.011129,365499,-0.020544,0.010241,0.044843,336074
4,rs73075354,G,C,0.116173,0.006933,0.001547,0.000007,365499,0.038589,0.015628,0.013539,336074
...,...,...,...,...,...,...,...,...,...,...,...,...
2058,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157264,-0.001481,0.001391,0.287144,365499,-0.007129,0.014019,0.611093,336074
2059,rs188426589,A,T,0.012548,-0.008732,0.004520,0.053361,365499,-0.013464,0.045338,0.766489,336074
2060,rs6944995,G,T,0.853189,0.003185,0.001400,0.022909,365499,0.013798,0.014138,0.329086,336074
2061,rs192297723,C,A,0.011681,-0.002544,0.004923,0.605312,365499,-0.051344,0.049307,0.297730,336074


In [13]:
# Prepare waist_asthma sumdata for GSMR
waist_v_asthma = pd.merge(waist_regenie, asthma_regenie,  how='inner', left_on=['SNP','a1','a2'], right_on = ['SNP','a1','a2']).drop(["asthma_a1_freq"], axis=1)
waist_v_asthma = waist_v_asthma.rename(columns={"waist_a1_freq":"a1_freq", "waist_beta":"bzx", "waist_se":"bzx_se", "waist_pval":"bzx_pval", "waist_n":"bzx_n", "asthma_beta":"bzy", "asthma_se":"bzy_se", "asthma_pval":"bzy_pval", "asthma_n":"bzy_n"})
waist_v_asthma

Unnamed: 0,SNP,a1,a2,a1_freq,bzx,bzx_se,bzx_pval,bzx_n,bzy,bzy_se,bzy_pval,bzy_n
0,rs545409685,C,T,0.002660,0.000830,0.009980,0.933687,365499,0.116773,0.071129,0.100651,339345
1,7:27869261_CAGTA_C,C,CAGTA,0.001494,-0.018772,0.013917,0.177384,365499,0.036830,0.098431,0.708277,339345
2,rs73075348,G,A,0.056610,-0.002467,0.002139,0.248720,365499,-0.000204,0.015226,0.989316,339345
3,rs6948467,A,G,0.392772,-0.002577,0.001015,0.011129,365499,-0.002206,0.007223,0.760013,339345
4,rs73075354,G,C,0.116173,0.006933,0.001547,0.000007,365499,0.012311,0.011001,0.263089,339345
...,...,...,...,...,...,...,...,...,...,...,...,...
2059,7:28273623_TTTCCTTCCTTCC_T,T,TTTCCTTCCTTCC,0.157264,-0.001481,0.001391,0.287144,365499,0.000601,0.009892,0.951580,339345
2060,rs188426589,A,T,0.012548,-0.008732,0.004520,0.053361,365499,-0.004086,0.031982,0.898333,339345
2061,rs6944995,G,T,0.853189,0.003185,0.001400,0.022909,365499,0.006399,0.009963,0.520704,339345
2062,rs192297723,C,A,0.011681,-0.002544,0.004923,0.605312,365499,0.013026,0.035082,0.710402,339345


In [14]:
# Save GSMR data
asthma_v_t2d.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_v_t2d_gsmr_data", sep=" ", index=False)
waist_v_t2d.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_v_t2d_gsmr_data", sep=" ", index=False)
waist_v_asthma.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_v_asthma_gsmr_data", sep=" ", index=False)

In [15]:
# Save the genetic variants and effect alleles to estimate LD correlation matrix
asthma_v_t2d[["SNP", "a1"]].to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_v_t2d_snps.allele", sep=" ", header=False, index=False)
waist_v_t2d[["SNP", "a1"]].to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_v_t2d_snps.allele", sep=" ", header=False, index=False)
waist_v_asthma[["SNP", "a1"]].to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_v_asthma_snps.allele", sep=" ", header=False, index=False)

### 1.2 check sumstats against bgenfile and bfile to determine intersection of variants

* within jazf1 region - 7 27868573 28273990 of the sumstats files there are 2067 variants for asthma and t2d, and 2068 variants for waist
    * the intersection with the bgen file indicates 1952 and 1953 of these variants respectively

In [18]:
# asthma data
asthma_sumstats = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_PC10_step2_imp.regenie_ASTHMA.regenie", sep=" ")
asthma_sumstats = asthma_sumstats[(asthma_sumstats["CHROM"] == 7) & (asthma_sumstats["GENPOS"] >= 27868573) & (asthma_sumstats["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

# t2d data
t2d_sumstats = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/T2D_PC10_step2_imp.regenie_T2D.regenie", sep=" ")
t2d_sumstats = t2d_sumstats[(t2d_sumstats["CHROM"] == 7) & (t2d_sumstats["GENPOS"] >= 27868573) & (t2d_sumstats["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

# waist circumference data
waist_sumstats = pd.read_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/WC_PC10_step2_imp.regenie_WAISTcirc_invranknorm.regenie", sep=" ")
waist_sumstats = waist_sumstats[(waist_sumstats["CHROM"] == 7) & (waist_sumstats["GENPOS"] >= 27868573) & (waist_sumstats["GENPOS"] <= 28273990)][["CHROM", "GENPOS", "ID", "ALLELE0", "ALLELE1", "A1FREQ", "N", "BETA", "SE", "LOG10P"]]

# from current regenie data need to calculate A0FREQ and PVAL

# A0FREQ
get_a0freq = lambda row: 1 - row["A1FREQ"]

asthma_sumstats["A0FREQ"] = asthma_sumstats.apply(get_a0freq, axis=1)
t2d_sumstats["A0FREQ"] = t2d_sumstats.apply(get_a0freq, axis=1)
waist_sumstats["A0FREQ"] = waist_sumstats.apply(get_a0freq, axis=1)

# PVAL
get_pval = lambda row: 10 ** (-row["LOG10P"])

asthma_sumstats["PVAL"] = asthma_sumstats.apply(get_pval, axis=1)
t2d_sumstats["PVAL"] = t2d_sumstats.apply(get_pval, axis=1)
waist_sumstats["PVAL"] = waist_sumstats.apply(get_pval, axis=1)

# keeping only relevant columns
asthma_sumstats = asthma_sumstats[["CHROM","GENPOS","ID", "ALLELE0", "ALLELE1", "A0FREQ", "BETA", "SE", "PVAL", "N"]]
t2d_sumstats = t2d_sumstats[["CHROM","GENPOS","ID", "ALLELE0", "ALLELE1", "A0FREQ", "BETA", "SE", "PVAL", "N"]]
waist_sumstats = waist_sumstats[["CHROM","GENPOS","ID", "ALLELE0", "ALLELE1", "A0FREQ", "BETA", "SE", "PVAL", "N"]]



In [20]:
# renaming columns in Sumstat format
asthma_sumstats = asthma_sumstats.rename(columns={"CHROM":"CHR", "GENPOS":"POS", "ID":"SNP", "ALLELE0":"A1", "ALLELE1":"A2", "A0FREQ":"A1FREQ", "BETA":"beta", "SE":"se", "PVAL":"p"})
t2d_sumstats = t2d_sumstats.rename(columns={"CHROM":"CHR", "GENPOS":"POS", "ID":"SNP", "ALLELE0":"A1", "ALLELE1":"A2", "A0FREQ":"A1FREQ", "BETA":"beta", "SE":"se", "PVAL":"p"})
waist_sumstats = waist_sumstats.rename(columns={"CHROM":"CHR", "GENPOS":"POS", "ID":"SNP", "ALLELE0":"A1", "ALLELE1":"A2", "A0FREQ":"A1FREQ", "BETA":"beta", "SE":"se", "PVAL":"p"})

In [21]:
# Save sumstats in jazf1 region for future use

asthma_sumstats.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_sumstats_jazf1", sep="\t", index=False)
t2d_sumstats.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/t2d_sumstats_jazf1", sep="\t", index=False)
waist_sumstats.to_csv("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_sumstats_jazf1", sep="\t", index=False)

In [30]:
# Define set of chr:pos_Ref.allele_Alt.allele format to check the intersection of variants with the bgen file.

def chromsnp(row):
    return f"{row['CHR']}:{row['POS']}_{row['A2']}_{row['A1']}"
asthma_sumstats_chromsnp = asthma_sumstats.apply(chromsnp, axis=1)
asthma_sumstats_chromsnp = set(asthma_sumstats_chromsnp.to_list())
t2d_sumstats_chromsnp = t2d_sumstats.apply(chromsnp, axis=1)
t2d_sumstats_chromsnp = set(t2d_sumstats_chromsnp.to_list())
waist_sumstats_chromsnp = waist_sumstats.apply(chromsnp, axis=1)
waist_sumstats_chromsnp = set(waist_sumstats_chromsnp.to_list())

In [31]:
# importimputed bgen (imputed) from chr.7

bgenfile = pd.read_csv("/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_mfi_chr7_v3.txt", sep="\t", header=None)


In [32]:
bgenfile

Unnamed: 0,0,1,2,3,4,5,6,7
0,7:14808_T_C,rs555283805,14808,T,C,6.696880e-05,C,0.364399
1,7:15064_T_C,rs576737504,15064,T,C,9.846300e-04,C,0.652527
2,7:16454_C_T,rs544026442,16454,C,T,7.442310e-07,T,0.011086
3,7:16692_G_C,rs370739206,16692,G,C,0.000000e+00,G,
4,7:16712_T_G,rs373250171,16712,T,G,2.717570e-04,G,0.207113
...,...,...,...,...,...,...,...,...
5405519,7:159128544_A_C,rs183389554,159128544,A,C,7.966900e-05,C,0.308943
5405520,7:159128550_C_G,rs145893243,159128550,C,G,2.656680e-02,G,0.939808
5405521,7:159128554_C_T,rs77350961,159128554,C,T,2.120010e-04,T,0.254989
5405522,7:159128560_T_C,rs542634737,159128560,T,C,5.464860e-03,C,0.532634


In [33]:
# create the variant id set 
bgen_chromsnp = set(bgenfile[0].to_list())

In [34]:
# check the intersection between the sumstats data and bgenfile
len(asthma_sumstats_chromsnp.intersection(bgen_chromsnp))

1952

In [35]:
len(t2d_sumstats_chromsnp.intersection(bgen_chromsnp))

1952

In [36]:
len(waist_sumstats_chromsnp.intersection(bgen_chromsnp))

1953

In [39]:
# checking the list ofsig snps
import numpy as np
import pandas as pd

In [40]:

asthma_sumstats_chromsnp = asthma_sumstats.apply(chromsnp, axis=1)
t2d_sumstats_chromsnp = t2d_sumstats.apply(chromsnp, axis=1)
waist_sumstats_chromsnp = waist_sumstats.apply(chromsnp, axis=1)

In [41]:
asthma_sumstats_chromsnp

6069486                7:27869098_T_C
6069487            7:27869261_CAGTA_C
6069488                7:27869377_A_G
6069489                7:27869782_G_A
6069490                7:27869794_C_G
                      ...            
6071548    7:28273623_TTTCCTTCCTTCC_T
6071549                7:28273697_T_A
6071550                7:28273719_T_G
6071551                7:28273829_A_C
6071552                7:28273986_T_C
Length: 2067, dtype: object

In [2]:
library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [1]:
#
asthma_sumstats <- read.table("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_sumstats_jazf1", header=TRUE, sep="\t")
t2d_sumstats <- read.table("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/t2d_sumstats_jazf1", header=TRUE, sep="\t")
waist_sumstats <- read.table("/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_sumstats_jazf1", header=TRUE, sep="\t")

In [3]:
head(asthma_sumstats)

Unnamed: 0_level_0,CHR,POS,SNP,A1,A2,A1FREQ,beta,se,p,N
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,7,27869098,rs545409685,C,T,0.00266,0.116773,0.0711289,0.1006505,339345
2,7,27869261,7:27869261_CAGTA_C,C,CAGTA,0.001502,0.0368299,0.0984305,0.7082768,339345
3,7,27869377,rs73075348,G,A,0.056628,-0.000203885,0.0152262,0.9893163,339345
4,7,27869782,rs6948467,A,G,0.392766,-0.00220631,0.00722281,0.760013,339345
5,7,27869794,rs73075354,G,C,0.116197,0.0123111,0.0110007,0.2630886,339345
6,7,27869921,rs35410592,A,C,0.009441,-0.0158857,0.0369552,0.6672952,339345


In [15]:
asthma_sumstats_p508<-filter(asthma_sumstats, p<=5.0e-08)
t2d_sumstats_p508<-filter(t2d_sumstats, p<=5.0e-08)
waist_sumstats_p508<-filter(waist_sumstats, p<=5.0e-08)

In [34]:
dim(asthma_sumstats_p508)

In [35]:
dim(t2d_sumstats_p508)

In [36]:
dim(waist_sumstats_p508)

In [14]:
head(asthma_sumstats_p508)

Unnamed: 0_level_0,CHR,POS,SNP,A1,A2,A1FREQ,beta,se,p,N
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,7,28142186,rs10245867,T,G,0.329166,-0.0470728,0.00744926,2.836416e-10,339345
2,7,28148940,rs2158624,C,T,0.115972,-0.0618923,0.0107978,1.186834e-08,339345
3,7,28149255,rs57585717,A,G,0.117971,-0.0655676,0.0107062,1.128132e-09,339345
4,7,28149761,rs11771411,C,T,0.649166,0.0402658,0.00733811,4.262162e-08,339345
5,7,28149808,rs11767776,T,G,0.651326,0.0404958,0.00734758,3.719461e-08,339345
6,7,28150689,rs10276070,C,T,0.651202,0.0403434,0.00734705,4.172247e-08,339345


In [16]:
head(t2d_sumstats_p508)

Unnamed: 0_level_0,CHR,POS,SNP,A1,A2,A1FREQ,beta,se,p,N
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,7,28142088,rs10274928,G,A,0.52218,0.0733662,0.00997015,1.936422e-13,336074
2,7,28142186,rs10245867,T,G,0.329071,-0.0634964,0.0105348,1.866767e-09,336074
3,7,28162674,rs12531540,T,C,0.495571,0.074887,0.0099791,6.258929e-14,336074
4,7,28172732,rs702814,T,C,0.506075,0.0865061,0.00997351,4.375221e-18,336074
5,7,28173522,rs702815,T,C,0.713377,0.0610795,0.0109193,2.493618e-08,336074
6,7,28174085,7:28174085_CT_C,C,CT,0.624398,0.0631097,0.0106752,3.655106e-09,336074


In [17]:
head(waist_sumstats_p508)

Unnamed: 0_level_0,CHR,POS,SNP,A1,A2,A1FREQ,beta,se,p,N
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,7,27953981,rs10275982,A,C,0.337556,-0.00658053,0.00105114,3.840785e-10,365499
2,7,27961286,rs35135622,A,T,0.340949,-0.00647235,0.00104788,6.548472e-10,365499
3,7,27962293,rs11411348,GT,G,0.339355,-0.00632878,0.00105437,1.943972e-09,365499
4,7,27970153,rs10239787,T,C,0.327382,-0.00617877,0.0010561,4.899706e-09,365499
5,7,27972593,rs4722745,T,C,0.328133,-0.00623948,0.00105547,3.388832e-09,365499
6,7,27974243,rs28507096,T,C,0.328294,-0.00622558,0.00105533,3.653003e-09,365499


In [39]:
# Save list of significant snps
write.table(asthma_sumstats_p508,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_sumstats_p508.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)
write.table(t2d_sumstats_p508,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/t2d_sumstats_p508.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)
write.table(waist_sumstats_p508,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_sumstats_p508.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)

In [51]:
# create a marker name chr:pos_a2_a1 consistent with bgen file to subset variants not fond in the bgen file.

asthma_sumstats_marker <- asthma_sumstats %>% mutate(MARKER = paste0(CHR,":",POS,"_",A2,"_",A1))
t2d_sumstats_marker <- t2d_sumstats %>% mutate(MARKER = paste0(CHR,":",POS,"_",A2,"_",A1))
waist_sumstats_marker <- waist_sumstats %>% mutate(MARKER = paste0(CHR,":",POS,"_",A2,"_",A1))

In [55]:
write.table(asthma_sumstats_marker,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/asthma_sumstats_marker.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)
write.table(t2d_sumstats_marker,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/t2d_sumstats_marker.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)
write.table(waist_sumstats_marker,"/mnt/mfs/statgen/bst2126/pleiotropy/JAZF1_sum/waist_sumstats_marker.txt", col.names=TRUE,row.names=FALSE, sep="\t",quote=FALSE)

In [52]:
head(asthma_sumstats_marker)

Unnamed: 0_level_0,CHR,POS,SNP,A1,A2,A1FREQ,beta,se,p,N,MARKER
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<chr>
1,7,27869098,rs545409685,C,T,0.00266,0.116773,0.0711289,0.1006505,339345,7:27869098_T_C
2,7,27869261,7:27869261_CAGTA_C,C,CAGTA,0.001502,0.0368299,0.0984305,0.7082768,339345,7:27869261_CAGTA_C
3,7,27869377,rs73075348,G,A,0.056628,-0.000203885,0.0152262,0.9893163,339345,7:27869377_A_G
4,7,27869782,rs6948467,A,G,0.392766,-0.00220631,0.00722281,0.760013,339345,7:27869782_G_A
5,7,27869794,rs73075354,G,C,0.116197,0.0123111,0.0110007,0.2630886,339345,7:27869794_C_G
6,7,27869921,rs35410592,A,C,0.009441,-0.0158857,0.0369552,0.6672952,339345,7:27869921_C_A


In [None]:
# import bgenfile
bgenfile <- read.table("/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_mfi_chr7_v3.txt", sep="\t", header=None)
cols<-c("

In [None]:
t2d_sumstats_varID <- t2d_sumstats %>% (MARKER = paste0(CHR,":"A2,"_",A1"))

In [None]:
waist_sumstats_varID <- waist_sumstats %>% (MARKER = paste0(CHR,":"A2,"_",A1"))