# Analysis of pleiotopy between Alzheimer's Disease and ARHI traits 

First of all a review of the available references is made regarding AD. Particularly for GWAS using AD and meta-analysis from different studies. 

1. Kunkle et al https://www.nature.com/articles/s41588-019-0358-2
2. Jansen et al https://pubmed.ncbi.nlm.nih.gov/30617256/
3. Wightman et al https://www.nature.com/articles/s41588-021-00921-z
4. [family history based] https://pubmed.ncbi.nlm.nih.gov/29777097/
5. Schwartzentruber et al https://pubmed.ncbi.nlm.nih.gov/33589840/ 
6. Bellenguez et al https://www.nature.com/articles/s41588-022-01024-z

**Using AD proxy case definition**

[Schwartzentruber et al 2021](https://www.nature.com/articles/s41588-020-00776-w)

**Using clinically diagnosed LOAD**

[Kunkle et al 2019](https://www.nature.com/articles/s41588-019-0358-2)

**Using clinically diagnosed/proxy AD cases**

[Belenguez et al 2022](https://www.nature.com/articles/s41588-022-01024-z)

- two stage GWAS
- 111,326 clinically diagnosed/‘proxy’ AD cases and 677,663 controls
- The European Alzheimer & Dementia Biobank  + UK Biobank
- GWAS meta-analysis results can be downloaded here https://www.ebi.ac.uk/gwas/studies/GCST90027158

The results from this meta-analysis have been downloaded to our cluster and are located here:

`/mnt/vast/hpc/csg/data_public/GWAS_sumstats/GCST90027158_buildGRCh38.tsv.gz`

## GWAS for ARHI used in the pleiotropy analysis

Parameters used for REGENIE

```
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
minMAC=4
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
```

The GWAS results are stored here (Note: these results correspond to white European individuals:

- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f3393_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f2247_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f2257_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_combined_500K

# Step 1. Liftover of the meta-analysis sumtats of AD to match our GWAS results in hg19

In [None]:
module load Singularity
USER_PATH=~/working
export PATH=$HOME/miniconda3/bin:$PATH
sos run $USER_PATH/bioworkflows/GWAS/liftover.ipynb \
--cwd /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI \
--input_file /mnt/vast/hpc/csg/data_public/GWAS_sumstats/GCST90027158_buildGRCh38.tsv.gz \
--output_file GCST90027158_buildGRCh37.tsv \
--fr 'hg38' \
--to 'hg19' \
--yml_file $USER_PATH/UKBB_GWAS_dev/data/liftover.yml

# Step 2. Association analysis of the imputed data

Do the association analysis for the Hearing impairment traits with the white european individuals present in the 500K samples

In [2]:
# Common variables Columbia's cluster
UKBB_PATH=$HOME/UKBiobank
UKBB_yale=$HOME/UKBiobank_Yale_transfer
USER_PATH=$HOME/project
container_lmm=$HOME/containers/lmm.sif
container_marp=$HOME/containers/marp.sif
container_annovar=$HOME/containers/gatk4-annovar.sif
hearing_pheno_path=$UKBB_PATH/phenotype_files/hearing_impairment
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
formatFile_fastgwa=$USER_PATH/UKBB_GWAS_dev/data/fastGWA_template.yml
formatFile_bolt=$USER_PATH/UKBB_GWAS_dev/data/boltlmm_template.yml
formatFile_saige=$USER_PATH/UKBB_GWAS_dev/data/saige_template.yml
formatFile_regenie=$USER_PATH/UKBB_GWAS_dev/data/regenie_template.yml
# Workflows
lmm_sos=$USER_PATH/bioworkflows/GWAS/LMM.ipynb
anno_sos=$USER_PATH/bioworkflows/variant-annotation/annovar.ipynb
clumping_sos=$USER_PATH/bioworkflows/GWAS/LD_Clumping.ipynb
extract_sos=$USER_PATH/bioworkflows/GWAS/Region_Extraction.ipynb
snptogene_sos=$USER_PATH/UKBB_GWAS_dev/workflow/snptogene.ipynb

# LMM directories for imputed data
lmm_imp_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data
lmm_imp_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_imputed_data
lmm_imp_dir_saige=$UKBB_PATH/results/SAIGE_results/results_imputed_data
lmm_imp_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_imputed_data

# LMM directories for exome data
lmm_exome_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_exome_data
lmm_exome_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_exome_data
lmm_exome_dir_saige=$UKBB_PATH/results/SAIGE_results/results_exome_data
lmm_exome_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_exome_data

## LMM variables 
## Specific to Bolt_LMM
LDscoresFile=$UKBB_PATH/LDSCORE.1000G_EUR.tab.gz
geneticMapFile=$UKBB_PATH/genetic_map_hg19_withX.txt.gz
covarMaxLevels=10
numThreads=20
bgenMinMAF=0.001
bgenMinINFO=0.8
lmm_job_size=1
ylim=0

### Specific to FastGWA (depeding if you run from Yale or Columbia)
####Yale's cluster
grmFile=$UKBB_PATH/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp
####Columbia's cluster
grmFile=$UKBB_yale/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp

### Specific to SAIGE
bgenMinMAC=4
trait_type=binary
loco=TRUE
sampleCol=IID

### Specific to REGENIE
bsize=1000
lowmem=$HOME/scratch60/
lowmem_dir=$HOME/scratch60/predictions
trait=bt
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
reverse_log_p=True




## New PCA
### Exclude the samples wiht code 1415 but without the sub-category code

In [None]:
import pandas as pd
sample500k = pd.read_csv("/home/gl2776/UKBiobank/phenotype_files/HI_UKBB/ukb47922_white_460649ind.keep_id",header=None,sep=" ")
sample500k

In [None]:
exclusion = pd.read_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/with_1415_without_subcat.500k.sample_id.txt",header=0,sep="\t")
exclusion = exclusion["IID"].to_list()
len(exclusion)

In [None]:
sample500k[~sample500k[0].isin(exclusion)].shape

In [None]:
sample500k[~sample500k[0].isin(exclusion)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/ukb47922_white_456561ind_exclu_1415.keep_id",header=False,index=False,sep=" ")

In [None]:
## Keep_id for PCA
# f3393
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2257
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2247_f2257
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)

In [None]:
# phenopca
import pandas as pd
# f3393
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_aid_f3393_expandedwhite_15601cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_difficulty_f2247_expandedwhite_110453cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_noise_f2257_expandedwhite_161443cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Combined_f2247_f2257_expandedwhite_93258cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.phenopca", sep='\t', index=False)

### Calculate the first 2 PC's for each of these phenotypes to run the association analysis with the imputed data
#### f.3393
##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f3393_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f3393_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f3393_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### f.2247

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2247_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2247_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2247_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### f.2257

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2257_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### Combined f.2247 & f.2257

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2247_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2247_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2247_f2257_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

### Combine the PCs in the phenotype files

Phenotype files named "070722_*_500k_PC1_PC2" are with original PCs calculated by original sample size with code 1415. "071222_*_500k_PC1_PC2" will be the new phenotype files with the newly calculated PCs

In [None]:
# f3393
import pandas as pd
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f3393', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")

In [None]:
# f2247
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2247', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")
# f2257
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2257', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")
# f2247_f2257
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2247_f2257', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")

## Run association analysis with imputed data for each phenotype
### 500k

#### f.3393

In [68]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f3393_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions
ref_first=True

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --ref_first $ref_first
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393/f3393_500K_impdata_regenie_2022-07-26.sbatch[0m
INFO: Workflow csg (ID=w2e19c78363e8a433) is executed successfully with 1 completed step.



#### f.2247

In [69]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2247_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions
ref_first=True

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --ref_first $ref_first
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247/f2247_500K_impdata_regenie_2022-07-26.sbatch[0m
INFO: Workflow csg (ID=w416eb35379405fde) is executed successfully with 1 completed step.



#### f.2257

In [70]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2257
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2257_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions
ref_first=True

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --ref_first $ref_first
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2257/f2257_500K_impdata_regenie_2022-07-26.sbatch[0m
INFO: Workflow csg (ID=w3ba10e625e2c3b71) is executed successfully with 1 completed step.



#### Combined f.2247 & f.2257

In [71]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247_f2257
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2247_f2257_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions
ref_first=True

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --ref_first $ref_first
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247_f2257/f2247_f2257_500K_impdata_regenie_2022-07-26.sbatch[0m
INFO: Workflow csg (ID=w0f1374ff2f10b381) is executed successfully with 1 completed step.



# Step 3. LD clumping between different traits

## Format file preparation so that all summary stats match 

In [2]:
import pandas as pd
data = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37.tsv.gz', compression='gzip', sep="\t", header=0)

  msg['msg_id'] = self._parent_header['header']['msg_id']
  from collections import Sized, KeysView, Sequence


In [9]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [4]:
data.rename(columns={'A0':'REF', 'A1':'ALT', 'STAT':'BETA'}, inplace=True)

In [5]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,chr1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,chr1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,chr1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,chr1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,chr1:661906:G:C,0.2938,0.3171,0.3542


In [6]:
data['SNP'] = data['SNP'].str.replace('chr', '')

In [7]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [30]:
data.to_csv("/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz", sep='\t', index=False, header=True, compression="gzip")

### Re-format the SNP in the 500K GWAS summary stats

In [16]:

f3393 = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz', dtype=str, compression='gzip',sep="\t", header=0)

In [17]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939253875768
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.5226211725664186
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.0615063563378631
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832341109345
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.7267499323234313


In [18]:
f3393['var'] = f3393['CHR'] + ':' + f3393['POS'] + ':' + f3393['REF'] + ':' + f3393['ALT']

In [19]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P,var
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939253875768,1:13259:A:G
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.5226211725664186,1:17569:A:C
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.0615063563378631,1:17641:A:G
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832341109345,1:30741:A:C
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.7267499323234313,1:52144:A:T


In [16]:
import os
rootdir = '/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/'

for subdir, dirs, files in os.walk(rootdir):
    for filename in files:
        if filename.endswith("071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz"):
            print(os.path.join(subdir, filename))
            df = pd.read_csv(os.path.join(subdir, filename), dtype=str, compression='gzip',sep="\t", header=0)
            df.rename(columns={'SNP':'varid'}, inplace=True)
            df['SNP'] = df['CHR'] + ':' + df['POS'] + ':' + df['REF'] + ':' + df['ALT']
            df.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/' + filename, sep='\t', index=False, header=True, compression="gzip")
            print(df.head())

/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz
  CHR    POS REF ALT        varid        BETA        SE                     P  \
0   1  13259   G   A  rs562993331    0.387653  0.169082  0.023122777490193567   
1   1  17569   C   A  rs535086049   -0.331285  0.764723    0.6648627475668728   
2   1  17641   G   A  rs578081284  -0.0451544  0.103088    0.6613738204754334   
3   1  30741   C   A  rs558169846    -1.76303  0.991047  0.024469776350843825   
4   1  57222   T   C  rs576081345   0.0501817  0.114946    0.6624254356036648   

           SNP  
0  1:13259:G:A  
1  1:17569:C:A  
2  1:17641:G:A  
3  1:30741:C:A  
4  1:57222:T:C  


# Variables for LD clumping 

In [3]:
clumping_sos=~/project/bioworkflows/GWAS/LD_Clumping.ipynb
## GenoFile is not necessary to run LD clumping but it is needed by the global parameters. It won't be use in this module but you have to provide path
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
reference_genotype_prefix=~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.nondup
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=20
clump_job_size=1
container=~/containers/lmm.sif




## Solution to empty clumped files

The empty clumped files were showing up because the varid in the reference file is rsid whereas in the summary stats files it is based on chr:pos:ref:alt
Therefore, I have decided to re-name all of the variants in the reference file with rsid for the more conventional notation CHR:POS:REF:ALT which can be acomplished using plink2

In [16]:
module load PLINK/2.0
plink2 --bfile ~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.2000.ref_geno \
       --make-bed \
       --out ~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.2000.ref_geno\
       --new-id-max-allele-len 662 \
       --set-all-var-ids @:#:\$r:\$a
       

PLINK v2.00a2.3LM 64-bit Intel (24 Jan 2020)   www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.2000.ref_geno.log.
Options in effect:
  --bfile /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.2000.ref_geno
  --make-bed
  --new-id-max-allele-len 662
  --out /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.2000.ref_geno
  --set-all-var-ids @:#:$r:$a

Start time: Thu Jul 28 13:46:07 2022
257481 MiB RAM detected; reserving 128740 MiB for main workspace.
Allocated 72416 MiB successfully, after larger attempt(s) failed.
Using up to 64 threads (change this with --threads).
2000 samples (1096 females, 904 males; 2000 founders) loaded from
/home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.2000.ref_geno.fam.
92457702 variants loaded

There was still an error because some of the variants after renaming them were duplicated having the same ref/alt alleles. Therefore, I proceded to remove all the duplicated variants

In [31]:
module load PLINK/2.0
plink2 --bfile ~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.2000.ref_geno \
        --rm-dup  'force-first' 'list' --make-bed \
        --out ~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.nondup.2000.ref_geno

PLINK v1.90b4.9 64-bit (13 Oct 2017)           www.cog-genomics.org/plink/1.9/
(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --bfile /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.2000.ref_geno
  --list-duplicate-vars require-same-ref
  --memory 60000

Segmentation fault (core dumped)



## AD_meta and f3393

In [19]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f3393_ADmeta
clumping_sbatch=$clumping_dir/f3393_ADmeta-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080722_f3393_ADmeta/f3393_ADmeta-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w08dba5e8d7034f9d) is executed successfully with 1 completed step.



### Check overlap between AD sumstats and f3393 sumstats

In [1]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz"
AD = pd.read_csv(file,compression="gzip",header=0,sep="\t")

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [2]:
AD.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [11]:
#Number of hits that are genome-wide significant
AD_sig=AD[AD['P'] < 5e-08]

In [3]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz'
f3393 = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [4]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939,1:13259:A:G
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.522621,1:17569:A:C
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.061506,1:17641:A:G
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832,1:30741:A:C
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.72675,1:52144:A:T


In [12]:
f3393_sig=f3393[f3393['P'] < 5e-08]

In [13]:
AD_sig[AD_sig['SNP'].isin(f3393_sig['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
8200715,6,32654502,G,A,6:32654502:G:A,0.0798,0.0123,1.004e-10
8201118,6,32665728,G,A,6:32665728:G:A,0.0766,0.0121,2.682e-10
8201135,6,32666173,G,A,6:32666173:G:A,0.0795,0.0123,1.218e-10
8201140,6,32666295,G,A,6:32666295:G:A,0.0793,0.0123,1.323e-10
8201173,6,32666875,G,A,6:32666875:G:A,0.076,0.0122,3.967e-10
8201174,6,32666899,G,A,6:32666899:G:A,0.0761,0.0122,3.775e-10
8201177,6,32666968,C,T,6:32666968:C:T,0.076,0.0122,4.034e-10
8201179,6,32667006,C,CATG,6:32667006:C:CATG,0.079,0.0123,1.423e-10
8201192,6,32667343,T,A,6:32667343:T:A,0.0727,0.0121,2.167e-09
8201204,6,32667895,G,C,6:32667895:G:C,0.0778,0.0122,2.088e-10


In [14]:
f3393_sig[f3393_sig['SNP'].isin(AD_sig['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
8515913,6,32654502,G,A,rs9275183,-0.105956,0.018669,1.915226e-08,6:32654502:G:A
8516436,6,32665728,G,A,rs9275312,-0.102268,0.018389,3.606783e-08,6:32665728:G:A
8516462,6,32666173,G,A,rs9275318,-0.106932,0.018668,1.418208e-08,6:32666173:G:A
8516467,6,32666295,G,A,rs9275319,-0.106943,0.018673,1.427414e-08,6:32666295:G:A
8516498,6,32666875,G,A,rs9275330,-0.104214,0.018446,2.200645e-08,6:32666875:G:A
8516499,6,32666899,G,A,rs9275331,-0.104226,0.018446,2.191341e-08,6:32666899:G:A
8516502,6,32666968,C,T,rs9275333,-0.104285,0.018445,2.151642e-08,6:32666968:C:T
8516504,6,32667006,C,CATG,rs760571019,-0.10424,0.018448,2.189626e-08,6:32667006:C:CATG
8516516,6,32667343,T,A,rs9275338,-0.108311,0.018419,5.866648e-09,6:32667343:T:A
8516550,6,32667895,G,C,rs9275358,-0.104878,0.018604,2.36249e-08,6:32667895:G:C


In [5]:
AD[AD['SNP'].isin(f3393['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
16,1,693731,G,A,1:693731:G:A,-0.0243,0.0161,0.13180
87,1,730087,C,T,1:730087:C:T,-0.0078,0.0225,0.72810
93,1,731718,C,T,1:731718:C:T,-0.0295,0.0155,0.05685
94,1,732032,C,A,1:732032:C:A,-0.0343,0.0162,0.03397
114,1,734349,C,T,1:734349:C:T,-0.0308,0.0155,0.04682
...,...,...,...,...,...,...,...,...
21101031,22,51219387,C,T,22:51219387:C:T,0.0004,0.0168,0.97900
21101049,22,51221731,C,T,22:51221731:C:T,0.0021,0.0168,0.90140
21101081,22,51229805,C,T,22:51229805:C:T,-0.0001,0.0169,0.99430
21101103,22,51236013,AT,A,22:51236013:AT:A,0.0078,0.0110,0.47460


In [6]:
f3393[f3393['SNP'].isin(AD['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
32,1,693731,G,A,rs12238997,-0.017106,0.020979,0.414868,1:693731:G:A
53,1,730087,C,T,rs148120343,0.008566,0.029290,0.769938,1:730087:C:T
54,1,731718,C,T,rs58276399,-0.015556,0.019901,0.434411,1:731718:C:T
56,1,732032,C,A,rs61770163,-0.025332,0.021278,0.233832,1:732032:C:A
57,1,734349,C,T,rs141242758,-0.015371,0.019910,0.440108,1:734349:C:T
...,...,...,...,...,...,...,...,...,...
21735882,22,51219387,C,T,rs9616832,-0.040643,0.024311,0.094563,22:51219387:C:T
21735893,22,51221731,C,T,rs115055839,-0.043280,0.024333,0.075297,22:51221731:C:T
21735917,22,51229805,C,T,rs9616985,-0.043153,0.024427,0.077291,22:51229805:C:T
21735922,22,51236013,AT,A,rs200507571,0.002709,0.016160,0.866861,22:51236013:AT:A


In [7]:
len(f3393['SNP'])

21735926

In [8]:
len(AD['SNP'])

21101114

## AD_meta and f2247

In [21]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADmeta
clumping_sbatch=$clumping_dir/f2247_ADmeta-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADmeta/f2247_ADmeta-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=we08732f2bd56b244) is executed successfully with 1 completed step.



### Check overalp between AD sumstats and f2247 sumstats

In [1]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz"
AD = pd.read_csv(file,compression="gzip",header=0,sep="\t")

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [2]:
AD.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [11]:
#Number of hits that are genome-wide significant
AD_sig=AD[AD['P'] < 5e-08]

In [18]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz'
f2247 = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [19]:
f2247.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,A,G,rs562993331,-0.387653,0.169082,0.023123,1:13259:A:G
1,1,17569,A,C,rs535086049,0.331285,0.764723,0.664863,1:17569:A:C
2,1,17641,A,G,rs578081284,0.045154,0.103088,0.661374,1:17641:A:G
3,1,30741,A,C,rs558169846,1.76303,0.991047,0.02447,1:30741:A:C
4,1,57222,C,T,rs576081345,-0.050182,0.114946,0.662425,1:57222:C:T


In [20]:
f2247_sig=f2247[f2247['P'] < 5e-08]

In [21]:
AD_sig[AD_sig['SNP'].isin(f2247_sig['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
8199044,6,32577380,G,A,6:32577380:G:A,0.0729,0.0103,1.829e-12
8199110,6,32578590,G,A,6:32578590:G:A,0.0736,0.0104,1.362e-12
8199369,6,32584693,G,C,6:32584693:G:C,0.0747,0.0105,9.003e-13
8200715,6,32654502,G,A,6:32654502:G:A,0.0798,0.0123,1.004e-10
8201118,6,32665728,G,A,6:32665728:G:A,0.0766,0.0121,2.682e-10
8201135,6,32666173,G,A,6:32666173:G:A,0.0795,0.0123,1.218e-10
8201140,6,32666295,G,A,6:32666295:G:A,0.0793,0.0123,1.323e-10
8201173,6,32666875,G,A,6:32666875:G:A,0.076,0.0122,3.967e-10
8201174,6,32666899,G,A,6:32666899:G:A,0.0761,0.0122,3.775e-10
8201177,6,32666968,C,T,6:32666968:C:T,0.076,0.0122,4.034e-10


In [22]:
f2247_sig[f2247_sig['SNP'].isin(AD_sig['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
8555319,6,32577380,G,A,rs660895,-0.038348,0.006534,4.579204e-09,6:32577380:G:A
8555404,6,32578590,G,A,rs3997868,-0.037369,0.006557,1.251065e-08,6:32578590:G:A
8555800,6,32584693,G,C,rs510205,-0.038854,0.006549,3.10313e-09,6:32584693:G:C
8561048,6,32654502,G,A,rs9275183,-0.046044,0.008126,1.546002e-08,6:32654502:G:A
8561570,6,32665728,G,A,rs9275312,-0.044248,0.007994,3.2867e-08,6:32665728:G:A
8561596,6,32666173,G,A,rs9275318,-0.046374,0.008125,1.216942e-08,6:32666173:G:A
8561601,6,32666295,G,A,rs9275319,-0.046472,0.008127,1.144748e-08,6:32666295:G:A
8561632,6,32666875,G,A,rs9275330,-0.045466,0.008023,1.541452e-08,6:32666875:G:A
8561633,6,32666899,G,A,rs9275331,-0.04553,0.008023,1.469332e-08,6:32666899:G:A
8561636,6,32666968,C,T,rs9275333,-0.045572,0.008023,1.425706e-08,6:32666968:C:T


## AD_meta and f2257

In [22]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADmeta
clumping_sbatch=$clumping_dir/f2257_ADmeta-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADmeta/f2257_ADmeta-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w1932b2f70073779a) is executed successfully with 1 completed step.



### Check overalp between AD sumstats and f2257 sumstats

In [1]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz"
AD = pd.read_csv(file,compression="gzip",header=0,sep="\t")

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [2]:
AD.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [11]:
#Number of hits that are genome-wide significant
AD_sig=AD[AD['P'] < 5e-08]

In [25]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.gz'
f2257 = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [28]:
f2257.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,A,G,rs562993331,-0.29128,0.157364,0.06417,1:13259:A:G
1,1,17569,A,C,rs535086049,0.301553,0.68814,0.661231,1:17569:A:C
2,1,17641,A,G,rs578081284,0.010873,0.090079,0.903924,1:17641:A:G
3,1,30741,A,C,rs558169846,0.665389,0.529693,0.209051,1:30741:A:C
4,1,57222,C,T,rs576081345,0.052396,0.102875,0.610529,1:57222:C:T


In [31]:
f2257_sig=f2257[f2257['P'] < 5e-08]

In [34]:
AD_res=AD_sig[AD_sig['SNP'].isin(f2257_sig['SNP'])]
AD_res.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_gwashits_in_f2257imp.tsv' , sep='\t', index=False, header=True)

In [35]:
f2257_res=f2257_sig[f2257_sig['SNP'].isin(AD_sig['SNP'])]
f2257_res.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/f2257imp_gwashits_in_AD.tsv' , sep='\t', index=False, header=True)

## AD_meta and f2247_f2257

In [23]:
# # Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADmeta
clumping_sbatch=$clumping_dir/combined_ADmeta-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADmeta/combined_ADmeta-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w807d8b080b4cc717) is executed successfully with 1 completed step.



### Check overlap between ADmeta sumstats and combined sumstats

In [1]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz"
AD = pd.read_csv(file,compression="gzip",header=0,sep="\t")

  msg['msg_id'] = self._parent_header['header']['msg_id']


In [2]:
AD.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,1:661906:G:C,0.2938,0.3171,0.3542


In [11]:
#Number of hits that are genome-wide significant
AD_sig=AD[AD['P'] < 5e-08]

In [37]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.gz'
comb = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [39]:
comb.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,A,G,rs562993331,-0.439782,0.176722,0.014219,1:13259:A:G
1,1,17569,A,C,rs535086049,0.082381,0.804463,0.918436,1:17569:A:C
2,1,17641,A,G,rs578081284,0.073457,0.109471,0.502211,1:17641:A:G
3,1,30741,A,C,rs558169846,1.58704,0.987036,0.045833,1:30741:A:C
4,1,57222,C,T,rs576081345,-0.030952,0.122759,0.800936,1:57222:C:T


In [40]:
comb_sig=comb[comb['P'] < 5e-08]

In [41]:
AD_sig[AD_sig['SNP'].isin(comb_sig['SNP'])]
#AD_res.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_gwashits_in_f2257imp.tsv' , sep='\t', index=False, header=True)

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
8199044,6,32577380,G,A,6:32577380:G:A,0.0729,0.0103,1.829e-12
8199369,6,32584693,G,C,6:32584693:G:C,0.0747,0.0105,9.003e-13
18705003,17,43691377,C,T,17:43691377:C:T,0.0591,0.0101,4.438e-09
18705640,17,43781105,G,A,17:43781105:G:A,0.0534,0.0098,4.805e-08
18705643,17,43781426,C,T,17:43781426:C:T,0.0534,0.0098,4.623e-08
18705734,17,43789640,C,A,17:43789640:C:A,0.0534,0.0098,4.714e-08
18705774,17,43792358,G,A,17:43792358:G:A,0.0534,0.0098,4.648e-08
18705778,17,43792586,C,A,17:43792586:C:A,0.0533,0.0098,4.792e-08
18705781,17,43792895,G,T,17:43792895:G:T,0.0533,0.0098,4.829e-08
18705782,17,43792896,C,T,17:43792896:C:T,0.0533,0.0098,4.749e-08


In [42]:
comb_sig[comb_sig['SNP'].isin(AD_sig['SNP'])]
#f2257_res.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/f2257imp_gwashits_in_AD.tsv' , sep='\t', index=False, header=True)

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
8543520,6,32577380,G,A,rs660895,-0.038974,0.006932,1.974104e-08,6:32577380:G:A
8544002,6,32584693,G,C,rs510205,-0.039528,0.006947,1.335611e-08,6:32584693:G:C
19452908,17,43691377,C,T,rs382362,-0.040797,0.007445,4.452665e-08,17:43691377:C:T
19453605,17,43781105,G,A,rs968028,-0.037328,0.006795,4.097698e-08,17:43781105:G:A
19453608,17,43781426,C,T,rs62056879,-0.037083,0.006793,4.988845e-08,17:43781426:C:T
19453696,17,43789640,C,A,rs62056930,-0.037202,0.00679,4.442833e-08,17:43789640:C:A
19453736,17,43792358,G,A,rs56160448,-0.037528,0.006793,3.439537e-08,17:43792358:G:A
19453741,17,43792586,C,A,rs1880749,-0.037438,0.006791,3.682392e-08,17:43792586:C:A
19453742,17,43792895,G,T,rs1568951,-0.037377,0.006791,3.877484e-08,17:43792895:G:T
19453743,17,43792896,C,T,rs1568950,-0.037377,0.006791,3.877484e-08,17:43792896:C:T


# Bivariate LD_clumping analysis

**Important Note:** please take into account that the number of clumps in the log file can (and possibly will) be different (higher) that the number of regions in the `*.clumped_regions` file because there will be some variants that are not clumped with any other and thus do not create a region. 

## AD_proxy data create format compatible with LD_clumping pipeline

The file was provided by Andy DeWan from Yale cluster `/gpfs/gibbs/pi/dewan/data/UKBiobank/results/REGENIE_results/results_imputed_data/AD` 

In [None]:
file = pd.read_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt')

In [38]:
from pathlib import Path
formatFile= Path('~/project/UKBB_GWAS_dev/data/regenie_template.yml')
import gzip
import pandas as pd
file = pd.read_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt', header=0,delim_whitespace=True, quotechar='"' )
if formatFile.is_file():
    output = '~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all' + '_original_columns.snp_stats.gz'
else:
    output = '~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt' + '.snp_stats.gz'

In [18]:
file.head()

Unnamed: 0,CHROM,GENPOS,ID,ALLELE0,ALLELE1,A1FREQ,INFO,N,TEST,BETA,SE,CHISQ,LOG10P,EXTRA,P
0,1,13259,rs562993331,G,A,0.000268,0.845414,460649,ADD,0.158113,0.203616,0.602991,0.359083,,0.437438
1,1,17569,rs535086049,C,A,1.2e-05,0.843954,460649,ADD,1.40702,0.719964,3.5791,1.23276,,0.058511
2,1,17641,rs578081284,G,A,0.000799,0.85363,460649,ADD,0.078532,0.119229,0.433839,0.292335,,0.510111
3,1,30741,rs558169846,C,A,2.2e-05,0.903759,460649,ADD,-1.31449,0.771762,2.90101,1.05294,,0.088524
4,1,57222,rs576081345,T,C,0.000622,0.837846,460649,ADD,0.291624,0.126625,5.00898,1.59832,,0.025216


In [19]:
file.to_csv(output, compression='gzip', sep='\t', header = True, index = False)

In [37]:
 # unify output format
import pandas as pd
from pathlib import Path
formatFile = Path('/home/dmc2245/project/UKBB_GWAS_dev/data/regenie_template.yml')
reverse_log_p = True
if formatFile.is_file() or reverse_log_p:
    sumstats = pd.read_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt', header=0,delim_whitespace=True, quotechar='"') 
    if formatFile.is_file():
        import yaml
        config = yaml.safe_load(open(formatFile, 'r'))
    try:
        sumstats = sumstats.loc[:,list(config.values())]
    except:
        raise ValueError(f'According to formatFile, input summary statistics should have the following columns: {list(config.values())}.')
    sumstats.columns = list(config.keys())
    if reverse_log_p:
            sumstats['P'] = sumstats['P'].apply(lambda row: 10**-row)
    sumstats.to_csv("~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.reheader.snp_stats.gz", compression='gzip', sep='\t', header = True, index = False)

In [38]:
sumstats.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,13259,G,A,rs562993331,0.158113,0.203616,0.437438
1,1,17569,C,A,rs535086049,1.40702,0.719964,0.058511
2,1,17641,G,A,rs578081284,0.078532,0.119229,0.510111
3,1,30741,C,A,rs558169846,-1.31449,0.771762,0.088524
4,1,57222,T,C,rs576081345,0.291624,0.126625,0.025216


### AD_proxy and f3393

In [45]:
import pandas as pd
data = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.reheader.snp_stats.gz', compression='gzip', sep="\t", header=0, dtype=str)

In [46]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,13259,G,A,rs562993331,0.158113,0.203616,0.4374384963021345
1,1,17569,C,A,rs535086049,1.40702,0.719964,0.0585113340398525
2,1,17641,G,A,rs578081284,0.0785319,0.119229,0.5101113656100561
3,1,30741,C,A,rs558169846,-1.31449,0.771762,0.0885237901518778
4,1,57222,T,C,rs576081345,0.291624,0.126625,0.025216220879824


In [47]:
data.rename(columns={'SNP':'varid'}, inplace=True)
data['SNP'] = data['CHR'] + ':' + data['POS'] + ':' + data['REF'] + ':' + data['ALT']
data.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.final.snp_stats.gz', sep='\t', index=False, header=True, compression="gzip")

In [48]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,G,A,rs562993331,0.158113,0.203616,0.4374384963021345,1:13259:G:A
1,1,17569,C,A,rs535086049,1.40702,0.719964,0.0585113340398525,1:17569:C:A
2,1,17641,G,A,rs578081284,0.0785319,0.119229,0.5101113656100561,1:17641:G:A
3,1,30741,C,A,rs558169846,-1.31449,0.771762,0.0885237901518778,1:30741:C:A
4,1,57222,T,C,rs576081345,0.291624,0.126625,0.025216220879824,1:57222:T:C


In [4]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f3393_ADproxy
clumping_sbatch=$clumping_dir/f3393_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f3393_ADproxy/f3393_ADproxy-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=wf767271d8665d92f) is executed successfully with 1 completed step.



### Check overlap between AD proxy sumstats and f3393 sumstats

In [51]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.final.snp_stats.gz"
ADproxy = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [52]:
ADproxy.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,G,A,rs562993331,0.158113,0.203616,0.437438,1:13259:G:A
1,1,17569,C,A,rs535086049,1.40702,0.719964,0.058511,1:17569:C:A
2,1,17641,G,A,rs578081284,0.078532,0.119229,0.510111,1:17641:G:A
3,1,30741,C,A,rs558169846,-1.31449,0.771762,0.088524,1:30741:C:A
4,1,57222,T,C,rs576081345,0.291624,0.126625,0.025216,1:57222:T:C


In [61]:
#Number of hits that are genome-wide significant
ADproxy[ADproxy['P'] < 5e-08]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
1458344,1,207684192,T,G,rs12037841,-0.046865,0.008055,6.716299e-09,1:207684192:T:G
1458360,1,207685965,A,C,rs4562624,-0.047314,0.008047,4.653932e-09,1:207685965:A:C
1458405,1,207692049,A,G,rs6656401,-0.048239,0.008065,2.519997e-09,1:207692049:A:G
1458559,1,207747296,A,G,rs1752684,-0.045510,0.008029,1.613095e-08,1:207747296:A:G
1458600,1,207750568,T,C,rs679515,-0.048780,0.008080,1.791884e-09,1:207750568:T:C
...,...,...,...,...,...,...,...,...,...
21104267,8,27466912,G,GA,rs11449170,0.044635,0.006392,2.747894e-12,8:27466912:G:GA
21104276,8,27467686,C,T,rs9331896,0.044401,0.006381,3.305978e-12,8:27467686:C:T
21104279,8,27467821,C,G,rs2070926,0.045557,0.006390,9.600636e-13,8:27467821:C:G
21104286,8,27468503,C,A,rs867230,0.045308,0.006406,1.451443e-12,8:27468503:C:A


In [56]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz'
f3393 = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [57]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939,1:13259:A:G
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.522621,1:17569:A:C
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.061506,1:17641:A:G
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832,1:30741:A:C
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.72675,1:52144:A:T


In [62]:
f3393[f3393['P'] < 5e-08]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
2181439,2,54687254,T,C,rs13026575,-0.091392,0.016453,3.519978e-08,2:54687254:T:C
2181496,2,54693754,G,T,rs10169422,-0.089591,0.015662,1.338474e-08,2:54693754:G:T
2181535,2,54697387,A,AT,2:54697387_AT_A,-0.081630,0.014796,4.111308e-08,2:54697387:A:AT
2181560,2,54699021,C,T,rs9677271,-0.093448,0.016656,2.598723e-08,2:54699021:C:T
2181562,2,54699067,A,AACTTC,rs746705223,-0.093255,0.016662,2.809959e-08,2:54699067:A:AACTTC
...,...,...,...,...,...,...,...,...,...
21733696,22,50956117,A,G,rs200126237,-0.909770,0.141901,3.488751e-09,22:50956117:A:G
21734140,22,50993633,T,C,22:50993633_C_T,-1.425390,0.175632,2.767579e-13,22:50993633:T:C
21734565,22,51032411,G,A,rs776344563,-1.469330,0.183468,5.858683e-13,22:51032411:G:A
21735509,22,51146132,C,CAAATA,22:51146132_CAAATA_C,-1.262120,0.186638,8.077370e-10,22:51146132:C:CAAATA


In [65]:
ADproxy_sig[ADproxy_sig['varid'].isin(f3393_sig['varid'])]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
18189710,6,32665728,A,G,rs9275312,-0.053307,0.009735,3.834864e-08,6:32665728:A:G
18189768,6,32666802,C,T,rs9275327,-0.053083,0.009776,4.980696e-08,6:32666802:C:T
18189770,6,32666822,C,T,rs9275328,-0.053097,0.009776,4.938784e-08,6:32666822:C:T
18189814,6,32667638,G,A,rs9275348,-0.053767,0.00984,4.097227e-08,6:32667638:G:A
18189825,6,32667895,C,G,rs9275358,-0.054177,0.00986,3.447625e-08,6:32667895:C:G
18189831,6,32667957,C,T,rs9275362,-0.054326,0.009988,4.72683e-08,6:32667957:C:T
18189841,6,32668125,G,A,rs9275365,-0.053562,0.009758,3.556805e-08,6:32668125:G:A


In [66]:
# Issues with REF/ALT alleles. GWAS was run again with --ref_first to account for these differences. 
f3393_sig[f3393_sig['varid'].isin(ADproxy_sig['varid'])]

Unnamed: 0,CHR,POS,REF,ALT,varid,BETA,SE,P,SNP
8516436,6,32665728,G,A,rs9275312,-0.102268,0.018389,3.606783e-08,6:32665728:G:A
8516493,6,32666802,T,C,rs9275327,-0.104226,0.018451,2.210192e-08,6:32666802:T:C
8516495,6,32666822,T,C,rs9275328,-0.104121,0.018451,2.28197e-08,6:32666822:T:C
8516539,6,32667638,A,G,rs9275348,-0.102989,0.018589,4.056113e-08,6:32667638:A:G
8516550,6,32667895,G,C,rs9275358,-0.104878,0.018604,2.36249e-08,6:32667895:G:C
8516555,6,32667957,T,C,rs9275362,-0.103782,0.018842,4.881463e-08,6:32667957:T:C
8516565,6,32668125,A,G,rs9275365,-0.103314,0.018425,2.79113e-08,6:32668125:A:G


## AD_proxy and f2247

In [5]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADproxy
clumping_sbatch=$clumping_dir/f2247_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADproxy/f2247_ADproxy-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=wb761843e0224ccac) is executed successfully with 1 completed step.



## AD proxy and f2257

In [6]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADproxy
clumping_sbatch=$clumping_dir/f2257_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADproxy/f2257_ADproxy-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w5fad6235b33f0392) is executed successfully with 1 completed step.



## AD proxy and f2247_f2257

In [7]:
# # Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADproxy
clumping_sbatch=$clumping_dir/combined_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADproxy/combined_ADproxy-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w79db46045a15b733) is executed successfully with 1 completed step.



## AD proxy and ADmeta 

In [18]:
# # Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_ADmeta_ADproxy
clumping_sbatch=$clumping_dir/ADmeta_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_ADmeta_ADproxy/ADmeta_ADproxy-ldclumping_2022-08-10.sbatch[0m
INFO: Workflow csg (ID=wccf54b3709973dd2) is executed successfully with 1 completed step.



### Use the replicate option

This means that only clumps containing clumped SNPs with p2-significant results in more than one result file are shown.

In [4]:
## Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/081822_ADmeta_ADproxy_replicate
clumping_sbatch=$clumping_dir/ADmeta_ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz  /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz`
replicate=True
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=5e-06
clump_r2=0.04
clump_kb=2000
clump_annotate=BETA
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate
    --replicate $replicate
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/081822_ADmeta_ADproxy_replicate/ADmeta_ADproxy-ldclumping_2022-08-18.sbatch[0m
INFO: Workflow csg (ID=wc52702c093137b46) is executed successfully with 1 completed step.



### Find overlap between AD_meta and AD_proxy

In [None]:
import pandas as pd
file = "/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz"
AD_meta = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [3]:
AD_meta

Unnamed: 0,CHR,POS,REF,ALT,BETA,SE,P,SNP_original,SNP
0,1,529825,C,T,-0.0152,0.3236,0.9626,1:529825:C:T,1:529825:C:T
1,1,531142,C,CTG,-0.2494,0.2247,0.2670,1:531142:C:CTG,1:531142:C:CTG
2,1,566327,G,A,0.0975,0.4320,0.8214,1:566327:G:A,1:566327:G:A
3,1,581537,G,A,-0.3028,0.3180,0.3411,1:581537:G:A,1:581537:G:A
4,1,661906,G,C,0.2938,0.3171,0.3542,1:661906:G:C,1:661906:G:C
...,...,...,...,...,...,...,...,...,...
21101109,22,51237496,C,T,0.1502,0.2931,0.6083,22:51237496:C:T,22:51237496:C:T
21101110,22,51237535,C,A,-0.0655,0.2898,0.8213,22:51237535:C:A,22:51237535:C:A
21101111,22,51238249,C,A,0.0098,0.0228,0.6673,22:51238249:C:A,22:51238249:C:A
21101112,22,51238318,T,A,0.0476,0.0813,0.5584,22:51238318:T:A,22:51238318:T:A


In [8]:
#Number of hits that are genome-wide significant
AD_meta_sig=AD_meta[AD_meta['P'] < 5e-06]

In [5]:
file='/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz'
AD_proxy = pd.read_csv(file,compression="gzip",header=0,sep="\t")

In [6]:
AD_proxy

Unnamed: 0,CHR,POS,REF,ALT,MAF,INFO,BETA,SE,P,varid,SNP_original,SNP
0,1,13259,G,A,0.000268,0.845414,0.158113,0.203616,0.437438,rs562993331,1:13259:G:A,1:13259:G:A
1,1,17569,C,A,0.000012,0.843954,1.407020,0.719964,0.058511,rs535086049,1:17569:C:A,1:17569:C:A
2,1,17641,G,A,0.000799,0.853630,0.078532,0.119229,0.510111,rs578081284,1:17641:G:A,1:17641:G:A
3,1,30741,C,A,0.000022,0.903759,-1.314490,0.771762,0.088524,rs558169846,1:30741:C:A,1:30741:C:A
4,1,57222,T,C,0.000622,0.837846,0.291624,0.126625,0.025216,rs576081345,1:57222:T:C,1:57222:T:C
...,...,...,...,...,...,...,...,...,...,...,...,...
23095994,9,141101939,C,T,0.022299,1.000000,-0.008858,0.021224,0.676424,9:141101939_C_T,9:141101939:C:T,9:141101939:C:T
23095995,9,141102535,CT,C,0.000484,0.816239,0.266958,0.153428,0.081865,9:141102535_CT_C,9:141102535:CT:C,9:141102535:CT:C
23095996,9,141102859,C,T,0.011236,0.809996,-0.013092,0.032992,0.691506,9:141102859_C_T,9:141102859:C:T,9:141102859:C:T
23095997,9,141104957,G,A,0.010978,0.846640,0.023038,0.032529,0.478803,9:141104957_G_A,9:141104957:G:A,9:141104957:G:A


In [9]:
AD_proxy_sig=AD_proxy[AD_proxy['P'] < 5e-06]

In [10]:
AD_meta_sig[AD_meta_sig['SNP'].isin(AD_proxy_sig['SNP'])]

Unnamed: 0,CHR,POS,REF,ALT,BETA,SE,P,SNP_original,SNP
991475,1,161097241,T,A,0.0451,0.0090,5.831000e-07,1:161097241:T:A,1:161097241:T:A
991490,1,161100336,T,A,0.0446,0.0090,7.636000e-07,1:161100336:T:A,1:161100336:T:A
991505,1,161103445,G,T,-0.0491,0.0088,2.677000e-08,1:161103445:G:T,1:161103445:T:G
991511,1,161104124,G,A,0.0447,0.0090,7.208000e-07,1:161104124:G:A,1:161104124:G:A
991522,1,161106354,G,T,-0.0494,0.0088,2.242000e-08,1:161106354:G:T,1:161106354:T:G
...,...,...,...,...,...,...,...,...,...
19936176,19,45806553,C,T,0.3074,0.0437,2.034000e-12,19:45806553:C:T,19:45806553:C:T
19936327,19,45825418,C,T,0.3338,0.0521,1.538000e-10,19:45825418:C:T,19:45825418:C:T
19936371,19,45830947,G,A,-0.0451,0.0089,4.086000e-07,19:45830947:G:A,19:45830947:G:A
19936431,19,45838141,C,T,0.6209,0.0764,4.215000e-16,19:45838141:C:T,19:45838141:C:T


In [11]:
AD_proxy_sig[AD_proxy_sig['SNP'].isin(AD_meta_sig['SNP'])]
#f2257_res.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/f2257imp_gwashits_in_AD.tsv' , sep='\t', index=False, header=True)

Unnamed: 0,CHR,POS,REF,ALT,MAF,INFO,BETA,SE,P,varid,SNP_original,SNP
1065054,1,161097241,T,A,0.275104,0.999587,0.032051,0.006947,4.102135e-06,rs12727614,1:161097241:T:A,1:161097241:T:A
1065073,1,161100336,T,A,0.275114,0.999638,0.032019,0.006946,4.190058e-06,rs11583045,1:161100336:T:A,1:161100336:T:A
1065090,1,161103445,T,G,0.295974,0.999809,0.032169,0.006802,2.336307e-06,rs10797093,1:161103445:T:G,1:161103445:T:G
1065094,1,161104124,G,A,0.275063,0.999811,0.031940,0.006946,4.421911e-06,rs4540676,1:161104124:G:A,1:161104124:G:A
1065110,1,161106354,T,G,0.295966,0.999906,0.032252,0.006802,2.198214e-06,rs11265557,1:161106354:T:G,1:161106354:T:G
...,...,...,...,...,...,...,...,...,...,...,...,...
21104385,8,27481984,C,T,0.689115,0.987385,0.034706,0.006813,3.361940e-07,rs507341,8:27481984:C:T,8:27481984:T:C
21104388,8,27482354,T,G,0.310179,0.987466,-0.034708,0.006816,3.401496e-07,rs495150,8:27482354:T:G,8:27482354:T:G
21104425,8,27486916,G,A,0.310241,0.987670,-0.034207,0.006814,4.962837e-07,rs576748,8:27486916:G:A,8:27486916:G:A
21104471,8,27494366,T,C,0.041300,0.983765,-0.077456,0.016219,1.470415e-06,rs138529507,8:27494366:T:C,8:27494366:T:C


# Try univariate clumping to see which regions appear and lower p-value to 5e-06

## AD_proxy clump

Run first round of univariate clump to identify variants with flipped alleles

In [21]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy
clumping_sbatch=$clumping_dir/ADproxy-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.final.snp_stats.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/ADproxy/ADproxy-ldclumping_2022-08-04.sbatch[0m
INFO: Workflow csg (ID=w117c1f46c49790a7) is executed successfully with 1 completed step.



### Create Allele flip file

In [22]:
# To get the variant ID of the alleles that need to be flipped
cd ~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy
cat AD_step2_BT_all.final.snp_stats.clumped | grep "not found in dataset" | awk '{print $1}' > ADproxy_allele_flip




In [55]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.gz', dtype=str, compression='gzip',sep="\t", header=0)
stat.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','P':'P'}, inplace=True)
#stat[['P']] = stat[['P']].apply(pd.to_numeric)
#stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy/ADproxy_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [56]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [57]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [58]:
stat[stat['SNP_original'].isin(changelist)]

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,LOG10P,EXTRA,P,SNP_original,CHR_POS,SNP
1458289,1,207679307,rs4844600,A,G,0.811141,0.993394,460649,ADD,-0.0389473,0.00792551,24.0149,6.01957,,9.55938603503361e-07,1:207679307:A:G,1:207679307,1:207679307:G:A
1458344,1,207684192,rs12037841,T,G,0.82115,0.998536,460649,ADD,-0.0468646,0.00805513,33.6153,8.17287,,6.7162986587847305e-09,1:207684192:T:G,1:207684192,1:207684192:G:T
1458359,1,207685786,rs4266886,T,C,0.811316,0.99526,460649,ADD,-0.0400964,0.007919,25.4903,6.35201,,4.44621029573694e-07,1:207685786:T:C,1:207685786,1:207685786:C:T
1458360,1,207685965,rs4562624,A,C,0.821226,1.0,460649,ADD,-0.0473144,0.00804713,34.329,8.33218,,4.6539316421925305e-09,1:207685965:A:C,1:207685965,1:207685965:C:A
1458405,1,207692049,rs6656401,A,G,0.82207,0.999179,460649,ADD,-0.0482389,0.00806466,35.5235,8.5986,,2.5199968621092403e-09,1:207692049:A:G,1:207692049,1:207692049:G:A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22032102,8,139366534,rs11166817,C,T,0.644352,0.998075,460649,ADD,-0.0305679,0.00650746,22.0191,5.56872,,2.69947929013705e-06,8:139366534:C:T,8:139366534,8:139366534:T:C
22032107,8,139367027,rs17722613,G,A,0.644394,0.998056,460649,ADD,-0.0305161,0.00650787,21.9418,5.55123,,2.81041205724176e-06,8:139367027:G:A,8:139367027,8:139367027:A:G
22032108,8,139367262,rs11782012,C,T,0.643547,0.998113,460649,ADD,-0.0301768,0.00650446,21.4799,5.44666,,3.57552649044493e-06,8:139367262:C:T,8:139367262,8:139367262:T:C
22032207,8,139381546,rs340708,G,T,0.644328,0.999248,460649,ADD,-0.0303928,0.00650423,21.7895,5.51675,,3.04263600373233e-06,8:139381546:G:T,8:139381546,8:139381546:T:G


In [61]:
header = ["CHR", "POS", "REF", "ALT", "MAF", "INFO", "BETA", "SE", "P", "varid", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

In [29]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy
clumping_sbatch=$clumping_dir/ADproxy-ldclumping_flip_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/AD_step2_BT_all.txt.snp_stats.ldclump.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy/ADproxy-ldclumping_flip_2022-08-04.sbatch[0m
INFO: Workflow csg (ID=wdd06a4f2948a0bc9) is executed successfully with 1 completed step.



### After applying the allele_flip code there are still 15 top variants not present in the reference_file

These variants can be found at the tail of this file 
`~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_ADproxy/*.regenie.snp_stats.ldclump.clumped`

## f3393 clump

In [90]:
# Set the bash variables 
# memory 80GB
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393
clumping_sbatch=$clumping_dir/f3393-ldclumping$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=1
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393/f3393-ldclumping2022-08-05.sbatch[0m
INFO: Workflow csg (ID=w23848caebd857ce7) is executed successfully with 1 completed step.



After replacing the reference file for the new one where all the variants are names by CHR:POS:REF:ALT and the duplicated ones are removed we get the following results


This new problem arises where some of the alleles are still flipped in the summary stats/reference sequence

```
941 top variant IDs missing; see the end of the .clumped file.
--clump: 135 clumps formed from 6544 top variants
```

In the reference.bim file `ukb39554_c1_22_v3.imputed.renamedvar.nondup.2000.ref_geno.bim`: 

Example:
```
6:32585755:A(major):T(minor)
6	6:32585755:A:T	0	32585755	T	A
```

In the summary stats file
```
6:32585755:T(ref):A(alt)
6	32585755	T	A	rs3129762	0.0818472	0.0180549	4.992637104333426e-06	6:32585755:T:A
```

In dbsnp

```
Europeans T=0.14 (reference) and A=0.86 (alternative)
```



I can think of one solution to solve this temporarly 

1. Create an extra column in the sumstats of the f3393 and flip all the variants having this issue to the format found in the bim file (major:minor) keeping the original SNP column renamed to SPN_ref_alt and the other one would be SNP (major/minor)

In [5]:
import pandas as pd
f3393 = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats_original_columns.gz', dtype=str, compression='gzip',sep="\t", header=0)

In [6]:
f3393.head()

Unnamed: 0,CHROM,GENPOS,ID,ALLELE0,ALLELE1,A1FREQ,INFO,N,TEST,BETA,SE,CHISQ,LOG10P,EXTRA
0,1,13259,rs562993331,G,A,0.000240613,0.844357,251433,ADD,0.444965,0.405872,1.20192,0.563934,
1,1,17569,rs535086049,C,A,1.34679e-05,0.894182,251433,ADD,-1.19541,1.86985,0.408719,0.281813,
2,1,17641,rs578081284,G,A,0.000794684,0.852345,251433,ADD,-0.435933,0.233139,3.4963,1.21108,
3,1,30741,rs558169846,C,A,2.55789e-05,0.910708,251433,ADD,-1.28043,1.24525,1.0573,0.517366,
4,1,52144,rs190291950,T,A,0.000485679,0.802923,251433,ADD,0.117852,0.337249,0.122117,0.138615,


In [13]:
f3393.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','LOG10P':'P'}, inplace=True)

In [14]:
f3393.head()

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original
0,1,13259,rs562993331,G,A,0.000240613,0.844357,251433,ADD,0.444965,0.405872,1.20192,0.272939,,1:13259:G:A
1,1,17569,rs535086049,C,A,1.34679e-05,0.894182,251433,ADD,-1.19541,1.86985,0.408719,0.522621,,1:17569:C:A
2,1,17641,rs578081284,G,A,0.000794684,0.852345,251433,ADD,-0.435933,0.233139,3.4963,0.061506,,1:17641:G:A
3,1,30741,rs558169846,C,A,2.55789e-05,0.910708,251433,ADD,-1.28043,1.24525,1.0573,0.303832,,1:30741:C:A
4,1,52144,rs190291950,T,A,0.000485679,0.802923,251433,ADD,0.117852,0.337249,0.122117,0.72675,,1:52144:T:A


In [10]:
f3393[['P']] = f3393[['P']].apply(pd.to_numeric)
f3393['P'] = f3393['P'].apply(lambda row: 10**-row)

In [11]:
f3393['SNP_original'] = f3393['CHR'] + ':' + f3393['POS'] + ':' + f3393['REF'] + ':' + f3393['ALT']

In [12]:
f3393.head()

Unnamed: 0,CHR,POS,varid,REF,ALT,A1FREQ,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original
0,1,13259,rs562993331,G,A,0.000240613,0.844357,251433,ADD,0.444965,0.405872,1.20192,0.272939,,1:13259:G:A
1,1,17569,rs535086049,C,A,1.34679e-05,0.894182,251433,ADD,-1.19541,1.86985,0.408719,0.522621,,1:17569:C:A
2,1,17641,rs578081284,G,A,0.000794684,0.852345,251433,ADD,-0.435933,0.233139,3.4963,0.061506,,1:17641:G:A
3,1,30741,rs558169846,C,A,2.55789e-05,0.910708,251433,ADD,-1.28043,1.24525,1.0573,0.303832,,1:30741:C:A
4,1,52144,rs190291950,T,A,0.000485679,0.802923,251433,ADD,0.117852,0.337249,0.122117,0.72675,,1:52144:T:A


In [21]:
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/f3393/f3393_allele_flip_list', dtype=str , header=0 , names=['SNP'])

In [22]:
flip_list.head()

Unnamed: 0,SNP
0,5:73090165:A:AGTT
1,5:73090533:A:G
2,5:73090261:T:C
3,5:73073615:T:C
4,5:73090589:G:A


In [28]:
flip_var = flip_list['SNP'].tolist()

In [25]:
f3393['SNP']=f3393.loc[:, 'SNP_original']
f3393.head()

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original,SNP
0,1,13259,rs562993331,G,A,0.000240613,0.844357,251433,ADD,0.444965,0.405872,1.20192,0.272939,,1:13259:G:A,1:13259:G:A
1,1,17569,rs535086049,C,A,1.34679e-05,0.894182,251433,ADD,-1.19541,1.86985,0.408719,0.522621,,1:17569:C:A,1:17569:C:A
2,1,17641,rs578081284,G,A,0.000794684,0.852345,251433,ADD,-0.435933,0.233139,3.4963,0.061506,,1:17641:G:A,1:17641:G:A
3,1,30741,rs558169846,C,A,2.55789e-05,0.910708,251433,ADD,-1.28043,1.24525,1.0573,0.303832,,1:30741:C:A,1:30741:C:A
4,1,52144,rs190291950,T,A,0.000485679,0.802923,251433,ADD,0.117852,0.337249,0.122117,0.72675,,1:52144:T:A,1:52144:T:A


In [91]:
# To get the variant ID of the alleles that need to be flipped
cd ~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393
cat 071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.clumped | grep "not found in dataset" | awk '{print $1}' > f3393_allele_flip




In [92]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats_original_columns.gz', dtype=str, compression='gzip',sep="\t", header=0)
stat.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','LOG10P':'P'}, inplace=True)
stat[['P']] = stat[['P']].apply(pd.to_numeric)
stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393/f3393_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [93]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [94]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [95]:
stat

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original,CHR_POS,SNP
0,1,13259,rs562993331,G,A,0.000240613,0.844357,251433,ADD,0.444965,0.405872,1.20192,2.729393e-01,,1:13259:G:A,1:13259,1:13259:G:A
1,1,17569,rs535086049,C,A,1.34679e-05,0.894182,251433,ADD,-1.19541,1.86985,0.408719,5.226212e-01,,1:17569:C:A,1:17569,1:17569:C:A
2,1,17641,rs578081284,G,A,0.000794684,0.852345,251433,ADD,-0.435933,0.233139,3.4963,6.150636e-02,,1:17641:G:A,1:17641,1:17641:G:A
3,1,30741,rs558169846,C,A,2.55789e-05,0.910708,251433,ADD,-1.28043,1.24525,1.0573,3.038323e-01,,1:30741:C:A,1:30741,1:30741:C:A
4,1,52144,rs190291950,T,A,0.000485679,0.802923,251433,ADD,0.117852,0.337249,0.122117,7.267499e-01,,1:52144:T:A,1:52144,1:52144:T:A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21735888,22,51232581,rs5771020,T,C,0.297602,0.801896,251433,ADD,0.00162598,0.0153923,0.011159,9.158710e-01,,22:51232581:T:C,22:51232581,22:51232581:T:C
21735889,22,51236013,rs200507571,A,AT,0.252029,0.805492,251433,ADD,-0.00270919,0.01616,0.0281057,8.668607e-01,,22:51236013:A:AT,22:51236013,22:51236013:A:AT
21735890,22,51237063,rs3896457,T,C,0.298901,0.857807,251433,ADD,-0.00859535,0.0148427,0.33535,5.625252e-01,,22:51237063:T:C,22:51237063,22:51237063:T:C
21735891,22,51237215,rs536109858,C,T,0.000413208,0.914193,251433,ADD,1.34564,0.209808,33.7213,6.360189e-09,,22:51237215:C:T,22:51237215,22:51237215:C:T


In [96]:
header = ["CHR", "POS", "REF", "ALT", "MAF", "INFO", "BETA", "SE", "P", "varid", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

### After applying this code there are still 22 top variants not present in the reference file 

```
5:73090534:T:C not found in dataset
2:54959092:G:C not found in dataset
6:32582603:A:G not found in dataset
6:32673475:C:T not found in dataset
6:32633668:G:A not found in dataset
17:43853922:A:G not found in dataset
17:44138377:GA:G not found in dataset
17:44138377:GAA:GA not found in dataset
17:44129949:G:C not found in dataset
17:44063723:C:A not found in dataset
17:43756458:A:G not found in dataset
17:43950976:T:G not found in dataset
17:44293020:A:G not found in dataset
14:81846115:C:T not found in dataset
3:181975568:CTT:C not found in dataset
17:44355634:G:C not found in dataset
12:102780879:A:G not found in dataset
17:44173502:ATTC:A not found in dataset
5:73066154:A:T not found in dataset
11:118514022:G:C not found in dataset
17:44355634:A:C not found in dataset
10:30315633:T:C not found in dataset
6:84127089:TTTG:T not found in dataset
```

In [97]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393
clumping_sbatch=$clumping_dir/f3393-ldclumping_flip$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.gz
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=1
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f3393/f3393-ldclumping_flip2022-08-05.sbatch[0m
INFO: Workflow csg (ID=wb50975dbec8641f8) is executed successfully with 1 completed step.



## f2247 clump

Asign 80GB memory for the job to run 

In [8]:
module load Singularity/3.5.3
sos dryrun /home/dmc2245/project/bioworkflows/GWAS/LD_Clumping.ipynb \
    default \
    --cwd /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247\
    --genoFile /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr1_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr2_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr3_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr4_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr5_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr6_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr7_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr8_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr9_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr10_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr11_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr12_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr13_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr14_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr15_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr16_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr17_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr18_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr19_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr20_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr21_v3.bgen /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr22_v3.bgen\
    --sampleFile /home/dmc2245/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample\
    --reference_genotype_prefix /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.nondup  \
    --sumstatsFiles /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz \
    --ld_sample_size 2000 \
    --clump_field P\
    --clump_p1 5e-06 \
    --clump_p2 1 \
    --clump_r2 0.04 \
    --clump_kb 2000 \
    --clump_annotate varid \
    --numThreads 20 \
    --job_size 1\
    --container /home/dmc2245/containers/lmm.sif

INFO: Checking [32mdefault[0m: Perform LD-clumping in PLINKv1.9
HINT: singularity exec  /home/dmc2245/containers/lmm.sif /bin/bash /mnt/mfs/hgrcgrid/homes/dmc2245/project/UKBB_GWAS_dev/analysis/pleio_AD_ARHI/tmpbxu796k5/singularity_run_18882.sh
plink \
--bfile /home/dmc2245/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed.renamedvar.nondup.2000.ref_geno \
--clump /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz \
--clump-field P \
--clump-p1 5e-06 \
--clump-p2 1.0 \
--clump-r2 0.04 \
--clump-kb 2000 \
--clump-verbose \
--clump-annotate varid \
--clump-allow-overlap \
--out /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats \
--threads 20 \
&& touch /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumpi

In [72]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats_original_columns.gz', dtype=str, compression='gzip',sep="\t", header=0)
stat.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','LOG10P':'P'}, inplace=True)
stat[['P']] = stat[['P']].apply(pd.to_numeric)
stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247/f2247_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [73]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [74]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [75]:
stat[stat['SNP_original'].isin(changelist)]

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original,CHR_POS,SNP
47322,1,6493666,rs569450960,C,CAA,0.0609333,0.96973,344693,ADD,-0.0526317,0.0114712,21.1616,4.221337e-06,,1:6493666:C:CAA,1:6493666,1:6493666:CAA:C
336176,1,46026397,rs373579463,CT,C,0.566553,0.969944,344693,ADD,-0.0326015,0.00549209,35.2231,2.940086e-09,,1:46026397:CT:C,1:46026397,1:46026397:C:CT
336267,1,46036960,rs514595,T,C,0.840746,0.997817,344693,ADD,-0.0385353,0.00730693,27.7363,1.390273e-07,,1:46036960:T:C,1:46036960,1:46036960:C:T
336273,1,46037357,rs518216,C,T,0.83866,0.997918,344693,ADD,-0.0374882,0.00727,26.5194,2.609036e-07,,1:46037357:C:T,1:46037357,1:46037357:T:C
336275,1,46037394,rs659437,T,C,0.838692,0.997913,344693,ADD,-0.0375429,0.00727049,26.5931,2.511424e-07,,1:46037394:T:C,1:46037394,1:46037394:C:T
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21850668,22,51043278,rs3219201,G,A,0.678689,0.991739,344693,ADD,-0.0277026,0.00575655,23.1349,1.510254e-06,,22:51043278:G:A,22:51043278,22:51043278:A:G
21850695,22,51046096,rs470115,A,G,0.819177,0.970074,344693,ADD,-0.0323748,0.00706527,20.9521,4.709231e-06,,22:51046096:A:G,22:51046096,22:51046096:G:A
21850707,22,51047218,rs5770938,G,A,0.7617,0.978794,344693,ADD,-0.0353692,0.00634002,31.0625,2.498561e-08,,22:51047218:G:A,22:51047218,22:51047218:A:G
21850720,22,51048059,rs965618,A,G,0.729428,0.995698,344693,ADD,-0.0323653,0.00603233,28.7416,8.270847e-08,,22:51048059:A:G,22:51048059,22:51048059:G:A


In [76]:
header = ["CHR", "POS", "REF", "ALT", "MAF", "INFO", "BETA", "SE", "P", "varid", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

In [77]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247
clumping_sbatch=$clumping_dir/f2247-ldclumping_flip_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247/f2247-ldclumping_flip_2022-08-05.sbatch[0m
INFO: Workflow csg (ID=w8d2ba18b02a6165c) is executed successfully with 1 completed step.



### After applying the allele_flip code there are still 91 top variants not present in the reference_file

These variants can be found at the tail of this file `/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.clumped`

## f2257 clump

In [17]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257
clumping_sbatch=$clumping_dir/f2257-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=1
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257/f2257-ldclumping_2022-08-03.sbatch[0m
INFO: Workflow csg (ID=w27f2a3ef13e0301f) is executed successfully with 1 completed step.



In [78]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2257/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats_original_columns.gz', dtype=str, compression='gzip',sep="\t", header=0)
stat.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','LOG10P':'P'}, inplace=True)
stat[['P']] = stat[['P']].apply(pd.to_numeric)
stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257/f2257_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [79]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [80]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [81]:
stat[stat['SNP_original'].isin(changelist)]

Unnamed: 0,CHR,POS,varid,REF,ALT,MAF,INFO,N,TEST,BETA,SE,CHISQ,P,EXTRA,SNP_original,CHR_POS,SNP
58269,1,7965280,rs6694557,T,C,0.864359,0.964317,395732,ADD,-0.0343144,0.00696062,24.2706,8.370858e-07,,1:7965280:T:C,1:7965280,1:7965280:C:T
209878,1,28975150,rs35910435,C,T,0.0518141,1.0,395732,ADD,0.0507732,0.0105006,23.3218,1.370377e-06,,1:28975150:C:T,1:28975150,1:28975150:T:C
226419,1,31259857,1:31259857_CT_C,CT,C,0.612566,0.811564,395732,ADD,-0.0249873,0.00534486,21.8505,2.947406e-06,,1:31259857:CT:C,1:31259857,1:31259857:C:CT
336745,1,46026397,rs373579463,CT,C,0.566571,0.970091,395732,ADD,-0.0257179,0.00480954,28.5894,8.947259e-08,,1:46026397:CT:C,1:46026397,1:46026397:C:CT
336872,1,46041100,rs1084086,T,G,0.703508,1.0,395732,ADD,-0.0280157,0.00513516,29.7464,4.924136e-08,,1:46041100:T:G,1:46041100,1:46041100:G:T
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21795261,22,38474852,rs1004764,G,A,0.620458,0.997395,395732,ADD,-0.0234469,0.00483889,23.4727,1.267010e-06,,22:38474852:G:A,22:38474852,22:38474852:A:G
21801263,22,39286208,rs132502,A,C,0.233611,1.0,395732,ADD,-0.026474,0.00556285,22.6694,1.924110e-06,,22:39286208:A:C,22:39286208,22:39286208:C:A
21906441,22,51043278,rs3219201,G,A,0.678556,0.991675,395732,ADD,-0.0236357,0.00503915,21.9901,2.740627e-06,,22:51043278:G:A,22:51043278,22:51043278:A:G
21906480,22,51047218,rs5770938,G,A,0.76171,0.978624,395732,ADD,-0.0274931,0.00555585,24.4688,7.552313e-07,,22:51047218:G:A,22:51047218,22:51047218:A:G


In [82]:
header = ["CHR", "POS", "REF", "ALT", "MAF", "INFO", "BETA", "SE", "P", "varid", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

### After applying the allele_flip code there are still 76 top variants not present in the reference_file

These variants can be found at the tail of this file `~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.clumped`


In [84]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257
clumping_sbatch=$clumping_dir/f2257-ldclumping_flip_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=varid
numThreads=1
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2257/f2257-ldclumping_flip_2022-08-05.sbatch[0m
INFO: Workflow csg (ID=w16814b7e66737f2c) is ignored with 1 ignored step.



## Combined clump

In [11]:
# # Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257
clumping_sbatch=$clumping_dir/combined-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257/combined-ldclumping_2022-08-04.sbatch[0m
INFO: Workflow csg (ID=wa5d11405fe8fa890) is executed successfully with 1 completed step.



In [None]:
# To get the variant ID of the alleles that need to be flipped
cd ~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257
cat 071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.clumped | grep "not found in dataset" | awk '{print $1}' > f2247_f2257_allele_flip

In [85]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247_f2257/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats_original_columns.gz', dtype=str, compression='gzip',sep="\t", header=0)
stat.rename(columns={'CHROM':'CHR', 'GENPOS':'POS', 'ID':'varid', 'ALLELE0':'REF', 'ALLELE1':'ALT', 'A1FREQ':'MAF','LOG10P':'P'}, inplace=True)
stat[['P']] = stat[['P']].apply(pd.to_numeric)
stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257/f2247_f2257_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [86]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [87]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [88]:
header = ["CHR", "POS", "REF", "ALT", "MAF", "INFO", "BETA", "SE", "P", "varid", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

In [89]:
## Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257
clumping_sbatch=$clumping_dir/combined-ldclumping_flip_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_f2247_f2257/combined-ldclumping_flip_2022-08-05.sbatch[0m
INFO: Workflow csg (ID=w9838dfc6f95d510f) is executed successfully with 1 completed step.



### After applying the allele_flip code there are still 76 top variants not present in the reference_file

These variants can be found at the tail of this file `~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/

## AD_meta clump

In [8]:
## Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta
clumping_sbatch=$clumping_dir/AD_meta-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta/AD_meta-ldclumping_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=w5c3d9f868a13359b) is executed successfully with 1 completed step.



In [10]:
# To get the variant ID of the alleles that need to be flipped
cd ~/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta
cat GCST90027158_buildGRCh37_reheader.tsv.clumped| grep "not found in dataset" | awk '{print $1}' > ADmeta_allele_flip




In [12]:
import pandas as pd
stat = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz', dtype=str, compression='gzip',sep="\t", header=0)
#stat[['P']] = stat[['P']].apply(pd.to_numeric)
#stat['P'] = stat['P'].apply(lambda row: 10**-row)
stat['SNP_original'] = stat["CHR"].str.cat(others=[stat["POS"],stat['REF'],stat['ALT']], sep=':')
stat["CHR_POS"] = stat["CHR"].str.cat(others=[stat["POS"]], sep=':')
flip_list= pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta/ADmeta_allele_flip', dtype=str , header=None , names=['SNP'])
changelist = flip_list['SNP'].tolist()
stat['SNP']=stat.loc[:, 'SNP_original']

In [13]:
def snp_flip(row):
    row["SNP"] = str(row["CHR"])+":"+str(row["POS"])+":"+row["ALT"]+":"+row["REF"]
    return row

In [14]:
stat[stat['SNP'].isin(changelist)] = stat[stat['SNP'].isin(changelist)].apply(snp_flip,axis=1)

In [15]:
stat[stat['SNP_original'].isin(changelist)]

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P,SNP_original,CHR_POS
7474,0,0,G,A,0:0:A:G,0.0052,0.0173,0.7631,0:0:G:A,0:0
7476,0,0,G,A,0:0:A:G,-0.0071,0.0331,0.8292,0:0:G:A,0:0
7477,0,0,G,A,0:0:A:G,-0.0015,0.0141,0.916,0:0:G:A,0:0
7478,0,0,G,A,0:0:A:G,0.0024,0.0151,0.8725,0:0:G:A,0:0
7479,0,0,C,T,0:0:T:C,-0.3168,0.3158,0.3157,0:0:C:T,0:0
...,...,...,...,...,...,...,...,...,...,...
21093486,0,0,C,T,0:0:T:C,0.2512,0.4173,0.5472,0:0:C:T,0:0
21093487,0,0,G,A,0:0:A:G,0.0186,0.0108,0.08605,0:0:G:A,0:0
21093488,0,0,C,T,0:0:T:C,-0.0327,0.0142,0.02135,0:0:C:T,0:0
21093489,0,0,C,A,0:0:A:C,-0.1226,0.2286,0.5916,0:0:C:A,0:0


In [16]:
header = ["CHR", "POS", "REF", "ALT", "BETA", "SE", "P", "SNP_original", "SNP"]
stat.to_csv('~/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz', index=False, compression='gzip',sep="\t", columns = header)

In [17]:
## Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta
clumping_sbatch=$clumping_dir/AD_meta-ldclumping_flip_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump.gz
ld_sample_size=2000
clump_field=P
clump_p1=5e-06
clump_p2=1
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container $container
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/072822_AD_meta/AD_meta-ldclumping_flip_2022-08-08.sbatch[0m
INFO: Workflow csg (ID=wa96242296e322948) is executed successfully with 1 completed step.



# Find intersection between LD clumping regions

In [10]:
import numpy as np
import pandas as pd
from functools import reduce
import glob
filenames = glob.glob("/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822*/*.clumped_region")

In [12]:
dfs = [pd.read_csv(filename, dtype=str , sep=' ', header=None) for filename in filenames]

In [13]:
dfs

[      0         1         2
 0     1   6315125   6631106
 1     1  12169337  16161606
 2     1  20269718  22239026
 3     1  42273297  43365565
 4     1  45167429  47068339
 ..   ..       ...       ...
 560  22  37977713  38626556
 561  22  46608847  49887600
 562  22  48993774  51146132
 563  22  50436376  51229805
 564  22  51032153  51173041
 
 [565 rows x 3 columns],
       0         1         2
 0     1   6315125   6631106
 1     1  12197464  16161606
 2     1  45167429  47068339
 3     1  80995695  80995695
 4     1  82120001  82781655
 ..   ..       ...       ...
 450  22  48993774  51146132
 451  22  49049741  50616646
 452  22  50436376  51229805
 453  22  51032153  51173041
 454  22  51237215  51237215
 
 [455 rows x 3 columns],
       0         1         2
 0     1   6315125   6631106
 1     1  12169337  16161606
 2     1  20269718  22239026
 3     1  42273297  43365565
 4     1  45100431  46895641
 ..   ..       ...       ...
 562  22  37976422  38673790
 563  22  37977713

In [14]:
reduce(lambda left, right: pd.merge(left, right, how="inner", on=[0,1,2]), dfs)

Unnamed: 0,0,1,2
0,1,207302548,208001215
1,2,127773524,127898467
2,3,120807217,121829279
3,4,10864183,11050683
4,4,17421637,18282225
5,5,72819206,73187503
6,6,38980511,42216621
7,6,47312531,47875319
8,7,138432792,138552312
9,10,11424595,11835147


In [5]:

f3393_ADmeta=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f3393_ADmeta/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump_071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.clumped_region', dtype=str , sep=' ', header=None )
f2247_ADmeta=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADmeta/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump_071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.clumped_region', dtype=str , sep=' ', header=None )
f2257_ADmeta=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADmeta/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump_071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.clumped_region', dtype=str ,sep=' ', header=None )
combined_ADmeta=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADmeta/GCST90027158_buildGRCh37_reheader.snp_stats.ldclump_071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.clumped_region', dtype=str , sep=' ', header=None)

f3393_ADproxy=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f3393_ADproxy/AD_step2_BT_all.txt.snp_stats.ldclump_071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.ldclump.clumped_region', dtype=str ,  sep=' ', header=None)
f2247_ADproxy=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2247_ADproxy/AD_step2_BT_all.txt.snp_stats.ldclump_071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.ldclump.clumped_region', dtype=str , sep=' ', header=None)
f2257_ADproxy=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_f2257_ADproxy/AD_step2_BT_all.txt.snp_stats.ldclump_071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2_f2257.regenie.snp_stats.ldclump.clumped_region', dtype=str ,  sep=' ', header=None)
combined_ADproxy=pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/080822_combined_ADproxy/AD_step2_BT_all.txt.snp_stats.ldclump_071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f2247_f2257.regenie.snp_stats.ldclump.clumped_region', dtype=str , sep=' ', header=None)

In [10]:
f3393_ADmeta

Unnamed: 0,0,1,2
0,1,20269718,22239026
1,1,42273297,43365565
2,1,50649929,52830530
3,1,62814858,63491421
4,1,63381854,67224373
...,...,...,...
422,21,38925356,42309248
423,21,41743586,45396035
424,21,43747236,44021573
425,22,37978750,38626278


In [6]:
combined_ADproxy

Unnamed: 0,0,1,2
0,1,6315125,6631106
1,1,12197464,16161606
2,1,45167429,47068339
3,1,80995695,80995695
4,1,82120001,82781655
...,...,...,...
450,22,48993774,51146132
451,22,49049741,50616646
452,22,50436376,51229805
453,22,51032153,51173041


In [7]:
reduce(lambda left, right: pd.merge(left, right, how="inner", on=[0,1,2]), [f3393_ADmeta,f2247_ADmeta,f2257_ADmeta,combined_ADmeta])

Unnamed: 0,0,1,2
0,1,20269718,22239026
1,1,42273297,43365565
2,1,50649929,52830530
3,1,62814858,63491421
4,1,63381854,67224373
...,...,...,...
329,21,27344189,27765689
330,21,28095992,28209494
331,21,33951180,37891455
332,21,37948100,38914616


In [8]:
reduce(lambda left, right: pd.merge(left, right, how="inner", on=[0,1,2]), [f3393_ADproxy,f2247_ADproxy,f2257_ADproxy,combined_ADproxy])

Unnamed: 0,0,1,2
0,1,80995695,80995695
1,1,105458976,107422312
2,1,158531011,158531011
3,1,160834616,161383559
4,1,169267498,172944529
...,...,...,...
176,20,776747,894121
177,20,43358848,44214164
178,20,59195437,62721760
179,22,49049741,50616646


In [9]:
reduce(lambda left, right: pd.merge(left, right, how="inner", on=[0,1,2]), [f3393_ADmeta,f2247_ADmeta,f2257_ADmeta,combined_ADmeta,f3393_ADproxy,f2247_ADproxy,f2257_ADproxy,combined_ADproxy])

Unnamed: 0,0,1,2
0,1,207302548,208001215
1,2,127773524,127898467
2,3,120807217,121829279
3,4,10864183,11050683
4,4,17421637,18282225
5,5,72819206,73187503
6,6,38980511,42216621
7,6,47312531,47875319
8,7,138432792,138552312
9,10,11424595,11835147


In [None]:
#To find the name of the file where the region was located
grep -RH "1 42273297 43365565" 080822_*/*.clumped_region

In [None]:
sort -m 080822_*/*.clumped_region | uniq -d > shared_regions.txt
grep -Fx -f shared_regions.txt 080822_*/*.clumped_region

# Find the shared significant variants between AD_meta and AD_proxy

# Step 4: run region extraction to get some of the regions to test