# Analysis of pleiotopy between Alzheimer's Disease and ARHI traits 

First of all a review of the available references is made regarding AD. Particularly for GWAS using AD and meta-analysis from different studies. 

1. Kunkle et al https://www.nature.com/articles/s41588-019-0358-2
2. Jansen et al https://pubmed.ncbi.nlm.nih.gov/30617256/
3. Wightman et al https://www.nature.com/articles/s41588-021-00921-z
4. [family history based] https://pubmed.ncbi.nlm.nih.gov/29777097/
5. Schwartzentruber et al https://pubmed.ncbi.nlm.nih.gov/33589840/ 
6. Bellenguez et al https://www.nature.com/articles/s41588-022-01024-z

**Using AD proxy case definition**

[Schwartzentruber et al 2021](https://www.nature.com/articles/s41588-020-00776-w)

**Using clinically diagnosed LOAD**

[Kunkle et al 2019](https://www.nature.com/articles/s41588-019-0358-2)

**Using clinically diagnosed/proxy AD cases**

[Belenguez et al 2022](https://www.nature.com/articles/s41588-022-01024-z)

- two stage GWAS
- 111,326 clinically diagnosed/‘proxy’ AD cases and 677,663 controls
- The European Alzheimer & Dementia Biobank  + UK Biobank
- GWAS meta-analysis results can be downloaded here https://www.ebi.ac.uk/gwas/studies/GCST90027158

The results from this meta-analysis have been downloaded to our cluster and are located here:

`/mnt/vast/hpc/csg/data_public/GWAS_sumstats/GCST90027158_buildGRCh38.tsv.gz`

## GWAS for ARHI used in the pleiotropy analysis

Parameters used for REGENIE

```
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
minMAC=4
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
```

The GWAS results are stored here (Note: these results correspond to white European individuals:

- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f3393_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f2247_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_f2257_500K
- /mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/2021_10_07_combined_500K

# Step 1. Liftover of the meta-analysis sumtats of AD to match our GWAS results in hg19

In [None]:
module load Singularity
USER_PATH=~/working
export PATH=$HOME/miniconda3/bin:$PATH
sos run $USER_PATH/bioworkflows/GWAS/liftover.ipynb \
--cwd /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI \
--input_file /mnt/vast/hpc/csg/data_public/GWAS_sumstats/GCST90027158_buildGRCh38.tsv.gz \
--output_file GCST90027158_buildGRCh37.tsv \
--fr 'hg38' \
--to 'hg19' \
--yml_file $USER_PATH/UKBB_GWAS_dev/data/liftover.yml

# Step 2. Association analysis of the imputed data

Do the association analysis for the Hearing impairment traits with the white european individuals present in the 500K samples

In [39]:
# Common variables Columbia's cluster
UKBB_PATH=$HOME/UKBiobank
UKBB_yale=$HOME/UKBiobank_Yale_transfer
USER_PATH=$HOME/project
container_lmm=$HOME/containers/lmm.sif
container_marp=$HOME/containers/marp.sif
container_annovar=$HOME/containers/gatk4-annovar.sif
hearing_pheno_path=$UKBB_PATH/phenotype_files/hearing_impairment
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
formatFile_fastgwa=$USER_PATH/UKBB_GWAS_dev/data/fastGWA_template.yml
formatFile_bolt=$USER_PATH/UKBB_GWAS_dev/data/boltlmm_template.yml
formatFile_saige=$USER_PATH/UKBB_GWAS_dev/data/saige_template.yml
formatFile_regenie=$USER_PATH/UKBB_GWAS_dev/data/regenie_template.yml
# Workflows
lmm_sos=$USER_PATH/bioworkflows/GWAS/LMM.ipynb
anno_sos=$USER_PATH/bioworkflows/variant-annotation/annovar.ipynb
clumping_sos=$USER_PATH/bioworkflows/GWAS/LD_Clumping.ipynb
extract_sos=$USER_PATH/bioworkflows/GWAS/Region_Extraction.ipynb
snptogene_sos=$USER_PATH/UKBB_GWAS_dev/workflow/snptogene.ipynb

# LMM directories for imputed data
lmm_imp_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data
lmm_imp_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_imputed_data
lmm_imp_dir_saige=$UKBB_PATH/results/SAIGE_results/results_imputed_data
lmm_imp_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_imputed_data

# LMM directories for exome data
lmm_exome_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_exome_data
lmm_exome_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_exome_data
lmm_exome_dir_saige=$UKBB_PATH/results/SAIGE_results/results_exome_data
lmm_exome_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_exome_data

## LMM variables 
## Specific to Bolt_LMM
LDscoresFile=$UKBB_PATH/LDSCORE.1000G_EUR.tab.gz
geneticMapFile=$UKBB_PATH/genetic_map_hg19_withX.txt.gz
covarMaxLevels=10
numThreads=20
bgenMinMAF=0.001
bgenMinINFO=0.8
lmm_job_size=1
ylim=0

### Specific to FastGWA (depeding if you run from Yale or Columbia)
####Yale's cluster
grmFile=$UKBB_PATH/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp
####Columbia's cluster
grmFile=$UKBB_yale/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp

### Specific to SAIGE
bgenMinMAC=4
trait_type=binary
loco=TRUE
sampleCol=IID

### Specific to REGENIE
bsize=1000
lowmem=$HOME/scratch60/
lowmem_dir=$HOME/scratch60/predictions
trait=bt
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
reverse_log_p=True




## New PCA
### Exclude the samples wiht code 1415 but without the sub-category code

In [None]:
import pandas as pd
sample500k = pd.read_csv("/home/gl2776/UKBiobank/phenotype_files/HI_UKBB/ukb47922_white_460649ind.keep_id",header=None,sep=" ")
sample500k

In [None]:
exclusion = pd.read_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/with_1415_without_subcat.500k.sample_id.txt",header=0,sep="\t")
exclusion = exclusion["IID"].to_list()
len(exclusion)

In [None]:
sample500k[~sample500k[0].isin(exclusion)].shape

In [None]:
sample500k[~sample500k[0].isin(exclusion)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/ukb47922_white_456561ind_exclu_1415.keep_id",header=False,index=False,sep=" ")

In [None]:
## Keep_id for PCA
# f3393
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2257
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)
# f2247_f2257
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2"
phe = pd.read_csv(file,header=0,sep="\t")
phe[["FID", "IID"]].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2.keep_id", sep='\t', index=False, header=False)

In [None]:
# phenopca
import pandas as pd
# f3393
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_aid_f3393_expandedwhite_15601cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_difficulty_f2247_expandedwhite_110453cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Hearing_noise_f2257_expandedwhite_161443cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.phenopca", sep='\t', index=False)
# f2247
file = "/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2"
list1415 = pd.read_csv(file,header=0,sep="\t")["IID"].to_list()
file = "/home/gl2776/UKBiobank/results/092821_PCA_results_500K/100521_UKBB_Combined_f2247_f2257_expandedwhite_93258cases_237318ctrl_500k.phenopca"
phe = pd.read_csv(file,header=0,sep="\t")
phe[phe["IID"].isin(list1415)].to_csv("/home/gl2776/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.phenopca", sep='\t', index=False)

### Calculate the first 2 PC's for each of these phenotypes to run the association analysis with the imputed data
#### f.3393
##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f3393_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f3393_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f3393_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### f.2247

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2247_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2247_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2247_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### f.2257

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2257_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

#### Combined f.2247 & f.2257

##### Step 1

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257
#gwas_sbatch=$USER_PATH/UKBB_GWAS_dev/output/qc1_f2247_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
gwas_sbatch=$cwd/qc1_f2247_f2257_qcarray_$(date +"%Y-%m-%d").sbatch
## Use qc'ed genotype array
genoFile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
keep_samples=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2.keep_id
#Keep variants after LD pruning
keep_variants=$UKBB_PATH/results/092821_PCA_results_500K/092821_ldprun_unrelated/cache/*092821_ldprun_unrelated.filtered.prune.in

#GWAS QC variables set all of this variables to 0 to avoid doing more filtering
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
numThreads=1
mem='30G'

gwasqc_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/GWAS_QC.ipynb

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
    --mem $mem
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

##### Step 2.

In [None]:
## Columbia's cluster
cwd=$UKBB_PATH/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257
#This is the bfile obtained in step 1
genoFile=$cwd/*.bed
# Format FID, IID, pop
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.phenopca
label_col=ethnicity
pop_col=ethnicity
pca_sbatch=$cwd/flashpca_f2247_f2257_pc_$(date +"%Y-%m-%d").sbatch
k=2
maha_k=2
min_axis=""
max_axis=""

pca_sos=$USER_PATH/xqtl-pipeline/code/data_preprocessing/genotype/PCA.ipynb

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container $container_lmm
"""

sos run  $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

### Combine the PCs in the phenotype files

Phenotype files named "070722_*_500k_PC1_PC2" are with original PCs calculated by original sample size with code 1415. "071222_*_500k_PC1_PC2" will be the new phenotype files with the newly calculated PCs

In [None]:
# f3393
import pandas as pd
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f3393/071122_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f3393', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")

In [None]:
# f2247
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247/071122_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2247', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")
# f2257
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2257/071122_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2257', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")
# f2247_f2257
phe = pd.read_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/070722_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2",header=0,sep="\t")
pc = pd.read_csv("~/UKBiobank/results/092821_PCA_results_500K/071122_PCA_500k_exclu_1415/f2247_f2257/071122_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k.pca.txt",header=0,sep="\t")
phe = phe[['FID', 'IID', 'sex', 'f2247_f2257', 'age']]
phe = phe.merge(pc[["IID","PC1","PC2"]],how="left",left_on="IID",right_on="IID")
phe.to_csv("~/UKBiobank/phenotype_files/hearing_impairment/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2",index=False,sep="\t")

## Run association analysis with imputed data for each phenotype
### 500k

#### f.3393

In [None]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f3393_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### f.2247

In [None]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2247_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### f.2257

In [None]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2257
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2257_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Hearing_noise_f2257_expandedwhite_159033cases_236699ctrl_500k_PC1_PC2
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### Combined f.2247 & f.2257

In [None]:
cwd=~/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247_f2257
lmm_dir_regenie=$cwd
lmm_sbatch_regenie=$cwd/f2247_f2257_500K_impdata_regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2
covarFile=$hearing_pheno_path/fulldb_500K/pleiotropy_AD_ARHI/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
bgenMinINFO=0.8
bgenMinMAF=0.001
minMAC=4
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed data hg19 to run the association analysis
genoFile=`echo ~/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=SNP
lowmem_dir=$cwd/predictions

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --label_annotate $label_annotate
    --no-annotate
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

# Step 3. LD clumping between different traits

## Format file preparation so that all summary stats match 

In [4]:
import pandas as pd
data = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37.tsv.gz', compression='gzip', sep="\t", header=0)

In [5]:
data.head()

Unnamed: 0,CHR,POS,A0,A1,SNP,STAT,SE,P
0,1,529825,C,T,chr1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,chr1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,chr1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,chr1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,chr1:661906:G:C,0.2938,0.3171,0.3542


In [6]:
data.rename(columns={'A0':'REF', 'A1':'ALT', 'STAT':'BETA'}, inplace=True)

In [7]:
data.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,529825,C,T,chr1:529825:C:T,-0.0152,0.3236,0.9626
1,1,531142,C,CTG,chr1:531142:C:CTG,-0.2494,0.2247,0.267
2,1,566327,G,A,chr1:566327:G:A,0.0975,0.432,0.8214
3,1,581537,G,A,chr1:581537:G:A,-0.3028,0.318,0.3411
4,1,661906,G,C,chr1:661906:G:C,0.2938,0.3171,0.3542


In [8]:
data.to_csv("/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/GCST90027158_buildGRCh37_reheader.tsv.gz", sep='\t', index=False, header=True, compression="gzip")

### Re-format the SNP in the 500K GWAS summary stats

In [16]:

f3393 = pd.read_csv('/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f3393/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz', dtype=str, compression='gzip',sep="\t", header=0)

In [17]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939253875768
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.5226211725664186
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.0615063563378631
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832341109345
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.7267499323234313


In [18]:
f3393['var'] = f3393['CHR'] + ':' + f3393['POS'] + ':' + f3393['REF'] + ':' + f3393['ALT']

In [19]:
f3393.head()

Unnamed: 0,CHR,POS,REF,ALT,SNP,BETA,SE,P,var
0,1,13259,A,G,rs562993331,-0.444965,0.405872,0.272939253875768,1:13259:A:G
1,1,17569,A,C,rs535086049,1.19541,1.86985,0.5226211725664186,1:17569:A:C
2,1,17641,A,G,rs578081284,0.435933,0.233139,0.0615063563378631,1:17641:A:G
3,1,30741,A,C,rs558169846,1.28043,1.24525,0.303832341109345,1:30741:A:C
4,1,52144,A,T,rs190291950,-0.117852,0.337249,0.7267499323234313,1:52144:A:T


In [23]:
import os
directory = '/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/'

for filename in os.listdir(directory):
    if filename.endswith(".regenie.snp_stats.gz"):
        df = pd.read_csv(filename, dtype=str, compression='gzip',sep="\t", header=0)
        df.head()
     continue
    else:
    continue

IndentationError: unindent does not match any outer indentation level (<tokenize>, line 8)

In [35]:
import os
rootdir = '/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/'

for subdir, dirs, files in os.walk(rootdir):
    for filename in files:
        if filename.endswith(".regenie.snp_stats.gz"):
            print(os.path.join(subdir, filename))
            df = pd.read_csv(os.path.join(subdir, filename), dtype=str, compression='gzip',sep="\t", header=0)
            df.rename(columns={'SNP':'varid'}, inplace=True)
            df['SNP'] = df['CHR'] + ':' + df['POS'] + ':' + df['REF'] + ':' + df['ALT']
            df.to_csv('/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/' + filename, sep='\t', index=False, header=True, compression="gzip")
            print(df.head())

/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247/071222_UKBB_Hearing_difficulty_f2247_expandedwhite_107994cases_236699ctrl_500k_PC1_PC2_f2247.regenie.snp_stats.gz
  CHR    POS REF ALT        varid        BETA        SE                     P  \
0   1  13259   A   G  rs562993331   -0.387653  0.169082  0.023122777490193567   
1   1  17569   A   C  rs535086049    0.331285  0.764723    0.6648627475668728   
2   1  17641   A   G  rs578081284   0.0451544  0.103088    0.6613738204754334   
3   1  30741   A   C  rs558169846     1.76303  0.991047  0.024469776350843825   
4   1  57222   C   T  rs576081345  -0.0501817  0.114946    0.6624254356036648   

           SNP  
0  1:13259:A:G  
1  1:17569:A:C  
2  1:17641:A:G  
3  1:30741:A:C  
4  1:57222:C:T  
/mnt/vast/hpc/csg/UKBiobank/results/REGENIE_results/results_imputed_data/071222_500k_without_1415/f2247_f2257/071222_UKBB_Combined_f2247_f2257_expandedwhite_91092cases_236699ctrl_500k_PC1_PC2_f

## Variables for LD clumping 

In [53]:
clumping_sos=~/project/bioworkflows/GWAS/LD_Clumping.ipynb
## GenoFile is not necessary to run LD clumping but it is needed by the global parameters. It won't be use in this module but you have to provide path
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
reference_genotype_prefix=~/UKBiobank/results/LD_clumping/ref_files/ukb39554_c1_22_v3.imputed
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=0.01
clump_r2=0.04
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1




## AD and f3393

In [54]:
# Set the bash variables 
clumping_dir=/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping
clumping_sbatch=$clumping_dir/AD_f3393-ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=`echo /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/CST90027158_buildGRCh37_reheader.tsv.gz /mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/071222_UKBB_Hearing_aid_f3393_expandedwhite_14734cases_236699ctrl_500k_PC1_PC2_f3393.regenie.snp_stats.gz`

clumping_args="""default 
    --cwd $clumping_dir
    --genoFile $genoFile
    --sampleFile $sampleFile
    --reference_genotype_prefix $reference_genotype_prefix  
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/mnt/vast/hpc/csg/UKBiobank/results/pleiotropy_AD_ARHI/LD_clumping/AD_f3393-ldclumping_2022-07-21.sbatch[0m
INFO: Workflow csg (ID=wf686120a6a136c95) is executed successfully with 1 completed step.

