# Analysis of hearing impairment phenotypes

This notebook applies the `Get_Job_Script.ipynb` to automatically generate the sbatch scripts to run in Yale's cluster. The end result is to apply [various LMM workflows](https://github.com/statgenetics/UKBB_GWAS_dev/tree/master/workflow) to perform association analysis in different hearing impairment traits, do clumping analysis and extract associated regions.

The phenotypes analyzed are:

1. Hearing aid f.3393
2. Hearing difficulty f.2247
3. Hearing difficulty with background noise f.2257
4. Combined phenotype f.2247 & f.2257

## File paths on Yale cluster

- Genotype files in PLINK format:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/hearing_impairment`
- Relationship file:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`

## Create symlinks to necessary folders in your home dir

```
ln -s /mnt/mfs/statgen/archive/UKBiobank_Yale_transfer ~/
ln -s /mnt/mfs/statgen/UKBiobank ~/
ln -s /mnt/mfs/statgen/containers ~/
```

## Fork and clone bioworkflows and UKBB_GWAS_dev repos to a folder called project

```
mkdir project 
git clone https://github.com/statgenetics/UKBB_GWAS_dev.git
git clone https://github.com/cumc/bioworkflows.git`m
```



## 08/31/20 analysis

On the cluster, open up this notebook using the JupyterLab server you set up via the ssh channel, then run the following cells,

## Bash variables for workflow configuration

### Yale's cluster

Run this cell when working on Yale's cluster

In [1]:
# Common variables Yale cluster
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
USER_PATH=$HOME/project
container_lmm=$UKBB_PATH/lmm.sif
container_marp=$UKBB_PATH/marp.sif
container_annovar=$UKBB_PATH/annovar.sif
hearing_pheno_path=$UKBB_PATH/phenotype_files/hearing_impairment
tpl_file=$USER_PATH/UKBB_GWAS_dev/farnam.yml
formatFile_fastgwa=$USER_PATH/UKBB_GWAS_dev/data/fastGWA_template.yml
formatFile_bolt=$USER_PATH/UKBB_GWAS_dev/data/boltlmm_template.yml
formatFile_saige=$USER_PATH/UKBB_GWAS_dev/data/saige_template.yml
formatFile_regenie=$USER_PATH/UKBB_GWAS_dev/data/regenie_template.yml
###bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
unrelated_samples=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620

# Cleaned Imputed data BGEN input
genoFile=`echo $UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

# Non-QC'ed Exome data PLINK input (as downloaded from the UKBB)
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`

  msg['msg_id'] = self._parent_header['header']['msg_id']





### Columbia's cluster

Run this cell if running your jobs on Columbia's cluster

In [8]:
# Common variables Columbia's cluster
UKBB_PATH=$HOME/UKBiobank
UKBB_yale=$HOME/UKBiobank_Yale_transfer
USER_PATH=$HOME/project
container_lmm=$HOME/containers/lmm.sif
container_marp=$HOME/containers/marp.sif
container_annovar=$HOME/containers/gatk4-annovar.sif
hearing_pheno_path=$UKBB_PATH/phenotype_files/hearing_impairment
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml
formatFile_fastgwa=$USER_PATH/UKBB_GWAS_dev/data/fastGWA_template.yml
formatFile_bolt=$USER_PATH/UKBB_GWAS_dev/data/boltlmm_template.yml
formatFile_saige=$USER_PATH/UKBB_GWAS_dev/data/saige_template.yml
formatFile_regenie=$USER_PATH/UKBB_GWAS_dev/data/regenie_template.yml
##bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
##unrelated_samples=$UKBB_yale/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620

# Cleaned Imputed data BGEN input
##genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
##sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

# Non-QC'ed Exome data PLINK input (as downloaded from the UKBB)
##genoFile=`echo $UKBB_yale/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`




## Shared variables for workflows and results

In [9]:
# Workflows
lmm_sos=$USER_PATH/bioworkflows/GWAS/LMM.ipynb
anno_sos=$USER_PATH/bioworkflows/variant-annotation/annovar.ipynb
clumping_sos=$USER_PATH/bioworkflows/GWAS/LD_Clumping.ipynb
extract_sos=$USER_PATH/bioworkflows/GWAS/Region_Extraction.ipynb
snptogene_sos=$USER_PATH/UKBB_GWAS_dev/workflow/snptogene.ipynb

# LMM directories for imputed data
lmm_imp_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data
lmm_imp_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_imputed_data
lmm_imp_dir_saige=$UKBB_PATH/results/SAIGE_results/results_imputed_data
lmm_imp_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_imputed_data

# LMM directories for exome data
lmm_exome_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_exome_data
lmm_exome_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_exome_data
lmm_exome_dir_saige=$UKBB_PATH/results/SAIGE_results/results_exome_data
lmm_exome_dir_regenie=$UKBB_PATH/results/REGENIE_results/results_exome_data




## Specification of LMM variables

In [10]:
## LMM variables 
## Specific to Bolt_LMM
LDscoresFile=$UKBB_PATH/LDSCORE.1000G_EUR.tab.gz
geneticMapFile=$UKBB_PATH/genetic_map_hg19_withX.txt.gz
covarMaxLevels=10
numThreads=20
bgenMinMAF=0.001
bgenMinINFO=0.8
lmm_job_size=1
ylim=0

### Specific to FastGWA (depeding if you run from Yale or Columbia)
####Yale's cluster
grmFile=$UKBB_PATH/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp
####Columbia's cluster
grmFile=$UKBB_yale/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp

### Specific to SAIGE
bgenMinMAC=4
trait_type=binary
loco=TRUE
sampleCol=IID

### Specific to REGENIE
bsize=1000
lowmem=$HOME/scratch60/
lowmem_dir=$HOME/scratch60/predictions
trait=bt
minMAC=4
maf_filter=0.01
geno_filter=0.01
hwe_filter=0
mind_filter=0.1
reverse_log_p=True




## Specification of LD clumping variables

In [4]:
# LD clumping directories
clumping_dir=$UKBB_PATH/results/LD_clumping

## LD clumping variables
# For sumtastsFiles if more than one provide each path
####Yale's cluster for bgen data
##bfile_ref=$UKBB_PATH/results/LD_clumping/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.1210.ref_geno.bed
####Columbia's cluster for bgen data
##bfile_ref=$UKBB_yale/results/LD_clumping/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.1210.ref_geno.bed
# Changes dependending upon which traits are analyzed
ld_sample_size=1210
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1
clumpFile= 
clumregionFile=




## Specification of Region extraction variables

In [5]:
# Region extraction directories
extract_dir=$UKBB_PATH/results/region_extraction

## Region extraction variables
#region_file=$UKBB_PATH/results/LD_clumping/
#geno_path=$UKBB_PATH/results/UKBB_bgenfilepath.txt
#sumstats_path=$UKBB_PATH/results/FastGWA_results/results_imputed_data/
#extract_job_size=10




## Creation of the GRM for the imputed data from the genotype array that contains the full sample

In [8]:
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
phenoFile=$hearing_pheno_path/120120_UKBB_Hearing_aid_f3393_expandedwhite
phenoCol=hearing_aid_cat
grm_sbatch=../output/grm_fastgwa_$(date +"%Y-%m-%d").sbatch

grm_args="""gcta
    --cwd $lmm_dir_fastgwa
    --bfile $bfile
    --genoFile $genoFile
    --phenoFile $phenoFile
    --phenoCol $phenoCol
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $grm_sbatch \
    --args "$grm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m../output/grm_fastgwa_2021-08-12.sbatch[0m
INFO: Workflow csg (ID=w573976bb95aff6a6) is executed successfully with 1 completed step.



## Subsetting the UKBB database for the individuals with exomes

In [None]:
#!/bin/sh
#$ -l h_rt=36:00:00
#$ -l h_vmem=100G
#$ -N subset_ukb47922
#$ -o /home/dmc2245/project/UKBB_GWAS_dev/output/subset_ukb47922_$JOB_ID.out
#$ -e /home/dmc2245/project/UKBB_GWAS_dev/output/subset_ukb47922_$JOB_ID.err
#$ -j y
#$ -q csg.q
#$ -S /bin/bash
export PATH=$HOME/miniconda3/bin:$PATH
module load Singularity/3.5.3
module load R/4.0
Rscript subset_ukb47922.R

In [None]:
setwd('/mnt/mfs/statgen/UKBiobank/phenotype_files/HI_UKBB')
source("/mnt/mfs/statgen/UKBiobank/data/ukbb_databases/ukb47922_updatedAug2021/ukb47922.r")
print('Finished loading database')
print('The database has',nrow(bd), 'rows')
df.geno <- read.table("/mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/072721_run/plink/ukb23156_c1.merged.filtered.fam", header=FALSE, stringsAsFactors = FALSE)
names(df.geno) <-c("FID","IID","ignore1", "ignore2", "ignore3", "ignore4")
print('There are', nrow(df.geno), 'exomed individuals')
head(bd[,1, drop=FALSE])
names(bd)[1] <- "IID"
head(bd[,1, drop=FALSE])
df.gen.phen <-merge(df.geno, bd, by="IID", all=FALSE)
print('Subsetting of the database completed')
print('There are',nrow(df.gen.phen), 'in the subsetted database')
write.csv(df.gen.phen,'082321_UKBB_exomes.csv', row.names = FALSE)
print('Finished writing the csv file')

# 1. Hearing aid user f.3393

## FastGWA job only white British

In [None]:
lmm_dir_fastgwa=$lmm_imp_dir_fastgwa/f3393_hearing_aid
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f3393_imp-fastgwa.sbatch
phenoFile=$hearing_pheno_path/200828_UKBB_Hearing_aid_f3393
covarFile=$hearing_pheno_path/200828_UKBB_Hearing_aid_f3393
phenoCol=hearing_aid_cat
covarCol=sex
qCovarCol=age_final_aid

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

## FastGWA all whites

In [2]:
lmm_dir_fastgwa=$lmm_imp_dir_fastgwa//f3393_hearing_aid_expandedwhite
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f3393_expandedwhite_imp-fastgwa.sbatch
phenoFile=$hearing_pheno_path/120120_UKBB_Hearing_aid_f3393_expandedwhite
covarFile=$hearing_pheno_path/120120_UKBB_Hearing_aid_f3393_expandedwhite
phenoCol=hearing_aid_cat
covarCol=sex
qCovarCol=age_final_aid

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-12-02_f3393_expandedwhite_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=852e6b72805bc1ab) is executed successfully with 1 completed step.


## FastGWA exome data

In [2]:
lmm_dir_fastgwa=$lmm_exome_dir_fastgwa/f3393_hearing_aid_exomes
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f3393_exomes-fastgwa.sbatch
phenoFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_aid_f3393_128254ind_exomes
covarFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_aid_f3393_128254ind_exomes
phenoCol=hearing_aid_cat
covarCol=sex
qCovarCol=age_final_aid

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --genoFile $genoFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-11_f3393_exomes-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=3f949028f44f4152) is executed successfully with 1 completed step.



## Regenie exome data

In [4]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes_bfile
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exome_bfile-regenie.sbatch
phenoFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_aid_f3393_128254ind_exomes
covarFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_aid_f3393_128254ind_exomes
phenoCol=hearing_aid_cat
covarCol=sex
qCovarCol=age_final_aid
#Use original bed files from the UKBB exome data
#bfile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_prefix $lowmem_prefix
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-12_f3393_hearing_aid_exome_bfile-regenie.sbatch[0m
INFO: Workflow farnam (ID=w03430bc454ddd1e8) is executed successfully with 1 completed step.



## 08-16-21 Regenie exome data Qc'ed 200K exomes

In [18]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f3393_hearing_aid
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200Kexomes-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200Kexomes-regenie_2021-08-17.sbatch[0m
INFO: Workflow csg (ID=w74ab0b008f3c94f0) is executed successfully with 1 completed step.



## 08-17-21 Annotation of results from regenie qc 200k exomes

In [12]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f3393_hearing_aid/annotation_p5e-08
sumstatsFile=$lmm_exome_dir_regenie/081621_f3393_hearing_aid/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl_f3393.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=/mnt/mfs/statgen/UKBiobank/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --rsid $rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_annovar_2021-08-17.sbatch[0m
INFO: Workflow csg (ID=w3f546290203624d2) is executed successfully with 1 completed step.



## 09-07-21 Regenie exome data 50K vs 150 K (QC'ed 200K exomes, genotype file unqc'ed)

#### 50K

In [8]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f3393_hearing_aid_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_24496ind_50K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_24496ind_50K.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_50K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wf83f02b9889c4e0d) is executed successfully with 1 completed step.



#### 150K

In [7]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f3393_hearing_aid_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_79891ind_150K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_79891ind_150K.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_150K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w8025af7b7d9f68cd) is executed successfully with 1 completed step.



## 09-08-21 Analysis with exome QC data and QC genotype array with new database ukb47922

In [9]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## After doing the annotation you can create the annotatted manhatan plot
anno_file=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/091321_annotation/*.formatted.csv
label_annotate=Gene
#known_vars="chr5:272741:A:G chr5:272748:G:C chr5:272755:A:G"
#new_vars="chr5:73776529:T:C chr5:73780632:G:A chr5:73780649:GT:G chr5:73780686:C:A chr5:73794436:T:C chr5:73795301:T:A chr5:73795403:C:T chr6:75362956:T:C chr6:75841299:A:G"

# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --anno_file $anno_file
    --label_annotate $label_annotate
    --annotate
    --new_vars $new_vars
    --known_vars $known_vars
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args" 

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K-regenie_2021-10-19.sbatch[0m
INFO: Workflow csg (ID=we66172e5d1891c29) is executed successfully with 1 completed step.



### 09-08-21 Annotation 

In [11]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/091321_annotation
sumstatsFile=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_annovar_2021-09-13.sbatch[0m
INFO: Workflow csg (ID=w5c0cf51b9571f756) is executed successfully with 1 completed step.



#### 09-14-21 50K f3393

In [15]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_24189ind_50K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_24189ind_50K.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## After doing the annotation you can create the annotatted manhatan plot
anno_file=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/091321_annotation/*.formatted.csv
label_annotate=Gene
snp_list=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.top_snps.tsv
# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --anno_file $anno_file
    --annotate
    --top_snps
    --snp_list $snp_list
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_50K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w07ce7346b8a8bccd) is executed successfully with 1 completed step.



#### 09-14-21 150K f3393

In [5]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_78848ind_150K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_78848ind_150K.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## After doing the annotation you can create the annotatted manhatan plot
anno_file=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K/091721_annotation/*.formatted.csv
label_annotate=Gene
# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --annotate
    --anno_file $anno_file
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_150K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w48f3e009e944f901) is executed successfully with 1 completed step.



#### 09-17-21 Annotation 150K f3393

In [26]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K/091721_annotation
sumstatsFile=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K/090321_UKBB_Hearing_aid_f3393_expandedwhite_78848ind_150K_f3393.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_150K_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_150K_annovar_2021-09-17.sbatch[0m
INFO: Workflow csg (ID=w0e3c8599bad52968) is executed successfully with 1 completed step.



#### Hudson 150K vs 200K

In [4]:
hudson_sos=~/project/bioworkflows/GWAS/Hudson_Plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/exome_data/$(date +"%Y%m%d")_f3393_150kvs200k
hudson_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_150k_200k_hudson_$(date +"%Y-%m-%d").sbatch
sumstats_1=$UKBB_PATH/results/REGENIE_results/results_exome_data/090921_f3393_hearing_aid_200K/*.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/091421_f3393_hearing_aid_150K/*.snp_stats.gz
toptitle="H-aid mega-analysis"
bottomtitle="H-aid discovery"
highlight_p_top=0.0
highlight_p_bottom=0.0
pval_filter=5e-08
highlight_snp=
job_size=1 
phenocol1='H-aid 200K'
phenocol2='H-aid 150K'
container_lmm=~/containers/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --container_lmm $container_lmm
"""
sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_150k_200k_hudson_2021-10-11.sbatch[0m
INFO: Workflow csg (ID=wdb69365c9bfa457c) is executed successfully with 1 completed step.



## 09-21-21 Conditional analysis GCTA-COJO

In [6]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K/092121_gcta_cond
sumstatsFile=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats_original_columns.gz
snp_list=~/test/gcta_cojo/snp.list 
formatFile=~/project/UKBB_GWAS_dev/data/gcta-cojo_template.yml
numThreads=5
bfile=$UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c6.merged.filtered.bed
job_size=1
maf=0.001
gcta_cojo_sos=~/project/UKBB_GWAS_dev/workflow/GCTA-COJO.ipynb
gcta_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_cond_$(date +"%Y-%m-%d").sbatch

gcta_args="""gcta_cond \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --snp_list $snp_list\
    --bfile $bfile \
    --job_size $job_size \
    --maf $maf \
    --formatFile $formatFile\
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_cojo_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_cond_2021-11-02.sbatch[0m
INFO: Workflow csg (ID=w78d7d81c743271c4) is executed successfully with 1 completed step.



## 12-22-21 Estimate joint-effects of a subset of SPNs

In [5]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/122221_gcta_joint
sumstatsFile=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats_original_columns.gz
snp_list=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/122221_gcta_joint/snp_list.txt
formatFile=~/project/UKBB_GWAS_dev/data/gcta-cojo_template.yml
numThreads=5
bfile=$UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c5.merged.filtered.bed
job_size=1
maf=0.001
chrom=5
gcta_cojo_sos=~/project/UKBB_GWAS_dev/workflow/GCTA-COJO.ipynb
gcta_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_joint_$(date +"%Y-%m-%d").sbatch

gcta_args="""gcta_joint \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --snp_list $snp_list\
    --bfile $bfile \
    --job_size $job_size \
    --maf $maf \
    --chrom $chrom \
    --formatFile $formatFile\
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_cojo_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_joint_2021-12-22.sbatch[0m
INFO: Workflow csg (ID=w6b9965f1a6986291) is ignored with 1 ignored step.



## 11-02-21 Stepwise model selection GCTA-COJO 

In [9]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f3393_hearing_aid_150K/110221_gcta_stepwise
sumstatsFile=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2_f3393.regenie.snp_stats_original_columns.gz
formatFile=~/project/UKBB_GWAS_dev/data/gcta-cojo_template.yml
numThreads=5
bfile=$UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c5.merged.filtered.bed
job_size=1
maf=0.001
gcta_cojo_sos=~/project/UKBB_GWAS_dev/workflow/GCTA-COJO.ipynb
gcta_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_slct_$(date +"%Y-%m-%d").sbatch

gcta_args="""gcta_slct \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --snp_list $snp_list\
    --bfile $bfile \
    --job_size $job_size \
    --maf $maf \
    --formatFile $formatFile\
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $gcta_cojo_sos \
    --to-script $gcta_sbatch \
    --args "$gcta_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_gcta_slct_2021-11-02.sbatch[0m
INFO: Workflow csg (ID=w282ca2f59169194d) is executed successfully with 1 completed step.



## 01-05-22 Regenie imputed data with 200K individuals that have both exome and imputed data 

In [11]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_imp_dir_regenie/010522_f3393_hearing_aid_200K_imputed
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_imputed-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed bgen files
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
bgenMinINFO=0.3
bgenMinMAC=4
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
label_annotate=Gene
# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --no-annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args" 

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_imputed-regenie_2022-01-06.sbatch[0m
INFO: Workflow csg (ID=w4c734452d4e35dca) is executed successfully with 1 completed step.



## Regenie imputed data: Expanded white control NA

This analysis is done with the correct number of cases and controls (those NA for f.3393)

In [12]:
lmm_dir_regenie=$lmm_imp_dir_regenie/f3393_hearing_aid_impdata_newpheno
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_impdata-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=`echo $UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

[91mERROR[0m: [91mFailed to locate /home/dmc2245/project/bioworkflows/admin/Get_Job_Script.ipynb.sos[0m



## Regenie: single variant association analysis with exome data on replication set 50K exomes

### f.3393 & Pure controls

In [2]:
# First run using only pure controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes50K_pure_ctrl
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exomes50K_pure_ctrl-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-19_f3393_hearing_aid_exomes50K_pure_ctrl-regenie.sbatch[0m
INFO: Workflow farnam (ID=wfb6905d41d076b6e) is executed successfully with 1 completed step.



### f.3393 & Controls NA for f.3393

In [4]:
# First run using only pure controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes50K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exomes50K_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-19_f3393_hearing_aid_exomes50K_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w06d19fc2ff71aba3) is ignored with 1 ignored step.



## Regenie in exome data (original Plink files UKBB unqc'ed) using modified phenotype file with controls_na for f.3393

### f.3393 & Controls NA for f.3393

In [2]:
# First run using controls na for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes200K_noqc_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/062421_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_104402ind
covarFile=$hearing_pheno_path/062421_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_104402ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
hwe_filter=5e-08

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_f3393_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w574ef8382b581afa) is executed successfully with 1 completed step.



## Regenie in exome data after VCF-QC 200K exomes

### f.3393 & Controls NA for f.3393

In [9]:
# Run using all controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes200K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=`echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_exomes200K_ctrl_na-regenie_2021-05-18.sbatch[0m
INFO: Workflow csg (ID=w1682bd842f840e70) is executed successfully with 1 completed step.



## Regenie Burden with 50K exomes

In [None]:
lmm_dir_regenie=$lmm_exome_dir_regenie/burden/f3393_hearing_aid_exomes50K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exomes50K_ctrl_na-regenie-burden.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_228760ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_aid
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed
anno_file=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.anno_file
set_list=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.set_list_file
mask_file=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.mask_file
keep_gene=
build_mask=max
aaf_bins=0.005,0.01

lmm_args="""regenie_burden
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --trait $trait
    --anno_file $anno_file
    --set_list $set_list
    --mask_file $mask_file
    --keep_gene $keep_gene
    --aaf_bins $aaf_bins
    --build_mask $build_mask
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

## Create annotation file with annovar for 200K exomes 

In [3]:
annovar_dir=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/
annovar_sos=$USER_PATH/bioworkflows/variant-annotation/annovar.ipynb
annovar_sbatch=$USER_PATH/UKBB_GWAS_dev/output/ukb23155_200k_exome_annotation_$(date +"%Y-%m-%d").sbatch
bfiles=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
walltime="60h"
mem="30G"

annovar_args="""annovar
    --cwd $annovar_dir 
    --bim_name $bfiles 
    --humandb /mnt/mfs/statgen/isabelle/REF/humandb  
    --xref_path /mnt/mfs/statgen/isabelle/REF/humandb 
    --job_size 1 
    --build 'hg38' 
    --name_prefix ukb23155_chr1_chr22 
    --walltime $walltime
    --mem $mem
    --container_annovar /mnt/mfs/statgen/containers/gatk4-annovar.sif
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $annovar_sos \
    --to-script $annovar_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/ukb23155_200kexome_annotation_2021-08-12.sbatch[0m
INFO: Workflow csg (ID=wd678cbee99943d68) is executed successfully with 1 completed step.



## Create the anno_file, set_list_file and mask_files necessary for burden test

In [6]:
burden_dir=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/burden_files
anno_sbatch_burden=$USER_PATH/UKBB_GWAS_dev/output/ukb23155_200Kexomes_burdenfiles_$(date +"%Y-%m-%d").sbatch
## Annotated exome file for 50K exomes UKBB
#annotated_file_hg38=$UKBB_PATH/results/ukb32285_exomespb_annovar/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.csv
## Annotate exome file for 200K exomes UKBB
annotated_file_hg38=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/ukb23155_chr1_chr22_exomedata.hg38.hg38_multianno.csv.gz
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
job_size=1
name_prefix='ukb23155_chr1_chr22_burden_files'
container_annovar=$HOME/containers/gatk4-annovar.sif

anno_args="""burden_files
    --cwd $burden_dir
    --annotated_file $annotated_file_hg38
    --bim_name $bim_name
    --name_prefix $name_prefix
    --job_size $job_size
    --container_annovar $container_annovar
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $anno_sos \
    --to-script $anno_sbatch_burden\
    --args "$anno_args"


INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/gl2776/working/UKBB_GWAS_dev/output/ukb23155_200Kexomes_burdenfiles_2021-08-11.sbatch[0m
INFO: Workflow csg (ID=w8ca34b002594c307) is executed successfully with 1 completed step.



## Regenie Burden with 200K exomes

This run is with the new phenotype file with pure control definition and cases definition made by Fabiha

In [None]:
lmm_dir_regenie=$lmm_exome_dir_regenie/burden/f3393_hearing_aid_200K_exomes
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_200k_exomes-regenie-burden.sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
phenoCol=f3393
covarCol=sex
qCovarCol=age
#This run do it with unqc'ed plink files while we wait for the qc'ed ones
genoFile=
#Use the original bed files for the genotype array for the expanded white on regenie step1
##Yale's cluster
#bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed
## Columbia's cluster
bfile=$UKBB_PATH/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed
anno_file=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.anno_file
set_list=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.set_list_file
mask_file=$lmm_exome_dir_regenie/burden/ukb32285_exomespb_chr1_22.hg38.hg38_multianno.mask_file
keep_gene=
build_mask=max
aaf_bins=0.005,0.01

lmm_args="""regenie_burden
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --trait $trait
    --anno_file $anno_file
    --set_list $set_list
    --mask_file $mask_file
    --keep_gene $keep_gene
    --aaf_bins $aaf_bins
    --build_mask $build_mask
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

## Bolt-LMM job

In [2]:
lmm_dir_bolt=$lmm_imp_dir_bolt/f3393_hearing_aid
lmm_sbatch_bolt=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_imp-bolt.sbatch
phenoFile=$hearing_pheno_path/200828_UKBB_Hearing_aid_f3393
covarFile=$hearing_pheno_path/200828_UKBB_Hearing_aid_f3393
phenoCol=hearing_aid_cat
covarCol=sex
qCovarCol=age_final_aid

lmm_args="""boltlmm
    --cwd $lmm_dir_bolt 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_bolt 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp    
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_bolt \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-20_f3393_hearing_aid_imp-bolt.sbatch[0m
INFO: Workflow farnam (ID=c482417dfcba9f23) is executed successfully with 1 completed step.



## LD clumping job

### Imputed data

In [None]:
clumping_dir=$clumping_dir/f3393_hearing_aid
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_ldclumping.sbatch
sumstatsFiles=$lmm_imp_dir_fastgwa/f3393_hearing_aid/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats.gz

clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genofile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

### Exome data: 

FIXME: 
1. Option: Is to create the reference bed from the exome data. However this bfile is used to calculate the LD between SNPs, therefore it is better to use a reference file created from the genotype array. One drawback will be that variants present in the exome and absent in the genotype array wont be selected as index SNPs.
2. Option: use the bfile_ref created from the imputed data 

In [1]:
cwd=$clumping_dir/f3393_hearing_aid_exome
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exome_ldclumping.sbatch
#clumping_sbatch=../output/$(date +"%Y-%m-%d")_refbedfile2_exome_ldclumping.sbatch
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
sumstatsFiles=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_stats.gz
sampleFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bfile_ref=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631_chr1_22_exomedata.1200.ref_geno.bed
ld_sample_size=1200
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $cwd
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-07_f3393_hearing_aid_exome_ldclumping.sbatch[0m
INFO: Workflow farnam (ID=w64d7831c0c82bb7e) is executed successfully with 1 completed step.



## 09-22-21 LD clumping for the hearing aid data 

### Step 1. Get a random set of samples

In [39]:
cwd=$clumping_dir
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/filter_samples_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/*snp_stats.gz
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""filter_samples
    --cwd $cwd
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/filter_samples_exome_ldclumping_2021-09-22.sbatch[0m
INFO: Workflow csg (ID=w9fc715ba2f8e5890) is executed successfully with 1 completed step.



### Step 2. Create the reference bfile from the plink QC'ed files

In [5]:
cwd=$clumping_dir
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/refbedfile_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/*snp_stats.gz
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
name_bref='ukb23156_c1_22.merged.filtered'
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""reference
    --cwd $cwd
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --name_ref $name_bref
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/refbedfile_exome_ldclumping_2021-09-23.sbatch[0m
INFO: Workflow csg (ID=w9cf8dfc0819763b2) is executed successfully with 1 completed step.



## Step 3. Do the LD clumping

In [8]:
cwd=$clumping_dir/092321_f3393
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/*snp_stats.gz
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
bfile_ref=$UKBB_PATH/results/LD_clumping/ukb23156_c1_22.merged.filtered.2000.ref_geno.bed
name_bref='ukb23156_c1_22.merged.filtered'
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $cwd
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --name_bref $name_bref
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/exome_ldclumping_2021-09-23.sbatch[0m
INFO: Workflow csg (ID=we4fd4870caab48b8) is executed successfully with 1 completed step.



## Region extraction

### Imputed data

In [4]:
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_Hearing_aid_f3393
extract_dir=$UKBB_PATH/results/region_extraction/f3393_hearing_aid
extract_sos=~/project/UKBB_GWAS_dev/workflow/Region_Extraction_4.ipynb
extract_sbatch=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_imp-region.sbatch
region_file=$UKBB_PATH/results/LD_clumping/f3393_hearing_aid/*.clumped_region
geno_path=$UKBB_PATH/results/UKBB_bgenfilepath.txt
sumstats_path=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f3393_hearing_aid/*.snp_stats.gz

extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --bgen-sample-path $sampleFile
    --sumstats-path $sumstats_path
    --format-config-path $formatFile_fastgwa
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-28_f3393_hearing_aid_imp-region.sbatch[0m
INFO: Workflow farnam (ID=55af58288aa5cc61) is executed successfully with 1 completed step.



### Exome data

In [15]:
### Create the bedfilepath file
cd $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020
for file in ukb23155_c{1..22}_b0_v1.bed;
    do echo `pwd`/$file;
done | awk '{print NR " " $s}' > UKBB_exome_plinkfilepath.txt

In [15]:
tpl_file=/home/dc2325/project/UKBB_GWAS_dev/farnam.yml
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/phenotypes_exome_data/010421_UKBB_Hearing_aid_f3393_128254ind_exomes
extract_dir=$UKBB_PATH/results/region_extraction/f3393_hearing_aid_exomes
# Original region extraction pipeline
extract_sos=~/project/bioworkflows/admin/Region_Extraction.ipynb 
#extract_sos=~/project/UKBB_GWAS_dev/workflow/Region_Extraction_4.ipynb
extract_sbatch=/home/dc2325/project/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exome-regionextrac.sbatch
region_file=$UKBB_PATH/results/LD_clumping/f3393_hearing_aid_exome/*.clumped_region
geno_path=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBB_exome_plinkfilepath.txt
sumstats_path=$UKBB_PATH/results/REGENIE_results/results_exome_data/f3393_hearing_aid_exomes/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_stats.gz
unrelated_samples=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
extract_job_size=10
sampleFile=
formatFile_regenie=
#container_lmm=$UKBB_PATH/lmm.sif

extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --sumstats-path $sumstats_path
    --format-config-path $formatFile_regenie
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

[91mERROR[0m: [91mFailed to locate /home/dmc2245/project/bioworkflows/admin/Get_Job_Script.ipynb.sos[0m



In [13]:
### Create the bedfilepath file
cd $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/
for file in ukb23156_c{1..22}.merged.filtered.bed;
    do echo `pwd`/$file;
done | awk '{print NR " " $s}' > $clumping_dir/092321_UKBB_qc_exome_geno_path.txt




## 09-23-21 Extraction of regions f.3393 200K

In [4]:
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_aid_f3393_expandedwhite_6436cases_96601ctrl_PC1_2.tsv
extract_dir=$UKBB_PATH/results/region_extraction/092321_f3393_hearing_aid_200K
extract_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_regionextrac_$(date +"%Y-%m-%d").sbatch
region_file=$clumping_dir/092321_f3393/*.clumped_region
geno_path=$clumping_dir/092321_UKBB_qc_exome_geno_path.txt
sumstats_path=$lmm_exome_dir_regenie/090921_f3393_hearing_aid_200K/*.snp_stats.gz
unrelated_samples=$clumping_dir/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
extract_job_size=5
format-config-path=
## No need to use format file in this case (regenie is already formatted for region extraction)
extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --sumstats-path $sumstats_path
    --format-config-path $formatFile_regenie
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f3393_hearing_aid_200K_regionextrac_2021-09-24.sbatch[0m
INFO: Workflow csg (ID=w2063cc588ca17e32) is executed successfully with 1 completed step.



## Fine mapping

In [None]:
finemap_dir=$UKBB_PATH/results/fine_mapping/f3393_hearing_aid
finemap_sos=~/project/UKBB_GWAS_dev/workflow/SuSiE_test.ipynb
finemap_sbatch=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_imp-finemap.sbatch
region_dir=$UKBB_PATH/results/region_extraction/f3393_hearing_aid
region_file=$UKBB_PATH/results/LD_clumping/f3393_hearing_aid/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats.clumped_region
sumstats_path=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f3393_hearing_aid/*.snp_stats.gz
N=230411
container_lmm=/home/dc2325/scratch60/lmm_v_1_4.sif

finemap_args="""default
    --cwd $finemap_dir
    --region_dir $region_dir
    --region_file $phenoFile
    --sumstats_path $sumstats_path
    --container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $finemap_sos \
    --to-script $finemap_sbatch \
    --args "$finemap_args"

## Post_GWAS annotation SNP-to-gene

### Annotate exome data (old pheno)

In [3]:
lmm_dir=$UKBB_PATH/results/REGENIE_results/results_exome_data/f3393_hearing_aid_exomes
postgwa_sbatch=../output/$(date +"%Y-%m-%d")_f3393_postgwa.sbatch
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f3393_hearing_aid_exomes/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_stats.gz
tpl_file=../farnam.yml
postgwa_sos=~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb
job_size=1
hg=38
postgwa_args="""default
    --cwd $lmm_dir
    --sumstatsFile $sumstatsFile
    --hg $hg
    --job_size $job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $postgwa_sos \
    --to-script $postgwa_sbatch \
    --args "$postgwa_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-28_f3393_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w6b10aab5fa809918) is executed successfully with 1 completed step.


### Merge bim files from 200K exome QC'ed data

In [5]:
cwd=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge
bimfiles=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bim`
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
build='hg38'
name_prefix='091321'

sos run ~/project/bioworkflows/variant-annotation/annovar.ipynb bim_from_plink\
    --cwd $cwd \
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --build $build \
    --name_prefix $name_prefix\
    --job_size $job_size \
    --container_annovar $container_annovar

INFO: Running [32mbim_from_plink[0m: Merge all the *.bim files into a single file. Needs to be run once per type of data (e.g. genotype, exome)
INFO: [32mbim_from_plink[0m is [32mcompleted[0m.
INFO: [32mbim_from_plink[0m output:   [32m/home/dmc2245/UKBiobank/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim[0m
INFO: Workflow bim_from_plink (ID=w14e9db79c78ecf77) is executed successfully with 1 completed step.



## Annotate Qc'ed exome data hg38

In [5]:
annovar_dir=$UKBB_PATH/results/$(date +"%Y_%m_%d")_hg38_exome
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
annovar_sbatch=~/project/UKBB_GWAS_dev/output/annovar_chr1_22_exome_hg38_$(date +"%Y-%m-%d").sbatch
annovar_sos=~/project/bioworkflows/variant-annotation/annovar.ipynb
job_size=1
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
container_annovar=~/containers/gatk4-annovar.sif
name_prefix=ukb23155_chr1_chr22
walltime='60h'
# Use the bim_merge workflow first and then the annovar workflow
annovar_args="""annovar
    --cwd $annovar_dir \
    --bim_name $bim_name \
    --humandb $humandb \
    --xref_path $xref_path \
    --job_size $job_size \
    --name_prefix $name_prefix \
    --walltime $walltime \
    --container_annovar $container_annovar
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $annovar_sos \
    --to-script $annovar_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/annovar_chr1_22_exome_hg38_2021-10-12.sbatch[0m
INFO: Workflow csg (ID=wb3e6601cf57ec990) is executed successfully with 1 completed step.



### Annotate exome data (old pheno) with bfile

In [13]:
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
ukbb=$UKBB_PATH
USER_PATH=/home/dc2325/project
cwd=/home/dc2325/scratch60/output/bfile_annovar
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f3393_hearing_aid_exomes_bfile/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=/gpfs/gibbs/pi/dewan/data/UKBiobank/annovar.sif
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/project/results/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd $cwd \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles\
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --ukbb $ukbb \
    --container_annovar $container_annovar \
    -s build

INFO: Running [32mannovar_1[0m: Get the list of significantly associated SNPs
INFO: Step [32mannovar_1[0m (index=0) is [32mignored[0m with signature constructed
INFO: [32mannovar_1[0m output:   [32m/home/dc2325/scratch60/output/bfile_annovar/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_annotate[0m
INFO: Running [32mannovar_2[0m: Get chr, start, end, ref_allele, alt_allele format
INFO: Step [32mannovar_2[0m (index=0) is [32mignored[0m with signature constructed
INFO: [32mannovar_2[0m output:   [32m/home/dc2325/scratch60/output/bfile_annovar/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.avinput[0m
INFO: Running [32mannovar_3[0m: Annotate variants file using ANNOVAR
INFO: [32mannovar_3[0m is [32mcompleted[0m.
INFO: [32mannovar_3[0m output:   [32m/home/dc2325/scratch60/output/bfile_annovar/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.hg38_multianno.csv[0m
INFO: Workflow annovar (I

### Annotate exome data (new phenotype) with bfile

In [None]:
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
ukbb=$UKBB_PATH
USER_PATH=/home/dc2325/project
cwd=/home/dc2325/scratch60/output/200k_new_pheno_annovar
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes200K_noqc_ctrl_na/062421_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_144952ind_f2247_ctrl_na.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=$UKBB_PATH/annovar.sif
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/project/results/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd $cwd \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --ukbb $ukbb \
    --bimfiles $bimfiles\
    --container_annovar $container_annovar

## Annotation using ANNOVAR of 200Kexomes, 50K exomes and bgen imputed variants

### 200K exomes

In [3]:
annovar_dir=$UKBB_PATH/results/annovar_exome
bedfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_chr1_chr22_exomedata.bim
annovar_sbatch=../output/$(date +"%Y-%m-%d")_annovar_chr1_22_exomes_postgwa.sbatch
tpl_file=../farnam.yml
annovar_sos=~/project/UKBB_GWAS_dev/workflow/QC_Exome_UKBB.ipynb
job_size=1
humandb=/gpfs/ysm/datasets/db/annovar/humandb
ukbb=/gpfs/gibbs/pi/dewan/data/UKBiobank
container_annovar=$UKBB_PATH/annovar.sif 
name_prefix=ukb23155_chr1_chr22

# Use the bim_merge workflow first and then the annovar workflow
annovar_args="""annovar
    --cwd $annovar_dir \
    --bedfiles $bedfiles\
    --bimfiles $bimfiles \
    --bim_name $bim_name \
    --humandb $humandb \
    --ukbb $ukbb \
    --job_size $job_size \
    --name_prefix $name_prefix \
    --container_annovar $container_annovar
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $annovar_sos \
    --to-script $annovar_sbatch \
    --args "$annovar_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-07_annovar_chr1_22_exomes_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w696e79ce57c148dd) is executed successfully with 1 completed step.



### 50K exomes

In [3]:
# First run using only pure controls for f3393 
annovar_dir=$UKBB_PATH/results/annovar_exome
bfiles=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
bim_name=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bim
annovar_sbatch=~/project/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_annovar_chr1_22_50Kexomes_postgwa.sbatch
annovar_sos=~/project/bioworkflows/variant-annotation/annovar.ipynb
job_size=1
humandb=/gpfs/ysm/datasets/db/annovar/humandb
ukbb=/gpfs/gibbs/pi/dewan/data/UKBiobank
container_annovar=$UKBB_PATH/annovar.sif 
name_prefix=ukb23155_chr1_chr22_50Kexomes

# Use the bim_merge workflow first and then the annovar workflow
annovar_args="""annovar
    --cwd $annovar_dir \
    --bim_name $bim_name \
    --humandb $humandb \
    --ukbb $ukbb \
    --job_size $job_size \
    --name_prefix $name_prefix \
    --container_annovar $container_annovar
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $annovar_sos \
    --to-script $annovar_sbatch \
    --args "$annovar_args"


INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-30_annovar_chr1_22_50Kexomes_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w29025279f82f6eb9) is executed successfully with 1 completed step.



### BGEN imputed data

# 2. Hearing difficulty/problems f.2247

## FastGWA job white British

In [None]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_hearing_difficulty
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2247_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_Hearing_difficulty_f2247
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_Hearing_difficulty_f2247
phenoCol=hearing_diff_new
covarCol=sex
qCovarCol=age_final_diff

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

## FastGWA all white

In [3]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_hearing_difficulty_expandedwhite
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2247_expanded_white_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_Hearing_difficulty_f2247_expandedwhite
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_Hearing_difficulty_f2247_expandedwhite
phenoCol=hearing_diff_new
covarCol=sex
qCovarCol=age_final_diff

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-12-02_f2247_expanded_white_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=9da5aed9605f3f9f) is executed successfully with 1 completed step.


## FastGWA exome data

In [2]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_exome_data/f2247_hearing_difficulty_exomes
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2247_exomes-fastgwa.sbatch
bfile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
sampleFile=
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/phenotypes_exome_data/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/phenotypes_exome_data/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes
formatFile_fastgwa=~/project/UKBB_GWAS_dev/data/fastGWA_template.yml
phenoCol="hearing_diff_new"
covarCol=sex
qCovarCol=age_final_diff
bgenMinMAF=0.001

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --genoFile $genoFile 
    --phenoFile $phenoFile
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-12_f2247_exomes-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=372a90dd555a65fb) is executed successfully with 1 completed step.



## Regenie exome data

In [5]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_hearing_difficulty_exomes_bfile
lmm_sbatch_regenie=../output/$(date +"%Y-%m-%d")_f2247_hearing_difficulty_exome_bfile-regenie.sbatch
phenoFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes
covarFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes
phenoCol=hearing_diff_new
covarCol=sex
qCovarCol=age_final_diff
#Use original bed files from the UKBB exome data
#bfile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_prefix $lowmem_prefix
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-12_f2247_hearing_difficulty_exome_bfile-regenie.sbatch[0m
INFO: Workflow farnam (ID=w6ce6b23ca9b23cce) is executed successfully with 1 completed step.



## 08-16-21 Regenie exome qc'ed data

In [7]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f2247_hearing_difficulty
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200Kexomes-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200Kexomes-regenie_2021-08-17.sbatch[0m
INFO: Workflow csg (ID=w2417b87a07888b94) is ignored with 1 ignored step.



## 08-17-21 Annotation of results from regenie qc 200k exomes

In [17]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f2247_hearing_difficulty/annotation_p5e-08
sumstatsFile=$lmm_exome_dir_regenie/081621_f2247_hearing_difficulty/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl_f2247.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=/mnt/mfs/statgen/UKBiobank/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --rsid $rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_annovar_2021-08-18.sbatch[0m
INFO: Workflow csg (ID=wa4059ac9dcf33d2a) is ignored with 1 ignored step.



## 09-07-21 Regenie exome data 50K vs 150 K (QC'ed 200K exomes, genotype file unqc'ed)


### 50K

In [9]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f2247_hearing_difficulty_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_35147ind_50K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_35147ind_50K.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_50K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w9737ee7a97ef3454) is executed successfully with 1 completed step.



### 150K 

In [10]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f2247_hearing_difficulty_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_109172ind_150K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_109172ind_150K.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w224638864838fe9b) is executed successfully with 1 completed step.



## 09-08-21 Analysis with exome QC data and QC genotype array with new database

In [18]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## After doing the annotation you can create the annotatted manhatan plot
anno_file=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/091321_annotation/*.formatted.csv
label_annotate=Gene

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --anno_file $anno_file
    --label_annotate $label_annotate
    --annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200K-regenie_2021-09-16.sbatch[0m
INFO: Workflow csg (ID=wa53d871b44b89ca5) is ignored with 1 ignored step.



### 09-08-21 Annotation

In [22]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/091321_annotation
sumstatsFile=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2_f2247.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_annovar_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=w5996caf461860ba3) is executed successfully with 1 completed step.



### 09-14-21 50K f.2247

In [16]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2247_hearing_difficulty_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_34596ind_50K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_34596ind_50K.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
anno_file=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/091321_annotation/*.formatted.csv
label_annotate=Gene
snp_list=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/*.top_snps.tsv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --anno_file $anno_file
    --annotate
    --top_snps
    --snp_list $snp_list
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_50K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=wabf8c8163e2d2b9c) is executed successfully with 1 completed step.



### 09-14-21 150K f.2247

In [13]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2247_hearing_difficulty_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_107507ind_150K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_107507ind_150K.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
label_annotate='Gene'
anno_file=$lmm_exome_dir_regenie/091421_f2247_hearing_difficulty_150K/091721_annotation/*.formatted.csv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --annoate
    --anno_file $anno_file
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w9884a347cc8c5773) is executed successfully with 1 completed step.



## 09-17-21 annotation f.2247 150K

In [6]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2247_hearing_difficulty_150K/091721_annotation
sumstatsFile=$lmm_exome_dir_regenie/091421_f2247_hearing_difficulty_150K/*.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_150K_annovar_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w2a078a4fafcb3e06) is executed successfully with 1 completed step.



### Hudson plot

In [12]:
### Hudson plot 
hudson_sos=~/project/bioworkflows/GWAS/Hudson_Plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/exome_data/$(date +"%Y%m%d")_f2247_150kvs200k
hudson_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_150k_200k_hudson_$(date +"%Y-%m-%d").sbatch
sumstats_1=$UKBB_PATH/results/REGENIE_results/results_exome_data/090921_f2247_hearing_difficulty_200K/*.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/091421_f2247_hearing_difficulty_150K/*.snp_stats.gz
toptitle="H-diff mega-analysis"
bottomtitle="H-diff discovery"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
highlight_snp=
job_size=1 
phenocol1='H-diff 200K'
phenocol2='H-diff 150K'
container_lmm=~/containers/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --container_lmm $container_lmm
"""
sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_150k_200k_hudson_2021-10-12.sbatch[0m
INFO: Workflow csg (ID=wce6d667563bc6de1) is executed successfully with 1 completed step.



## 01-05-22 Regenie with imputed data for 200K that have both exome and imputed

In [12]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_imp_dir_regenie/010522_f2247_hearing_diff_200K_imputed
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_diff_200K_imputed-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2.tsv
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed bgen files%
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
bgenMinINFO=0.3
bgenMinMAC=4
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --no-annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args" 

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_diff_200K_imputed-regenie_2022-01-06.sbatch[0m
INFO: Workflow csg (ID=w10ba44d3a705245e) is executed successfully with 1 completed step.



## Regenie imputed data: expanded white control NA

This run includes the new phenotype information with the imputed data

In [5]:
lmm_dir_regenie=$lmm_imp_dir_regenie/f2247_hearing_difficulty_impdata_newpheno
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_hearing_difficulty_impdata-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
phenoCol=f2247_ctrl_na
covarCol=sex
qCovarCol=age_final_diff_new
genoFile=`echo $UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-13_f2247_hearing_difficulty_impdata-regenie.sbatch[0m
INFO: Workflow farnam (ID=w80da2f9338cdcf29) is executed successfully with 1 completed step.



## Regenie: 50K exomes replication set

### f.2247 & pure controls

In [2]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_hearing_difficulty_exomes50K_pure_ctrl
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_hearing_difficulty_exomes50K_pure_ctrl-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_pure_ctrl_184909ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_pure_ctrl_184909ind
phenoCol=f2247_ctrl_pure
covarCol=sex
qCovarCol=age_final_diff_new
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-21_f2247_hearing_difficulty_exomes50K_pure_ctrl-regenie.sbatch[0m
INFO: Workflow farnam (ID=wcaa8ffe2883ba03a) is executed successfully with 1 completed step.



In [None]:
# First run using controls na for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f3393_hearing_aid_exomes200K_noqc_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f3393_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/062421_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_104402ind
covarFile=$hearing_pheno_path/062421_UKBB_Hearing_aid_f3393_expandedwhite_z974included_ctrl_na_104402ind
phenoCol=f3393_ctrl_na
covarCol=sex
qCovarCol=age_final_aid
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
hwe_filter=5e-08

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

### f.2247 Controls with NA for f.3393

In [6]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_hearing_difficulty_exomes50K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_hearing_difficulty_exomes50K_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
phenoCol=f2247_ctrl_na
covarCol=sex
qCovarCol=age_final_diff_new
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-19_f2247_hearing_difficulty_exomes50K_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=wcb0151f3731055ae) is executed successfully with 1 completed step.



## Regenie in exome data (original Plink files UKBB unqc'ed) using modified phenotype file with controls_na for f.3393

### f.2247 Controls with NA for f.3393

In [3]:
# First run using controls na for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_hearing_difficulty_exomes200K_noqc_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/062421_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_144952ind
covarFile=$hearing_pheno_path/062421_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_144952ind
phenoCol=f2247_ctrl_na
covarCol=sex
qCovarCol=age_final_diff_new
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
hwe_filter=5e-08

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_f2247_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w82514504865b0a4f) is executed successfully with 1 completed step.



## Regenie in exome data after VCF-QC 200K exomes

### f.2247 Controls with NA for f.3393

In [10]:
# Run using all controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_hearing_difficulty_exomes200K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_difficulty_f2247_expandedwhite_z974included_ctrl_na_316411ind
phenoCol=f2247_ctrl_na
covarCol=sex
qCovarCol=age_final_diff_new
genoFile=`echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_exomes200K_ctrl_na-regenie_2021-05-18.sbatch[0m
INFO: Workflow csg (ID=wdf2119aff8d7a187) is executed successfully with 1 completed step.



## LD clumping job

### Imputed data

In [None]:
clumping_dir=$UKBB_PATH/results/LD_clumping/f2247_hearing_difficulty
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2247_hearing_difficulty_ldclumping.sbatch
sumstatsFiles=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_hearing_difficulty/200828_UKBB_Hearing_difficulty_f2247_hearing_diff_new.fastGWA.snp_stats.gz

clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --bgenFile $bgenFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

### Exome data

In [4]:
tpl_file=../farnam.yml
clumping_dir=$UKBB_PATH/results/LD_clumping/f2247_hearing_difficulty_exome
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2247_hearing_diff_exome_ldclumping.sbatch
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
sumstatsFiles=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes_hearing_diff_new.regenie.snp_stats.gz
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
sampleFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
unrelated_samples=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
bfile_ref=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631_chr1_22_exomedata.1200.ref_geno.bed
container_lmm=$UKBB_PATH/lmm.sif
ld_sample_size=1200
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $clumping_dir
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-07_f2247_hearing_diff_exome_ldclumping.sbatch[0m
INFO: Workflow farnam (ID=w923c2b21e51a953e) is executed successfully with 1 completed step.



## 09-22-21 LD clumping

In [9]:
cwd=$clumping_dir/092321_f2247
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/*snp_stats.gz
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
bfile_ref=$UKBB_PATH/results/LD_clumping/ukb23156_c1_22.merged.filtered.2000.ref_geno.bed
name_bref='ukb23156_c1_22.merged.filtered'
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $cwd
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --name_bref $name_bref
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_exome_ldclumping_2021-09-23.sbatch[0m
INFO: Workflow csg (ID=wa91d96f2886a85d4) is executed successfully with 1 completed step.



## Region extraction

## 09-23-21 Extraction of regions f.2247 200K

In [7]:
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_difficulty_f2247_expandedwhite_45502cases_96601ctrl_PC1_2.tsv
extract_dir=$UKBB_PATH/results/region_extraction/092321_f2247_hearing_difficulty_200K
extract_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200K_regionextrac_$(date +"%Y-%m-%d").sbatch
region_file=$clumping_dir/092321_f2247/*.clumped_region
geno_path=$clumping_dir/092321_UKBB_qc_exome_geno_path.txt
sumstats_path=$lmm_exome_dir_regenie/090921_f2247_hearing_difficulty_200K/*.snp_stats.gz
unrelated_samples=$clumping_dir/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
extract_job_size=5
formatFile_regenie=
## No need to use format file in this case (regenie is already formatted for region extraction)
extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --sumstats-path $sumstats_path
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_hearing_difficulty_200K_regionextrac_2021-09-24.sbatch[0m
INFO: Workflow csg (ID=wdc4cd2781d9cfc03) is executed successfully with 1 completed step.



## Post-GWAS annotation

In [8]:
lmm_dir=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes
postgwa_sbatch=../output/$(date +"%Y-%m-%d")_f2247_postgwa.sbatch
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes_hearing_diff_new.regenie.snp_stats.gz
tpl_file=../farnam.yml
postgwa_sos=~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb
job_size=1
hg=38
postgwa_args="""default
    --cwd $lmm_dir
    --sumstatsFile $sumstatsFile
    --hg $hg
    --job_size $job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $postgwa_sos \
    --to-script $postgwa_sbatch \
    --args "$postgwa_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-28_f2247_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w50d0cee396725518) is executed successfully with 1 completed step.


In [None]:
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
cwd=/home/dc2325/scratch60/output/bfile_annovar
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes_bfile/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes_hearing_diff_new.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=$UKBB_PATH/annovar.sif
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/scratch60/output/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd $cwd \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --ukbb $UKBB_PATH \
    --container_annovar $container_annovar\
    -s build

# 3. Hearing difficulty with background noise f.2257

## FastGWA job white British

In [None]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2257_hearing_background_noise
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2257_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_Hearing_background_noise_f2257
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_Hearing_background_noise_f2257
phenoCol=hearing_noise_cat
covarCol=sex
qCovarCol=age_final_noise

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

## FastGWA job all white

In [6]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2257_hearing_background_noise_expandedwhite
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2257_expandedwhite_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_Hearing_background_noise_f2257_expandedwhite
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_Hearing_background_noise_f2257_expandedwhite
phenoCol=hearing_noise_cat
covarCol=sex
qCovarCol=age_final_noise

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb dewan \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mdewan[0m: Configuration for Yale `pi_dewan` partition cluster
INFO: [32mdewan[0m is [32mcompleted[0m.
INFO: [32mdewan[0m output:   [32m../output/2020-12-02_f2257_expandedwhite_imp-fastgwa.sbatch[0m
INFO: Workflow dewan (ID=5e423268c0ec2f3b) is executed successfully with 1 completed step.


## Regenie exome data

In [7]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2257_hearing_noise_exomes_bfile
lmm_sbatch_regenie=../output/$(date +"%Y-%m-%d")_f2257_hearing_noise_exome_bfile-regenie.sbatch
phenoFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes
covarFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes
phenoCol=hearing_noise_cat
covarCol=sex
qCovarCol=age_final_noise
#Use original bed files from the UKBB exome data
#bfile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_prefix $lowmem_prefix
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-12_f2257_hearing_noise_exome_bfile-regenie.sbatch[0m
INFO: Workflow farnam (ID=wd8efa7feebdc6a0e) is executed successfully with 1 completed step.



## 08-16-21 Regenie qc'ed exome data 

In [5]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f2257_hearing_noise
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_200Kexomes-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_200Kexomes-regenie_2021-08-17.sbatch[0m
INFO: Workflow csg (ID=w06bacff111c41666) is executed successfully with 1 completed step.



## 08-17-21 Annotation of results from regenie qc 200k exomes

In [23]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_f2257_hearing_noise_fabiha/annotation_p5e-08
sumstatsFile=$lmm_exome_dir_regenie/081621_f2257_hearing_noise_fabiha/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl_f2257.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=/mnt/mfs/statgen/UKBiobank/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --rsid $rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_annovar_2021-08-18.sbatch[0m
INFO: Workflow csg (ID=wcd10d359c4da1090) is executed successfully with 1 completed step.



## 09-07-21 Regenie exome data 50K vs 150 K (QC'ed 200K exomes, genotype file unqc'ed)

### 50K

In [11]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f2257_hearing_noise_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_39344ind_50K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_39344ind_50K.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_50K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=we3b92604833f76bd) is executed successfully with 1 completed step.



### 150K

In [12]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_f2257_hearing_noise_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_125393ind_150K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_125393ind_150K.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_150K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wa18227600b180ef0) is executed successfully with 1 completed step.



## 09-08-21 Analysis with exome QC data and QC genotype array with new database

In [19]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_200K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
anno_file=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/091321_annotation/*.formatted.csv
label_annotate=Gene

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --anno_file $anno_file
    --label_annotate $label_annotate
    --annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_200K-regenie_2021-09-16.sbatch[0m
INFO: Workflow csg (ID=wcddb1159730eda0e) is executed successfully with 1 completed step.



#### 09-08-21 Annotation

In [21]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/091321_annotation
sumstatsFile=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2_f2257.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_annovar_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wa3ca2ac7fee56782) is executed successfully with 1 completed step.



### 09-14-21 50K f.2257

In [17]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2257_hearing_noise_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_noise_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_38723ind_50K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_38723ind_50K.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
anno_file=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/091321_annotation/*.formatted.csv
label_annotate=Gene
snp_list=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/*.top_snps.tsv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --anno_file $anno_file
    --annotate
    --top_snps
    --snp_list $snp_list
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_noise_50K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w5cfface26c5d32c7) is executed successfully with 1 completed step.



### 09-14-21 150K f.2257

In [12]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2257_hearing_noise_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_noise_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_123538ind_150K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_123538ind_150K.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
label_annotate='Gene'
anno_file=$lmm_exome_dir_regenie/091421_f2257_hearing_noise_150K/091721_annotation/*.formatted.csv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --annotate
    --anno_file $anno_file
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_noise_150K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w766f6cd59097c1a8) is executed successfully with 1 completed step.



#### 09-17-21 Annotation f.2257 150K

In [7]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_f2257_hearing_noise_150K/091721_annotation
sumstatsFile=$lmm_exome_dir_regenie/091421_f2257_hearing_noise_150K/*.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_150K_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_150K_annovar_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w1be089a277599563) is executed successfully with 1 completed step.



### Hudson plot

In [13]:
### Hudson plot 
hudson_sos=~/project/bioworkflows/GWAS/Hudson_Plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/exome_data/$(date +"%Y%m%d")_f2257_150kvs200k
hudson_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_150k_200k_hudson_$(date +"%Y-%m-%d").sbatch
sumstats_1=$UKBB_PATH/results/REGENIE_results/results_exome_data/090921_f2257_hearing_noise_200K/*.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/091421_f2257_hearing_noise_150K/*.snp_stats.gz
toptitle="H-noise mega-analysis"
bottomtitle="H-noise discovery"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
highlight_snp=
job_size=1 
phenocol1='H-noise 200K'
phenocol2='H-noise 150K'
container_lmm=~/containers/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --container_lmm $container_lmm
"""
sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_150k_200k_hudson_2021-10-12.sbatch[0m
INFO: Workflow csg (ID=wf69faab7dc0d1fc6) is executed successfully with 1 completed step.



## 01-05-22 Regenie imputed data for 200K individuals with both exome and imputed

In [13]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/010522_f2257_hearing_noise_200K_imputed
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_200K_imp-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2.tsv
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed bgen files%
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
bgenMinINFO=0.3
bgenMinMAC=4
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --no-annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args" 

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_200K_imp-regenie_2022-01-06.sbatch[0m
INFO: Workflow csg (ID=w1558079f7409ebeb) is executed successfully with 1 completed step.



## Regenie imputed data: expanded white control NA

In [6]:
lmm_dir_regenie=$lmm_imp_dir_regenie/f2257_hearing_noise_impdata_newpheno
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2257_hearing_noise_impdata-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
phenoCol=f2257_ctrl_na
covarCol=sex
qCovarCol=age_final_noise
genoFile=`echo $UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-13_f2257_hearing_noise_impdata-regenie.sbatch[0m
INFO: Workflow farnam (ID=w072e7fe313d2ff85) is executed successfully with 1 completed step.



## Regenie: 50K exomes replication set

### f.2257 & Pure controls

In [3]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2257_hearing_background_noise_exomes50K_pure_ctrl
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2257_hearing_background_noise_exomes50K_pure_ctrl-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_pure_ctrl_232101ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_pure_ctrl_232101ind
phenoCol=f2257_ctrl_pure
covarCol=sex
qCovarCol=age_final_noise
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-21_f2257_hearing_background_noise_exomes50K_pure_ctrl-regenie.sbatch[0m
INFO: Workflow farnam (ID=w33d438bc153fd1f4) is executed successfully with 1 completed step.



### f.2257 & controls NA for f.3393

In [4]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2257_hearing_background_noise_exomes50K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2257_hearing_background_noise_exomes50K_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
phenoCol=f2257_ctrl_na
covarCol=sex
qCovarCol=age_final_noise
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-21_f2257_hearing_background_noise_exomes50K_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w0012483d59f4307c) is executed successfully with 1 completed step.



## Regenie in exome data (original Plink files UKBB unqc'ed) using modified phenotype file with controls_na for f.3393

### f.2257 Controls with NA for f.3393

In [4]:
# First run using controls na for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2257_hearing_difficulty_exomes200K_noqc_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2257_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/062421_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_166199ind
covarFile=$hearing_pheno_path/062421_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_166199ind
phenoCol=f2257_ctrl_na
covarCol=sex
qCovarCol=age_final_noise
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
hwe_filter=5e-08

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_f2257_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=wdb0e3955c5caa7bd) is executed successfully with 1 completed step.



## Regenie in exome data after VCF-QC 200K exomes

### f.2257 & controls NA for f.3393

In [11]:
# Run using all controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2257_hearing_background_noise_exomes200K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_background_noise_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
covarFile=$hearing_pheno_path/041521_UKBB_Hearing_background_noise_f2257_expandedwhite_z974included_ctrl_na_363603ind
phenoCol=f2257_ctrl_na
covarCol=sex
qCovarCol=age_final_noise
genoFile=`echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_background_noise_exomes200K_ctrl_na-regenie_2021-05-18.sbatch[0m
INFO: Workflow csg (ID=w5cbf62c4eb06bba8) is executed successfully with 1 completed step.



## LD clumping job

### Imputed data

In [None]:
clumping_dir=$UKBB_PATH/results/LD_clumping/f2257_hearing_background_noise
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2257_hearing_background_noise_ldclumping.sbatch
sumstatsFiles=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2257_hearing_background_noise/200828_UKBB_Hearing_background_noise_f2257_hearing_noise_cat.fastGWA.snp_stats.gz

clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --bgenFile $bgenFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

### Exome data

In [5]:
tpl_file=../farnam.yml
clumping_dir=$UKBB_PATH/results/LD_clumping/f2257_hearing_background_noise_exome
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2257_hearing_background_noise_exome_ldclumping.sbatch
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
sumstatsFiles=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2257_hearing_noise_exomes/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes_hearing_noise_cat.regenie.snp_stats.gz
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
sampleFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
unrelated_samples=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
bfile_ref=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631_chr1_22_exomedata.1200.ref_geno.bed
container_lmm=$UKBB_PATH/lmm.sif
ld_sample_size=1200
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $clumping_dir
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-07_f2257_hearing_diff_exome_ldclumping.sbatch[0m
INFO: Workflow farnam (ID=w7c7b3789d732c7e0) is executed successfully with 1 completed step.



## 09-22-21 LD clumping f.2257

In [10]:
cwd=$clumping_dir/092321_f2257
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/*snp_stats.gz
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
bfile_ref=$UKBB_PATH/results/LD_clumping/ukb23156_c1_22.merged.filtered.2000.ref_geno.bed
name_bref='ukb23156_c1_22.merged.filtered'
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $cwd
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --name_bref $name_bref
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_exome_ldclumping_2021-09-23.sbatch[0m
INFO: Workflow csg (ID=wbce7f04bc6387cc6) is executed successfully with 1 completed step.



## Region extraction

## 09-23-21 Extraction of regions f.2257 200K

In [6]:
phenoFile=$hearing_pheno_path/090321_UKBB_Hearing_noise_f2257_expandedwhite_65660cases_96601ctrl_PC1_2.tsv
extract_dir=$UKBB_PATH/results/region_extraction/092321_f2257_hearing_noise_200K
extract_sbatch=$USER_PATH/UKBB_GWAS_dev/output/f2257_hearing_noise_200K_regionextrac_$(date +"%Y-%m-%d").sbatch
region_file=$clumping_dir/092321_f2257/*.clumped_region
geno_path=$clumping_dir/092321_UKBB_qc_exome_geno_path.txt
sumstats_path=$lmm_exome_dir_regenie/090921_f2257_hearing_noise_200K/*.snp_stats.gz
unrelated_samples=$clumping_dir/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
extract_job_size=5
formatFile_regenie=
## No need to use format file in this case (regenie is already formatted for region extraction)
extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --sumstats-path $sumstats_path
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2257_hearing_noise_200K_regionextrac_2021-09-24.sbatch[0m
INFO: Workflow csg (ID=wc47b731b568fa3b5) is executed successfully with 1 completed step.



## Post-GWAS annotation

In [9]:
lmm_dir=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2257_hearing_noise_exomes
postgwa_sbatch=../output/$(date +"%Y-%m-%d")_f2257_postgwa.sbatch
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2257_hearing_noise_exomes/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes_hearing_noise_cat.regenie.snp_stats.gz
tpl_file=../farnam.yml
postgwa_sos=~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb
job_size=1
hg=38
postgwa_args="""default
    --cwd $lmm_dir
    --sumstatsFile $sumstatsFile
    --hg $hg
    --job_size $job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $postgwa_sos \
    --to-script $postgwa_sbatch \
    --args "$postgwa_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-28_f2257_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w9856930ebcb27a9d) is executed successfully with 1 completed step.


In [None]:
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
cwd=/home/dc2325/scratch60/output/bfile_annovar
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2257_hearing_noise_exomes_bfile/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes_hearing_noise_cat.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=$UKBB_PATH/annovar.sif
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/scratch60/output/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd $cwd \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --ukbb $UKBB_PATH \
    --container_annovar $container_annovar\
    -s build

# 4. Combined phenotype f.2247 & f.2257

## FastGWA job white British

In [None]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_f2257_combined
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2247_f2257_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_f2247_f2257
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_f2247_f2257
phenoCol=f2247_f2257
covarCol=sex
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

## FastGWA job all white

In [7]:
lmm_dir_fastgwa=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_f2257_combined_expandedwhite
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_f2247_f2257_expandedwhite_imp-fastgwa.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_f2247_f2257_expandedwhite
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/120120_UKBB_f2247_f2257_expandedwhite
phenoCol=f2247_f2257
covarCol=sex
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb dewan \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mdewan[0m: Configuration for Yale `pi_dewan` partition cluster
INFO: [32mdewan[0m is [32mcompleted[0m.
INFO: [32mdewan[0m output:   [32m../output/2020-12-02_f2247_f2257_expandedwhite_imp-fastgwa.sbatch[0m
INFO: Workflow dewan (ID=cca88c31ebd37730) is executed successfully with 1 completed step.


## Bolt-LMM job

In [3]:
lmm_dir_bolt=$UKBB_PATH/results/BOLTLMM_results/results_imputed_data/f2247_f2257_combined
lmm_sbatch_bolt=../output/$(date +"%Y-%m-%d")_f2247_f2257_imp-bolt.sbatch
phenoFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_f2247_f2257
covarFile=$UKBB_PATH/phenotype_files/hearing_impairment/200828_UKBB_f2247_f2257
phenoCol=f2247_f2257
covarCol=sex
qCovarCol=age
lmm_option='lmmForceNonInf'

lmm_args="""boltlmm
    --cwd $lmm_dir_bolt 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_bolt 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --lmm_option $lmm_option
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp    
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_bolt \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-28_f2247_f2257_imp-bolt.sbatch[0m
INFO: Workflow farnam (ID=6f1d2712738ba187) is executed successfully with 1 completed step.



## Regenie exome data

In [8]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_f2257_combined_exomes_bfile
lmm_sbatch_regenie=../output/$(date +"%Y-%m-%d")_f2247_f2257_combined_exome_bfile-regenie.sbatch
phenoFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_f2247_f2257_136862ind_exomes
covarFile=$hearing_pheno_path/phenotypes_exome_data/010421_UKBB_f2247_f2257_136862ind_exomes
phenoCol=f2247_f2257
covarCol=sex
qCovarCol=age_combined
#Use original bed files from the UKBB exome data
#bfile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/exome_files_snpsonly/ukb23155.filtered.merged.bed
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_prefix $lowmem_prefix
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-12_f2247_f2257_combined_exome_bfile-regenie.sbatch[0m
INFO: Workflow farnam (ID=wbd137a88958a3c38) is executed successfully with 1 completed step.



## 08-16-21 Regenie qc'ed exomes

In [6]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_combined_f2247_f2257
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_200Kexomes-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_200Kexomes-regenie_2021-08-17.sbatch[0m
INFO: Workflow csg (ID=w3ba62ef19da32b8a) is executed successfully with 1 completed step.



## 08-17-21 Annotation of results from regenenie qc 200k exomes

In [19]:
lmm_dir_regenie=$lmm_exome_dir_regenie/081621_combined_f2247_f2257/annotation_p5e-08
sumstatsFile=$lmm_exome_dir_regenie/081621_combined_f2247_f2257/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl_f2247_f2257.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=/mnt/mfs/statgen/UKBiobank/results/ukb23155_200Kexomes_annovar/exome_bim_merge/ukb23155_chr1_chr22.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_f2247_f2257_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --rsid $rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_f2247_f2257_annovar_2021-08-18.sbatch[0m
INFO: Workflow csg (ID=w17fc4ce1d08993f5) is executed successfully with 1 completed step.



## 09-07-21 Regenie exome data 50K vs 150 K (QC'ed 200K exomes, genotype file unqc'ed)

### 50K

In [13]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_combined_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_33399ind_50K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_33399ind_50K.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_50K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=wde12520d041b012d) is executed successfully with 1 completed step.



### 150K

In [15]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090721_combined_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_103732ind_150K.tsv
covarFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_103732ind_150K.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_150K-regenie_2021-09-07.sbatch[0m
INFO: Workflow csg (ID=w3a99e1402aa8c2db) is executed successfully with 1 completed step.



## 09-08-21 Analysis with exome QC data and QC genotype array with new database

In [20]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_f2247_f2257_200K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
anno_file=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/091321_annotation/*.formatted.csv
label_annotate=Gene

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --anno_file $anno_file
    --label_annotate $label_annotate
    --annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_f2247_f2257_200K-regenie_2021-09-16.sbatch[0m
INFO: Workflow csg (ID=wda9f043545708e27) is executed successfully with 1 completed step.



#### 09-08-21 Annotation

In [23]:
lmm_dir_regenie=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/091321_annotation
sumstatsFile=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2_f2247_f2257.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_annovar_2021-09-14.sbatch[0m
INFO: Workflow csg (ID=wb959eb558500de75) is executed successfully with 1 completed step.



### 09-14-21 50K

In [18]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_combined_50K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_50K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_32878ind_50K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_32878ind_50K.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
anno_file=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/091321_annotation/*.formatted.csv
label_annotate=Gene
snp_list=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/*.top_snps.tsv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --anno_file $anno_file
    --annotate
    --top_snps
    --snp_list $snp_list
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_50K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w6c103036e030190a) is executed successfully with 1 completed step.



### 09-14-21 combined 150K

In [11]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_combined_150K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_150K-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_102133ind_150K.tsv
covarFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_102133ind_150K.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the QC'ed exome files variant and sample missingness < 10%
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
label_annotate='Gene'
anno_file=$lmm_exome_dir_regenie/091421_combined_150K/091721_annotation/*.formatted.csv

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --label_annotate $label_annotate
    --annotate
    --anno_file $anno_file
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_150K-regenie_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=wf381374fc8974a49) is executed successfully with 1 completed step.



#### 09-17-21 Annotation combined 150K

In [8]:
lmm_dir_regenie=$lmm_exome_dir_regenie/091421_combined_150K/091721_annotation
sumstatsFile=$lmm_exome_dir_regenie/091421_combined_150K/*.regenie.snp_stats.gz
hg=38
job_size=1
bim_name=$UKBB_PATH/results/ukb23155_200Kexomes_annovar/091321_exome_bim_merge/ukb23155_chr1_chr22_091321.bim
humandb=/mnt/mfs/statgen/isabelle/REF/humandb
xref_path=/mnt/mfs/statgen/isabelle/REF/humandb
p_filter=5e-08
anno_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_150K_annovar_$(date +"%Y-%m-%d").sbatch


annovar_args="""annovar \
    --cwd $lmm_dir_regenie \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --bimfiles $bimfiles \
    --p_filter $p_filter \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --no-rsid  \
    --xref_path $xref_path \
    --container_annovar $container_annovar \
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $snptogene_sos \
    --to-script $anno_sbatch \
    --args "$annovar_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_150K_annovar_2021-09-20.sbatch[0m
INFO: Workflow csg (ID=w2425b37d537f2fce) is executed successfully with 1 completed step.



### Hudson plot

In [14]:
### Hudson plot 
hudson_sos=~/project/bioworkflows/GWAS/Hudson_Plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/exome_data/$(date +"%Y%m%d")_combined_150kvs200k
hudson_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_150k_200k_hudson_$(date +"%Y-%m-%d").sbatch
sumstats_1=$UKBB_PATH/results/REGENIE_results/results_exome_data/090921_combined_f2247_f2257_200K/*.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/091421_combined_150K/*.snp_stats.gz
toptitle="H-mix mega-analysis"
bottomtitle="H-mix discovery"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
highlight_snp=
job_size=1 
phenocol1='H-mix 200K'
phenocol2='H-mix 150K'
container_lmm=~/containers/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --container_lmm $container_lmm
"""
sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_150k_200k_hudson_2021-10-12.sbatch[0m
INFO: Workflow csg (ID=w201b441855c756fc) is executed successfully with 1 completed step.



## 09-20-21 Conditional analysis GCTA-COJO

## 01-05-22 Regenie for imputed data of 200K individuals with both exome and imputed data

In [14]:
## All filters set to 0 because the version of the bfile has already been QC'ed previously and there is not need to do it here
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
lmm_dir_regenie=$lmm_exome_dir_regenie/010522_combined_f2247_f2257_200K_imputed
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_imp-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2.tsv
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
#Use the original bed files that passed QC using Megan's parameters geno=0.01, mind=0.1, maf=0.01, hwe=5e-08
bfile=$UKBB_PATH/genotype_files_processed/090221_sample_variant_qc_final_callrate90/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.bed
# Use the imputed bgen files%
genoFile=`echo $UKBB_yale/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
bgenMinINFO=0.3
bgenMinMAC=4
sampleFile=$UKBB_yale/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

# If --annotate then it will add the label to the plot otherwise --no-annotate

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --no-annotate
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg\
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args" 

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_imp-regenie_2022-01-06.sbatch[0m
INFO: Workflow csg (ID=w2a3e661beb982d9b) is executed successfully with 1 completed step.



## Regenie imputed data: expanded white controls NA

In [7]:
lmm_dir_regenie=$lmm_imp_dir_regenie/f2247_f2257_combined_impdata_newpheno
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_f2257_combined_impdata-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
covarFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
phenoCol=f2247_f2257_ctrl_na
covarCol=sex
qCovarCol=age_combined
genoFile=`echo $UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$UKBB_PATH/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-13_f2247_f2257_combined_impdata-regenie.sbatch[0m
INFO: Workflow farnam (ID=w5bc39c94a28aa7c9) is executed successfully with 1 completed step.



## Regenie: 50K replication set

### combined phenotype & pure controls

In [5]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_f2257_exomes50K_pure_ctrl
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_f2257_exomes50K_pure_ctrl-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_pure_ctrl_168414ind
covarFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_pure_ctrl_168414ind
phenoCol=f2247_f2257_ctrl_pure
covarCol=sex
qCovarCol=age_combined
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-21_f2247_f2257_exomes50K_pure_ctrl-regenie.sbatch[0m
INFO: Workflow farnam (ID=w5e147b05ebd7fc54) is executed successfully with 1 completed step.



### combined phenotype & controls Na for f.3393

In [6]:
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_f2257_exomes50K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_f2257_exomes50K_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
covarFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
phenoCol=f2247_f2257_ctrl_na
covarCol=sex
qCovarCol=age_combined
genoFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/ukb32285_exomespb_chr1_22.bed
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-04-21_f2247_f2257_exomes50K_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w1a01bde43afe534a) is executed successfully with 1 completed step.



## Regenie in exome data (original Plink files UKBB unqc'ed) using modified phenotype file with controls_na for f.3393

### Combined phenotype Controls with NA for f.3393

In [5]:
# First run using controls na for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_f2257_exomes200K_noqc_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_f2247_f2257_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch
phenoFile=$hearing_pheno_path/062421_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_137245ind
covarFile=$hearing_pheno_path/062421_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_137245ind
phenoCol=f2247_f2257_ctrl_na
covarCol=sex
qCovarCol=age_combined
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
hwe_filter=5e-08

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_f2247_f2257_hearing_aid_exomes200K_noqc_ctrl_na-regenie.sbatch[0m
INFO: Workflow farnam (ID=w8700ff76175d3c64) is executed successfully with 1 completed step.



## Regenie in exome data after VCF-QC 200K exomes

### combined phenotype & controls Na for f.3393

In [12]:
# Run using all controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/f2247_f2257_exomes200K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/f2247_f2257_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
covarFile=$hearing_pheno_path/041521_UKBB_f2247_f2257_expandedwhite_z974included_ctrl_na_299916ind
phenoCol=f2247_f2257_ctrl_na
covarCol=sex
qCovarCol=age_combined
genoFile=`echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/f2247_f2257_exomes200K_ctrl_na-regenie_2021-05-18.sbatch[0m
INFO: Workflow csg (ID=w6bd7047f416af43a) is executed successfully with 1 completed step.



## LD clumping job

### Imputed data

In [None]:
clumping_dir=$UKBB_PATH/results/LD_clumping/f2247_f2257_combined
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_combined_ldclumping.sbatch
sumstatsFiles=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_f2257_combined/200828_UKBB_f2247_f2257_f2247_f2257.fastGWA.snp_stats.gz

clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --bgenFile $bgenFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

### Exome data

In [6]:
tpl_file=../farnam.yml
clumping_dir=$UKBB_PATH/results/LD_clumping/f2247_f2257_combined_exome
clumping_sos=~/project/bioworkflows/admin/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_combined_exome_ldclumping.sbatch
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
sumstatsFiles=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes/010421_UKBB_f2247_f2257_136862ind_exomes_f2247_f2257.regenie.snp_stats.gz
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
sampleFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
unrelated_samples=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
bfile_ref=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631_chr1_22_exomedata.1200.ref_geno.bed
container_lmm=$UKBB_PATH/lmm.sif
ld_sample_size=1200
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $clumping_dir
    --bfile $bfile
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-07_f2247_f2257_combined_exome_ldclumping.sbatch[0m
INFO: Workflow farnam (ID=wf81e3f1aaede6aab) is executed successfully with 1 completed step.



## 09-22-21 LD clumping combined

In [11]:
cwd=$clumping_dir/092321_combined
clumping_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_exome_ldclumping_$(date +"%Y-%m-%d").sbatch
sumstatsFiles=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/*snp_stats.gz
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
## Individuals from the subset of white individuals with exome data 
sampleFile=$UKBB_PATH/results/083021_PCA_results/europeans/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.fam
unrelated_samples=$UKBB_PATH/results/083021_PCA_results/090221_ldprun_unrelated/cache/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
bfile_ref=$UKBB_PATH/results/LD_clumping/ukb23156_c1_22.merged.filtered.2000.ref_geno.bed
name_bref='ukb23156_c1_22.merged.filtered'
ld_sample_size=2000
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20

# Select samples filter_samples workflow & create reference file with reference workflow
# Then use default workflow to run the LD clumping
clumping_args="""default
    --cwd $cwd
    --bfile_ref $bfile_ref 
    --genoFile $genoFile
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --name_bref $name_bref
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_exome_ldclumping_2021-09-23.sbatch[0m
INFO: Workflow csg (ID=wf891996597342d3f) is executed successfully with 1 completed step.



### Post-GWAS annotation Snp-to-gene

In [10]:
lmm_dir=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes
postgwa_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_postgwa.sbatch
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes/010421_UKBB_f2247_f2257_136862ind_exomes_f2247_f2257.regenie.snp_stats.gz
tpl_file=../farnam.yml
postgwa_sos=~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb
job_size=1
hg=38

postgwa_args="""default
    --cwd $lmm_dir
    --sumstatsFile $sumstatsFile
    --hg $hg
    --job_size $job_size
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $postgwa_sos \
    --to-script $postgwa_sbatch \
    --args "$postgwa_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-01-28_f2247_f2257_postgwa.sbatch[0m
INFO: Workflow farnam (ID=w311a9815d7e57ba3) is executed successfully with 1 completed step.


### Post-GWAS annotation ANNOVAR

In [None]:
lmm_dir=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes
postgwa_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_postgwa.sbatch
postgwa_sos=~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb
tpl_file=../farnam.yml
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes/010421_UKBB_f2247_f2257_136862ind_exomes_f2247_f2257.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=/home/dc2325/scratch60/annovar.sif
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/scratch60/output/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

annovar_args="""annovar
    --cwd $lmm_dir
    --hg $hg
    --bimfiles $bimfiles
    --bim_name $bim_name
    --sumstatsFile $sumstatsFile
    --hg $hg
    --humandb $humandb
    --job_size $job_size
    --container_annovar $container_annovar
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $postgwa_sos \
    --to-script $postgwa_sbatch \
    --args "$postgwa_args"

In [None]:
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
cwd=/home/dc2325/scratch60/output/bfile_annovar
sumstatsFile=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes_bfile/010421_UKBB_f2247_f2257_136862ind_exomes_f2247_f2257.regenie.snp_stats.gz
hg=38
job_size=1
container_annovar=$UKBB_PATH/annovar.sif
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bim`
bim_name=/home/dc2325/scratch60/output/ukb23155_chr1_chr22.bim
humandb=/gpfs/ysm/datasets/db/annovar/humandb

sos run ~/project/UKBB_GWAS_dev/workflow/snptogene.ipynb annovar \
    --cwd $cwd \
    --sumstatsFile $sumstatsFile\
    --bim_name $bim_name \
    --hg $hg \
    --job_size $job_size \
    --humandb $humandb\
    --ukbb $UKBB_PATH \
    --container_annovar $container_annovar\
    -s build

## Region extraction

## 09-23-21 Extraction of regions combined 200K

In [8]:
phenoFile=$hearing_pheno_path/090321_UKBB_Combined_f2247_f2257_expandedwhite_38410cases_96601ctrl_PC1_2.tsv
extract_dir=$UKBB_PATH/results/region_extraction/092321_combined_f2247_f2257_200K
extract_sbatch=$USER_PATH/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_regionextrac_$(date +"%Y-%m-%d").sbatch
region_file=$clumping_dir/092321_combined/*.clumped_region
geno_path=$clumping_dir/092321_UKBB_qc_exome_geno_path.txt
sumstats_path=$lmm_exome_dir_regenie/090921_combined_f2247_f2257_200K/*.snp_stats.gz
unrelated_samples=$clumping_dir/UKB_genotypedatadownloaded083019.090221_sample_variant_qc_final_callrate90.filtered.extracted.europeans.filtered.090221_ldprun_unrelated.filtered.fam
extract_job_size=5
formatFile_regenie=
## No need to use format file in this case (regenie is already formatted for region extraction)
extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $phenoFile
    --geno-path $geno_path
    --sumstats-path $sumstats_path
    --unrelated-samples $unrelated_samples
    --job-size $extract_job_size
    --container_lmm $container_lmm
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/combined_f2247_f2257_200K_regionextrac_2021-09-24.sbatch[0m
INFO: Workflow csg (ID=we5936457d979d7f1) is executed successfully with 1 completed step.



# 5. Hudson plot

## f2247_f2257 combined vs Hdiff wells paper

In [2]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/hearing_impairment/
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_vs_hdiff_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_f2257_combined/200828_UKBB_f2247_f2257_f2247_f2257.fastGWA.snp_stats.gz
sumstats_2=/home/dc2325/project/HI_UKBB/2019_Wells_sumstats/HD_EA_gwas_sumstats.txt.gz
toptitle="f2247_f2257_combined"
bottomtitle="Hdiff_Wells_GWAS"
highlight_p_top=0.0
highlight_p_bottom=0.0
pval_filter=5e-08
highlight_snp=/home/dc2325/project/HI_UKBB/2019_Wells_sumstats/hdiff_snps_wells
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-26_f2247_f2257_vs_hdiff_hudson.sbatch[0m
INFO: Workflow farnam (ID=f34e7c33d379875d) is executed successfully with 1 completed step.



## f3393 vs haid wells paper

In [2]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots/hearing_impairment/
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f3393_vs_Haid_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f3393_hearing_aid/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats.gz
sumstats_2=/home/dc2325/project/HI_UKBB/2019_Wells_sumstats/HAID_EA_gwas_sumstats.txt.gz
toptitle="f_3393_hearing_aid"
bottomtitle="Haid_Wells_GWAS"
highlight_p_top=0.0
highlight_p_bottom=0.0
pval_filter=5e-08
highlight_snp=/home/dc2325/project/HI_UKBB/2019_Wells_sumstats/haid_snps_wells
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --highlight_snp $highlight_snp
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-22_f3393_vs_Haid_hudson.sbatch[0m
INFO: Workflow farnam (ID=f48b56e3e091ef16) is executed successfully with 1 completed step.



## f.2247 exome and inputed data

In [7]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f2247_imp_exome_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_hearing_difficulty/200828_UKBB_Hearing_difficulty_f2247_hearing_diff_new.fastGWA.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_hearing_difficulty_exomes/010421_UKBB_Hearing_difficulty_f2247_171970ind_exomes_hearing_diff_new.regenie.snp_stats.gz
toptitle="f2247_imputed"
bottomtitle="f2247_exome"
phenocol1="f2247_imputed"
phenocol2="f2247_exome"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --phenocol1 $phenoCol1
    --phenocol2 $phenoCol2
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-02_f2247_imp_exome_hudson.sbatch[0m
INFO: Workflow farnam (ID=w2d252da39eb028af) is executed successfully with 1 completed step.



## f.2257 exome and imputed data

In [8]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f2257_imp_exome_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2257_hearing_background_noise/200828_UKBB_Hearing_background_noise_f2257_hearing_noise_cat.fastGWA.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2257_hearing_noise_exomes/010421_UKBB_Hearing_background_noise_f2257_175531ind_exomes_hearing_noise_cat.regenie.snp_stats.gz
toptitle="f2257_imputed"
bottomtitle="f2257_exome"
phenocol1="f2257_imputed"
phenocol2="f2257_exome"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-02_f2257_imp_exome_hudson.sbatch[0m
INFO: Workflow farnam (ID=w0c5959e07610ebcd) is executed successfully with 1 completed step.



## Combined phenotype

In [8]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f2247_f2257_imp_exome_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f2247_f2257_combined/200828_UKBB_f2247_f2257_f2247_f2257.fastGWA.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/f2247_f2257_combined_exomes/010421_UKBB_f2247_f2257_136862ind_exomes_f2247_f2257.regenie.snp_stats.gz
toptitle="Combined_f2247_f2257_imputed"
bottomtitle="Combined_f2247_f2257_exome"
phenocol1="f2247_f2257_imputed"
phenocol2="f2247_f2257_exome"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-03_f2247_f2257_imp_exome_hudson.sbatch[0m
INFO: Workflow farnam (ID=w093ec0ac945b39c0) is executed successfully with 1 completed step.



## f3393 exome and imputed data

In [3]:
tpl_file=../farnam.yml
hudson_sos=~/project/bioworkflows/admin/Hudson_plot.ipynb
hudson_dir=$UKBB_PATH/results/hudson_plots
hudson_sbatch=../output/$(date +"%Y-%m-%d")_f3393_imp_exome_hudson.sbatch
sumstats_1=$UKBB_PATH/results/FastGWA_results/results_imputed_data/f3393_hearing_aid/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats.gz
sumstats_2=$UKBB_PATH/results/REGENIE_results/results_exome_data/f3393_hearing_aid_exomes/010421_UKBB_Hearing_aid_f3393_128254ind_exomes_hearing_aid_cat.regenie.snp_stats.gz
toptitle="f3393_imputed"
bottomtitle="f3393_exome"
phenocol1="f3393_imputed"
phenocol2="f3393_exome"
highlight_p_top=5e-08
highlight_p_bottom=5e-08
pval_filter=5e-08
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

hudson_args="""hudson
    --cwd $hudson_dir
    --sumstats_1 $sumstats_1
    --sumstats_2 $sumstats_2
    --toptitle $toptitle
    --bottomtitle $bottomtitle
    --phenocol1 $phenocol1
    --phenocol2 $phenocol2
    --job_size $job_size
    --highlight_p_top $highlight_p_top
    --highlight_p_bottom $highlight_p_bottom
    --pval_filter $pval_filter
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $hudson_sos \
    --to-script $hudson_sbatch \
    --args "$hudson_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2021-02-03_f3393_imp_exome_hudson.sbatch[0m
INFO: Workflow farnam (ID=w7da36fe5bdb68b67) is executed successfully with 1 completed step.



# 6. Fine mapping

## f.3391 hearing aid

In [1]:
tpl_file=../farnam.yml
finemap_sos=~/project/UKBB_GWAS_dev/SuSiE_RSS.ipynb
finemap_dir=$UKBB_PATH/results/fine_mapping/f3393_hearing_aid
finemap_sbatch=../output/$(date +"%Y-%m-%d")_f3393_hearing_aid_susie.sbatch
sumstatFile=$UKBB_PATH/results/region_extraction/f3393_hearing_aid/10_126783170_126813028/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats_10_126783170_126813028.sumstats.gz
ldFile=$UKBB_PATH/results/region_extraction/f3393_hearing_aid/10_126783170_126813028/200828_UKBB_Hearing_aid_f3393_hearing_aid_cat.fastGWA.snp_stats_10_126783170_126813028.sample_ld.gz
N=230411
job_size=1
container_lmm=$UKBB_PATH/lmm.sif

finemap_args="""default
    --cwd $finemap_dir
    --sumstatFile $sumstatFile
    --ldFile $ldFile
    --N $N
    --job_size $job_size
    --container_lmm $container_lmm
"""
sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $finemap_sos \
    --to-script $finemap_sbatch \
    --args "$finemap_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-10-15_f3393_hearing_aid_susie.sbatch[0m
INFO: Workflow farnam (ID=fdf18b3eb0fe563d) is executed successfully with 1 completed step.



# 7. Mendelian-like phenotype with UKBB 200K exome data (plink_geno_mind)

In [3]:
# Run using all controls for f3393 
lmm_dir_regenie=$lmm_exome_dir_regenie/mendelian_like_exomes200K_ctrl_na
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/mendelian_like_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/full_mendilian-like_pheno_file.tsv
covarFile=$hearing_pheno_path/full_mendilian-like_pheno_file.tsv
phenoCol=mendilian-like
covarCol=sex
#qCovarCol=age_final_aid
genoFile=`echo /mnt/mfs/statgen/UKBiobank/data/exome_files/project_VCF/plink_files/plink_geno_mind/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/mendelian_like_exomes200K_ctrl_na-regenie_2021-07-30.sbatch[0m
INFO: Workflow csg (ID=w31f7be3d104e7443) is executed successfully with 1 completed step.



## Mendelian-like phenotype with UKBB 200K QC'ed exome data 

In [5]:
lmm_dir_regenie=$lmm_exome_dir_regenie/mendelian_like_qc_exomes200K
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/mendelian_like_qc_exomes200K_ctrl_na-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
phenoCol=mendelian
covarCol=sex
# We don't have the age of onset from the icd10 variable yet (08-10-21)
qCovarCol="PC1 PC2"
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`
#Use the original bed files for the genotype array for the expanded white on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

INFO: Running [32mcsg[0m: Configuration for Columbia csg partition cluster
INFO: [32mcsg[0m is [32mcompleted[0m.
INFO: [32mcsg[0m output:   [32m/home/dmc2245/project/UKBB_GWAS_dev/output/mendelian_like_qc_exomes200K_ctrl_na-regenie_2021-08-10.sbatch[0m
INFO: Workflow csg (ID=w37431a767d09bf5f) is executed successfully with 1 completed step.



### Medelian-like phenotype with fastGWA

In [None]:
lmm_dir_fastgwa=$lmm_exome_dir_fastgwa/mendelian_like_qc_exomes200K
lmm_sbatch_fastgwa=../output/mendelian_like_qc_exomes200K$(date +"%Y-%m-%d")-fastgwa.sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
phenoCol=mendelian
covarCol=sex
qCovarCol="PC1 PC2"
genoFile=`echo $UKBB_PATH/data/exome_files/project_VCF/072721_run/plink/ukb23156_c{1..22}.merged.filtered.bed`

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --genoFile $genoFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile  
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

## Regenie imputed data: Expanded white control NA (08/10/21 analysis)

#### Analysis for f2247 & f2257 (080421)


In [None]:
lmm_dir_regenie=$lmm_imp_dir_regenie/081021_Combined_f2247_f2257
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/Combined_f2247_f2257-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Combined_f2247_f2257_expandedwhite_39049cases_98082ctrl
phenoCol=f2247_f2257
covarCol=sex
qCovarCol="age PC1 PC2"
genoFile=`echo $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### Analysis for Hearing_aid_f3393 (080421)

In [None]:
lmm_dir_regenie=$lmm_imp_dir_regenie/081021_Hearing_aid_f3393
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/Hearing_aid_f3393-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_aid_f3393_expandedwhite_6305cases_98082ctrl
phenoCol=f3393
covarCol=sex
qCovarCol="age PC1 PC2"
genoFile=`echo $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### Hearing_difficulty_f2247

In [None]:
lmm_dir_regenie=$lmm_imp_dir_regenie/081021_Hearing_difficulty_f2247
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/Hearing_difficulty_f2247-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_difficulty_f2247_expandedwhite_46237cases_98082ctrl
phenoCol=f2247
covarCol=sex
qCovarCol="age PC1 PC2"
genoFile=`echo $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### Hearing_noise_f2257

In [None]:
lmm_dir_regenie=$lmm_imp_dir_regenie/081021_Hearing_noise_f2257
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/Hearing_noise_f2257-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Hearing_noise_f2257_expandedwhite_66656cases_98082ctrl
phenoCol=f2257
covarCol=sex
qCovarCol="age PC1 PC2"
genoFile=`echo $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"

#### Mendelian

In [None]:
lmm_dir_regenie=$lmm_imp_dir_regenie/081021_Mendelian
lmm_sbatch_regenie=$USER_PATH/UKBB_GWAS_dev/output/Mendelian-regenie_$(date +"%Y-%m-%d").sbatch
phenoFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
covarFile=$hearing_pheno_path/080421_UKBB_Mendelian_expandedwhite_1520cases_98082ctrl
phenoCol=mendelian
covarCol=sex
qCovarCol="age PC1 PC2"
genoFile=`echo $HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=$HOME/UKBiobank_Yale_transfer/ukb39554_imputeddataset/ukb32285_imputedindiv.sample

#Use the original bed files for the genotype array on regenie step1
bfile=$UKBB_PATH/data/genotype_files/UKB_expandedwhite_qcgenotypefiles/UKB_expandedwhiteonly_phenotypeindepqc_410905indiv_528206snps_102720.bed

lmm_args="""regenie
    --cwd $lmm_dir_regenie 
    --bfile $bfile 
    --genoFile $genoFile
    --sampleFile $sampleFile
    --phenoFile $phenoFile 
    --formatFile $formatFile_regenie 
    --phenoCol $phenoCol
    --covarCol $covarCol  
    --qCovarCol $qCovarCol
    --bsize $bsize
    --lowmem_dir $lowmem_dir
    --trait $trait 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --minMAC $minMAC
    --job_size $lmm_job_size
    --ylim $ylim
    --reverse_log_p $reverse_log_p
    --numThreads $numThreads
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run $USER_PATH/UKBB_GWAS_dev/admin/Get_Job_Script.ipynb csg \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_regenie \
    --args "$lmm_args"