# Analysis of SRT phenotypes

This notebook applies the `Get_Job_Script.ipynb` to automatically generate the sbatch scripts to run in Yale's cluster. The end result is to apply [various LMM workflows](https://github.com/statgenetics/UKBB_GWAS_dev/tree/master/workflow) to perform association analysis in different hearing impairment traits, do clumping analysis and extract associated regions.

The phenotypes analyzed are:

1. Left ear f.20019
2. Right ear f.20021
3. Best ear (create a new variable extracting the min SRT value among f.20019 and f.20021)
4. Worst ear (create a new variable extracting the max SRT value among f.20019 and f.20021)
5. Dichotomous trait: 
- controls are those at the bottom 25% for left ear (< -7.5) and right ear (< -7.5)
- cases are those at the top 25% for left (>-5) and right ear (>-5)

## Phenotype file names

1. 200904_UKBB_SRT_cat_cc
2. 200904_UKBB_SRT_int_best_cc
3. 200904_UKBB_SRT_int_left_cc
4. 200904_UKBB_SRT_int_right_cc
5. 200904_UKBB_SRT_int_worst_cc

## File paths on Yale cluster

- Genotype files in PLINK format:
`/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment`
- Relationship file:
`/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`

## Bash variables for workflow configuration

In [2]:
# Common variables
tpl_file=../farnam.yml
bfile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated082020removedwithdrawnindiv.bed
sampleFile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
bgenFile=`echo /SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
unrelated_samples=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
formatFile_fastgwa=~/project/UKBB_GWAS_DEV/data/fastGWA_template.yml
formatFile_bolt=~/project/UKBB_GWAS_DEV/data/boltlmm_template.yml
formatFile_saige=~/project/UKBB_GWAS_DEV/data/saige_template.yml
formatFile_regenie=~/project/UKBB_GWAS_DEV/data/regenie_template.yml
# LMM directories
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/
lmm_dir_bolt=/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data/
lmm_dir_saige=/SAY/dbgapstg/scratch/UKBiobank/results/SAIGE_results/results_imputed_data/
lmm_dir_regenie=/SAY/dbgapstg/scratch/UKBiobank/results/REGENIE_results/results_imputed_data/
lmm_sos=~/project/bioworkflows/GWAS/LMM.ipynb
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_imp-fastgwa.sbatch
lmm_sbatch_bolt=../output/$(date +"%Y-%m-%d")_imp-bolt.sbatch
lmm_sbatch_saige=../output/$(date +"%Y-%m-%d")_imp-saige.sbatch
lmm_sbatch_regenie=../output/$(date +"%Y-%m-%d")_imp-regenie.sbatch
## LMM variables 
LDscoresFile=~/software/BOLT-LMM_v2.3.4/tables/LDSCORE.1000G_EUR.tab.gz
geneticMapFile=~/software/BOLT-LMM_v2.3.4/tables/genetic_map_hg19_withX.txt.gz
covarMaxLevels=10
numThreads=20
bgenMinMAF=0.001
bgenMinINFO=0.8
lmm_job_size=1
ylim=0
container_lmm=/SAY/dbgapstg/scratch/UKBiobank/lmm.sif
container_marp=/home/dc2325/scratch60/marp.sif
### Specific to FastGWA
grmFile=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.grm.sp
### Specific to SAIGE
bgenMinMAC=4
trait_type=binary
loco=TRUE
sampleCol=IID
### Specific to REGENIE
bsize=1000
lowmem=/SAY/dbgapstg/scratch/UKBiobank/results/REGENIE_results/results_imputed_data/
trait=bt
minMAC=5
reverse_log_p=True
# LD clumping directories
clumping_dir=/SAY/dbgapstg/scratch/UKBiobank/results/LD_clumping/
clumping_sos=~/project/bioworkflows/GWAS/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_imp_ldclumping.sbatch
## LD clumping variables
# For sumtastsFiles if more than one provide each path
bfile_ref=/SAY/dbgapstg/scratch/UKBiobank/results/LD_clumping/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.1210.ref_geno.bed
# Changes dependending upon which traits are analyzed
# In this case tinnitus only
sumstatsFiles=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/tinnitus/*.snp_stats.gz
ld_sample_size=1210
clump_field=P
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1
clumpFile= 
clumregionFile=
# Region extraction directories
extract_dir=/SAY/dbgapstg/scratch/UKBiobank/results/region_extraction/tinnitus
extract_sos=~/project/bioworkflows/GWAS/Region_Extraction.ipynb
extract_sbatch=../output/$(date +"%Y-%m-%d")_tinnitus_imp-region.sbatch
## Region extraction variables
region_file=/SAY/dbgapstg/scratch/UKBiobank/results/LD_clumping/tinnitus/*.clumped_region
geno_path=/SAY/dbgapstg/scratch/UKBiobank/results/UKBB_bgenfilepath.txt
sumstats_path=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/tinnitus/*.snp_stats.gz
extract_job_size=10




## Left ear f.20019

In [3]:
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/f20019_srt_int_left
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_srt_int_left_imp-fastgwa.sbatch
phenoFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_left_cc
covarFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_left_cc
phenoCol="srt_int_left"
covarCol="sex volume_mean noise_imp music_imp"
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-09-08_srt_int_left_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=26f0b36bd41b77cc) is executed successfully with 1 completed step.



## Right ear f.20021

In [4]:
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/f20021_srt_int_right
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_srt_int_right_imp-fastgwa.sbatch
phenoFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_right_cc
covarFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_right_cc
phenoCol="srt_int_right"
covarCol="sex volume_mean noise_imp music_imp"
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-09-08_srt_int_right_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=822f898f8ab1642b) is executed successfully with 1 completed step.



## Best ear

In [5]:
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/srt_int_best
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_srt_int_best_imp-fastgwa.sbatch
phenoFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_best_cc
covarFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_best_cc
phenoCol="srt_int_best"
covarCol="sex volume_mean noise_imp music_imp"
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-09-08_srt_int_best_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=cf9fdd8854c9411d) is executed successfully with 1 completed step.



## Worst ear

In [6]:
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/srt_int_worst
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_srt_int_worst_imp-fastgwa.sbatch
phenoFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_worst_cc
covarFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_int_worst_cc
phenoCol="srt_int_worst"
covarCol="sex volume_mean noise_imp music_imp"
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-09-08_srt_int_worst_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=4d61a16a2bd5a010) is executed successfully with 1 completed step.



## Dichotomous trait

In [9]:
lmm_dir_fastgwa=/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data/srt_cat
lmm_sbatch_fastgwa=../output/$(date +"%Y-%m-%d")_srt_cat_imp-fastgwa.sbatch
phenoFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_cat_cc
covarFile=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/hearing_impairment/200904_UKBB_SRT_cat_cc
phenoCol="srt_cat"
covarCol="sex volume_mean noise_imp music_imp"
qCovarCol=age

lmm_args="""fastGWA
    --cwd $lmm_dir_fastgwa 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile_fastgwa 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
    --grmFile $grmFile
    --ylim $ylim
    --container_lmm $container_lmm
    --container_marp $container_marp
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch_fastgwa \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-09-08_srt_cat_imp-fastgwa.sbatch[0m
INFO: Workflow farnam (ID=03372e77a724218c) is executed successfully with 1 completed step.

