## Analysis of BMI

This notebook applies [various LMM workflows](https://dianacornejo.github.io/pleiotropy_UKB/workflow) to perform association analysis for BMI.

## File paths on Yale cluster

- Genotype files in PLINK format:
`/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/SAY/dbgapstg/scratch/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis`
- Relationship file:
`/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`

## 07/01/20 analysis

On the cluster, open up this notebook using the JupyterLab server you set up via the ssh channel, then run the following cells,

### Bash variables for workflow configurations

In [7]:
# Common variables
tpl_file=../farnam.yml
bfile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed
sampleFile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
bgenFile=`echo /SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
unrelated_samples=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
# LMM directories
lmm_dir=/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data/INT-BMI
lmm_sos=../workflow/LMM.ipynb
lmm_sbatch=../output/$(date "+%Y-%m-%d")_INT-BMI-bolt.sbatch
phenoFile=~/project/phenotypes_UKB/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720
## LMM variables 
formatFile=~/project/UKBB_GWAS_DEV/data/boltlmm_template.yml
covarFile=~/project/phenotypes_UKB/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720
LDscoresFile=~/software/BOLT-LMM_v2.3.4/tables/LDSCORE.1000G_EUR.tab.gz
geneticMapFile=~/software/BOLT-LMM_v2.3.4/tables/genetic_map_hg19_withX.txt.gz
phenoCol=INT-BMI
covarCol=SEX
covarMaxLevels=10
qCovarCol=AGE
numThreads=20
bgenMinMAF=0.001
bgenMinINFO=0.8
lmm_job_size=1
# LD clumping directories
clumping_dir=/SAY/dbgapstg/scratch/UKBiobank/results/LD_clumping/INT-BMI
clumping_sos=../workflow/LD_Clumping.ipynb
clumping_sbatch=../output/$(date +"%Y-%m-%d")_INT-BMI_ldclumping.sbatch
bfile_ref=
## LD clumping variables
sumstatsFiles=/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data/INT-BMI/ukb_imp_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.BoltLMM.snp_stats.all_chr.gz
ld_sample_size=1210
clump_field=P_BOLT_LMM
clump_p1=5e-08
clump_p2=1
clump_r2=0.2
clump_kb=2000
clump_annotate=BP
numThreads=20
clump_job_size=1
# Region extraction directories
tpl_file=../farnam.yml
extract_dir=/SAY/dbgapstg/scratch/UKBiobank/results/region_extraction/INT-BMI
extract_sos=../workflow/Region_Extraction.ipynb
extract_sbatch=../output/$(date +"%Y-%m-%d")INT-BMI-region.sbatch
##
region_file=~/scratch60/plink-clumping/chr7_region/INT-BMI_region.txt
pheno_path=~/project/phenotypes_UKB/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720
geno_path=/SAY/dbgapstg/scratch/UKBiobank/results/UKBB_bgenfilepath.txt
bgen_sample_file=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
sumstats_path=/home/dc2325/project/results/pleiotropy/2020-04_bolt/INT-BMI/ukb_imp_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.BoltLMM.snp_stats.all_chr.gz
unrelated_samples=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
extract_job_size=10

### BoltLMM job

In [8]:
lmm_args="""boltlmm
    --cwd $lmm_dir 
    --bfile $bfile 
    --sampleFile $sampleFile
    --bgenFile $bgenFile 
    --phenoFile $phenoFile 
    --formatFile $formatFile 
    --covarFile $covarFile 
    --LDscoresFile $LDscoresFile 
    --geneticMapFile $geneticMapFile 
    --phenoCol $phenoCol 
    --covarCol $covarCol 
    --covarMaxLevels $covarMaxLevels 
    --qCovarCol $qCovarCol 
    --numThreads $numThreads 
    --bgenMinMAF $bgenMinMAF 
    --bgenMinINFO $bgenMinINFO 
    --job_size $lmm_job_size
"""

sos run ../workflow/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $lmm_sos \
    --to-script $lmm_sbatch \
    --args "$lmm_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfarnam[0m output:   [32m../output/2020-07-14_INT-BMI-bolt.sbatch[0m
INFO: Workflow farnam (ID=604bd89311794464) is ignored with 1 ignored step.


### LD clumping job

In [9]:
clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile 
    --bgenFile $bgenFile
    --bfile_ref $bfile_ref
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

sos run ../workflow/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-07-14_INT-BMI_ldclumping.sbatch[0m
INFO: Workflow farnam (ID=c64fee76decf1496) is executed successfully with 1 completed step.


#### LD clumping for two phenotypes

In [2]:
# Set the bash variables 
tpl_file=../farnam.yml
clumping_dir=~/scratch60/plink-clumping
clumping_sos=../workflow/LD_Clumping.ipynb
clumping_sbatch=../output/062420-INT-BMI_asthma-ldclumping.sbatch
##
bfile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv.bed
bgenFile=`echo /SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb_imp_chr{1..22}_v3.bgen`
sampleFile=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
sumstatsFiles=`echo /home/dc2325/scratch60/plink-clumping/*.sumstats.gz`
unrelated_samples=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
clump_field=P

clumping_args="""default 
    --cwd $clumping_dir 
    --bfile $bfile 
    --bgenFile $bgenFile 
    --sampleFile $sampleFile 
    --sumstatsFiles $sumstatsFiles 
    --unrelated_samples $unrelated_samples 
    --ld_sample_size $ld_sample_size 
    --clump_field $clump_field
    --clump_p1 $clump_p1 
    --clump_p2 $clump_p2 
    --clump_r2 $clump_r2 
    --clump_kb $clump_kb 
    --clump_annotate $clump_annotate 
    --numThreads $numThreads 
    --job_size $clump_job_size
"""

#Running the LDclumping workflow for INT-BMI trait

sos run ../workflow/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $clumping_sos \
    --to-script $clumping_sbatch \
    --args "$clumping_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfarnam[0m output:   [32m../output/062420-sos-INT-BMI_asthma-ldclumping.sbatch[0m
INFO: Workflow farnam (ID=a83111c4b422b5b6) is ignored with 1 ignored step.


### Region extract job

If you think it is too messy to define everything upfront you can also do it right here:

In [4]:
extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $pheno_path
    --geno-path $geno_path
    --bgen-sample-path $bgen_sample_file
    --sumstats-path $sumstats_path
    --format-config-path $format_config_path
    --unrelated-samples $unrelated_samples
"""

sos run ../workflow/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"

INFO: Running [32mfarnam[0m: 
INFO: [32mfarnam[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mfarnam[0m output:   [32m../output/062320-sos-INT-BMI-region.sbatch[0m
INFO: Workflow farnam (ID=4b23d355e0bc61e9) is ignored with 1 ignored step.


#### Region extract for LD clump BMI and asthma

In [None]:
tpl_file=../farnam.yml
extract_dir=~/scratch60/region_extract
extract_sos=../workflow/Region_Extraction.ipynb
extract_sbatch=../output/070120-INT-BMI-region.sbatch
format_config_path=~/project/UKBB_GWAS_DEV/data/boltlmm_template.yml

##
region_file=~/scratch60/plink-clumping/asthma.sumstats_INT_BMI.sumstats.clumped_region
pheno_path=/SAY/dbgapstg/scratch/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/normalized_phenotypes/UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720
geno_path=~/scratch60/plink-clumping/chr7_region/bgenfilepath.txt
bgen_sample_file=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/ukb32285_imputedindiv.sample
sumstats_path=/SAY/dbgapstg/scratch/UKBiobank/results/BOLTLMM_results/results_imputed_data/INT-BMI/ukb_imp_v3.UKB_caucasians_BMIwaisthip_AsthmaAndT2D_INT-BMI_withagesex_041720.BoltLMM.snp_stats.all_chr.gz
unrelated_samples=/SAY/dbgapstg/scratch/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620
extract_job_size=10


extract_args="""default
    --cwd $extract_dir
    --region-file $region_file
    --pheno-path $pheno_path
    --geno-path $geno_path
    --bgen-sample-path $bgen_sample_file
    --sumstats-path $sumstats_path
    --format-config-path $format_config_path
    --unrelated-samples $unrelated_samples
"""

sos run ../workflow/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $extract_sos \
    --to-script $extract_sbatch \
    --args "$extract_args"