# PCA analysis scripts

This notebook applies the `Get_Job_Script.ipynb` to automatically generate the sbatch scripts to run in Yale's cluster. The end result is to apply [various LMM workflows](https://github.com/statgenetics/UKBB_GWAS_dev/tree/master/workflow) to perform association analysis of different lipid traits (cholesterol, HDL, LDL, triglycerides), do clumping analysis and extract associated regions.

## File paths on Yale cluster
- Genotype files exome data:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020`
- Genotype files in PLINK format:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_Caucasiansubset_cholesterolfields_adjbymedstatus_062420_foranalysis`
- Relationship file:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`
- Other traits to be analyzed:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_CAUC_lipidsforanalysis_apolipoproteinAandB,Hba1c_continuousandcategorical,egfrbyCKDEPI,serumcreatinine,UACR_inverseranknorm_110320`

In [1]:
# Common variables
tpl_file=../farnam.yml
pca_dir=/gpfs/gibbs/pi/dewan/data/UKBiobank/results/pca_exomes
famFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
database=/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
# Container
container_lmm=/gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
# Pipeline
pca_sos=~/project/UKBB_GWAS_dev/PCA.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_white.sbatch
numThreads=1
job_size=1
#PCA variables
k=10
maxiter=0
topk=10
sigma=6
window=50
shift=5
r2=0.5




## PCA jobs

In [2]:
pca_args="""filter
    --cwd $pca_dir 
    --bedfiles $bedfiles
    --bimfiles $bimfiles
    --famFile $famFile
    --database $database
    --k $k
    --maxiter $maxiter
    --topk $topk
    --sigma $sigma
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m../output/2020-12-02_pca_white.sbatch[0m
INFO: Workflow farnam (ID=1ee956dba5381c80) is executed successfully with 1 completed step.



In [None]:
tpl_file=../farnam.yml
pca_dir=/gpfs/gibbs/pi/dewan/data/UKBiobank/results/pca_exomes
famFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
database=/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
# Container
container_lmm=/gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
# Pipeline
pca_sos=~/project/UKBB_GWAS_dev/PCA.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_white.sbatch
numThreads=1
job_size=1
#PCA variables
k=10
maxiter=5
topk=10
sigma=6
window=50
shift=5
r2=0.5
stand="binom2"
maf_filter=0.01
geno_filter=0.01
mind_filter=0.02

sos run ~/project/UKBB_GWAS_dev/PCA.ipynb smartpca \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --database $database \
    --k $k \
    --stand $stand \
    --maxiter $maxiter \
    --topk $topk \
    --sigma $sigma \
    --window $window \
    --shift $shift \
    --r2 $r2 \
    --maf_filter $maf_filter\
    --geno_filter $geno_filter\
    --mind_filter $mind_filter \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    -s build

In [None]:
    smartpca.perl \
    -i example.geno \
    -a example.snp \
    -b example.ind \
    -k 2 \
    -o example.pca \
    -p example.plot \
    -e example.eval \
    -l example.log \
    -m 5 \
    -t 2 \
    -s 6.0

In [None]:
par.PACKEDPED.EIGENSTRAT
genotypename:    ukb23155_s200631.filtered.merged.bed
snpname:         ukb23155_s200631.filtered.merged.bim
indivname:       ukb23155_s200631.filtered.merged.fam
outputformat:    EIGENSTRAT
genotypeoutname: ukb23155_s200631.filtered.merged.eigenstratgeno
snpoutname:      ukb23155_s200631.filtered.merged.snp
indivoutname:    ukb23155_s200631.filtered.merged.ind

In [None]:
#!/bin/bash
#SBATCH --partition general
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 60G
#SBATCH --time 5-0:00:00
#SBATCH --job-name ../output/2020-12-01_pca_white
#SBATCH --output ../output/2020-12-01_pca_white-%J.out
#SBATCH --error ../output/2020-12-01_pca_white-%J.log
module load EIGENSOFT/7.2.1-foss-2018b
smartpca.perl -i ukb23155_s200631.filtered.merged.bed -a ukb23155_s200631.filtered.merged.pedsnp -b ukb23155_s200631.filtered.merged.pedind -o ukb23155_s200631.filtered.merged.pca -p ukb23155_s200631.filtered.merged.plot -e ukb23155_s200631.filtered.eval -l ukb23155_s200631.filtered.merged.log

# Running plink missing pipeline

In [8]:
tpl_file=../farnam.yml
pca_dir=/gpfs/gibbs/pi/dewan/data/UKBiobank/results/pca_exomes
famFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo /gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
# Container
container_lmm=/gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
container_marp=/gpfs/gibbs/pi/dewan/data/UKBiobank/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/plink_missing.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos dryrun ~/project/UKBB_GWAS_dev/plink_missing.ipynb qc:3\
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp \
    -s build

INFO: Checking [32mqc_3[0m: Generate analysis report: HTML file, and optionally PPTX file
HINT: singularity exec  /gpfs/gibbs/pi/dewan/data/UKBiobank/marp.sif /bin/sh /gpfs/ysm/project/dewan/dc2325/UKBB_GWAS_dev/analysis/tmpbpe3iw9z/singularity_run_9198.sh
node /opt/marp/.cli/marp-cli.js /home/dc2325/scratch60/pca/ukb23155.merged.md -o /home/dc2325/scratch60/pca/ukb23155.merged.html \
    --title 'Sample and variant missingness UKBB ukb23155.merged' \
    --allow-local-files
node /opt/marp/.cli/marp-cli.js /home/dc2325/scratch60/pca/ukb23155.merged.md -o /home/dc2325/scratch60/pca/ukb23155.merged.pptx \
    --title 'Sample and variant missingness UKBB ukb23155.merged' \
    --allow-local-files 


INFO: [32mqc_3[0m is [32mcompleted[0m.
INFO: [32mqc_3[0m output:   [32m/home/dc2325/scratch60/pca/ukb23155.merged.html[0m
INFO: Workflow qc (ID=ee77ce711d745522) is tested successfully with 1 completed step.


## Extracting individuals for a particular snp plink

In [12]:
tpl_file=../farnam.yml
pca_dir=/home/dc2325/scratch60/plink_extract
famFile=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c12_b0_v1.bed
bimfiles=/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr12.bim
snp_list=/home/dc2325/scratch60/plink_extract/snp.txt
# Container
container_lmm=/gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif
container_marp=/gpfs/gibbs/pi/dewan/data/UKBiobank/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/plink_extract.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos run ~/project/UKBB_GWAS_dev/plink_extract.ipynb  \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --snp_list $snp_list \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp

INFO: Running [32mdefault[0m: select individuals and filter specific snps
INFO: [32mdefault[0m is [32mcompleted[0m.
INFO: [32mdefault[0m output:   [32m/home/dc2325/scratch60/plink_extract/ukb23155_c12_b0_v1.extract.raw[0m
INFO: Workflow default (ID=337b795abb011101) is executed successfully with 1 completed step.
