# Scripts for PCA analyses

This notebook applies the `Get_Job_Script.ipynb` to automatically generate the sbatch scripts to run in Yale's or Columbia's cluster. 

Here the scripts generated are to run:

1. PCA analysis
2. Detect missingness in plink files
3. Extract SNPs/Individuals using Plink
4. Run regenie burden MWE

## File paths on Yale cluster
- Genotype files exome data:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/ukb28374_exomedata/exome_data_OCT2020`
- Genotype files in PLINK format:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/UKB_Caucasians_phenotypeindepqc120319_updated020720removedwithdrawnindiv`
- Genotype files in bgen format:
`SAY/dbgapstg/scratch/UKBiobank/genotype_files/ukb39554_imputeddataset/`
- Summary stats for imputed variants BOLT-LMM:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/BOLTLMM_results/results_imputed_data`
- Summary stats for inputed variants FastGWA:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/FastGWA_results/results_imputed_data`
- Phenotype files:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_Caucasiansubset_cholesterolfields_adjbymedstatus_062420_foranalysis`
- Relationship file:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/genotype_files/pleiotropy_geneticfiles/unrelated_n307259/UKB_unrelatedcauc_phenotypes_asthmat2dbmiwaisthip_agesex_waisthipratio_040620`
- Other traits to be analyzed:
`/gpfs/gibbs/pi/dewan/data/UKBiobank/phenotype_files/pleiotropy_R01/phenotypesforanalysis/UKB_CAUC_lipidsforanalysis_apolipoproteinAandB,Hba1c_continuousandcategorical,egfrbyCKDEPI,serumcreatinine,UACR_inverseranknorm_110320`
- PCA results for expanded white
`/gpfs/gibbs/pi/dewan/data/UKBiobank/results/070921_pca_genotype_array`

## Yale's variables

In [None]:
# Common variables Yale's cluster
UKBB_PATH=/gpfs/gibbs/pi/dewan/data/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/farnam.yml
# Working directory for PCA made from exome data
pca_dir=$UKBB_PATH/results/pca_exomes
#Working directory for PCA made from genotype array data
cwd=$UKBB_PATH/results/070921_pca_genotype_array
#Use the original bed files for the genotype array for kinship calculation
bfile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
# Container lmm 
container_lmm=$UKBB_PATH/lmm.sif
# Use a subset of the exomed markers
#genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#phenoFile=~/scratch60/pca/cache/ukb23155_s200631.non_white_white_outliers_11971ind.pheno
#database=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
#ethnia_prefix='non_white_white_outliers_11971ind'

## Columbia's variables

In [None]:
# This is the path to the data transferred from Yale
UKBB_yale=/mnt/mfs/statgen/archive/UKBiobank_Yale_transfer
UKBB_PATH=/mnt/mfs/statgen/UKBiobank
USER_PATH=$HOME/project
OUT_PATH=$USER_PATH/UKBB_GWAS_dev/output
tpl_file=$USER_PATH/bioworkflows/admin/csg.yml

## General variables

In [1]:
# Pipeline
pca_sos=$USER_PATH/bioworkflows/GWAS/PCA.ipynb
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=1
job_size=1
#PCA variables change according to your analyses
k=10
maxiter=0
topk=10
sigma=6
window=50
shift=10
r2=0.1




In [None]:
# Name of bash script
#pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_non_white.sbatch
#pca_sbatch=../output/$(date +"%Y-%m-%d")_flashpca_non_white_whiteoutliers.sbatch

## PCA jobs

### Full sample exome data UKBB

## 1. Do QC_1 on the genotype file (genotypic array) that includes all samples

In [4]:
# Yale's cluster vars
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21
# Original bfile containing all of the samples Yale's cluster
#genoFile=$UKBB_PATH/genotype_files/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep the samples of white individuals only
#keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind

#Columbia's cluster
cwd=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21
# Original bfile containing all of the samples Columbias's cluster
genoFile=$UKBB_yale/pleiotropy_geneticfiles/UKB_originalgenotypefilesdownloaded083019/UKB_genotypedatadownloaded083019.bed
#To keep the samples of white individuals only
keep_samples=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind

maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
mind_filter=0.1
mem='30G'

gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
gwasqc_sbatch=$USER_PATH/UKBB_GWAS_dev/output/$(date +"%Y-%m-%d")_gwasqc1_originalbed.sbatch

gwasqc1_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb dewan \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwasqc_sbatch \
    --args "$gwasqc1_args"

INFO: Running [32mdewan[0m: Configuration for Yale `pi_dewan` partition cluster
INFO: [32mdewan[0m is [32mcompleted[0m.
INFO: [32mdewan[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-14_gwasqc1_originalbed.sbatch[0m
INFO: Workflow dewan (ID=w138e6a004ca3f3fd) is executed successfully with 1 completed step.



## 2. Run king:

Estimate relationship between the exomed individuals.

In this case using the subset of white individuals first file `030821_ukb42495_exomed_white_189010ind`

In [2]:
##Yale's variables
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
##Use the qc'ed version of the genotype data
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed

##Columbia's variables
cwd=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed

king_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_king_extendedwhite.sbatch
kinship=0.0625
gwasqc_sos=$USER_PATH/bioworkflows/GWAS/GWAS_QC.ipynb
numThreads=20
mem='30G'
walltime='24h'

king_args="""king
    --cwd $cwd
    --genoFile $genoFile
    --kinship $kinship
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --walltime $walltime
    --no-maximize-unrelated
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $king_sbatch \
    --args "$king_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-21_flashpca_king_extendedwhite.sbatch[0m
INFO: Workflow farnam (ID=wb21b557b74272734) is executed successfully with 1 completed step.



## 2.1 Merge all of the exome bed files

This step is only necessary if working with the exome data. This is the non-qc'ed UKBB data

In [4]:
## Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/merged_exomes
#genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`

## Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/merged_exomes
genoFile=`echo $UKBB_yale/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`

gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_merged_exomes.sbatch
numThreads=20
mem='60G'
merged_prefix='ukb23155_all_merged'

gwasqc_args="""merge_plink
    --cwd $pca_dir
    --genoFile $genoFile
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23_merged_exomes.sbatch[0m
INFO: Workflow farnam (ID=w01c904d3a0a199dc) is executed successfully with 1 completed step.



## 3. QC the exome data for PCA calculations 

This time I'll use the merged exomes with no qc (directly downloaded from UKBB) as the mind filter is hard to apply in individual chromosomes)

In [2]:
## Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_29_21_merged_exomes
## Use non qc'ed exomes files but merged since the begining
#genoFile=$UKBB_PATH/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
#To keep the samples of white individuals only
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

## Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_06_29_21_merged_exomes
genoFile=$UKBB_yale/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
keep_samples=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

#GWAS QC variables
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
#In this case I do want to remove individuals with 1% missing data
mind_filter=0.1
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_qc_merged_exomes.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_noqc_merged_exomes'

gwasqc_args="""qc
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-29_flashpca_white_unrelated_qc_merged_exomes.sbatch[0m
INFO: Workflow farnam (ID=w3a3fa030c00ef69a) is executed successfully with 1 completed step.



## 3.1 QC the genotype data we want to use for the PCA calculation on unrelated individuals

Ideal: In the case of the UKBB exome data, we would like to use the genotypes after pVCF-QC for every chromosome.

Trial run: Use the exome data without QC filters as was released by the UKBB

In [2]:
pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21
#pca_dir=~/scratch60/pca/white_expanded_06_14_21
## Use non qc'ed exomes files
genoFile=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
#To keep the samples of white individuals only
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind
remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#GWAS QC variables
maf_filter=0.01
geno_filter=0.01
hwe_filter=5e-08
## Do not use mind filter this time since it will remove samples based on each chromosome
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_qc.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_merged'

gwasqc_args="""qc
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23_flashpca_white_unrelated_qc.sbatch[0m
INFO: Workflow farnam (ID=w23511b402aa82ba0) is executed successfully with 1 completed step.



## 3.2 Remove related individuals and do LD prunning for genotype array and further PCA calculation

After the meeting on 06/30/21 the group decided that we should be using the PC's calculated from the genotype array. So this new analysis reflects that

In [6]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
## Use the qc version of the genotype array with the already filtered 189010 white individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
#To keep the samples of white individuals only
#remove_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
## Use the qc version of the genotype array with the already filtered 189010 white individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
#To keep the samples of white individuals only
remove_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id

#GWAS QC variables: leave all the variables in 0 so there's no more filtering in the already filtered data
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
#LD prunning variables
window=50
shift=10
r2=0.1
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_gwas_qc_white_expanded_unrelated_genoarray.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_unrelated_genoarray'

gwasqc_args="""qc
    --cwd $cwd
    --genoFile $genoFile
    --remove_samples $remove_samples
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --window $window
    --shift $shift
    --r2 $r2
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_gwas_qc_white_expanded_unrelated_genoarray.sbatch[0m
INFO: Workflow farnam (ID=wf132c3ecc6bb7885) is executed successfully with 1 completed step.



### Do the merge_plink independently since it's not working in the nested workflow

In [7]:
## Yale's cluster
#genoFile=`echo $UKBB_PATH/results/pca_exomes/white_expanded_06_14_21/cache/ukb23155_c{1..22}_b0_v1.white_expanded_06_14_21.filtered.prune.bed`
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21_merged_exomes

## Columbia's cluster
genoFile=`echo $UKBB_yale/results/pca_exomes/white_expanded_06_14_21/cache/ukb23155_c{1..22}_b0_v1.white_expanded_06_14_21.filtered.prune.bed`
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_06_14_21_merged_exomes

gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_merged_unrelated_pruned.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_unrelated_pruned_merged'

gwasqc_args="""merge_plink
    --cwd $pca_dir
    --genoFile $genoFile
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
    --merged_prefix $merged_prefix
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam\
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-23merged_unrelated_pruned.sbatch[0m
INFO: Workflow farnam (ID=w008b80bc19edb218) is executed successfully with 1 completed step.



## 4. Get bed file for related individuals for exome data

This implies a problem when getting the related idnividuals from the exome data that is not merged, that's why the data was merged first and then the related individuals can be extracted

In [None]:
##Yale's cluster
#pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_related_06_29_21
## Use non qc'ed exomes files
#genoFile=$UKBB_PATH/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=~/scratch60/pca/white_expanded_06_29_21_merged_exomes

##Columbia's cluster
pca_dir=$UKBB_yale/results/pca_exomes/white_expanded_related_06_29_21
## Use non qc'ed exomes files
genoFile=$UKBB_yale/results/pca_exomes/merged_exomes/ukb23155_all_merged.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
keep_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
keep_variants=~/scratch60/pca/white_expanded_06_29_21_merged_exomes

#GWAS QC variables
maf_filter=0.0
geno_filter=0.0
hwe_filter=0.0
mind_filter=0.0
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_qc_merged_exomes.sbatch
numThreads=20
mem='30G'

gwasqc_args="""qc:1
    --cwd $pca_dir
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

## 4.1 Get bed file for related individuals genotype data

In [12]:
##Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
#keep_samples=$UKBB_PATH/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
#keep_variants=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray
## Use qc'ed genotype array
genoFile=$UKBB_yale/results/070921_pca_genotype_array/plinkqc_05_28_21/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.bed
## phenoFile=$UKBB_yale/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno
keep_samples=$UKBB_yale/results/070921_pca_genotype_array/king_05_28_21/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.king_05_28_21.related_id
#Keep the same variants as above
keep_variants=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.in

#GWAS QC variables
maf_filter=0
geno_filter=0
hwe_filter=0
mind_filter=0
gwas_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_qc_genoarray.sbatch
numThreads=20
mem='30G'
merged_prefix='ukb23155_qc_related_genoarray'

gwasqc_args="""qc:1
    --cwd $cwd
    --genoFile $genoFile
    --keep_samples $keep_samples
    --keep_variants $keep_variants
    --maf_filter $maf_filter
    --geno_filter $geno_filter
    --hwe_filter $hwe_filter
    --mind_filter $mind_filter
    --merged_prefix $merged_prefix
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
    --mem $mem
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $gwasqc_sos \
    --to-script $gwas_sbatch \
    --args "$gwasqc_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_flashpca_white_related_qc_genoarray.sbatch[0m
INFO: Workflow farnam (ID=w3038c5cdfd98262a) is executed successfully with 1 completed step.



## 5. Run PCA analysis for unrelated expanded white individuals with merged exomed data

In [2]:
pca_dir=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21
#This is the bfile originated after filtering unrelated individuals
genoFile=$UKBB_PATH/results/pca_exomes/white_expanded_06_14_21_merged_exomes/ukb23155_unrelated_pruned_merged.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
label_col=pop
pop_col=pop
pops=extended_white
k=10
maha_k=5
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated.sbatch
min_axis=
max_axis=

pca_args="""flashpca
    --cwd $pca_dir
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-24_flashpca_white_unrelated.sbatch[0m
INFO: Workflow farnam (ID=w3542743200dac65e) is executed successfully with 1 completed step.



## 5.1 Run PCA analysis for unrelated expanded white individuals with genotype array

In [13]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray
#This is the bfile originated after filtering unrelated individuals and pruning
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_06_30_21_genoarray.filtered.prune.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_unrelated_genoarray.sbatch
k=10
maha_k=5
min_axis=
max_axis=

pca_args="""flashpca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-06-30_flashpca_white_unrelated_genoarray.sbatch[0m
INFO: Workflow farnam (ID=w68e7dd36254f1a17) is executed successfully with 1 completed step.



## 6. Project related invididuals back genotype array

In [6]:
## Yale's cluster
#cwd=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
#pca_model=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds

## Columbia's cluster
cwd=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
pca_model=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds

pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_genoarray_projected.sbatch
label_col=pop
pop_col=pop
pops=extended_white
k=10
maha_k=5
prob=0.997
pval=0.05
min_axis=""
max_axis=""
label_col=pop
pop_col=pop
pops=extended_white
## set the --homogeneous TRUE options to consider all the pops like one 
homogeneous=TRUE
## For the plot you need to use the *.projected.rds and not the *.projected.mahalanobis.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray_projected.pca.projected.outliers


pca_args="""project_samples
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --pca_model $pca_model
    --k $k
    --maha_k $maha_k
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --prob $prob
    --pval $pval
    --homogeneous $homogeneous
    --min_axis $min_axis
    --max_axis $max_axis
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-09_flashpca_white_related_genoarray_projected.sbatch[0m
INFO: Workflow farnam (ID=w7e5b5c1fbf037041) is executed successfully with 1 completed step.



# Plot the projected individuals highlight outliers

In [5]:
##Yale's cluster
#pca_dir=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
#genoFile=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
#phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
#pca_model=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds
#plot_data=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.rds
#outlier_file=$UKBB_PATH/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.outliers

## Columbia's cluster
pca_dir=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected
#This is the bfile originated after filtering related individuals
genoFile=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_related_06_30_21_genoarray/cache/UKB_genotypedatadownloaded083019.plinkqc_05_28_21.filtered.white_expanded_related_06_30_21_genoarray.filtered.extracted.bed
## I had to modify the original file to add a super_pop and replace ethnicity for pop
phenoFile=$UKBB_yale/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030821_ukb42495_exomed_white_189010ind.pheno.new
pca_model=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_06_30_21_genoarray/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_06_30_21_genoarray.extended_white.pca.rds
plot_data=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.rds
outlier_file=$UKBB_yale/results/070921_pca_genotype_array/white_expanded_07_09_21_genoarray_projected/030821_ukb42495_exomed_white_189010ind.pheno.white_expanded_07_09_21_genoarray_projected.pca.projected.outliers

label_col=pop
pop_col=pop
pops=extended_white
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_white_related_genoarray_plot.sbatch

pca_args="""plot_pca
    --cwd $cwd
    --genoFile $genoFile
    --phenoFile $phenoFile
    --label_col $label_col
    --pop_col $pop_col
    --pops $pops
    --plot_data $plot_data
    --outlier_file $outlier_file
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/admin/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-07-09_flashpca_white_related_genoarray_plot.sbatch[0m
INFO: Workflow farnam (ID=w77ea737f258b746f) is executed successfully with 1 completed step.



## Old run for white population (using old phenotype file)

In [19]:
pca_dir=~/scratch60/pca/white_030121_repeat
phenoFile=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030121_ukb42495_exomed_white_189228ind.pheno
keep_samples=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb43978_OCT2020/dc2325_phenotypes/030121_ukb42495_exomed_white_189228ind
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_filter_white.sbatch
trait_name=ethnicity
numThreads=20

pca_args="""filter
    --cwd $pca_dir
    --bfile $bfile
    --genoFile $genoFile
    --phenoFile $phenoFile
    --keep_samples $keep_samples
    --k $k
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-03-01_flashpca_filter_white.sbatch[0m
INFO: Workflow farnam (ID=wdddf2ddca776f9f1) is executed successfully with 1 completed step.



### 1. African ancestry

In [5]:
pca_dir=~/scratch60/pca/african_ancestry
ethnia_prefix='african_3690ind'
phenoFile=~/scratch60/pca/african_ancestry/cache/ukb23155_s200631.african_3690ind.pheno
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_african.sbatch
trait_name=ethnicity

pca_args="""flashpca
    --cwd $pca_dir
    --bfile $bfile
    --genoFile $genoFile
    --phenoFile $phenoFile
    --k $k
    --maxiter $maxiter
    --topk $topk
    --sigma $sigma
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-02-22_flashpca_african.sbatch[0m
INFO: Workflow farnam (ID=wd99c56feacb4bc44) is executed successfully with 1 completed step.



### 2. Asian ancestry

In [7]:
pca_dir=~/scratch60/pca/asian_ancestry
ethnia_prefix='asian_4618ind'
phenoFile=~/scratch60/pca/asian_ancestry/cache/ukb23155_s200631.asian_4618ind.pheno
pca_sbatch=$OUT_PATH/$(date +"%Y-%m-%d")_flashpca_asian.sbatch
trait_name=ethnicity

pca_args="""flashpca
    --cwd $pca_dir 
    --genoFile $genoFile
    --famFile $famFile
    --database $database
    --ethnia_prefix $ethnia_prefix
    --select_ethnia $select_ethnia
    --phenoFile $phenoFile
    --k $k
    --maxiter $maxiter
    --topk $topk
    --sigma $sigma
    --window $window
    --shift $shift
    --r2 $r2
    --trait_name $trait_name
    --numThreads $numThreads 
    --job_size $job_size
    --container_lmm $container_lmm
"""

sos run ~/project/bioworkflows/GWAS/Get_Job_Script.ipynb farnam \
    --template-file $tpl_file \
    --workflow-file $pca_sos \
    --to-script $pca_sbatch \
    --args "$pca_args"

INFO: Running [32mfarnam[0m: Configuration for Yale `farnam` cluster
INFO: [32mfarnam[0m is [32mcompleted[0m.
INFO: [32mfarnam[0m output:   [32m/home/dc2325/project/UKBB_GWAS_dev/output/2021-02-22_flashpca_asian.sbatch[0m
INFO: Workflow farnam (ID=wd32c86713853144c) is executed successfully with 1 completed step.



In [15]:
bfile=$UKBB_PATH/MWE/genotypes21_22.bed
genoFile=$UKBB_PATH/MWE/burden/ukb23155_c2*_b0_v1.plink.exome.filtered.bed
keep_samples=$UKBB_PATH/MWE/burden/unrelated_ind_burden.txt
phenoFile=$UKBB_PATH/MWE/burden/phenotype_burden.txt
kinship=0.05
maf_filter=0.01
geno_filter=0.1
mind_filter=0.2 
hwe_filter=5e-08 
numThreads=2
k=10
trait_name='ASTHMA'
sos run ~/project/bioworkflows/GWAS/PCA.ipynb flashpca:1\
    --cwd $pca_dir \
    --bfile $bfile \
    --genoFile $genoFile \
    --keep_samples $keep_samples \
    --kinship $kinship \
    --phenoFile $phenoFile \
    --window $window \
    --shift $shift \
    --r2 $r2 \
    --maf_filter $maf_filter\
    --geno_filter $geno_filter\
    --mind_filter $mind_filter \
    --hwe_filter $hwe_filter \
    --k $k\
    --trait_name $trait_name \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm

INFO: Running [32mflashpca_1[0m: Run PCA analysis using flashpca
[91mERROR[0m: [91mflashpca_1 (id=40e0bfbf8b836d3a) returns an error.[0m
[91mERROR[0m: [91m[flashpca_1]: [0]: 
Failed to execute [0m[32mRscript /home/dc2325/.sos/af137bdf2659ed53/flashpca_1_0_fe1199ce.R[0m[91m
exitcode=1, workdir=[0m[32m/gpfs/ysm/project/dewan/dc2325/UKBB_GWAS_dev/analysis/cluster_scripts[0m[91m, stdout=/home/dc2325/scratch60/pca/phenotype_burden.filtered.merged.prune.stdout, stderr=/home/dc2325/scratch60/pca/phenotype_burden.filtered.merged.prune.stderr
---------------------------------------------------------------------------[0m



In [None]:
tpl_file=../farnam.yml
pca_dir=$UKBB_PATH/results/pca_exomes
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
database=$UKBB_PATH/phenotype_files/pleiotropy_R01/ukb42495_updatedJune2020/ukb42495.tab
# Container
container_lmm=$UKBB_PATH/lmm.sif
# Pipeline
pca_sos=~/project/UKBB_GWAS_dev/PCA.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_pca_white.sbatch
numThreads=1
job_size=1
#PCA variables
k=10
maxiter=5
topk=10
sigma=6
window=50
shift=5
r2=0.5
stand="binom2"
maf_filter=0.01
geno_filter=0.01
mind_filter=0.02

sos run ~/project/UKBB_GWAS_dev/PCA.ipynb smartpca \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --database $database \
    --k $k \
    --stand $stand \
    --maxiter $maxiter \
    --topk $topk \
    --sigma $sigma \
    --window $window \
    --shift $shift \
    --r2 $r2 \
    --maf_filter $maf_filter\
    --geno_filter $geno_filter\
    --mind_filter $mind_filter \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    -s build

In [None]:
    smartpca.perl \
    -i example.geno \
    -a example.snp \
    -b example.ind \
    -k 2 \
    -o example.pca \
    -p example.plot \
    -e example.eval \
    -l example.log \
    -m 5 \
    -t 2 \
    -s 6.0

In [None]:
par.PACKEDPED.EIGENSTRAT
genotypename:    ukb23155_s200631.filtered.merged.bed
snpname:         ukb23155_s200631.filtered.merged.bim
indivname:       ukb23155_s200631.filtered.merged.fam
outputformat:    EIGENSTRAT
genotypeoutname: ukb23155_s200631.filtered.merged.eigenstratgeno
snpoutname:      ukb23155_s200631.filtered.merged.snp
indivoutname:    ukb23155_s200631.filtered.merged.ind

In [None]:
#!/bin/bash
#SBATCH --partition general
#SBATCH --nodes 1
#SBATCH --ntasks-per-node 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 60G
#SBATCH --time 5-0:00:00
#SBATCH --job-name ../output/2020-12-01_pca_white
#SBATCH --output ../output/2020-12-01_pca_white-%J.out
#SBATCH --error ../output/2020-12-01_pca_white-%J.log
module load EIGENSOFT/7.2.1-foss-2018b
smartpca.perl -i ukb23155_s200631.filtered.merged.bed -a ukb23155_s200631.filtered.merged.pedsnp -b ukb23155_s200631.filtered.merged.pedind -o ukb23155_s200631.filtered.merged.pca -p ukb23155_s200631.filtered.merged.plot -e ukb23155_s200631.filtered.eval -l ukb23155_s200631.filtered.merged.log

# Running plink missing pipeline

In [36]:
tpl_file=../farnam.yml
pca_dir=$UKBB_PATH/results/pca_exomes
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c{1..22}_b0_v1.bed`
bimfiles=`echo $UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr{1..22}.bim`
# Container
container_lmm=$UKBB_PATH/lmm.sif
container_marp=$UKBB_PATH/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/workflow/plink_missing.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos run ~/project/UKBB_GWAS_dev/workflow/plink_missing.ipynb missing\
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp \
    -s build

INFO: Running [32mmissing_1[0m: Genotype and sample missingness for exome files
INFO: Step [32mmissing_1[0m (index=0) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=1) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=2) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=3) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=4) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=5) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=6) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=7) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=8) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=9) is [32mignored[0m with signature constructed
INFO: Step [32mmissing_1[0m (index=10) is [32

## Extracting individuals for a particular snp plink

In [37]:
tpl_file=../farnam.yml
pca_dir=/home/dc2325/scratch60/plink_extract
famFile=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_s200631.fam
bedfiles=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/ukb23155_c12_b0_v1.bed
bimfiles=$UKBB_PATH/genotype_files/ukb28374_exomedata/exome_data_OCT2020/UKBexomeOQFE_chr12.bim
snp_list=/home/dc2325/scratch60/plink_extract/snp.txt
# Container
container_lmm=$UKBB_PATH/lmm.sif
container_marp=$UKBB_PATH/marp.sif
# Pipeline
plink_sos=~/project/UKBB_GWAS_dev/plink_extract.ipynb
# Name of bash script
pca_sbatch=../output/$(date +"%Y-%m-%d")_plink_miss.sbatch
numThreads=1
job_size=1


sos run ~/project/UKBB_GWAS_dev/plink_extract.ipynb  \
    --cwd $pca_dir \
    --bedfiles $bedfiles \
    --bimfiles $bimfiles \
    --famFile $famFile \
    --snp_list $snp_list \
    --numThreads $numThreads \
    --job_size $job_size \
    --container_lmm $container_lmm \
    --container_marp $container_marp

[91mERROR[0m: [91mFailed to locate /home/dc2325/project/UKBB_GWAS_dev/plink_extract.ipynb.sos[0m



## Regenie burden

In [23]:
genoFile=`echo $UKBB_PATH/MWE/burden/ukb23155_c{21..22}_b0_v1.plink.exome.filtered.bed`
sos dryrun ~/project/bioworkflows/GWAS/LMM.ipynb regenie_burden \
    --cwd output \
    --bfile genotypes21_22.bed \
    --genoFile $genoFile \
    --sampleFile \
    --phenoFile burden/phenotype_burden.txt\
    --phenoCol ASTHMA\
    --covarCol SEX \
    --qCovarCol AGE \
    --numThreads 8 \
    --bsize 10 \
    --anno_file burden/annotation_file.txt\
    --set_list burden/set_list_file.txt \
    --mask_file burden/mask_file.txt \
    --keep_gene burden/keep_file.txt\
    --aaf_bins 0.05 \
    --trait bt \
    --build_mask max \
    --container_lmm $UKBB_PATH/lmm.sif

INFO: Checking [32mregenie_burden[0m: Run regenie for burden tests
HINT: singularity exec  /gpfs/gibbs/pi/dewan/data/UKBiobank/lmm.sif /bin/bash /gpfs/ysm/project/dewan/dc2325/UKBB_GWAS_dev/analysis/tmph24c8d72/singularity_run_193934.sh
set -e
regenie \
  --step 2 \
  --bed /gpfs/gibbs/pi/dewan/data/UKBiobank/MWE/burden/ukb23155_c21_b0_v1.plink.exome.filtered \
  --phenoFile output/phenotype_burden.regenie_phenotype \
  --covarFile output/phenotype_burden.regenie_covar \
  --phenoColList ASTHMA \
  --bt
  --firth --approx \
  --pred output/phenotype_burden_ASTHMA.regenie_pred.list \
  --anno-file burden/annotation_file.txt \
  --set-list burden/set_list_file.txt \
  --extract-sets burden/keep_file.txt\
  --mask-def burden/mask_file.txt \
  --aaf-bins 0.05 \
  --write-mask \
  --build-mask \
  --bsize 10 \
  --check-burden-files \
  --gz \
  --out  output/cache/ukb23155_c21_b0_v1.plink.exome.filtered.burden


INFO: [32mregenie_burden[0m (index=0) is [32mcompleted[0m.
HINT: singula