# SID Genetics Study Step 5: Developing Genotype

## Objective
The purpose of this notebook is to pull genetic data (primarily in the form of dosages), for the various genetic analyses in the statin-induced diabetes (SID) study. This notebook also contains code for time-to-event genome-wide association studies (GWAS) using Regenie 4.1. GWAS are run twice for each analysis: once on statin users and once on non-users. This is so the effect sizes in each treatment group are known, and heterogeneity tests (Cochran's Q) will be run in the next notebook to determine heterogeneity.

Unlike previous notebooks in this study, this notebook is primarily run using Python and bash scripting.

Run the "1. dsub set up and ReadMe.ipynb" notebook from the "How to use dsub in the Researcher Workbench" featured workspace on AoU before running this notebook to submit dsub jobs.

In [None]:
# Import necessary packages
import sys
import os 
import numpy as np
import pandas as pd
from datetime import datetime

In [None]:
# Get workspace bucket name
my_bucket = os.getenv('WORKSPACE_BUCKET')
my_bucket

In [None]:
# Get username and save it to the environment
USER_NAME = os.getenv('OWNER_EMAIL').split('@')[0].replace('.','-')

# Save this Python variable as an environment variable so that its easier to use within %%bash cells.
%env USER_NAME={USER_NAME}

In [None]:
# Get the locations of genomic datasets
genomic_location = os.getenv("CDR_STORAGE_PATH")
%env genomic_location = {genomic_location}

# Get the location of short read snps
wgs_plink_path = f'{genomic_location}/wgs/short_read/snpindel'
%env wgs_plink_path = {wgs_plink_path}

acaf_plink_path = f'{wgs_plink_path}/acaf_threshold'
%env acaf_plink_path = {acaf_plink_path}

%env my_bucket = {my_bucket}

# Candidate Gene Study

**Objective**: The purpose of this section is to pull statin on-target candidate variants from AoU's ACAF Threshold callset using PLINK 2.0.

In [None]:
# Script to pull candidate variants from the ACAF Threshold callset
%%writefile candidateSNP_SID.sh
set -o errexit
set -o nounset

if [ -n "${SNP_LIST}" ]; then
    plink2 \
      --bed "${input_file1}" \
      --bim "${input_file2}" \
      --fam "${input_file3}" \
      --snps "${SNP_LIST}" \
      --keep "${ids}" \
      --export A \
      --out "${out_path}/sid_targets_chr${CHROMO}"
else
    echo "No SNPs found for chromosome ${CHROMO}, skipping plink2 command."
fi

In [None]:
# Copy script to bucket
%%bash
gsutil cp candidateSNP_SID.sh "${my_bucket}/data/"

In [None]:
# Submit job using dsub
%%bash --out candidate_study_sid
source ~/aou_dsub.bash

# Define script path and type data
BASH_SCRIPT="${my_bucket}/data/candidateSNP_SID.sh"

chromosomes=(5 6 12 19)
snps=("chr5:75352671:G:T" "chr5:75360714:T:C" "chr6:160589086:A:G" "chr12:21178615:T:C" "chr19:44908822:C:T" "chr19:44908684:T:C")

# Loop through each chromosome number
for chromo in "${chromosomes[@]}"; do
  # Filter SNPs for the current chromosome
  filtered_snps=()
  for snp in "${snps[@]}"; do
    if [[ "$snp" == chr${chromo}:* ]]; then
      filtered_snps+=("$snp")
    fi
  done

  # Convert filtered SNPs array to a comma-separated string
  snp_list=$(IFS=,; echo "${filtered_snps[*]}")

  # Run dsub command
  aou_dsub \
    --image us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:2.1.19 \
    --disk-size 1096 \
    --boot-disk-size 200 \
    --logging "${my_bucket}/data/logging" \
    --input input_file1="${wgs_plink_path}/acaf_threshold/plink_bed/chr${chromo}.bed" \
    --input input_file2="${wgs_plink_path}/acaf_threshold/plink_bed/chr${chromo}.bim" \
    --input input_file3="${wgs_plink_path}/acaf_threshold/plink_bed/chr${chromo}.fam" \
    --input ids="${my_bucket}/sid_pheno_files/genomic/itt_ids_v2.txt" \
    --env SNP_LIST="${snp_list}" \
    --env CHROMO="${chromo}" \
    --output-recursive out_path="${my_bucket}/sid_geno_files/candidate/" \
    --script "${BASH_SCRIPT}"
done

In [None]:
# View job status summary
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --names "candidatesnp-sid" \
    --users "${USER_NAME}" \
    --status '*' | head -n 6

In [None]:
# View detailed job summaries
%%bash

dstat \
    --provider google-cls-v2 \
    --project "${GOOGLE_PROJECT}" \
    --location us-central1 \
    --names "candidatesnp-sid" \
    --users "${USER_NAME}" \
    --status '*' \
    --full

# Microarray Data

**Objective**: The purpose of this notebook is to run quality control (QC) with PLINK 2.0 and a genome-wide association study with Regenie 4.1 using AoU's microarray dataset.

## Filter

In [None]:
# Script to run QC on microarray data for our cohort
# Minor allele frequency (MAF) can be changed, since MAFs that are too low may prevent GWAS from converging
%%writefile ~/filter_snps_allchr.sh

set -o pipefail 
set -o errexit

plink2 \
--bed "${input_bed}" \
--bim "${input_bim}" \
--fam "${input_fam}" \
--keep "${ids}" \
--maf "${MAF}" --mac 100 --geno 0.1 \
--mind 0.1 \
--write-snplist --write-samples --no-id-header \
--out "${OUTPUT_PATH}/qc_pass_maf${MAF_name}_${group}"

In [None]:
# Copy script to bucket
!gsutil cp /home/jupyter/filter_snps_allchr.sh {my_bucket}/data/dsub/

In [None]:
%%bash --out test_ID

source ~/aou_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

BASH_SCRIPT="${my_bucket}/data/dsub/filter_snps_allchr.sh"

# Variable to hold which subset must be filtered
group_name=ldl30

aou_dsub \
      --image us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:2.1.19 \
      --disk-size 1024 \
      --boot-disk-size 1000 \
      --logging "${my_bucket}/data/logging" \
      --input input_bed="${genomic_location}/microarray/plink/arrays.bed" \
      --input input_bim="${genomic_location}/microarray/plink/arrays.bim" \
      --input input_fam="${genomic_location}/microarray/plink/arrays.fam" \
      --input ids="${my_bucket}/sid_pheno_files/genomic/${group_name}_statin_ids_v2.txt" \
      --env MAF=0.25 \
      --env MAF_name=25 \
      --env group=${group_name} \
      --output-recursive OUTPUT_PATH="${my_bucket}/sid_geno_files/snps_pass/array/${group_name}/" \
      --script "${BASH_SCRIPT}"  

In [None]:
# View job status summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --name "filter-snps-allchr" \
    --users ${USER_NAME} \
    --status '*' | head -n 3

In [None]:
# View detailed job summaries
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --name "filter-snps-allchr" \
    --users ${USER_NAME} \
    --status '*' \
    --full

In [None]:
# Check that QC files were created
%%bash
gsutil -u ${GOOGLE_PROJECT} ls "${my_bucket}/sid_geno_files/snps_pass/array/"

## GWAS

In [None]:
# Script to run time-to-event GWAS with Regenie 4.1
%%writefile ~/sid_gwas_regenie_array.sh

set -o pipefail 
set -o errexit

regenie \
    --step 1 \
    --bed "${bed_file}/arrays" \
    --extract "${keep_snps}" \
    --keep "${ids}" \
    --phenoFile "${pheno_file}" \
    --phenoColList time \
    --eventColList status \
    --covarFile "${cov_file}" \
    --covarColList low_hdl,high_tg,high_bmi,pd_status,smoking_status,htn_status,gd_status,index_age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,male \
    --t2e \
    --bsize 1000 \
    --verbose \
    --force-step1 \
    --out "${OUTPUT_PATH}/${prefix}"_step1_array \
    --threads 16

#regenie pt 2
regenie \
    --step 2 \
    --bed "${bed_file}/arrays" \
    --extract "${keep_snps}" \
    --keep "${ids}" \
    --phenoFile "${pheno_file}" \
    --phenoColList time \
    --eventColList status \
    --covarFile "${cov_file}" \
    --covarColList low_hdl,high_tg,high_bmi,pd_status,smoking_status,htn_status,gd_status,index_age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,male \
    --pred "${OUTPUT_PATH}/${prefix}"_step1_array_pred.list \
    --t2e \
    --firth --approx \
    --bsize 400 \
    --verbose \
    --threads 16 \
    --out "${OUTPUT_PATH}/${prefix}"_step2_array

In [None]:
# Copy script to personal bucket
!gsutil cp /home/jupyter/sid_gwas_regenie_array.sh {my_bucket}/data/dsub/

In [None]:
# Submit job to dsub
%%bash --out LINE_COUNT_JOB_ID

# Get a shorter username to leave more characters for the job name.
DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

# For AoU RWB projects network name is "network".
AOU_NETWORK=network
AOU_SUBNETWORK=subnetwork

MACHINE_TYPE="n2-standard-4"

# Change for your bucket, path in output of cell directly above:
BASH_SCRIPT="${my_bucket}/data/dsub/sid_gwas_regenie_array.sh"

# Choose which MAFs to run thw gwas one
mafs=(25)

for MAF_name in "${mafs[@]}"; do

# Choose which subset and treatment group to run GWAS on
subset_name=ldl30
group_name=nu

dsub \
    --provider google-cls-v2 \
    --user-project "${GOOGLE_PROJECT}" \
    --project "${GOOGLE_PROJECT}" \
    --image shinshinbooboo210/regenie_gsutil:v4.1 \
    --network "${AOU_NETWORK}" \
    --subnetwork "${AOU_SUBNETWORK}" \
    --service-account "$(gcloud config get-value account)" \
    --user "${DSUB_USER_NAME}" \
    --regions us-central1 \
    --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
    "$@" \
    --preemptible \
    --disk-size 3000 \
    --boot-disk-size 300 \
    --machine-type ${MACHINE_TYPE} \
    --name "${JOB_NAME}" \
    --script "${BASH_SCRIPT}" \
    --env GOOGLE_PROJECT=${GOOGLE_PROJECT} \
    --input-recursive bed_file="${genomic_location}/microarray/plink/" \
    --input keep_snps="${my_bucket}/sid_geno_files/snps_pass/array/${subset_name}/qc_pass_maf${MAF_name}_${subset_name}.snplist" \
    --input ids="${my_bucket}/sid_pheno_files/genomic/${subset_name}_ids_v2.txt" \
    --input pheno_file="${my_bucket}/sid_pheno_files/genomic/${subset_name}_${group_name}_pheno_df.tsv" \
    --input cov_file="${my_bucket}/sid_pheno_files/genomic/${subset_name}_${group_name}_covs_df.tsv" \
    --env prefix=SID_GWAS_array_${group_name}_${subset_name}_MAF${MAF_name} \
    --output-recursive OUTPUT_PATH="${my_bucket}/sid_geno_files/arrays/${subset_name}/"
    
done

In [None]:
# View job status summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "sid-gwas-regenie-array" \
    --users ${USER_NAME} \
    --status '*' | head -n 7

In [None]:
# View detailed job summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "sid-gwas-regenie-array" \
    --users ${USER_NAME} \
    --status '*' \
    --full

In [None]:
# Check that files were created properly
%%bash
gsutil -u ${GOOGLE_PROJECT} ls "${my_bucket}/sid_geno_files/arrays/"

# Sequence Data

**Objective**: The purpose of this notebook is to run quality control (QC) with PLINK 2.0 and a genome-wide association study with Regenie 4.1 using sequencing data from AoU's ACAF Threshold callset.

## Filter 

In [None]:
# Script to run QC on ACAF Threshold callset
%%writefile ~/filter_snps.sh

set -o pipefail 
set -o errexit

plink2 \
    --bed "${input_bed}" \
    --bim "${input_bim}" \
    --fam "${input_fam}" \
    --keep "${ids}" \
    --mac 100 --geno 0.1 \
    --mind 0.1 \
    --write-snplist \
    --out "${out_path}/snps_pass_chr${CHROMO}"

In [None]:
# Copy script to bucket
!gsutil cp /home/jupyter/filter_snps.sh {my_bucket}/data/dsub/

In [None]:
# Submit job to dsub
%%bash --out test_ID

source ~/aou_dsub.bash # This file was created via notebook 01_dsub_setup.ipynb.

BASH_SCRIPT="${my_bucket}/data/dsub/filter_snps.sh"

LOWER=1
UPPER=23
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
do

# Choose which subset to filter
subset_name=itt

    aou_dsub \
      --image us.gcr.io/broad-dsp-gcr-public/terra-jupyter-aou:2.1.19 \
      --disk-size 1024 \
      --boot-disk-size 1000 \
      --logging "${my_bucket}/data/logging" \
      --input input_bed="${acaf_plink_path}/plink_bed/chr${chromo}.bed" \
      --input input_bim="${acaf_plink_path}/plink_bed/chr${chromo}.bim" \
      --input input_fam="${acaf_plink_path}/plink_bed/chr${chromo}.fam" \
      --input ids="${my_bucket}/sid_pheno_files/genomic/${subset_name}_statin_ids_v2.txt" \
      --env CHROMO=${chromo} \
      --output-recursive out_path="${my_bucket}/sid_geno_files/snps_pass/sequence/${subset_name}/" \
      --script "${BASH_SCRIPT}"  
  
done

In [None]:
# View job status summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "filter-snps" \
    --users ${USER_NAME} \
    --status '*'  | head -n 24

In [None]:
# View more detailed job summaries
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "filter-snps" \
    --users ${USER_NAME} \
    --status '*' \
    --full

In [None]:
# Check that files were created
%%bash
gsutil -u ${GOOGLE_PROJECT} ls "${my_bucket}/data/plink_result/"

## GWAS

In [None]:
# Script to run time-to-event GWAS on ACAF Threshold data
%%writefile ~/sid_gwas_regenie.sh

set -o pipefail
set -o errexit

regenie \
    --step 1 \
    --bed "${array_path}/arrays" \
    --extract "${keep_snps}" \
    --keep "${ids}" \
    --phenoFile "${pheno_file}" \
    --phenoColList time \
    --eventColList status \
    --covarFile "${cov_file}" \
    --covarColList low_hdl,high_tg,high_bmi,pd_status,smoking_status,htn_status,gd_status,index_age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,pop_black,pop_lat,pop_more,pop_asian,pop_aian,pop_mena,male \
    --t2e \
    --bsize 1000 \
    --verbose \
    --force-step1 \
    --out "${OUTPUT_PATH}/${prefix}"_step1_2 \
    --threads 16

regenie \
    --step 2 \
    --bed "${genos}/chr${chrom}" \
    --extract "${keep_snps2}" \
    --keep "${ids}" \
    --phenoFile "${pheno_file}" \
    --phenoColList time \
    --eventColList status \
    --covarFile "${cov_file}" \
    --covarColList low_hdl,high_tg,high_bmi,pd_status,smoking_status,htn_status,gd_status,index_age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,pop_black,pop_lat,pop_more,pop_asian,pop_aian,pop_mena,male \
    --pred "${OUTPUT_PATH}/${prefix}"_step1_2_pred.list \
    --t2e \
    --firth --approx \
    --bsize 400 \
    --verbose \
    --threads 16 \
    --out "${OUTPUT_PATH}/${prefix}"_step2_chr"${chrom}"

In [None]:
# Copy script to bucket
!gsutil cp /home/jupyter/sid_gwas_regenie.sh {my_bucket}/data/dsub/

In [None]:
# Submit job to dsub
%%bash --out LINE_COUNT_JOB_ID

source ~/aou_dsub.bash

# Get a shorter username to leave more characters for the job name.
DSUB_USER_NAME="$(echo "${OWNER_EMAIL}" | cut -d@ -f1)"

# For AoU RWB projects network name is "network".
AOU_NETWORK=network
AOU_SUBNETWORK=subnetwork

MACHINE_TYPE="n2-standard-4"

# Change for your bucket, path in output of cell directly above:
BASH_SCRIPT="${my_bucket}/data/dsub/sid_gwas_regenie.sh"


# Python is 'right side limited' wherein the last value is not included
# To run the regression across all chromosomes, set lower to 1 and upper to 23
# To run across one chromosome, set lower to the chomosome-of-interest and upper to the following

LOWER=1
UPPER=23
for ((chromo=$LOWER;chromo<$UPPER;chromo+=1))
do

# Choose which MAF, subset, and treatment group to run GWAS on
MAF_name=
subset_name=
group_name=

mnt_path="/mnt/data/input/gs/fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/plink_bed"

aou_dsub \
    --provider google-cls-v2 \
    --user-project "${GOOGLE_PROJECT}" \
    --project "${GOOGLE_PROJECT}" \
    --image shinshinbooboo210/regenie_gsutil:v4.1 \
    --network "${AOU_NETWORK}" \
    --subnetwork "${AOU_SUBNETWORK}" \
    --service-account "$(gcloud config get-value account)" \
    --user "${DSUB_USER_NAME}" \
    --regions us-central1 \
    --logging "${WORKSPACE_BUCKET}/dsub/logs/{job-name}/{user-id}/$(date +'%Y%m%d/%H%M%S')/{job-id}-{task-id}-{task-attempt}.log" \
    "$@" \
    --preemptible \
    --disk-size 3000 \
    --boot-disk-size 200 \
    --machine-type ${MACHINE_TYPE} \
    --name "${JOB_NAME}" \
    --script "${BASH_SCRIPT}" \
    --env GOOGLE_PROJECT=${GOOGLE_PROJECT} \
    --input-recursive array_path="${genomic_location}/microarray/plink/" \
    --input keep_snps="${my_bucket}/data/plink_result/qc_pass_maf${MAF_name}_${subset_name}.snplist" \
    --input ids="${my_bucket}/sid_pheno_files/${subset_name}_ids_v2.txt" \
    --input pheno_file="${my_bucket}/sid_pheno_files/${subset_name}_${group_name}_pheno_df.tsv" \
    --input cov_file="${my_bucket}/sid_pheno_files/${subset_name}_${group_name}_covs_df.tsv" \
    --input bed_file="gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/plink_bed/chr${chromo}.bed" \
    --input bim_file="gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/plink_bed/chr${chromo}.bim" \
    --input fam_file="gs://fc-aou-datasets-controlled/v8/wgs/short_read/snpindel/acaf_threshold/plink_bed/chr${chromo}.fam" \
    --input keep_snps2="${my_bucket}/data/plink_result/snps_pass_chr${chromo}.snplist" \
    --env prefix=SID_GWAS_regenie_${group_name}_${subset_name} \
    --env genos="${mnt_path}" \
    --env chrom=${chromo} \
    --output-recursive OUTPUT_PATH="${my_bucket}/sid_geno_files/${group_name}_${subset_name}/"
done

In [None]:
# View job status summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "sid-gwas-regenie" \
    --users ${USER_NAME} \
    --status '*' | head -n 24

In [None]:
# View detailed job summary
%%bash
dstat \
    --provider google-cls-v2 \
    --project ${GOOGLE_PROJECT} \
    --location us-central1 \
    --names "sid-gwas-regenie" \
    --users ${USER_NAME} \
    --status '*' \
    --full