# Preparing data for haplotype analysis

This consists of the following basic steps:
- Subset by ACAF threshold
- Subset by desired region (SPTSSB lead SNP 500kb before and after)
- Phase data
- Filter variants by p < 0.05 or 0.03 (PD GWAS stats)

In [None]:
## Useful packages and variables

import os                                       # interact with the environment
import sys                                      # interact with stderr
from firecloud import api as fapi               # interact with the workspace
from io import StringIO                         # work with file contents
import pandas as pd                             # work with tabular data
import numpy as np
import urllib.parse                             # build URLs to Google Cloud Console
from IPython.core.display import display, HTML  # display links to Google Cloud Console

# Enable IPython to display matplotlib graphs.
import matplotlib.pyplot as plt
%matplotlib inline
from cycler import cycler

# Get workspace attributes, so the notebook can access workspace data
BILLING_PROJECT_ID = os.environ['GOOGLE_PROJECT']
WORKSPACE_NAMESPACE = os.environ['WORKSPACE_NAMESPACE']
WORKSPACE_NAME = os.environ['WORKSPACE_NAME']
WORKSPACE_BUCKET = os.environ['WORKSPACE_BUCKET']

WORKSPACE_ATTRIBUTES = fapi.get_workspace(WORKSPACE_NAMESPACE, WORKSPACE_NAME).json().get('workspace',{}).get('attributes',{})

In [None]:
## Create folder
#!mkdir haplotype_analysis
%cd haplotype_analysis/
!ls -l
#!mkdir inputed_genotypes
#!mv inputed_genotypes/ imputed_genotypes/

In [None]:
## Copy imputed genotype for chromosome of interest
#!gsutil -mu {BILLING_PROJECT_ID} ls gs://gp2tier2/release10/imputed_genotypes/EUR/
!gsutil -mu {BILLING_PROJECT_ID} cp gs://gp2tier2/release10/imputed_genotypes/EUR/chr3_EUR_release10.pgen imputed_genotypes/
!ls -l imputed_genotypes/

In [None]:
## Install plink (only need to run once)
%cd ../../
!wget https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_latest.zip
!unzip plink2_linux_x86_64_latest.zip
!chmod +x plink2

## ACAF filtering
ACAF = Ancestry-Corrected Allele Frequency, usually calculated within a homogeneous ancestry group 
(e.g., EUR, EAS, AFR). It helps avoid population stratification artifacts.

Mostly important if dealing with very large WGS files from All of Us.

In [None]:
## Assuming .afreq are not available - Need to generate first
#%cd haplotype_analysis/imputed_genotypes/
!ls -l
#!head chr3_EUR_release10.psam

# Calculate allele frequency within EUR group -> .afreq file
!../../plink2 \
  --pfile chr3_EUR_release10 \
  --freq \
  --out chr3_eur_freq

In [None]:
## Generate list of SNPs
#!head chr3_eur_freq.afreq
!awk '$5 > 0.01 && $5 < 0.99' chr3_eur_freq.afreq | cut -f2 > acaf_filtered_snps.txt
!head acaf_filtered_snps.txt

In [None]:
## Filter based on ACAF threshold
!../../plink2 \
 --pfile chr3_EUR_release10 \
 --extract acaf_filtered_snps.txt \
 --make-pgen \
 --out chr3_acaf_filtered

#!head chr3_eur_freq.afreq
!head chr3_acaf_filtered.pvar
#!gsutil -u {BILLING_PROJECT_ID} cat gs://gp2tier2/release10/README_release10_01072025.txt | head -n 20

## Subsetting by locus of interest (SPTSSB)
Genome assembly is hg38 - chr3:160859842-161859842;
Coordinates in hg19 - chr3:160577630-161577630

In [None]:
## Subset to region of interest (SPTSSB) - Same coordinates used in coloc analysis
!../../plink2 \
  --pfile chr3_acaf_filtered \
  --chr 3 \
  --from-bp 160859842 \
  --to-bp 161859842 \
  --make-pgen \
  --out chr3_acaf_filtered_sptssb

## Phasing data
In theory, you could generate .ped and .map files now with the command below. But it is recommend to do phasing first instead

plink2 --pfile chr3_acaf_filtered_sptssb --recode --out sptssb_haplo

Phasing is the process of determining which alleles at multiple SNPs are inherited together on the same chromosome copy 
(i.e., haplotype resolution). In other words, it determines which allele came from which parent.

For that, we will use a well-established tool in the field called SHAPEIT, using 1000G or TOPMed reference map.

In [None]:
## Convert plink files to vcf
!../../plink2 \
  --pfile chr3_acaf_filtered_sptssb \
  --recode vcf bgz \
  --out chr3_acaf_filtered_sptssb

In [None]:
## Download genetic map (only need to run once)
!wget https://github.com/odelaneau/shapeit4/raw/master/maps/genetic_maps.b38.tar.gz

In [None]:
## Download reference panel (only once), for improved accuracy. 1000G files are in hg19 (need to liftOver later)
!wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz  # 1000G
!wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi  # 1000G
#!wget ftp://ngs.sanger.ac.uk/production/hg38/1000G/1000G.GRCh38.autosomes.genotypes.20170504/ALL.chr3.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz

!ls -l

In [None]:
## Install SHAPEIT4 or 5 (in that order - only once)
# Easier to do in the terminal.

# In case wanting to use a more recent version of GATK
#!wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip # gatk
#!unzip gatk-4.5.0.0.zip

#!git clone https://github.com/odelaneau/shapeit4.git
#!git clone https://github.com/odelaneau/shapeit5.git
!wget https://github.com/odelaneau/shapeit5/releases/download/v5.1.1/phase_common_static

!ls -l

In [None]:
!chmod 777 ./phase_common_static
!./phase_common_static

In [None]:
## Rename chromosomes in vcfs, so they match the fasta reference genome
# Create "chr_rename.txt"
# with open("chr_rename.txt", "w") as f:
#     # Autosomes
#     for i in range(1, 23):
#         f.write(f"{i}\tchr{i}\n")
#     # Sex chromosomes + MT
#     f.write("X\tchrX\n")
#     f.write("Y\tchrY\n")
#     f.write("MT\tchrM\n")

# Run renaming itself
!bcftools annotate \
  --rename-chrs chr_rename.txt \
  -o ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_renamed.vcf.gz \
  -O z \
  ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

!bcftools index ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_renamed.vcf.gz

In [None]:
## LiftOver 1000G reference panel, originally in hg19
# Download required files (only once)
#!wget https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz
#!wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa # used by GATK
#!wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai # used by GATK
#!wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.dict # used by GATK

#!ls -l

# Run the liftOver itself
# CONTINUE TROUBLESHOOTING FROM HERE
# Since providing a reference panel is not mandatory for SHAPEIT, we will skip it for now.

!gatk --java-options "-Xmx8g" LiftoverVcf \
  -I ALL.chr3.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes_renamed.vcf.gz \
  -O chr3_lifted_to_hg38.vcf.gz \
  -CHAIN hg19ToHg38.over.chain.gz \
  -REJECT chr3_rejected_variants.vcf.gz \
  -R GRCh38_full_analysis_set_plus_decoy_hla.fa

In [None]:
# Forgot to create index
!tabix haplotype_analysis/imputed_genotypes/chr3_acaf_filtered_sptssb.vcf.gz

In [None]:
# Create AC field (required by SHAPEIT)
!bcftools +fill-tags haplotype_analysis/imputed_genotypes/chr3_acaf_filtered_sptssb.vcf.gz -o haplotype_analysis/imputed_genotypes/chr3_acaf_filtered_sptssb_filled.vcf.gz

!tabix haplotype_analysis/imputed_genotypes/chr3_acaf_filtered_sptssb_filled.vcf.gz

In [None]:
# Unzip genetic map
!tar -xzvf genetic_maps.b38.tar.gz

In [None]:
## Run SHAPEIT5 itself
!./phase_common_static \
  --input haplotype_analysis/imputed_genotypes/chr3_acaf_filtered_sptssb_filled.vcf.gz \
  --map chr3.b38.gmap.gz \
  --region 3:160859842-161859842 \
  --output sptssb_phased.bcf \
  --log sptssb_shapeit5.log

# OPTIONAL
#  --reference chr3_lifted_to_hg38.vcf.gz \

!ls -l

# Verify the output
#!bcftools view sptssb_phased.vcf.gz | head -n 20

## Filtering GWAS variants
Will initially do a "soft" filtering using all PD GWAS variants with p<0.05

In [None]:
## Preparing PD GWAS file (case/control)
# File below was uploaded manually. Can be found in the following link: https://api.kpndataregistry.org/api/d/7j5797

!gsutil -mu {BILLING_PROJECT_ID} cat gs://fc-0e367d5c-f21f-4d17-9f20-16910ff5c3aa/uploads/GP2_GWAS_2025/GP2_ALL_EUR_CLINICAL_ONLY_HG38_12162024.txt.gz | zcat | cut -f 1,2,8 | grep -P '^3' > GP2_GWAS_case_ctrl_chr3.txt
!sed -i '1iCHROM\tPOS\tGWAS_P' GP2_GWAS_case_ctrl_chr3.txt

# Replace multiple spaces with tabs
!awk '{$1=$1}1' OFS='\t' GP2_GWAS_case_ctrl_chr3.txt > GP2_GWAS_case_ctrl_chr3.tsv

!head GP2_GWAS_case_ctrl_chr3.tsv

In [None]:
# Compreess and index annotation
!bgzip -c GP2_GWAS_case_ctrl_chr3.tsv > GP2_GWAS_case_ctrl_chr3.tsv.gz
!tabix -s 1 -b 2 -e 2 -H GP2_GWAS_case_ctrl_chr3.tsv.gz

In [None]:
## Option 1 (failed) - Annotate BCF files with GWAS p-values, then filter
# # Annotate
# !bcftools annotate \
#   -a GP2_GWAS_case_ctrl_chr3.tsv.gz \
#   -c CHROM,POS,GWAS_P \
#   -h header.txt \
#   -O b -o sptssb_phased_annotated.bcf \
#   sptssb_phased.bcf

# # Filter annotated VCF for p<0.05
# !bcftools view \
#   -i 'INFO/GWAS_P<0.05' \
#   -O b -o sptssb_phased_filtered.bcf \
#   sptssb_phased_annotated.bcf

# # Check if output is correct
# !bcftools view -H sptssb_phased_filtered.bcf | more

In [None]:
## Option 2 (worked) - Filter using regions file
# Create regions file - only significant PD GWAS variants
#!awk 'NR>1 && $3 < 0.05 { print $1"\t"$2 }' GP2_GWAS_case_ctrl_chr3.tsv > significant_variants.tsv

#!gsutil -mu {BILLING_PROJECT_ID} cp gs://fc-0e367d5c-f21f-4d17-9f20-16910ff5c3aa/uploads/GP2_GWAS_2025/GP2_GWAS_SPTSSB_vars_5x10-5_+-5kb.txt ./
#!gsutil -mu {BILLING_PROJECT_ID} cp gs://fc-0e367d5c-f21f-4d17-9f20-16910ff5c3aa/uploads/GP2_GWAS_2025/GP2_GWAS_SPTSSB_vars_5x10-8_+-5kb.txt ./
!gsutil -mu {BILLING_PROJECT_ID} cp gs://fc-0e367d5c-f21f-4d17-9f20-16910ff5c3aa/uploads/GP2_GWAS_2025/GP2_GWAS_SPTSSB_vars_Jeff_prioritized.txt ./

# Run filtering itself
#!bcftools view -T significant_variants.tsv sptssb_phased.bcf -O b -o sptssb_phased_filtered.bcf
!bcftools view -T GP2_GWAS_SPTSSB_vars_Jeff_prioritized.txt sptssb_phased.bcf -O b -o sptssb_phased_filtered_4.bcf

# Check if output is correct - non-filtered: 3138 variants; filtered: 1447 variants; filtered 2/3/4: 33/24/8
!bcftools view -H sptssb_phased.bcf | wc -l
!bcftools view -H sptssb_phased_filtered_2.bcf | wc -l
!bcftools view -H sptssb_phased_filtered_3.bcf | wc -l
!bcftools view -H sptssb_phased_filtered_4.bcf | wc -l

## Converting files to .ped and .map
Those are the file formats required by Mary Makarious haplotype analysis pipeline.

In [None]:
## Convert .bcf to .vcf
!bcftools view -O v -o sptssb_phased_filtered_4.vcf sptssb_phased_filtered_4.bcf

In [None]:
## Run plink to generate files
!./plink2 --vcf sptssb_phased_filtered_4.vcf --export ped --out sptssb_phased_filtered_4

## (Final Step) Rescue phenotype information
In the process of data prep/phasing, phenotype information (sex and case/control status) is often lost.

This information is critical for our haplotype analysis, since we want to compare haplotype frequencies between PD cases and healthy controls.

In [None]:
## List sample IDs
#!gsutil -mu {BILLING_PROJECT_ID} ls -l gs://gp2tier2/release10/imputed_genotypes/EUR/

#!cut -d' ' -f2 yourfile.fam > samples.txt  # IID
!head haplotype_analysis/imputed_genotypes/chr3_EUR_release10.psam

In [None]:
## Prepare phenotype file (.tsv)
# IID     PD_status
# ID001   1
# ID002   2
# 1 is likely control, and 2 is likely case

!cut -f 1,2,3 haplotype_analysis/imputed_genotypes/chr3_EUR_release10.psam | sed 's/ /\t/g' > chr3_pheno_map.tsv
!head chr3_pheno_map.tsv

In [None]:
## Update .ped
#!head sptssb_phased_filtered_4.ped # code -9 means missing phenotype
!awk 'NR==FNR {a[$1]=$2; next} { $5=a[$2]; print }' chr3_pheno_map.tsv sptssb_phased_filtered_4.ped > test0.ped
!awk 'NR==FNR {a[$1]=$3; next} { $6=a[$2]; print }' chr3_pheno_map.tsv test0.ped > test.ped

In [None]:
## Verify new .ped
!head test.ped
!echo xxxxxx
!cut -d' ' -f5 test.ped | sort | uniq -c # sex
!echo xxxxxx
!cut -d' ' -f6 test.ped | sort | uniq -c # case/control
!echo xxxxxx
!cut -f2 haplotype_analysis/imputed_genotypes/chr3_EUR_release10.psam | sort | uniq -c # sex
!echo xxxxxx
!cut -f3 haplotype_analysis/imputed_genotypes/chr3_EUR_release10.psam | sort | uniq -c # case/control

In [None]:
## Rename .peds
!mv sptssb_phased_filtered_4.ped sptssb_phased_filtered_4_noPheno.ped
!mv test.ped sptssb_phased_filtered_4.ped

# Test - Haplotype analysis with PLINK
This approach is more computationally efficient for working with ~30 variants.

In [None]:
## Installing PLINK 1.9
#!wget https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20250731.zip
!unzip plink_linux_x86_64_20250731.zip

In [None]:
## Running PLINK haplotype analysis - Only works with PLINK 1!
!./plink --bfile sptssb_phased_filtered_3.bcf --hap --hap-freq --out sptssb_haplo_freqs_plink