Assumes raw data from dbGAP on <code>/labs/tassimes/rodrigoguarischi/projects/sea/data_preparation_to_imputation</code> 

In [1]:
import os

# Change working directory
!cd "/labs/tassimes/rodrigoguarischi/projects/sea/data_preparation_to_imputation"

# Create simbolic link to PED/MAP file
!ln -s ./86679/NHLBI/SEA_Herrington/phs000349v1/p1/genotype/phg000121v1/genotype-calls-matrixfmt/SEA_Phase2.map .
!ln -s ./86679/NHLBI/SEA_Herrington/phs000349v1/p1/genotype/phg000121v1/genotype-calls-matrixfmt/SEA_Phase2.ped .
!ls

apply_grs
apply_pgs_and_plot_odds_ratios.ipynb
aux_files
data_preparation_to_imputation
data_preparation_to_imputation.ipynb
data_preparation_to_imputation_results_download.sh
exports
heart_disease_probabilities.pdf
imputed_data_qc.ipynb
imputed_genotypes
pgs-calc_plink2_score_comparison.ipynb
processed_files
raw_files
SEA_Phase2.map
SEA_Phase2.ped
SEA_Phase2.sex_recoded.map
SEA_Phase2.sex_recoded.ped
table_assembly
TOBEDELETED
tools
troubleshoot_imputation_maf.ipynb


<code>Perlegen</code> array design was based on human assembly <code>hg18</code>. In order to proceed with data_preparation we need to liftover raw files to <code>hg19</code> (Michigan Imputation Server, HRC reference panel) and <code>hg38</code> (TOPMed Imputation Server and reference panel)

Before liftover, we need to preprocess data to avoid losing data at X and Y chromosomes

In [None]:
# X and Y chromosomes are coded as 23 and 24 in the original MAP file. Liftovering this way will make SNPs on these 2 chromosomes
# to be dropped from the analysis. Recode them to X and Y to avoid losing this data
!perl -pe "s/^23/X/" SEA_Phase2.map | perl -pe "s/^24/Y/" > SEA_Phase2.sex_recoded.map

# PED file remains the same. Therefore, it's ok to use a symbolic link to reference it
!ln -s SEA_Phase2.ped SEA_Phase2.sex_recoded.ped

# List files
!ls

The data liftover of MAP/PED files will be done using the tool <code>liftOverPlink</code> (https://github.com/sritchie73/liftOverPlink)

In [None]:
# Assumes liftOverPlink.py script is installed in data_preparation folder as well dependencies properly present 
# (liftOver script from UCSC and chain file)

# Hg19
!../tools/liftOverPlink/liftOverPlink.py \
    -m SEA_Phase2.sex_recoded.map \
    -p SEA_Phase2.sex_recoded.ped \
    -e ../tools/liftOver \
    -c ../aux_files/hg18ToHg19.over.chain.gz \
    -o SEA_Phase2.sex_recoded.hg19_liftover

# Hg38
!../tools/liftOverPlink/liftOverPlink.py \
    -m SEA_Phase2.sex_recoded.map \
    -p SEA_Phase2.sex_recoded.ped \
    -e ../tools/liftOver \
    -c ../aux_files/hg18ToHg38.over.chain.gz \
    -o SEA_Phase2.sex_recoded.hg38_liftover

# List files
!ls -lha

Perfome data preparation steps as instructed in the page from Michigan Imputation Server https://imputationserver.readthedocs.io/en/latest/prepare-your-data/

In [None]:
# Load plink module at SCG
# !module load plink tabix

!input_file_basename='SEA_Phase2.sex_recoded.hg19_liftover'

# # Convert ped/map to bed
!module load plink; plink --file SEA_Phase2.sex_recoded.hg19_liftover --make-bed --out SEA_Phase2.sex_recoded.hg19_liftover

# # Create a frequency file
# !plink --freq --bfile ${input_file_basename} --out ${input_file_basename}

# # Execute HRC-1000G-check-bim.pl script to prepare data for imputation 
# !perl ../tools/HRC-1000G-check-bim.pl \
#   -b ${input_file_basename}.bim \
#   -f ${input_file_basename}.frq \
#   -r ../aux_files/HRC.r1-1.GRCh37.wgs.mac5.sites.tab \
#   -h
# !sh Run-plink.sh

# # Compress files as bzip
# !for vcf_file in $(ls *.vcf); do bgzip ${vcf_file}; done