This document outlines how to perform Chromosome Overlap in an HPC environment (LSF). The goal is to obtain a list of closed haplotype patterns from (filtered meta-)chromosomes from a population consisting of cases and controls.
Your input haplotype file haplotype_estimates.cases.${NAME}.txt should be of the following form:
sid | rs0000001_A | rs0000002_G | rs0000003_C | ... |
---|---|---|---|---|
0001 | 0 | 0 | 0 | ... |
0001 | 0 | 1 | 0 | ... |
0002 | 1 | 0 | 0 | ... |
0002 | 1 | 0 | 0 | ... |
Each row is chromosome from subject sid in the cases population; another file should exist for the controls. Alternate alleles of SNP rsid_Alt are indicated with a 1 and reference alleles with a 0. Such files are prepared from the scripts in scripts-haplotype_extraction.
Find all unique chromosomes in haplotype_estimates.cases.${NAME}.txt using
awk 'NR==1{char=length(NF-1)} NR>1{printf "1\t"; for (i=2;i<=NF;i++) {printf "%s%0"char"d_%s",(i>2?",":""),i-1,$i}; printf "\n"}' haplotype_estimates.cases.${NAME}.txt | awk 'BEGIN{OFS="\t"} {seen[$2]+=$1} END{for (i in seen) {print seen[i],i} }' > Pattern_combined.Iteration000.${NAME}_${PATTERN}.txt
where ${NAME} is a location label indicating the chromosome and position of the haplotypes, and ${PATTERN} is a suffix indicating the type of overlap (e.g., 2,j indicates pairwise overlap of and 2,3,j triple-wise). These names will be used for combining the results of different jobs. The output Pattern_combined.Iteration000.${NAME}_${PATTERN}.txt will be a two-column table of counts of meta-chromosomes:
2 | 01_1,02_0,04_0,... |
1 | 01_1,02_1,03_0,... |
... | ... |
Each meta-chromosome is a list of SNPs (indexed by their column numbers in your haplotype file) and their corresponding alleles (0 or 1).
Perform the initial overlap with at least 50 overlaps per job by running:
bsub -P ChromosomeOverlap -J ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000 -oo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000.out -eo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration000.err -R "rusage[mem=512]" "sh ChromosomeOverlap_iteration_sub_parallel.v3.sh \"${NAME}\" 2 \"${Pattern}\" 50 0"
The overlaps are partioned to nodes using index2combo2.R, which assigns a unique multiindex i_1,i_2,...,i_sigma corresponding to the Ith overlap of $SIGMA chromosomes in a σ-dimensional triangular array:
RScript index2combo2.R I=starting_index n=total_number_of_chromosomes sigma=$SIGMA
The value 50 indicates that the total number of overlaps per job should be 50, which value doubles until the total number of jobs required to complete all pairwise overlaps is below a threshold MAX_JOBS; 2 indicates that the overlaps should consist of two chromosomes each; and 0 indicates that no overlaps have been completed yet (corresponding to Iteration000). The output of this step is a file Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt, containing counts of meta-chromosomes and their corresponding SNP-alleles, and another file Closed_patterns_stats.${NAME}_${PATTERN}.txt, which keeps track of all the closed patterns found so far and the iteration at which they were discovered. The program automatically halts after completing iteration 1 to allow the user to apply p-value-based filtering.
Get the segregation of each pattern discovered at iteration 1 in Closed_patterns_stats.${NAME}_${PATTERN}.txt between cases and controls using:
bsub -P ChromosomeOverlap -J ChromosomeOverlap_haplotype_count_sub.v5.Iteration001 -oo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001.out -eo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001.err -R "rusage[mem=256]" "sh ChromosomeOverlap_haplotype_count_sub.v5.sh haplotype_estimates.cases.{NAME}.txt,haplotype_estimates.controls.${NAME}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt 50 \"1\" \"Iteration001.${NAME}\""
where again 50 is the initial number of overlaps assigned to each job and 1 instructs the program to only take patterns in Closed_patterns_stats.${NAME}_${PATTERN}.txt discovered at iteration 1. The counts are then fed to ChromosomeOverlap_fisher_exact.R for computation of the p-values. The result is a file fisher_exact.Iteration001.${NAME}.patterns_0000001-total_number_of_patterns.txt with OR in the fourth field and p-value in the fifth. Filtering can then be applied to update the Pattern_combined file using:
awk 'NR==FNR{seen[$1]; next} ($2 in seen){print $0}' <(awk '(NR>1 && $5+0<1e-9 && $4+0>1){print $1}' fisher_exact.Iteration001.${NAME}.patterns_0000001-total_number_of_patterns.txt) Pattern_combined.Iteration001.chr11.69231642-69431642_2,j.txt > Pattern_combined.Iteration001.${NAME}_${PATTERN}.tmp
mv Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt Pattern_combined_old.Iteration001.${NAME}_${PATTERN}.txt
mv Pattern_combined.Iteration001.${NAME}_${PATTERN}.tmp Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt
which selects risk-increasing patterns with p-values less than 1e-9. The unfiltered list is saved as Pattern_combined_old.Iteration001.${NAME}_${PATTERN}.txt. To try a different threshold, first run:
mv Pattern_combined_old.${NAME}_${PATTERN}.txt Pattern_combined.${NAME}_${PATTERN}.txt:
Once the filtered patterns are selected, run:
bsub -P ChromosomeOverlap -J ChromosomeOverlap_iteration_sub_parallel.v3.Iteration001.${NAME} -oo ChromosomeOverlap_iteration_sub_parallel.v3.${NAME}.out -eo ChromosomeOverlap_iteration_sub_parallel.v3.Iteration001.${NAME}.err -R "rusage[mem=512]" "sh ChromosomeOverlap_iteration_sub_parallel.v3.sh \"${NAME}\" 2 \"${PATTERN}\" 50 1"
to initiate the overlaps starting from iteration 1. All new patterns discovered at each iteration will be added to Closed_patterns_stats.${NAME}_${PATTERN}.txt. The process ceases when no new patterns are discovered, typically after five iterations have been completed. The patterns can be assessed for association with disease using Fisher's exact test:
bsub -P ChromosomeOverlap -J ChromosomeOverlap_haplotype_count_sub.v5.Iteration001-005 -oo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001-005.out -eo ChromosomeOverlap_haplotype_count_sub.v5.Iteration001=005.err -R "rusage[mem=256]" "sh ChromosomeOverlap_haplotype_count_sub.v5.sh haplotype_estimates.cases.{NAME}.txt,haplotype_estimates.controls.${NAME}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt 50 \"\" \"Iteration001-005.${NAME}\""
This code finds the counts of all patterns in Closed_patterns_stats.${NAME}_${PATTERN}.txt, starting from iteration 0. But if only new patterns are desired, one can create a subset file
awk 'BEGIN{OFS="\t"} NR==FNR{seen[$2]; next} ($3 in seen || $1>1){print $0}' Pattern_combined.Iteration001.${NAME}_${PATTERN}.txt Closed_patterns_stats.${NAME}_${PATTERN}.txt > Closed_patterns_stats.Iteration001-005.${NAME}_${PATTERN}.txt
and run ChromosomeOverlap_haplotype_count_sub.v5.sh on it.