forked from vibansal/HapCUT2
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into realignment.
Conflicts: hairs-src/extracthairs.c utilities/README.md utilities/calculate_haplotype_statistics.py
- Loading branch information
Showing
10 changed files
with
512 additions
and
292 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
Run HapCUT2 on Duitama et al fosmid data | ||
====== | ||
|
||
This directory contains a bash script for running HapCUT2 on the fosmid data from | ||
the 2012 Duitama et al study [1]: | ||
|
||
Steps in the script: | ||
1. download 1000 Genomes VCFs (phase 1, hg18) for NA12878 trio [2] | ||
2. filter the VCF for only NA12878 heterozygous sites (and separate by chromosome) | ||
3. check that variant indices and genomic coordinates match between filtered VCF and Duitama phased haplotype data | ||
4. download fosmid fragment files and remove first line (matrix dimensions) which is incompatible with HapCUT2 | ||
5. run HapCUT2 on each fragment file to produce HapCUT2 haplotype files | ||
6. use calculate_haplotype_statistics.py script to calculate the haplotype accuracy using the 1000G trio-phased VCF as ground truth | ||
|
||
More detailed explanation of steps 1-3: | ||
HapCUT2 requires a VCF file along with the fragment data, such that the variant indices | ||
in the fragment file match those in the VCF. The original filtered VCF from 1000 Genomes | ||
used to generate the fosmid fragments is not provided, so it is necessary to produce one. There is also a | ||
need for ground-truth haplotypes to compare against, and the phased 1000g variants | ||
can also be used for this purpose. So, the bash script | ||
downloads VCF files for NA12878 from 1000 Genomes project phase 1, and filters | ||
them manually (for NA12878 heterozygous sites only) and then checks | ||
that the genomic coordinates and variant indices match against refhap-based phased | ||
haplotype data files provided by Duitama et al. | ||
|
||
## Steps to run | ||
1. on a linux machine (tested using ubuntu) clone repository and build HapCUT2 using the makefile and instructions in main github README | ||
2. change to this directory (HapCUT2/reproduce_hapcut2_paper/run_hapcut2_fosmid) | ||
3. run run_hapcut2_fosmid_data.sh using bash: | ||
|
||
``` | ||
bash run_hapcut2_fosmid_data.sh | ||
``` | ||
or | ||
``` | ||
chmod +x run_hapcut2_fosmid_data.sh | ||
./run_hapcut2_fosmid_data.sh | ||
``` | ||
|
||
The HapCUT2 haplotypes will be in ```data/hapcut2_haplotypes``` and the error rates compared to 1000G trio haplotypes will be printed to console. | ||
|
||
References: | ||
|
||
[1] Duitama, J., McEwen, G.K., Huebsch, T., Palczewski, S., Schulz, S., Verstrepen, K., Suk, E.K. and Hoehe, M.R., 2011. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic acids research, 40(5), pp.2041-2053. | ||
|
||
[2] 1000 Genomes Project Consortium, 2010. A map of human genome variation from population-scale sequencing. Nature, 467(7319), p.1061. |
35 changes: 35 additions & 0 deletions
35
reproduce_hapcut2_paper/run_hapcut2_fosmid/create_NA12878_hg18_vcfs.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
|
||
header = '' | ||
prev_chrom = None | ||
chroms = ['chr{}'.format(x) for x in range(1,23)] + ['chrX'] | ||
output_filenames = ['data/NA12878_hg18_VCFs/{}.vcf'.format(c) for c in chroms] | ||
output_files = {chrom : open(f,'w') for (chrom,f) in zip(chroms,output_filenames)} | ||
with open("data/temp/NA12878_trio_genotypes_original.vcf",'r') as infile: | ||
for line in infile: | ||
# header lines | ||
if line[0] == '#': | ||
if line[:6] == '#CHROM': # need to remove parent sample labels | ||
el = line.strip().split('\t') | ||
line = '\t'.join(el[:9] + [el[11]]) | ||
header += line # add line to header so we can print it to separate chrom files | ||
continue | ||
|
||
el = line.strip().split('\t') | ||
assert(len(el) == 12) # VCF line with 3 individuals has 12 elements | ||
el = el[:9] + [el[11]] # remove parents | ||
|
||
el[0] = 'chr' + el[0] | ||
chrom = el[0] # add chr label | ||
if chrom != prev_chrom: # we're on to a new chromosome | ||
print(header, file=output_files[chrom]) # print header | ||
|
||
new_line = '\t'.join(el) | ||
# heterozygous variants for NA12878 only | ||
if el[9][:3] in ['0/1','1/0','0|1','1|0']: | ||
print(new_line, file=output_files[chrom]) | ||
|
||
prev_chrom = chrom | ||
|
||
# close new VCFs | ||
for f in output_files.values(): | ||
f.close() |
63 changes: 63 additions & 0 deletions
63
reproduce_hapcut2_paper/run_hapcut2_fosmid/run_hapcut2_fosmid_data.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
#!/bin/bash | ||
|
||
HAPCUT2=../../build/HAPCUT2 | ||
HAP_STATISTICS=../../utilities/calculate_haplotype_statistics.py | ||
|
||
|
||
mkdir -p data/temp | ||
# download the thousand genomes VCFs | ||
echo "DOWNLOADING 1000 GENOMES VCFS FOR NA12878 TRIO" | ||
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz \ | ||
-O data/temp/NA12878_trio_genotypes_original.vcf.gz | ||
# unzip the thousand genomes VCFs | ||
gunzip -c data/temp/NA12878_trio_genotypes_original.vcf.gz > data/temp/NA12878_trio_genotypes_original.vcf | ||
|
||
echo "FILTERING 1000 GENOMES VCFS FOR JUST NA12878 HETEROZYGOUS SITES" | ||
# process the thousand genomes VCFs to be just NA12878 heterozygous sites | ||
mkdir data/NA12878_hg18_VCFs | ||
python3 create_NA12878_hg18_vcfs.py | ||
|
||
echo "CHECKING THAT NA12878 VARIANT INDICES/COORDINATES MATCH DUITAMA PHASED DATA" | ||
# download the Duitama et al 'phased matrix'. It has genomic coordinates for the variants | ||
# we basically want to make sure that our processed NA12878 VCF | ||
# has matching variant indices and genomic coordinates | ||
# to Duitama's own Refhap-based phased data. | ||
wget http://www.molgen.mpg.de/~genetic-variation/SIH/Data/haplotypes.tar.gz \ | ||
-O data/temp/duitama_haplotypes.tar.gz | ||
mkdir data/temp/duitama_haplotypes | ||
tar -xzf data/temp/duitama_haplotypes.tar.gz -C data/temp/duitama_haplotypes | ||
python3 sanity_check_variant_indices.py | ||
|
||
echo "DOWNLOADING AND PROCESSING DUITAMA ET AL FRAGMENT FILES" | ||
# download and unzip the Duitama et al phasing matrices (fragment files) | ||
wget http://www.molgen.mpg.de/~genetic-variation/SIH/Data/phasing_matrices.tar.gz \ | ||
-O data/temp/phasing_matrices.tar.gz | ||
mkdir data/NA12878_fosmid_data_original | ||
mkdir data/NA12878_fosmid_data_formatted | ||
tar -xzf data/temp/phasing_matrices.tar.gz -C data/NA12878_fosmid_data_original | ||
|
||
# remove first line of fragment files | ||
# Duitama et al phasing matrices have the first line as matrix dimensions | ||
# this does not work with HapCUT2 | ||
for i in {1..22} X | ||
do | ||
tail -n +2 data/NA12878_fosmid_data_original/chr${i}.matrix.SORTED \ | ||
> data/NA12878_fosmid_data_formatted/chr${i}.matrix.SORTED | ||
done | ||
|
||
echo "RUNNING HAPCUT2 ON EACH CHROMOSOME OF DUITAMA ET AL DATA" | ||
mkdir data/hapcut2_haplotypes | ||
# run HapCUT2 on each chromosome | ||
for i in {1..22} X; do | ||
$HAPCUT2 --fragments data/NA12878_fosmid_data_formatted/chr${i}.matrix.SORTED \ | ||
--vcf data/NA12878_hg18_VCFs/chr${i}.vcf \ | ||
--out data/hapcut2_haplotypes/chr${i}.hap | ||
done | ||
|
||
echo "COMPARING ASSEMBLED HAPLOTYPE TO 1000G PHASE DATA FOR ACCURACY" | ||
python3 $HAP_STATISTICS -v1 data/NA12878_hg18_VCFs/chr{1..22}.vcf \ | ||
data/NA12878_hg18_VCFs/chrX.vcf \ | ||
-h1 data/hapcut2_haplotypes/chr{1..22}.hap \ | ||
data/hapcut2_haplotypes/chrX.hap \ | ||
-v2 data/NA12878_hg18_VCFs/chr{1..22}.vcf \ | ||
data/NA12878_hg18_VCFs/chrX.vcf |
27 changes: 27 additions & 0 deletions
27
reproduce_hapcut2_paper/run_hapcut2_fosmid/sanity_check_variant_indices.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
|
||
chroms = ['chr{}'.format(x) for x in range(1,23)] + ['chrX'] | ||
|
||
for c in chroms: | ||
duitama_chrom_pos = [] | ||
with open("data/temp/duitama_haplotypes/{}.real_refhap.phase".format(c),'r') as duitama_file: | ||
for line in duitama_file: | ||
el = line.strip().split() | ||
pos = int(el[0]) | ||
duitama_chrom_pos.append((c, pos)) | ||
|
||
NA12878_chrom_pos = [] | ||
with open("data/NA12878_hg18_VCFs/{}.vcf".format(c),'r') as NA12878_vcf: | ||
for line in NA12878_vcf: | ||
# header lines | ||
if line[0] == '#': | ||
continue | ||
|
||
el = line.strip().split('\t') | ||
chrom = el[0] # add chr label | ||
pos = int(el[1]) | ||
|
||
NA12878_chrom_pos.append((chrom,pos)) | ||
|
||
print("checking that indices in Duitama haplotype with {} elements matches indices in NA12878 haplotype with {} elements...".format(len(duitama_chrom_pos), len(NA12878_chrom_pos))) | ||
assert(duitama_chrom_pos == NA12878_chrom_pos) | ||
print("...PASSED") |
Oops, something went wrong.