Skip to content

1000 Genome CEU Trio Analysis

jingquanlim edited this page May 10, 2015 · 22 revisions

Data Availability

The BAM files used in this analysis are available from:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120117_ceu_trio_b37_decoy/

A copy of the calls used in our Bioinformatics paper are on our ftp site. The version of RetroSeq used to produce the comparison table in the paper was v1.32.

Reference Files

Reference genome: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

The Alu and L1 BED files are derived directly from Repeatmasker. A copy of the BED files used in the analysis are here: ftp://ftp-mouse.sanger.ac.uk/other/tk2/RetroSeq/hg19/

Also, we used Alu and L1 sequence files to increase the sensitivity of the discover stage. These were derived directly from Repbase. A copy of the files used is here: ftp://ftp-mouse.sanger.ac.uk/other/tk2/RetroSeq/hg19/hg19_probes.tgz

Discovery Phase

The command lines used for the discovery stages were:

retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align

retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align

retroseq.pl -discover -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam -output CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.candidates.tab -refTEs ref_types.tab -eref probes.tab -align

The -refTEs input file should be in the format <TE_name>:

Alu    /home/me/data/Alu.bed
L1    /home/me/data/L1.bed

The -eref option input file should be in the format <TE_name>:

Alu    /home/me/data/Alu.fasta
L1    /home/me/data/L1.fasta

Calling Phase

The command lines used for the calling phase were:

retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400

retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12891.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400

retroseq.pl -call -bam CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam -input CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.candidates.tab -ref hs37d5.fa -output CEUTrio.HiSeq.WGS.b37_decoy.NA12892.clean.dedup.recal.bam.vcf -filter ref_types.tab -reads 10 -depth 400

Final Call Filtering

The final calls were filtered in two ways to produce the final callsets.

First remove calls that are very close to reference annotated repeat elements. This was done using bedtools 'window' command:

Alu

bedtools window -b Alu.bed -a CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.Alu.vcf -v -w 100 > NA12878.ref-filtered.Alu.vcf

L1

bedtools intersect -b Alu.bed -a CEUTrio.HiSeq.WGS.b37_decoy.NA12878.clean.dedup.recal.bam.L1.rm_ref.vcf -v > NA12878.ref_filtered_alu.L1.vcf

bedtools window -b L1_HS.bed -a NA12878.ref_filtered_alu.L1.vcf -v -w 200 > NA12878.ref_filtered_alu.ref_filtered_L1.L1.vcf

Finally, we selected calls from the VCF file with the following INFO tags:

FL=6 & GQ>=28

FL=7 & GQ>=20

FL=8 & GQ>=20