Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with converting VCF to genlight #25

Closed
conkline opened this issue Sep 5, 2018 · 6 comments
Closed

Issue with converting VCF to genlight #25

conkline opened this issue Sep 5, 2018 · 6 comments

Comments

@conkline
Copy link

conkline commented Sep 5, 2018

Hi Thierry,

Thanks for all your work with radiator! I'm getting an error when trying to convert a VCF file (v4.2, biallelic, produced by FreeBayes) to genlight. This is the command used:

genomic_converter(data="~/Downloads/TotalRawSNPsHISEQ.biallelic.vcf.recode.vcf", output='genlight', vcf.metadata = TRUE, strata="~/Downloads/strata.tsv")

And the output/error message:

#######################################################################
##################### radiator::genomic_converter #####################
#######################################################################
Function arguments and values:
Working directory: /Users/Emily
Input file: ~/Downloads/TotalRawSNPsHISEQ.biallelic.vcf.recode.vcf
Strata: ~/Downloads/strata.tsv
Population levels: no
Population labels: no
Output format(s): tidy, genlight
Filename prefix: no
Filters: 
Blacklist of individuals: no
Blacklist of genotypes: no
Whitelist of markers: no
monomorphic.out: TRUE
snp.ld: no
common.markers: TRUE
max.marker: no
pop.select: no
maf.thresholds: no

Imputations options:
imputation.method: no

parallel.core: 3

#######################################################################

Importing data


Reading VCF...
Large vcf file may take several minutes...

conversion timing: 128 sec

radiator is working on the file ...
VCF is biallelic
Updating markers metadata and stats
Error in cbind_all(x) : Argument 3 must be length 378, not 3
In addition: Warning message:
In mclapply(seq_len(njobs), mc.preschedule = FALSE, mc.cores = njobs,  :
  3 function calls resulted in an error

And here is my sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] bindrcpp_0.2.2  radiator_0.0.16 adegenet_2.1.1  ade4_1.7-13    

loaded via a namespace (and not attached):
 [1] nlme_3.1-137           bitops_1.0-6          
 [3] gmodels_2.18.1         GenomeInfoDb_1.16.0   
 [5] tools_3.5.1            R6_2.2.2              
 [7] vegan_2.5-2            spData_0.2.9.0        
 [9] lazyeval_0.2.1         BiocGenerics_0.26.0   
[11] mgcv_1.8-24            colorspace_1.3-2      
[13] permute_0.9-4          sp_1.3-1              
[15] tidyselect_0.2.4       compiler_3.5.1        
[17] expm_0.999-2           scales_0.5.0          
[19] readr_1.1.1            stringr_1.3.1         
[21] digest_0.6.15          XVector_0.20.0        
[23] pkgconfig_2.0.1        htmltools_0.3.6       
[25] fst_0.8.8              rlang_0.2.2           
[27] shiny_1.1.0            bindr_0.1.1           
[29] gtools_3.8.1           spdep_0.7-7           
[31] dplyr_0.7.6            RCurl_1.95-4.11       
[33] magrittr_1.5           GenomeInfoDbData_1.1.0
[35] Matrix_1.2-14          Rcpp_0.12.18          
[37] munsell_0.4.3          S4Vectors_0.18.3      
[39] ape_5.1                stringi_1.2.4         
[41] yaml_2.1.18            MASS_7.3-50           
[43] zlibbioc_1.26.0        plyr_1.8.4            
[45] grid_3.5.1             parallel_3.5.1        
[47] gdata_2.18.0           listenv_0.7.0         
[49] promises_1.0.1         deldir_0.1-15         
[51] lattice_0.20-35        Biostrings_2.48.0     
[53] splines_3.5.1          hms_0.4.2             
[55] pillar_1.2.1           igraph_1.2.1          
[57] GenomicRanges_1.32.6   boot_1.3-20           
[59] seqinr_3.4-5           reshape2_1.4.3        
[61] codetools_0.2-15       gdsfmt_1.16.0         
[63] stats4_3.5.1           LearnBayes_2.15.1     
[65] glue_1.3.0             data.table_1.11.4     
[67] httpuv_1.4.5           gtable_0.2.0          
[69] purrr_0.2.5            tidyr_0.8.1           
[71] SeqArray_1.21.4        future_1.9.0          
[73] amap_0.8-16            assertthat_0.2.0      
[75] ggplot2_3.0.0          mime_0.5              
[77] xtable_1.8-2           coda_0.19-1           
[79] later_0.7.3            tibble_1.4.2          
[81] pbmcapply_1.2.5        IRanges_2.14.11       
[83] cluster_2.0.7-1        globals_0.12.1

Any input/help would be much appreciated!
Emily

@thierrygosselin
Copy link
Owner

Thanks for reporting the bug Emily
Do you mind sending the vcf and strata by email? It’s going to be faster for me to reproduce the error and fix it
Best
thierrygosselin@icloud.com

@thierrygosselin
Copy link
Owner

Sorry for the long delay Emily, the problem is outside radiator. Currently working to resolve this.

It's related to how freeBayes/dDocent generate the VCF (not ideal...below the full details).

In the vcf header for the format field you have these info:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Number of observation for each allele">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">

So anyone reading the file would expect 10 values for each individuals: GT, GQ, GL, DP, AD, RO, QR, AO, QA and MIN_DP, however the individuals only have 8 of those: GT, DP, AD, RO, QR, AO, QA, GL.

The FORMAT column is okay and output 8 values expected: GT:DP:AD:RO:QR:AO:QA:GL.

But some software/packages are using what's described in the header to parse the VCF file, but that's not the only problem, because removing those lines in the VCF doesn't solve the error and the genotypes are still not outputted (the GT field in the VCF)...

more later...

@thierrygosselin
Copy link
Owner

The INFO fields have the same problem:

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
##INFO=<ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
##INFO=<ID=PQA,Number=A,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
##INFO=<ID=SRP,Number=1,Type=Float,Description="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=SAP,Number=A,Type=Float,Description="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
##INFO=<ID=ABP,Number=A,Type=Float,Description="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RUN,Number=A,Type=Integer,Description="Run length: the number of consecutive repeats of the alternate allele in the reference genome">
##INFO=<ID=RPP,Number=A,Type=Float,Description="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPPR,Number=1,Type=Float,Description="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=RPL,Number=A,Type=Float,Description="Reads Placed Left: number of reads supporting the alternate balanced to the left (5') of the alternate allele">
##INFO=<ID=RPR,Number=A,Type=Float,Description="Reads Placed Right: number of reads supporting the alternate balanced to the right (3') of the alternate allele">
##INFO=<ID=EPP,Number=A,Type=Float,Description="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=EPPR,Number=1,Type=Float,Description="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality">
##INFO=<ID=DPRA,Number=A,Type=Float,Description="Alternate allele depth ratio.  Ratio between depth in samples with each called alternate allele and those without.">
##INFO=<ID=ODDS,Number=1,Type=Float,Description="The log odds ratio of the best genotype combination to the second-best.">
##INFO=<ID=GTI,Number=1,Type=Integer,Description="Number of genotyping iterations required to reach convergence or bailout.">
##INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
##INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
##INFO=<ID=NUMALT,Number=1,Type=Integer,Description="Number of unique non-reference alleles in called genotypes at this position.">
##INFO=<ID=MEANALT,Number=A,Type=Float,Description="Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.">
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">
##INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
##INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
##INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
##INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
##INFO=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum depth in gVCF output block.">
##INFO=<ID=END,Number=1,Type=Integer,Description="Last position (inclusive) in gVCF output record.">
##INFO=<ID=technology.Illumina,Number=A,Type=Float,Description="Fraction of observations supporting the alternate observed in reads from Illumina">

but only those are found for each SNPs (44 described and expected, 42 found in the data):

AB, ABP, AC, AF, AN, AO, CIGAR, DP, DPB, DPRA, EPP, EPPR, GTI, LEN, MEANALT, MQM, MQMR, NS, NUMALT, ODDS, PAIRED, PAIREDR, PAO, PQA, PQR, PRO, QA, QR, RO, RPL, RPP, RPPR, RPR, RUN, SAF, SAP, SAR, SRF, SRP, SRR, TYPE, technology.Illumina

@thierrygosselin
Copy link
Owner

correction: genotypes are parsed correctly...

@thierrygosselin
Copy link
Owner

weird stuff in your vcf...

e.g. line 83, column 17 (sample ID: BRSPHaw_024)
GT, DP, AD, RO, QR, AO, QA, GL = ./.:1:1,0,0:1:22:0,0:0,0:0,-0.30103,-2.19784,-0.30103,-2.19784,-2.19784

or

./.
1
1,0,0
1
22
0,0
0,0
0,-0.30103,-2.19784,-0.30103,-2.19784,-2.19784

The genotype is missing or was erased and the remaining values shows 3 values for allele depth (AD) and 6 values for GL. Your vcf is multi-allelic (haplotypes) but that marker is supposed to be biallelic (REF/ALT : C/T)...

reminds me of the genepop format where, with time, people started generating variants of the file format which became increasingly difficult to parse correctly!

@thierrygosselin
Copy link
Owner

should work now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants