Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Converting VCF to Genepop #177

Closed
alexkrohn opened this issue Mar 31, 2023 · 12 comments
Closed

Error Converting VCF to Genepop #177

alexkrohn opened this issue Mar 31, 2023 · 12 comments

Comments

@alexkrohn
Copy link

alexkrohn commented Mar 31, 2023

I'm encountering a problem that I don't fully understand when trying to convert a VCF to Genepop format (to use in NeEstimator).

Here is the code that I'm running:

library(radiator)
library(tidyverse)

# Import inds from vcf
inds <- extract_individuals_vcf("~/myvcf.vcf")

# Import population names for each ind
pop.names <- read.table("~/pop-file.csv", sep = ",", header=TRUE)

# Create strata df with inds and pop
strata.df <- inds %>%
  mutate(STRATA = pop.names$pop[match(inds$INDIVIDUALS, pop.names$ind)])

# Read vcf with strata file to assign inds to pop
vcf.radiator <- read_vcf("~/myvcf.vcf",
                         strata = strata.df)

# Write genepop file with popnames
write_genepop(vcf.radiator,
filename = "~/genepop.txt")

The pop-file.csv contains two columns: ind with a list of individual names as they appear in the VCF, and pop with their population assignments.

The strata do seem to be properly imported to the VCF. read_vcf() outputs:

################################################################################
############################## radiator::read_vcf ##############################
################################################################################
Execution date@time: 20230331@1006
Folder created: read_vcf_20230331@1006
Function call and arguments stored in: radiator_read_vcf_args_20230331@1006.tsv
File written: random.seed (134852)                                  
✔ Reading VCF [727ms]
Analyzing VCF
VCF source: ipyrad_v.0.9.81
Data is bi-allelic
Synchronizing data and strata...
    Number of strata: 4
    Number of individuals: 104
Reads assembly: reference-assisted

Cleaning VCF
Filters parameters file generated: filters_parameters_20230331@1006.tsv
                                                                    
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2 / 0 / 9
                                                                    
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2118 / 0 / 3948

Preparing output files...
File written: whitelist.markers.tsv                                 
File written: blacklist.markers.tsv                                 
File written: strata.filtered.tsv                                   

VCF statistics per individuals and markers
Generating statistics
✔ SNPs per locus [17ms]
✔ SNP position on the read [32ms]
✔ Missing genotypes [84ms]
✔ Heterozygosity [141ms]
ℹ MAC

Caution: More than 2 alleles detected in the dataset
File written: markers_number_alleles_problem.tsv                    
✔ MAC [78ms]
✔ Coverage ... [378ms]
✔ Generating figures [1.4s]
✔ Writing files [16ms]                                              
################################### SUMMARY ####################################

VCF summary
Missing data: 
    markers: 0.19
    individuals: 0.19


Coverage info:
    individuals mean total coverage: 97246
    individuals mean genotype coverage: 20
    markers mean coverage: 20

VCF info:
Number of chromosome/contig/scaffold: 2127
Number of locus: 1
Number of markers: 4875
Number of strata: 4
Number of individuals: 104

Number of ind/strata:
MS = 44
APP = 24
Other = 28
SUW = 8

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20230331@1006.gds

Computation time, overall: 4 sec
############################## completed read_vcf ##############################

However, write_genepop() gives an error saying that the column STRATA doesn't exist. Any idea why this might be? Here's the full write_genepop() error:

LOCUS field empty... adding unique id instead
Calibrating REF/ALT alleles...
✔ Reading data [200ms]
✔ Recoding genotypes... [244ms]
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `STRATA` doesn't exist.
Run `rlang::last_trace()` to see where the error occurred.
✖ Preparing data [46ms]
@thierrygosselin
Copy link
Owner

Dear Alex,

Big reptile fan here... Several at tortoises, turtles and lizards at home.

Try running genomic_converter:

test <- radiator::genomic_converter(data = "myvcf.vcf", strata = strata.df, output = "genepop", filename = "genepop.txt")

You can always send me in private the files to speed up the debugging

Best
Thierry

@alexkrohn
Copy link
Author

Hi Thierry!

Well I'm a big fan of Montreal, and radiator, so the admiration is mutual!

I tried running genomic_converter, but got an error during the reading step:

...
Genotypes formats generated with 4875 SNPs: 
    GT_BIN (the dosage of ALT allele: 0, 1, 2 NA): TRUE
    GT_VCF (the genotype coding VCFs: 0/0, 0/1, 1/1, ./.): TRUE
    GT_VCF_NUC (the genotype coding in VCFs, but with nucleotides: A/C, ./.): TRUE
    GT (the genotype coding 'a la genepop': 001002, 001001, 000000): TRUE
LOCUS field empty... adding unique id instead
DP column: cleaning and renaming to READ_DEPTH
CATG columns: cleaning and renaming C_DEPTH, A_DEPTH, T_DEPTH, G_DEPTH
Error in `dplyr::mutate()`:
ℹ In argument: `ALLELE_REF_DEPTH = dplyr::case_when(...)`.
Caused by error in `dplyr::case_when()`:
! Failed to evaluate the left-hand side of formula 1.
Caused by error:
! object 'REF' not found

Interestingly, read_vcf works fine here. Is there a place to specify that this is a VCF generated from a de novo assembly with ipyrad?

As an aside, I'm working to migrate my scripts away from PGDSpider, so I definitely appreciate the genomic_converter script that you've created.

@thierrygosselin
Copy link
Owner

I usually deal with stacks, ipyrad vcf specificity automatically
Do you mind sharing via email the vcf and pop map
I'm not able to reproduce with my ipyrad vcf test files.

@thierrygosselin
Copy link
Owner

With latest push (v.1.2.8) problem should be fixed

@thierrygosselin
Copy link
Owner

Depending on what you intend to do, several route to get the genepop file and all should work:

data <- radiator::read_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data)

or this

data.tidy <- radiator::tidy_vcf(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv")
radiator::write_genepop(data = data.tidy)

or this:

test <- radiator::genomic_converter(data = "104inds075maxmissing.recode.vcf", strata = "strata-df.tsv", output = "genepop")

@thierrygosselin
Copy link
Owner

But since you want to work with NeEstimator and I assume your data is filtered, I did a couple checks.
Assuming your stratification was ok, when you read the VCF file you should have this:

################################################################################
############################## radiator::read_vcf ##############################
################################################################################
Execution date@time: 20230403@1705
Folder created: read_vcf_20230403@1705
Function call and arguments stored in: radiator_read_vcf_args_20230403@1705.tsv
File written: random.seed (175401)                                  
✔ Reading VCF [4.3s]
Analyzing VCF
VCF source: ipyrad_v.0.9.81
Data is bi-allelic
Synchronizing data and strata...                                                                         
    Number of strata: 4
    Number of individuals: 104
Reads assembly: reference-assisted

Cleaning VCF
Filters parameters file generated: filters_parameters_20230403@1705.tsv
                                                                    
Filter monomorphic markers
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2 / 0 / 9
                                                                    
Filter common markers:
Number of individuals / strata / chrom / locus / SNP:
    Blacklisted: 0 / 0 / 2118 / 0 / 3948

Preparing output files...
File written: whitelist.markers.tsv                                 
File written: blacklist.markers.tsv                                 
File written: strata.filtered.tsv                                   

VCF statistics per individuals and markers
Generating statistics
✔ SNPs per locus [41ms]
✔ SNP position on the read [80ms]
✔ Missing genotypes [310ms]
✔ Heterozygosity [432ms]
ℹ MAC

Caution: More than 2 alleles detected in the dataset
File written: markers_number_alleles_problem.tsv                    
✔ MAC [296ms]
✔ Coverage ... [1.3s]
✔ Generating figures [3.1s]
✔ Writing files [34ms]                                              
################################### SUMMARY ####################################

VCF summary
Missing data: 
    markers: 0.19
    individuals: 0.19


Coverage info:
    individuals mean total coverage: 97246
    individuals mean genotype coverage: 20
    markers mean coverage: 20

VCF info:
Number of chromosome/contig/scaffold: 2127
Number of locus: 1
Number of markers: 4875
Number of strata: 4
Number of individuals: 104

Number of ind/strata:
MS = 44
APP = 24
Other = 28
SUW = 8

Number of duplicate id: 0
radiator Genomic Data Structure (GDS) file: radiator_20230403@1705.gds

Computation time, overall: 18 sec
############################## completed read_vcf ##############################

You see that 9 SNPs were removed because it was considered monomorphic.
This usually happens when you have low coverage and remove samples.
The program you used didn't check reference/alternate alleles and stuff like that

radiator also automatically removed the markers not in common between your population/strata, it's quite a few 45% so I would check the filters you did before generating that VCF.

You can check the folder: 03_filter_common_markers look for the upsetR plot (png file) you will see that it's your SUW pop driving down the SNP number.

@thierrygosselin
Copy link
Owner

2 more checks I did...because of the red flags I got from looking at the read_vcf folder

  1. duplicate samples
dup <- radiator::detect_duplicate_genomes(data = data)

the first one looks for duplicates. Technical ones forgotten or real ones generated by lab problems.
Answer no during the function run. You don't want to blacklist samples at the first run. You want to check the figures it will generate.

When you look at the Manhattan plot. Usually samples < 0.25 are considered duplicate or I look really hard into the pair of samples. They are not that much, usually.

And depending on projects, the pair of samples around the 0.5 mark are usually considered close kin... and if you run colony or Kinference it will highlight the relationship deeper.

If you have dots that are grey, it means that the samples pair are from different strata/pop. Usually not good if it's < or around 0.25. I would use the file individuals.pairwise.dist.tsv open in excel and look at the samples...

Conclusion:

  • very curious what's the species
  • you seem to have a problem with duplicates or it's a voluntary one (= technical replicates)
  • I wouldn't do a run NeEstimator with this dataset...

@thierrygosselin
Copy link
Owner

  1. mixed samples...
mix <- radiator::detect_mixed_genomes(data = data)
Inspect plots and tables in folder created...
    Do you want to exclude individuals based on heterozygosity ? (y/n):

Answer no to this.

Have a look at the boxplot and Manhattan figures. You will understand why the previous analysis generated false-closekin ...

From the Manhattan plot I would say that you have a missing data problem:

  • it's generating an individual heterozygosity problem
  • probably that this is because you have varying DNA quality (bad samples)
  • or that several different people contributed to the DNA extraction or other wet-lab steps...
  • SUW are all outliers in terms of missing data
  • APP is very different as well and in this strata you have an outlier maybe even a different specie
  • the missingness and heterozygosity problems are driving the duplicates ... and false close kin
  • so this will also generate all kind of bias in genetic analysis
  • be careful
  • you can write by email if you want more explanations...

@alexkrohn
Copy link
Author

Thanks for your detailed explanation, Thierry. I really appreciate it.

This dataset is from Alligator Snapping Turtles, and the three strata represent different species. Even the among-population divergence is quite strong in these species (FST ~ 0.4), but the among-species divergence is higher than I've ever seen (FST ~ 0.8, despite only being separated by ~50 km).

It's not surprising that there are lots of missing sites among the species. I came to this conclusion separately, but it is great to know how easy it can be done in radiator.

I was running this dataset through NeEstimator just out of curiosity to see if any of these species might meet the proposed IUCN cutoff for low genetic diversity (Ne << 50). I knew it wasn't the best way to calculate Ne, but I appreciate your feedback nonetheless. Thanks for helping me get radiator working with this vcf.

@thierrygosselin
Copy link
Owner

Different species make sense. MATE094 and MATE263 are probably not in the good strata. But even that, the duplicate analysis shows problematic close kin relationship. If it's really different species, I strongly suggest filtering separately before running NeEstimator or LDNe.

Good Luck with your analysis

@alexkrohn
Copy link
Author

alexkrohn commented Apr 7, 2023 via email

@alexkrohn
Copy link
Author

Hey @thierrygosselin. Still having the same problem with genomic_converter() runing v.1.2.8. I'm using a different VCF output from ipyrad this time, but the same species. Here's my sessionInfo()

R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin22.3.0 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.2.3/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] radiator_1.2.8      googlesheets4_1.1.0 hierfstat_0.5-11    adegenet_2.1.10     ade4_1.7-22         vcfR_1.14.0        
 [7] lubridate_1.9.2     forcats_1.0.0       stringr_1.5.0       dplyr_1.1.2         purrr_1.0.1         readr_2.1.4        
[13] tidyr_1.3.0         tibble_3.2.1        ggplot2_3.4.2       tidyverse_2.0.0     doParallel_1.0.17   iterators_1.0.14   
[19] foreach_1.5.2      

loaded via a namespace (and not attached):
  [1] googledrive_2.1.0      colorspace_2.1-0       seqinr_4.2-30          ellipsis_0.3.2         rprojroot_2.0.3       
  [6] XVector_0.38.0         GenomicRanges_1.50.2   fs_1.6.1               rstudioapi_0.14        farver_2.1.1          
 [11] remotes_2.4.2          ggrepel_0.9.3          bit64_4.0.5            fansi_1.0.4            codetools_0.2-19      
 [16] splines_4.2.3          cachem_1.0.7           memuse_4.2-3           pkgload_1.3.2          jsonlite_1.8.4        
 [21] cluster_2.1.4          shiny_1.7.4            compiler_4.2.3         httr_1.4.5             Matrix_1.5-4          
 [26] fastmap_1.1.1          gargle_1.3.0           cli_3.6.1              later_1.3.0            prettyunits_1.1.1     
 [31] htmltools_0.5.5        tools_4.2.3            igraph_1.4.2           gtable_0.3.3           glue_1.6.2            
 [36] GenomeInfoDbData_1.2.9 reshape2_1.4.4         rappdirs_0.3.3         Rcpp_1.0.10            cellranger_1.1.0      
 [41] vctrs_0.6.2            Biostrings_2.66.0      ape_5.7-1              nlme_3.1-162           pinfsc50_1.2.0        
 [46] ps_1.7.4               miniUI_0.1.1.1         timechange_0.2.0       mime_0.12              lifecycle_1.0.3       
 [51] devtools_2.4.5         MASS_7.3-58.3          zlibbioc_1.44.0        scales_1.2.1           vroom_1.6.1           
 [56] ragg_1.2.5             hms_1.1.3              promises_1.2.0.1       gdsfmt_1.34.1          rematch2_2.1.2        
 [61] curl_5.0.0             memoise_2.0.1          gridExtra_2.3          UpSetR_1.4.0           stringi_1.7.12        
 [66] paletteer_1.5.0        desc_1.4.2             S4Vectors_0.36.2       permute_0.9-7          BiocGenerics_0.44.0   
 [71] pkgbuild_1.4.0         GenomeInfoDb_1.34.9    rlang_1.1.1            pkgconfig_2.0.3        systemfonts_1.0.4     
 [76] bitops_1.0-7           matrixStats_0.63.0     SeqArray_1.38.0        lattice_0.21-8         htmlwidgets_1.6.2     
 [81] labeling_0.4.2         processx_3.8.0         bit_4.0.5              tidyselect_1.2.0       plyr_1.8.8            
 [86] magrittr_2.0.3         R6_2.5.1               profvis_0.3.7          IRanges_2.32.0         generics_0.1.3        
 [91] pillar_1.9.0           withr_2.5.0            mgcv_1.8-42            RCurl_1.98-1.12        crayon_1.5.2          
 [96] utf8_1.2.3             urlchecker_1.0.1       tzdb_0.3.0             usethis_2.1.6          grid_4.2.3            
[101] data.table_1.14.8      callr_3.7.3            vegan_2.6-4            digest_0.6.31          xtable_1.8-4          
[106] httpuv_1.6.9           textshaping_0.3.6      openssl_2.0.6          stats4_4.2.3           munsell_0.5.0         
[111] viridisLite_0.4.2      sessioninfo_1.2.2      askpass_1.1           

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants