Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few issues with genomic_converter() function #10

Closed
IdoBar opened this issue Oct 2, 2017 · 5 comments
Closed

A few issues with genomic_converter() function #10

IdoBar opened this issue Oct 2, 2017 · 5 comments

Comments

@IdoBar
Copy link

IdoBar commented Oct 2, 2017

Hi Thierry,

Working with the package, mainly to clean, import and convert SNP data to different formats, I've been trying to use genomic_converter() function and came up with a few issues with its behaviour:

  1. When exporting to SNPRelate format, it ignores the provided output filename and creates a date-signature based one (see related pull request).
  2. Using vcf.metadata=TRUE argument with a VCF file resulted in an error (object DP not found).
  3. Confusing inconsistency with function argument rules - blacklist.id argument can accept either a file or a data.frame object, while blacklist.genotype can only a filename containing a data.frame. I know it appears in the function documentation, but this inconsistency got me confused for a while until I double checked the fine details. I suggest making both arguments work with R objects, it makes much more sense than relying on files.
  4. snp.ld lets you choose the first, last or random SNP, while to me it makes sense to allow choosing a SNP that is NOT first nor last, because the ones at the tag ends are often supported by fewer reads and are less usable in validation (if flanking primers are to be designed).

That's it for now, thanks, Ido

@thierrygosselin
Copy link
Owner

Awesome, thanks Ido!

  1. SNPRelate issue is fixed. However, the output option will be remove today with the next release, see my comment on this.

  2. vcf.metadata=TRUE I've seen this problem yesterday, and I think it originate from my last fix trying to overcome the problem with metadata provided by stacks. The GL is still present in the vcf header but not in the format filed of each genotype. The packages I'm using vcfR and pegas are confused by this. I thought I had fix the issue with remaining arguments, but will have to do further test.

  3. blacklist.id and blacklist.genotype: good catch! I was currently incrementally including the functionality to have an object or a file! I was about to test with blacklist.genotype. I've finished the test with whitelist.markers (it's not in the doc, but works with object...). This will be in the next release today.

  4. snp.ld: I'll implement something. Although, having more than 2-3 SNPs on a 100pb read is not really a good sign it will be a nice addition to test. What would be the default behaviour if you only have 2 SNPs ? use the first or the last one ? (I would opt for the first). And if more than 3 SNPs are present use one at random.

thanks
Thierry

@thierrygosselin
Copy link
Owner

Hi Ido,

  1. SNPRelate : done
  2. vcf.metadata = TRUE: fixed
  3. blacklist.id, blacklist.genotype and whitelist.markers : all behaving the same now.
  4. snp.ld: a new option is implemented and called middle. Details in the function doc.

Cheers
Thierry

@IdoBar
Copy link
Author

IdoBar commented Oct 3, 2017

The new version (0.0.6), now completely fails to import data while applying filters, with the following error:

Error in UseMethod("filter_") : 
   no applicable method for 'filter_' applied to an object of class "character"

I suspect it has something to do with the changes made to accommodate blacklist.genotype as a data.frame, but I haven't looked into that yet.

Reverting back to the older version for now.
Cheers, Ido

@thierrygosselin
Copy link
Owner

Currently checking this along another problem I've detected.

This mainly affect VCF file.
To have unique markers I am combining CHROM__LOCUS__POS into MARKERS column.
The separator used is 2 underscores (it's the only one I've found that doesn't interfere with other package).This is what is currently used in
radiator and grur to export whitelists.

Since stacks version 1.44, the position of the SNP on the haplotype/read is included in the ID column in VCF file. Now the ID column is no longer unique and no longer correspond to LOCUS, the column requires parsing to get back to the LOCUS info. Which is really a pain. This should have been included in the POS column (the problem was raised on google group).

The whitelists and blacklists created were intended to be used in R with the packages and a tidy dataset and before this stacks update, it could also be used with a stacks vcf file.

I suspect the problem is related to whitelist, blacklist and blacklist.genotype with locus info that are used back to filter the VCF file and not the tidy dataset. I'll have to check this...

Otherwise, using a tidy dataset it works as intended.
Thierry

@thierrygosselin
Copy link
Owner

Works with the latest commit
Re-open the issue if you have problem

Best
Thierry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants