Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: tip dates functionality #27

Closed
richelbilderbeek opened this issue Oct 15, 2018 · 7 comments
Closed

Feature request: tip dates functionality #27

richelbilderbeek opened this issue Oct 15, 2018 · 7 comments

Comments

@richelbilderbeek
Copy link
Member

From @ksw9 at this Issue:

Great resource! Do you have plans for a function to import tip dates?
Thank you!

@richelbilderbeek
Copy link
Member Author

If I have a user with a use case: definitely! Could I use you for this?

If yes, I'll take a peek how to implement this this Friday, October 19th 2018.

If it is easy, I may add it that day.If it is too hard, I will give priority to getting babette accepted by rOpenSci and then to be put on CRAN.

@richelbilderbeek
Copy link
Member Author

No use case, so worked on getting babette accepted by rOpenSci.

If someone volunteers for a use case, let me know.

@richelbilderbeek
Copy link
Member Author

Peter Durr has volunteered to help. 🎉

@richelbilderbeek
Copy link
Member Author

Email from Peter Durr and example files:

[...] I appreciate that you probably only wanted some example files.
But when I started looking a the problem, I then realized that that it was quite complex, due to the challenges of getting the date file working.

anyway, attached are three files which will give you - I trust - a good example on which to base the tip dating function within your Beautier library:

  1. a fasta alignment file of 58 sequences of an important virus that causes epidemics in chickens ("Newcastle disease virus"): G_VII_pre2003_msa.fasta
  2. A tab separated list of the fasta headers in the alignment file plus the the year of isolation of the virus: G_VII_pre2003_dates_4.txt
  3. a XML file generated from BEAUti using the above two files - to check the data was OK:

The challenge I found was that creating the date file for BEAUti is very crude.

For this to be able to be uploaded and build the height - the number of decimal years before the most recent common sample - requires that the user upload a file with two columns/fields:

  1. the name field which must be exactly identical to the sequence header file.
  2. the date field must follow the name field with a TAB

This very restrictive nature of the permissive upload file means that it often fails - with no error message of why it failed! This is especially a problem with the requirement for tab-separation, as BEAUti does not accept a simple TSV export from Excel. Instead I needed to run it through various steps to get it to work - thus the file has a "4" in its' name!

In practice, because uploading a separate date file is so hard, all of the tutorials on producing a time-tree in BEAST I have seen use the tip-dating tool which extracts the date from the fasta header.

This does has the advantage that there will always be the correct order of the fasta sequence file and the date file, which is a potential problem if the two files are uploaded separately. However, this then puts the effort back into producing a complex header - with all the risk of introducing error manipulating the concatenation.

I am also guessing that implementing this complex interface using R functions will be a lot of work for you, as well as needing a complex R function with lots of arguments.

So thinking it through, I would like to recommend the following for babette/beautier:

The input date file:

  1. The preferred (maybe only) way beautier accepts date input will be by a separate file upload - as this avoids the need to implement tip-date parsing tool
  2. The date format to be restricted to dd/M/yyyy, M/yyyy or yyyy
  3. The date upload file must contain two comma-separated columns: the sequence ID and the date.
  4. The sequence ID must be contained in the fasta file header, but the sequence ID (in the date file) does not have to equate to the header

This I think will make the preparation of the date file very easy, but more importantly it would allow for some validation at import/parsing:

  1. the number of dates in the date file correspond to the number of sequences in the fasta file. Validation by counting the number of records in each file. Error message example: "number of sequences: 58; number of tip dates: 61"
  2. all of the dates follow one (and only one) of the three allowable formats. Validation: each of the entries in the second column is checked against the three formats to confirm a permissible format has been entered. Error message example: "Two date formats detected - only one date format is allowed"
  3. each ID in the date file can be matched with the corresponding fasta sequence header. Validation: the date ID is used to query the fasta header and to confirm it is present within it. Error message example: "The following date IDs could not be matched to a sequence: ........."

To make the above practical, I have attached as the fourth file a CSV date file exported from Excel containing just the Genbank accession ID and the year (date).

[...]

@richelbilderbeek richelbilderbeek changed the title Feature request: Tip dates functionality Feature request: tip dates functionality Nov 11, 2018
@richelbilderbeek
Copy link
Member Author

richelbilderbeek commented Nov 11, 2018

This is very helpful!

I will add an argument called 'tip_dates' that requires a data frame. Let the parsing be done by the caller 🌈

[edit: will follow Peter's idea to use a filename instead]

richelbilderbeek pushed a commit to ropensci/beautier that referenced this issue Nov 11, 2018
@richelbilderbeek
Copy link
Member Author

Came halfway, will finish at 16th (p = 25%), 23rd (p = 50%) or 30th (p = 99%) November.

@richelbilderbeek
Copy link
Member Author

Done. Not tested to the bone, but I was able to reproduce the file supplied by Peter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant