Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All elements need id(?), strategy for generating ids #14

Closed
cboettig opened this issue Aug 12, 2013 · 11 comments
Closed

All elements need id(?), strategy for generating ids #14

cboettig opened this issue Aug 12, 2013 · 11 comments

Comments

@cboettig
Copy link
Member

Need to generate ids for nodes such as <otus> and <trees>, etc. Should we use uuid for this?

@rvosa
Copy link
Contributor

rvosa commented Aug 13, 2013

Do we expect people to alter the DOM tree a lot (i.e. are their risks of clashes if we use a simpler scheme)? Otherwise maybe tag name + a counter is more concise?

@rvosa rvosa closed this as completed Aug 13, 2013
@rvosa rvosa reopened this Aug 13, 2013
@cboettig
Copy link
Member Author

@rvosa Yeah, I'm not sure -- still trying to wrap my head around this one. Most users will probably only use the top-level API for writing an ape::phylo tree or list of trees (ape:multiPhylo) to NeXML, in which case we can number them as we go. But the S4-Class-based interface we have so far also allows users to just coerce ape::phylo trees into the S4 RNeXML::tree class, which can then be inserted into NeXML later. Perhaps we don't want users doing that, but this means they could modularly build up the DOM and we then have to watch out for collisions.

Or in more concrete terms, I have this setAs("phylo", "tree" ...) subroutine for mapping phylo objects to the S4 object that mimics the schema. Since the phylo object doesn't have an ID, I either have to generate one at this time, or otherwise add the id when adding the tree to an existing or new nexml/trees object. Does that make sense?

In other news, the validator complains that UUIDs aren't valid id attributes:

... is not a valid value of the atomic type 'xs:ID'

@hlapp
Copy link
Contributor

hlapp commented Aug 15, 2013

On Aug 14, 2013, at 10:59 PM, Carl Boettiger wrote:

In other news, the validator complains that UUIDs aren't valid id attributes:

That sounds like a validator bug.

@rvosa
Copy link
Contributor

rvosa commented Aug 15, 2013

What do the UUIDs look like? The schema specifies that the type of @id is
xs:ID, which is a non-colonized name (NCName), so instance documents must
conform to the production rules of NCNames (probably most importantly:
start with a letter or an underscore). If they don't, I don't see how the
validator is at fault here.

On Thu, Aug 15, 2013 at 5:50 AM, Hilmar Lapp notifications@github.comwrote:

On Aug 14, 2013, at 10:59 PM, Carl Boettiger wrote:

In other news, the validator complains that UUIDs aren't valid id
attributes:

That sounds like a validator bug.


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-22683684
.

Dr. Rutger A. Vos
Bioinformaticist
Naturalis Biodiversity Center
Visiting address: Office A109, Einsteinweg 2, 2333 CC, Leiden, the
Netherlands
Mailing address: Postbus 9517, 2300 RA, Leiden, the Netherlands
http://rutgervos.blogspot.com

@hlapp
Copy link
Contributor

hlapp commented Aug 15, 2013

On Aug 15, 2013, at 5:44 AM, Rutger Vos wrote:

What do the UUIDs look like? [...] instance documents must conform to the production rules of NCNames (probably most importantly: start with a letter or an underscore). If they don't, I don't see how the
validator is at fault here.

UUIDs can start with a digit.

@cboettig: I suggest that if you choose UUIDs, you put them in the form of a urn:uuid: scheme. See http://www.ietf.org/rfc/rfc4122.txt

@rvosa
Copy link
Contributor

rvosa commented Aug 16, 2013

IDs need to be non-colonized names, i.e. strings without colons. If I
understand your suggestion correctly, the UUIDs would contain colons, which
would be a no-no.

On Thu, Aug 15, 2013 at 4:16 PM, Hilmar Lapp notifications@github.comwrote:

On Aug 15, 2013, at 5:44 AM, Rutger Vos wrote:

What do the UUIDs look like? [...] instance documents must conform to
the production rules of NCNames (probably most importantly: start with a
letter or an underscore). If they don't, I don't see how the
validator is at fault here.

UUIDs can start with a digit.

@cboettig: I suggest that if you choose UUIDs, you put them in the form of
a urn:uuid: scheme. See http://www.ietf.org/rfc/rfc4122.txt


Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-22705449
.

Dr. Rutger A. Vos
Bioinformaticist
Naturalis Biodiversity Center
Visiting address: Office A109, Einsteinweg 2, 2333 CC, Leiden, the
Netherlands
Mailing address: Postbus 9517, 2300 RA, Leiden, the Netherlands
http://rutgervos.blogspot.com

@hlapp
Copy link
Contributor

hlapp commented Aug 16, 2013

On Aug 16, 2013, at 6:27 AM, Rutger Vos wrote:

IDs need to be non-colonized names, i.e. strings without colons. If I
understand your suggestion correctly, the UUIDs would contain colons, which
would be a no-no.

So HTTP URIs can't be IDs?

@rvosa
Copy link
Contributor

rvosa commented Aug 19, 2013

Not normally. However, IDs can become part of HTTP URIs when transforming documents to RDF as they are then made globally unique by prefixing them with either the location of the document or the value of xml:base of the nearest ancestor node that contains this attribute. (Note that I didn't just make this up or anything.)

I see where you're going with this line of questioning. If we want HTTP URIs as IDs (good id(ea)), use xml:base.

@cboettig
Copy link
Member Author

I was just using the uuid package, which generates uuids that look like:

> UUIDgenerate()
[1] "f7af80aa-dfb2-4134-aa82-db1c0e9e7980"

No colons, so I'm not sure why the validator (accessed with the R wrapper to xmllib2) is unhappy.

Regardless, not sure uuids were a good idea for this purpose anyhow. The current workflow doesn't give the user the same flexibility over the DOM directly, so we probably don't have to worry about a user creating two S4 "tree" objects and then sticking them in the same nexml with duplicated IDs.

Instead, there is a method for phylo->nexml that creates the ids for otus as t1, t2..., nodes as n1, n2..., edges as e1, e2... etc (done). A separate method for multiPhylo -> nexml will allow the user to add multiple trees while avoiding id conflicts (not written yet). With a sensible top-level API I think we should be fine using these simple ids(?)

@cboettig
Copy link
Member Author

cboettig commented Sep 6, 2013

Okay, I think we're happy with our only locally unique ids for the moment. (Though still unsure what was wrong with the uuid above according to the validator...). Anyway, closing this issue.

@cboettig cboettig closed this as completed Sep 6, 2013
@cboettig
Copy link
Member Author

It appears that strings starting with a number were not valid ids (and uuids often start with numbers).

To address this, all functions that assign ids use the internal method nexml_id(), which can create local numbers using a given character prefix; e.g. edges use "nexml_id("e") to get ids like e1, e2, etc, using an internal counter. The counters start at 1 and increase each time the id of a given prefix is used in that R session, unless reset with reset_id_counter(). This local counter scheme is used by default.

The command options(uuid=TRUE) will make RNeXML use uuids for all id attributes instead. To avoid the validation error, these are prepended with uuid-. This option can be issued per session or put in the user's .Rprofile as persistent configuration. options(uuid=FALSE) sets the behavior back to the local identifiers.

test_global_ids.R provides a unit test that we generate valid nexml when using the global (uuid) id scheme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants