Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinctness of different kinds of refget digest IDs #329

Closed
jmarshall opened this issue Aug 15, 2018 · 6 comments
Closed

Distinctness of different kinds of refget digest IDs #329

jmarshall opened this issue Aug 15, 2018 · 6 comments

Comments

@jmarshall
Copy link
Member

Sequences are accessed via

GET /sequence/<id>

where <id> is described as

A string specifying the sequence to be returned. The identifier shall be a checksum derived from the sequence using one of the supported checksum algorithms, or an alias for the sequence supported by the server.

However there doesn't appear to be a way for the client to say which checksum algorithm or what kind of alias they are intending to be using.

The currently-defined checksum algorithms are md5 (ids are 32 hex digits) and trunc512 (ids are 48 hex digits). So fortunately it is trivial for the server to determine which of md5 or trunc512 is being used by looking at the <id>'s length. The format of “aliases” is not described however, and it appears a maliciously-constructed alias could collide with an extant checksum <id>.

  1. Is the intention that any more checksum algorithms defined in the future will also have easily distinguished <id> values?

  2. Would it be a good idea to prevent aliases from colliding with valid checksum <id> values?
    For example, servers might only consider an <id> to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters.

@andrewyatz
Copy link
Contributor

Good questions.

Is the intention that any more checksum algorithms defined in the future will also have easily distinguished values?

That was the idea. I would only want to expand the supported checksum range as/when we've found there to be an issue with an existing supported checksum i.e. if trunc512 and md5 resulted in a collision we would expand the digest size used by trunc512.

Would it be a good idea to prevent aliases from colliding with valid checksum values?
For example, servers might only consider an to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters.

I'm not totally sure. I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation.

What might be a good solution is to extend the API to, as you said, specify the type of checksum being used. That way aliases can collide but they would never resolve.

@yfarjoun
Copy link
Contributor

yfarjoun commented Aug 15, 2018 via email

@andrewyatz
Copy link
Contributor

Thanks for the comment on this.

Firstly versioning of refget is handled through the use of the VND mime type and supported versions are declared via the /service-info endpoint. Using VNDs is similar to how htsget is setup. If an implementor wished to URL version their implementation they would be free to do so but as a prefix to all URLs not inline as you have suggested here.

Secondly refget is not built to support sequence retrieval using an alias. Imagine the following URL /sequence/alias/chr1 and how impossible this is to resolve without additional metadata. Refget is trying to resolve this situation by using checksums so supporting alias lookup feels like it's going against refget's ethos.

Finally adding in specific endpoints for each checksum doubles the number of endpoints we support out of the box (/sequence/{checksum}/{id} and /sequence/{checksum}/{id}/metadata x {checksum}) could require duplication of documentation and means non-standard checksum extensions would also require this endpoint duplication.

Personally I think a parameter such as /sequence/{id}?algorithm=md5 would be a more scalable solution.

@andrewyatz
Copy link
Contributor

We just spoke on our 3-weekly call and discussed this issue. The conclusion was for the first iteration of refget the two supported algorithms can be identified by length alone. We will expand the spec when algorithm detection it becomes a problem.

The intention is that future checksums would be solved by either expanding the amount we truncate the sha-512 digest by, therefore still use length to differentiate. If a new method was employed in the specification, such as the VMC identifier, then we'd look to other solutions to handle this. This could include such as the ?algorithm=XXXXX parameter, prefixing checksums with a namespace or ensuring that length is still the predominant way to handle this.

We're also going to leave the alias issue to the implementation. In practice they will very rarely exceed the 30 character limit John suggested or be composed of hex digits but it feels like a problem for an implementation to resolve not us.

It was also mentioned we should add an additional section to the rationale about this decision.

andrewyatz added a commit that referenced this issue Sep 12, 2018
See issue #329.

The specification currently states there are two checksum identifiers available of different length. Length should be used to detect which one has been given. If we need to adopt other checksums then the spec will evolve to include other methods to identify the checksum used.
@jmarshall
Copy link
Member Author

(:+1: from me to what you've merged re checksum algorithms and the favoured ?algorithm= future direction, but IMHO the spec should say something more about aliases.)

IMHO the alias issue should be considered as a security issue.

It seems unfortunate if malicious users were able to poison a server by “uploading” a sequence with an alias that collided with the checksum of some other (perhaps well-known) sequence. (Via whatever separate “upload” facility — put files in the right place or whatever — the server has for getting data into its database, the details of which would be outwith refget's scope.)

@andrewyatz
Copy link
Contributor

I see your point. How about the following changes:

  • Change the section on aliases to "or another checksum scheme supported by the server"
  • Iterate that all checksum identifiers should be unique and not clash with the md5 and trunc512 algorithms defined here and another checksum identifier scheme as used on a single server instance
  • Other checksum schemes should be documented by the implementation and must be based on a digest of the underlying sequence

That hopefully puts clear water between aliases e.g. chr1 and alternative methods of generating the checksum identifiers. We never intended to query the server by alias.

As for the poisoned sequence problem ... yeah that's not a refget concern but is a general concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants