Distinctness of different kinds of refget digest IDs #329

jmarshall · 2018-08-15T11:01:10Z

Sequences are accessed via

GET /sequence/<id>

where <id> is described as

A string specifying the sequence to be returned. The identifier shall be a checksum derived from the sequence using one of the supported checksum algorithms, or an alias for the sequence supported by the server.

However there doesn't appear to be a way for the client to say which checksum algorithm or what kind of alias they are intending to be using.

The currently-defined checksum algorithms are md5 (ids are 32 hex digits) and trunc512 (ids are 48 hex digits). So fortunately it is trivial for the server to determine which of md5 or trunc512 is being used by looking at the <id>'s length. The format of “aliases” is not described however, and it appears a maliciously-constructed alias could collide with an extant checksum <id>.

Is the intention that any more checksum algorithms defined in the future will also have easily distinguished <id> values?
Would it be a good idea to prevent aliases from colliding with valid checksum <id> values?
For example, servers might only consider an <id> to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters.

The text was updated successfully, but these errors were encountered:

andrewyatz · 2018-08-15T15:01:53Z

Good questions.

Is the intention that any more checksum algorithms defined in the future will also have easily distinguished values?

That was the idea. I would only want to expand the supported checksum range as/when we've found there to be an issue with an existing supported checksum i.e. if trunc512 and md5 resulted in a collision we would expand the digest size used by trunc512.

Would it be a good idea to prevent aliases from colliding with valid checksum values?
For example, servers might only consider an to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters.

I'm not totally sure. I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation.

What might be a good solution is to extend the API to, as you said, specify the type of checksum being used. That way aliases can collide but they would never resolve.

yfarjoun · 2018-08-15T18:43:29Z

I think it would be bet to have different access points for the different retrieval types. While we are at it, I would add a version...so that you'll have GET /sequence/v1/md5/<md5-id> GET /sequence/v1/trunc512/<trunc512-id> GET /sequence/v1/alias/<alias-id> This will make it easy to change a version if we so choose, and add or remove capabilities.

…

On Wed, Aug 15, 2018 at 11:03 AM Andrew Yates ***@***.***> wrote: Good questions. Is the intention that any more checksum algorithms defined in the future will also have easily distinguished values? That was the idea. I would only want to expand the supported checksum range as/when we've found there to be an issue with an existing supported checksum i.e. if trunc512 and md5 resulted in a collision we would expand the digest size used by trunc512. Would it be a good idea to prevent aliases from colliding with valid checksum values? For example, servers might only consider an to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters. I'm not totally sure. I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation. What might be a good solution is to extend the API to, as you said, specify the type of checksum being used. That way aliases can collide but they would never resolve. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#329 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACnk0qTQa-sqcbLAnuFhup7bV21RKuclks5uRDg8gaJpZM4V97X5> .

andrewyatz · 2018-08-16T09:13:52Z

Thanks for the comment on this.

Firstly versioning of refget is handled through the use of the VND mime type and supported versions are declared via the /service-info endpoint. Using VNDs is similar to how htsget is setup. If an implementor wished to URL version their implementation they would be free to do so but as a prefix to all URLs not inline as you have suggested here.

Secondly refget is not built to support sequence retrieval using an alias. Imagine the following URL /sequence/alias/chr1 and how impossible this is to resolve without additional metadata. Refget is trying to resolve this situation by using checksums so supporting alias lookup feels like it's going against refget's ethos.

Finally adding in specific endpoints for each checksum doubles the number of endpoints we support out of the box (/sequence/{checksum}/{id} and /sequence/{checksum}/{id}/metadata x {checksum}) could require duplication of documentation and means non-standard checksum extensions would also require this endpoint duplication.

Personally I think a parameter such as /sequence/{id}?algorithm=md5 would be a more scalable solution.

andrewyatz · 2018-08-29T11:27:28Z

We just spoke on our 3-weekly call and discussed this issue. The conclusion was for the first iteration of refget the two supported algorithms can be identified by length alone. We will expand the spec when algorithm detection it becomes a problem.

The intention is that future checksums would be solved by either expanding the amount we truncate the sha-512 digest by, therefore still use length to differentiate. If a new method was employed in the specification, such as the VMC identifier, then we'd look to other solutions to handle this. This could include such as the ?algorithm=XXXXX parameter, prefixing checksums with a namespace or ensuring that length is still the predominant way to handle this.

We're also going to leave the alias issue to the implementation. In practice they will very rarely exceed the 30 character limit John suggested or be composed of hex digits but it feels like a problem for an implementation to resolve not us.

It was also mentioned we should add an additional section to the rationale about this decision.

See issue #329. The specification currently states there are two checksum identifiers available of different length. Length should be used to detect which one has been given. If we need to adopt other checksums then the spec will evolve to include other methods to identify the checksum used.

jmarshall · 2018-09-12T13:47:45Z

(:+1: from me to what you've merged re checksum algorithms and the favoured ?algorithm= future direction, but IMHO the spec should say something more about aliases.)

IMHO the alias issue should be considered as a security issue.

It seems unfortunate if malicious users were able to poison a server by “uploading” a sequence with an alias that collided with the checksum of some other (perhaps well-known) sequence. (Via whatever separate “upload” facility — put files in the right place or whatever — the server has for getting data into its database, the details of which would be outwith refget's scope.)

andrewyatz · 2018-09-12T16:31:54Z

I see your point. How about the following changes:

Change the section on aliases to "or another checksum scheme supported by the server"
Iterate that all checksum identifiers should be unique and not clash with the md5 and trunc512 algorithms defined here and another checksum identifier scheme as used on a single server instance
Other checksum schemes should be documented by the implementation and must be based on a digest of the underlying sequence

That hopefully puts clear water between aliases e.g. chr1 and alternative methods of generating the checksum identifiers. We never intended to query the server by alias.

As for the poisoned sequence problem ... yeah that's not a refget concern but is a general concern

jmarshall added the refget label Aug 15, 2018

andrewyatz closed this as completed Aug 29, 2018

andrewyatz added the wontfix label Aug 29, 2018

andrewyatz mentioned this issue Aug 29, 2018

Clarification of checksum ID identification #334

Merged

andrewyatz reopened this Sep 12, 2018

andrewyatz mentioned this issue Sep 12, 2018

Clarify alias usage #337

Merged

lairdm closed this as completed Sep 20, 2018

nsheff mentioned this issue Nov 12, 2020

Will the API offer an alias to digest conversion endpoint? ga4gh/refget#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinctness of different kinds of refget digest IDs #329

Distinctness of different kinds of refget digest IDs #329

jmarshall commented Aug 15, 2018

andrewyatz commented Aug 15, 2018

yfarjoun commented Aug 15, 2018 via email

andrewyatz commented Aug 16, 2018

andrewyatz commented Aug 29, 2018

jmarshall commented Sep 12, 2018

andrewyatz commented Sep 12, 2018

Distinctness of different kinds of refget digest IDs #329

Distinctness of different kinds of refget digest IDs #329

Comments

jmarshall commented Aug 15, 2018

andrewyatz commented Aug 15, 2018

yfarjoun commented Aug 15, 2018 via email

andrewyatz commented Aug 16, 2018

andrewyatz commented Aug 29, 2018

jmarshall commented Sep 12, 2018

andrewyatz commented Sep 12, 2018