-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinctness of different kinds of refget digest IDs #329
Comments
Good questions.
That was the idea. I would only want to expand the supported checksum range as/when we've found there to be an issue with an existing supported checksum i.e. if trunc512 and md5 resulted in a collision we would expand the digest size used by trunc512.
I'm not totally sure. I viewed the aliases section as a bit where an API can say "I believe this is a known alias for this ID". Nothing more. Those known aliases could be other checksums e.g. if UniParc implemented this they could provide their crc64 checksums as an alias. Part of me feels that this is a buyer beware situation. What might be a good solution is to extend the API to, as you said, specify the type of checksum being used. That way aliases can collide but they would never resolve. |
I think it would be bet to have different access points for the different
retrieval types. While we are at it, I would add a version...so that you'll
have
GET /sequence/v1/md5/<md5-id>
GET /sequence/v1/trunc512/<trunc512-id>
GET /sequence/v1/alias/<alias-id>
This will make it easy to change a version if we so choose, and add or
remove capabilities.
…On Wed, Aug 15, 2018 at 11:03 AM Andrew Yates ***@***.***> wrote:
Good questions.
Is the intention that any more checksum algorithms defined in the future
will also have easily distinguished values?
That was the idea. I would only want to expand the supported checksum
range as/when we've found there to be an issue with an existing supported
checksum i.e. if trunc512 and md5 resulted in a collision we would expand
the digest size used by trunc512.
Would it be a good idea to prevent aliases from colliding with valid
checksum values?
For example, servers might only consider an to potentially be an alias if
it has 30 or fewer characters or if it contains non-hexdigit characters.
I'm not totally sure. I viewed the aliases section as a bit where an API
can say "I believe this is a known alias for this ID". Nothing more. Those
known aliases could be other checksums e.g. if UniParc implemented this
they could provide their crc64 checksums as an alias. Part of me feels that
this is a buyer beware situation.
What might be a good solution is to extend the API to, as you said,
specify the type of checksum being used. That way aliases can collide but
they would never resolve.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#329 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACnk0qTQa-sqcbLAnuFhup7bV21RKuclks5uRDg8gaJpZM4V97X5>
.
|
Thanks for the comment on this. Firstly versioning of refget is handled through the use of the VND mime type and supported versions are declared via the Secondly refget is not built to support sequence retrieval using an alias. Imagine the following URL Finally adding in specific endpoints for each checksum doubles the number of endpoints we support out of the box ( Personally I think a parameter such as |
We just spoke on our 3-weekly call and discussed this issue. The conclusion was for the first iteration of refget the two supported algorithms can be identified by length alone. We will expand the spec when algorithm detection it becomes a problem. The intention is that future checksums would be solved by either expanding the amount we truncate the sha-512 digest by, therefore still use length to differentiate. If a new method was employed in the specification, such as the VMC identifier, then we'd look to other solutions to handle this. This could include such as the We're also going to leave the alias issue to the implementation. In practice they will very rarely exceed the 30 character limit John suggested or be composed of hex digits but it feels like a problem for an implementation to resolve not us. It was also mentioned we should add an additional section to the rationale about this decision. |
See issue #329. The specification currently states there are two checksum identifiers available of different length. Length should be used to detect which one has been given. If we need to adopt other checksums then the spec will evolve to include other methods to identify the checksum used.
(:+1: from me to what you've merged re checksum algorithms and the favoured IMHO the alias issue should be considered as a security issue. It seems unfortunate if malicious users were able to poison a server by “uploading” a sequence with an alias that collided with the checksum of some other (perhaps well-known) sequence. (Via whatever separate “upload” facility — put files in the right place or whatever — the server has for getting data into its database, the details of which would be outwith refget's scope.) |
I see your point. How about the following changes:
That hopefully puts clear water between aliases e.g. chr1 and alternative methods of generating the checksum identifiers. We never intended to query the server by alias. As for the poisoned sequence problem ... yeah that's not a refget concern but is a general concern |
Sequences are accessed via
where
<id>
is described asHowever there doesn't appear to be a way for the client to say which checksum algorithm or what kind of alias they are intending to be using.
The currently-defined checksum algorithms are
md5
(ids are 32 hex digits) andtrunc512
(ids are 48 hex digits). So fortunately it is trivial for the server to determine which ofmd5
ortrunc512
is being used by looking at the<id>
's length. The format of “aliases” is not described however, and it appears a maliciously-constructed alias could collide with an extant checksum<id>
.Is the intention that any more checksum algorithms defined in the future will also have easily distinguished
<id>
values?Would it be a good idea to prevent aliases from colliding with valid checksum
<id>
values?For example, servers might only consider an
<id>
to potentially be an alias if it has 30 or fewer characters or if it contains non-hexdigit characters.The text was updated successfully, but these errors were encountered: