Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss the right way to "bucket" assets #5

Closed
jechols opened this issue Apr 10, 2015 · 30 comments
Closed

Discuss the right way to "bucket" assets #5

jechols opened this issue Apr 10, 2015 · 30 comments

Comments

@jechols
Copy link
Contributor

jechols commented Apr 10, 2015

This is something I hadn't looked closely enough at before pushing the PR previously. If you override id generation, Fedora will properly be handed the desired path - for instance, be/19/d0/b2/2514nz446. If you do not override id generation and let Fedora do its UUID thing, Fedora chooses the path - for instance, 4a/48/49/6a/4a48496a-d56b-44a0-9836-e3381dfe13bd.

With UUIDs, the bucketing system is able to use the first 8 characters because they are never predictable. With NOIDs, there is a predictability, which is why I pushed up #3 to begin with.

The problem here can originate if you do not specify IDs prior to saving in Fedora for every single object you ever create. If there is any situation where an object is created in Fedora without both a predetermined PID and the overridden path translation code, things will be essentially unfindable by the app.

For most users this probably isn't an issue, but it's the sort of problem that could hit unexpectedly and won't make a lot of sense if it does.

I think the best long-term solution would be to let Fedora bucket by UUID and then have our pids be simply a piece of data associated with the object, rather than the pids being the definitive ID. I have no idea how to make this happen, as I (obviously) am not all that familiar with the Hydra stack, especially Fedora 4 / Hydra 9.

Barring that, it might be good to have the treeifier just return the id as-is unless it's a valid noid. Then anything not minted as a noid could just use the built-in fedora treeification. This is sort of weird and magicky, but it might be a good fix for the vast majority of use cases (e.g., creating a dummy asset in a test where the minting logic is deliberately not wanted).

@jechols
Copy link
Contributor Author

jechols commented Apr 10, 2015

Preliminary testing of the latter option (punt on non-noids in the treeifier) suggests this may be easy to fix -- if that approach is acceptable. I'll submit a PR, but bear in mind all code from @jechols should be considered SUSPECT from this point forward.

@jechols
Copy link
Contributor Author

jechols commented Apr 14, 2015

My first idea is garbage - can't bucket by UUID unless storing some kind of PID-to-UUID lookup, because otherwise how do we tell Fedora how to go from a pid to a file location? My second idea was good in theory, but I'm not sure there's a good way to verify a noid is good without configuring a global template or something. The treeifier could check the current template, but that only works if all objects are always minted with the same template, an assumption that may not be true.

Another approach could be to check for UUID-like strings and just skip those, but that feels even more hacky to me, and still won't necessarily be the fix for all situations.

@jechols
Copy link
Contributor Author

jechols commented May 22, 2015

So the way we solved the issue for OregonDigital was in fact by reversing the NOID and using that as the identifier. Yes, the official and final ID for an asset is a reversed NOID :( From there, we just let the default treeifier do its thing. It feels wrong, but it ensures that buckets work well no matter the pid length, and it's backwards-compatible with any possible identifier that has the same general bucketing strategy.

I hesitate to recommend that as a general strategy for obvious reasons. I'd love to hear a better solution, especially since we still haven't gone live with OD 2 :)

@hackartisan
Copy link

@jechols, have you had any more recent thoughts or discussions on this topic? I'm not sure how concerned I should be that an object might be saved without yet having an ID.

Would it not make sense to reverse NOIDS at time of minting instead?

@jechols
Copy link
Contributor Author

jechols commented Jul 14, 2015

I haven't considered other options. It makes me sad, but the bucketing / treeifying situation being what it is, it's really best to have an algorithm that's either fairly random (like Fedora's UUID) or else predictable in a way that lets you bucket safely. And reversing the noid is the best option I've managed to come up with.

To answer your question, basically that's what we did: https://github.com/OregonDigital/oregondigital_2/blob/2327bb1aac1895a46300897455970ce0b1963673/app/models/generic_asset.rb#L25-L27

When I say we reversed the NOID, I mean we reversed it through and through. Fedora never sees a normal NOID, which ensures we aren't doing any crazy translation anywhere. We use the old treeifier and everything, because we can now rest assured that the noid Fedora sees is safe to treeify as-is.

@mjgiarlo
Copy link
Member

Making sure @cbeer sees this thread. As the original author of the noid library, perhaps he has some wisdom to impart.

@hackartisan
Copy link

@jechols would you be so kind as to spell out the "obvious reasons" that using the reversed NOID as your ID is a bad idea not something you are comfortable recommending? I like it somewhat better than the strategy of changing the treeifier, because it preserves the easy-to-observe, direct relationship between the ID and the fedora uri.

@jcoyne
Copy link
Member

jcoyne commented Jul 20, 2015

@HackMasterA then the treeifier doesn't work for objects that do not receive a noid (and get a system generated UUID instead). The previous version of the treeifier use the same patter as the treeifier built into Fedora. So it didn't matter if you used the treeifier in this gem on the UUIDs.

@jechols
Copy link
Contributor Author

jechols commented Jul 20, 2015

@HackMasterA - that's actually a great question. Maybe the reasons aren't so obvious.

First, anybody who has NOIDs already minted, and wants to keep those NOIDs around, will have the sudden possibility of collisions. It's arguably a low possibility, but it's there.

Second, I don't really understand the concept of NOIDs. It's a bigger topic than fits here, but were it me I'd just use whatever Fedora generates or else my own simple auto-incrementing field. Given this lack of understanding, I suspect there are pretty specific reasons for the decisions around how to mint a NOID in a variety of situations. So I guess I assumed it was bad to saying we don't care about the NOID enough to even worry about things like the position of digits. It just feels like being incompatible with the rest of the NOID-using world is a bad thing. Maybe it isn't.

@jcoyne - I'm not sure what you mean. If you reverse the noid on mint, and prior to storing it in activefedora, the treeifier can keep on working as-is. Fedora never sees the original id, just the reversed one. In our setup, this works fine, using the same treeifier as Fedora. The real problem is that NOIDs by default (at least those produced by the current Ruby library) don't bucket very well, and I thought I'd be clever and "fix" that issue :-/

@hackartisan
Copy link

@jcoyne @jechols Does the treeifier defined in this library run on every ID, regardless of whether it was minted locally or remotely (in fedora)?

@jcoyne
Copy link
Member

jcoyne commented Jul 21, 2015

@HackMasterA Basically yes. If you install ActiveFedora::Noid, then this code underpins ActiveFedora::Base.id_to_uri and ActiveFedora::Base.uri_to_id

@hackartisan
Copy link

@jcoyne Good to know. So in the case that you change the treeifier before ever creating any data in your production system, would you run into trouble with UUIDs due to conflict with fedora's treeifier? Or would you be okay because the one you defined would always be used?

@jcoyne
Copy link
Member

jcoyne commented Jul 21, 2015

@HackMasterA as @jechols has suggested, if you reverse the order of the noid before you write it to Fedora, then you should never have a problem. However, I'm not sure you still have a NOID then (the existing NOID software won't validate). But like @jechols I don't see much value in a NOID other than it being short and easy to write on a notecard or read over the phone.

@hackartisan
Copy link

@jcoyne Yes I'm with you on that part. Seeking more insight on whether the treeifier in this library could potentially conflict with the treeifier in fedora (as you appeared to suggest yesterday), or whether it would replace fedora's treeifier.

@jcoyne
Copy link
Member

jcoyne commented Jul 21, 2015

@HackMasterA There is a known issue with conflicts when you run active_fedora-noid 1.0. But 0.3.0 is okay. This ticket was basically to "fix 1.0"

@hackartisan
Copy link

Thanks; I have a better understanding of the issue now. I definitely see both NOIDs and UUIDs in my sufia-based app. e.g., generic file objects have NOIDS and Hydra::AccessControls::Permission objects have UUIDs. So I think the problem scenario @jechols describes in the first comment is likely to be an issue with out-of-the-box sufia applications.

It sounds like changing the way NOIDs are minted is the best way to solve the bucketing problem. Reversing them seems to be a perfectly good strategy to me. It's simple, easy, and it works. The only problem I can see is that if you wanted to use the checkdigit somehow you would have to reverse it again to do so. I don't know if hydra uses the checkdigit anywhere; I do know that sufia configures the NOID to have one. It seems unlikely to be an issue, since not all objects are guaranteed to have a NOID as their identifier.

@mjgiarlo
Copy link
Member

I am working on a fix for this upstream. Stay tuned. ruby-microservices/noid#13

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 3, 2015

This is fixed in Noid 0.7.2. I can confirm this works in ActiveFedora::Noid:

$ bundle console
Resolving dependencies...
2.2.1 :001 > require 'active_fedora'
 => true 
2.2.1 :002 > service = ActiveFedora::Noid::Service.new
 => #<ActiveFedora::Noid::Service:0x00000002006750 @minter=#<ActiveFedora::Noid::SynchronizedMinter:0x00000002006700 @template=".reeddeeddk", @statefile="/tmp/minter-state">> 
2.2.1 :003 > service.mint
 => "7p88cg76p" 
2.2.1 :004 > service.mint
 => "3197xm04j" 
2.2.1 :005 > service.mint
 => "db78tc00b" 
2.2.1 :006 > service.mint
 => "4f16c314t" 
2.2.1 :007 > service.mint
 => "f4752g72m" 

Bundle update to noid 0.7.2 and test to see if we can close this issue? Ping: @jechols @HackMasterA @jcoyne @terrellt

@hackartisan
Copy link

@mjgiarlo works great!!

[2] pry(main)> f.assign_id
=> "41mg74tms"
[3] pry(main)> f.assign_id
=> "257d27bn7"
[4] pry(main)> 
[5] pry(main)> f.assign_id
=> "311n79kdc"
[6] pry(main)> f.assign_id
=> "977s75ngc"
[7] pry(main)> f.assign_id
=> "752f75xrd"
[8] pry(main)> f.assign_id
=> "62f7627x7"
[9] pry(main)> f.assign_id
=> "33dz0134f"
[10] pry(main)> f.assign_id
=> "66vx02680"

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

@mjgiarlo ++

I'll get 0.7.2 into OregonDigital 2 shortly and let you know, but looking at you results I think we're good!

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

[4] pry(main)> GenericAsset.create.id
=> "h001f336x"
[5] pry(main)> GenericAsset.create.id
=> "t448x52sz"
[6] pry(main)> GenericAsset.create.id
=> "m21zw692n"
[7] pry(main)> GenericAsset.create.id
=> "541zw692n"
[8] pry(main)> GenericAsset.create.id
=> "121dc544g"
[9] pry(main)> GenericAsset.create.id
=> "2040c37vb"
[10] pry(main)> GenericAsset.create.id
=> "v88vs36hh"
[11] pry(main)> GenericAsset.create.id
=> "840wp37h5"

👍

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

Odd that the 3rd and 4th are so close. Maybe some serious load testing would be warranted, but it looks like it's at least in the realm of "good enough".

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

Oops, I still had things reversed in my tests above. But I think the results are more or less the same. Bucketing is far better than before in a test of about 500 assets.

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 4, 2015

@jechols The default is to have 293 buckets of identifiers, and on each call to #mint one is chosen at random, so while it's not impossible to see two like identifiers come up sequentially I suspect you won't see it happen very often.

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 4, 2015

Related issue for @jechols and others who are interested: given 1) that the introduction of treeifier customization can cause backwards-incompatibility (read: if you change your treeifier, you lose the ability to find objects deposited before the treeifier change) and 2) that the treeifier was introduced as a workaround to noid's broken internal bucketing, what would you think of me removing that from the codebase entirely? We may be in a YAGNI situation now that noid bucketing is fixed.

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

@mjgiarlo: what determines the 293 buckets? I guess I should check your changes directly, but one of my original concerns was that a non-default NOID could result in thousands of assets in a single bucket. Is it choosing bucketing strategy with the assumption that the identifier will only be a set length?

As to the treefier stuff, the MD5 hash, and other "fixes" I pushed, they're garbage and can be dropped IMO. They were indeed a workaround and they really mess up anything in Fedora that isn't using the default treeifier.

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 4, 2015

@jechols 293 is a default value which comes from the Perl noid library, from which this library was ported. Here's a relevant excerpt:

    my $maxcounters = 293;      # prime, a little more than 29*10
    #
    # Using a prime under the theory (unverified) that it may help even
    # out distribution across the more significant digits of generated
    # identifiers.  In this way, for example, a method for mapping an
    # identifier to a pathname (eg, fk9tmb35x -> fk/9t/mb/35/x/, which
    # could be a directory holding all files related to the named
    # object), would result in a reasonably balanced filesystem tree
    # -- no subdirectories too unevenly loaded.  That's the hope anyway.

This value may be overridden when a minter is initialized by passing in the max_counters option, which you can see here:

https://github.com/microservices/noid/blob/master/lib/noid/minter.rb#L11

That value is used to initialize the buckets here:

https://github.com/microservices/noid/blob/master/lib/noid/minter.rb#L78

So it may indeed be overridden unless you've got a minter with existing state, because that would break your minter state (since buckets are part of the saved state).

Thanks for the note about the treeifier stuff, and for your patience in resolving this. I'll pluck those out and look to release a new version of AF::Noid soon.

@jechols
Copy link
Contributor Author

jechols commented Aug 4, 2015

Okay, good to know. I'm not sure why 293 was chosen other than maybe the assumption was you'd always have ed or de in any two-digit grouping. It doesn't seem to actually affect buckets long-term, but it will prioritize those first 293 until they're exhausted as far as I can tell. That's not terribly likely to be a problem, but it's probably worth documenting anyway.

No worries on the wait - I probably made things far worse with my previous "fixes" :-/

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 4, 2015

Thanks, @jechols.

@mjgiarlo
Copy link
Member

mjgiarlo commented Aug 4, 2015

Closing this issue and opening another to remove the treeifier stuff. Thanks again, all! 🎆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants