Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Case: Research Dataset #11

Closed
escowles opened this issue Oct 8, 2014 · 76 comments
Closed

Use Case: Research Dataset #11

escowles opened this issue Oct 8, 2014 · 76 comments

Comments

@escowles
Copy link
Contributor

escowles commented Oct 8, 2014

A research dataset containing a set of files organized into top-level categories of preparatory materials, raw data files, statistics, and visualization images, with multiple files in each category. The visualization images are further organized in a hierarchy by type, and then by X/Y/Z axis.

  • The dataset as a whole has descriptive metadata describing the research project, the research team, etc.
  • The top-level categories and the hierarchy of visualization images have titles.
  • Each individual file has a title and technical metadata.
@jcoyne
Copy link
Member

jcoyne commented Oct 9, 2014

Is "Organized by Hierarchy" an implementation detail? Can you provide the motivation for the organizing into this structure? Is this structure a standard for all research datasets? Can you elaborate about "X/Y/Z axis?" An example project structure would be enlightening.

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

Organized by hierarchy is an implementation detail, but one that needs support from the model. I don't think these conventions are common enough to warrant modeling -- I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy.

Here's an example component hierarchy:

  • Parameters
    • Parameters file 1
    • Parameters file 2
  • Raw data
    • Raw data file 1
    • Raw data file 2
  • Statistics
    • Statistics file 1
    • Statistics file 2
  • Visualizations
    • Visualizations of X-axis
      • X-axis file 1
      • X-axis file 2
    • Visualizations of Y-axis
      • Y-axis file 1
      • Y-axis file 2
    • Visualizations of Z-axis
      • Z-axis file 1
      • Z-axis file 2

Each of the lowest-level components would have a file attached.

@awead
Copy link
Contributor

awead commented Oct 9, 2014

@escowles are you saying "X/Y/Z" in the sense that you have a 3D hierarchy with 1:N relationships to each so as to encompass all possible implementations needs? Did that question even make sense?

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

@awead Yes, the dataset in question has spatial data and visualizations of it in 3 axes -- we model that in the hierarchy above. But I want to note that I'm not suggesting that we create Ruby classes to model spatial data. We use generic Component classes with titles like "Visualizations of X-axis", "Visualizations of Y-axis", etc. to label the containers of files.

@awead
Copy link
Contributor

awead commented Oct 9, 2014

@escowles So if I follow, each axis is an instance of the abstract Component class? I'm using the term "instance" vaguely here.

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

@awead Yes, each axis would be an instance of the Component class, and contain one or more files that were visualizations in that axis. So "Visualizations of X-axis" would be a Component containing the Components "X-axis file 1" and "X-axis file 2". The Component "X-axis file 1" would contain Files of source image (e.g., high-res TIFF), a thumbnail JPEG, etc.

@azaroth42
Copy link

+1 to this use case, and +1 to the overall model (give or take a 0.1 detail here and there) 👍

@mcritchlow
Copy link

@awead if it's helpful, here's an example of what @escowles has described in our current system http://library.ucsd.edu/dc/object/bb2322141x

@jeremyf
Copy link
Contributor

jeremyf commented Oct 9, 2014

Each of the lowest-level components would have a file attached.

Can we create a sub-component off of the lowest-level component that has a file? In other words, are we saying that we can have files attached at any level of the graph? Or are we stating that we have a Node and Leaf construct that once a Leaf you can't become a Node?

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

@jeremyf I think the model should support nodes with both child nodes and files. At UCSD we don't typically do that, but it seems like a good idea to support it. If we were modelling a filesystem, for example, that would allow both subdirectories and files.

@azaroth42
Copy link

+1 to any level being able to have associated bitstreams

@jeremyf
Copy link
Contributor

jeremyf commented Oct 9, 2014

👍 @escowles excellent, its different than what we've been working from, but that is an implementation detail (its easier to hide the ability to add a bitstream to a Curate::Work than it would be add that functionality)

@jeremyf
Copy link
Contributor

jeremyf commented Oct 9, 2014

Can someone help translate @escowles's issue into a pull request? Someone with an ICLA?

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

I'm seeing some enthusiasm about @escowles 's model. And he said:

I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy

Does this mean we're good with the Work (holds descMD) -> GenericFile (holds bitstream, and optionally holds its own file-specific descMD) model, since I believe @escowles has said that maps well to his model? If so, that seems like it'd bring together Sufia, Worthwhile, Curate, and UCSD, and possibly a bunch more of us without introducing a bunch of new concerns and concepts.

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

I think this is existing functionality, but want to confirm: Sufia & Worthwhile GenericFiles can link to each other, right? That's the one thing we'd need to encode a hierarchy with a flat set of GenericFiles.

@jpstroop
Copy link
Contributor

jpstroop commented Oct 9, 2014

Sorry, I'm catching up, but what does

Work (holds descMD) -> GenericFile

mean?

I'm happy to have a generic Object/Component/File classes and use descriptive metadata to describe the hierarchy

Are you saying a that Work couldn't contain a Work? If I'm correct, I'd be 😞 if after all this we find ourselves back at requiring METS/MODS/FOXML/EAD/whatever (including an RDF rendition thereof) to impose order rather than letting the model itself reflect relationships that are idiomatic to the constituent parts (streams, constituent models) that make up the object.

Other/alternative orders or hierarchies, sure, to me that's what descriptive metadata is for, but if there are relationships that are integral to the parts that comprise the whole, I think they should be reflected in the model itself.

Apologies if I'm misreading.

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

@escowles I'm not sure we've built out the capability to make those links -- unless this is what the recently excised Worthwhile LinkedResources do -- but IMO that is "a small ask" and a reasonable addition to the functionality we already have if that is the cost of making our current model to UCSD-compliant!

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

@jpstroop I'm not saying that it's inconceivable that a Work contain another Work -- I'm just not sure how many of our IR-like use cases require this functionality currently. As far as the first phase of Hydra::Works goes, I'm in favor of restricting the scope to IR-like use cases and solving for the 80%. So if we have those use cases, and they seem like commonly needed use cases, let's include Works containing Works in the initial model. If not, I might suggest we defer to the next phase, once we've got a common model for the Sufia/Worthwhile/Curate apps out there.

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

(Alternatively, I think it may also be OK to allow this (Works containing Works) in the model we develop if we also provide some guidance on how implementations like ScholarSphere might avoid/ignore/hide/disallow this complexity.)

@azaroth42
Copy link

If the intent is only to solve the very basic single list of files associated in an unordered set, please let's rename it far far away from Work or any of the other terms that imply there's a data model behind it.

I suggest Hydra::BasicGroupOfFiles

@jpstroop
Copy link
Contributor

jpstroop commented Oct 9, 2014

So if we have those use cases, and they seem like commonly needed use cases, let's include Works containing Works in the initial model. If not, I might suggest we defer to the next phase, once we've got a common model for the Sufia/Worthwhile/Curate apps out there.

Well...I have a PR in for one use case (or four, depending on how you look at it), all of which we've made a dog's dinner of w/ METS (valid XML != good modeling; I could show you but it would burn your eyes. 🔥 😎).

... I think it may also be OK to allow this (Works containing Works) in the model we develop if we also provide some guidance on how implementations like ScholarSphere might avoid/ignore/hide/disallow this complexity.

Absolutely! @jcoyne said the same thing here.

Maybe there's Hydra::IRWork model that extends Hydra::Work to include validations (or whatever the best approach is) the keep it from ever including a Work.

@escowles
Copy link
Contributor Author

escowles commented Oct 9, 2014

I don't think DigitalObjectSlashWorkSlashWhatever -> GenericFile -> bitstream is just a single unordered set of files.

  • We can use multiple GenericFiles to group files
  • We can link between GenericFiles to encode a hierarchy
  • We can attach properties to GenericFile to express order

IMHO, this is not just simpler than having infinite recursion of GenericFiles/Components, it's also more flexible since it can express relationships other than containment.

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

@azaroth42 I was thinking the intent, based on what I was hearing at the Sufia Futures discussion on Friday, was to come up with a model that can underlie Hydrus-based, Sufia-based, Worthwhile-based, and Curate-based apps. Whether we call it a Work or a BasicGroupOfFiles or an IRWork, how good of a fit do you judge this for Hydrus's needs?

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

What @escowles said was more articulate and more succinct than what I was saying.

@jpstroop
Copy link
Contributor

jpstroop commented Oct 9, 2014

Would the DigitalObjectSlashWorkSlashWhatever -> GenericFile -> bitstream approach mean you couldn't use the AF API to manage those 'more flexible' relationships?

@azaroth42
Copy link

I would need to defer to other Stanford folk on the appropriateness for Hydrus. Once there's a proposal, I'm happy to take it back and discuss with them :)

@mjgiarlo
Copy link
Member

mjgiarlo commented Oct 9, 2014

@azaroth42 Fair enough!

@jpstroop I would think those relationships would be manageable via AF but I defer to folks whose heads are in the code more frequently than mine. @escowles @jcoyne etc.

@awead
Copy link
Contributor

awead commented Oct 11, 2014

@escowles, 👍 to ordered list ontology. I'm also assuming that these aspects would be baked-in to the model but easy to ignore of you weren't worried about order. Also, might sort fields be implementation-specific?

@mjgiarlo yeah, this shouldn't be hard to map, although we'll mint a bunch more pids to create the additional "works" for each existing GenericFile.

@mjgiarlo
Copy link
Member

@awead If we decide to make use of the Batch objects we already have in the system such that every Batch of GenericFiles is a Work -- not saying we should, but it's one migration decision we could make -- we may also need to create Components to hold descriptive metadata about Files. (Since in the @escowles model, a File object cannot hold descMD.) Still pretty easy to map.

@escowles
Copy link
Contributor Author

@awead I'm not sure about the mechanics, but I've heard use cases in #17 and #18 for both Sets (unordered, non-duplicated) and Arrays (ordered, duplication allowed). So maybe the default collection is one of those, and there's a subclass that overrides the members to use the other. So if you don't care about ordering, you'd just need to use the right class and then the members would be an unordered set.

@jpstroop
Copy link
Contributor

@jcoyne

@jpstroop No, a work may have many Collections.

I completely agree with this, however, we often have administrative documents that we stash with collections, and I'd like to be able to know which collection those go with--like a canonical collection that a Wortem has_one of. Maybe this suggests that we need a Projgroup (@escowles could probably think of a better name 😄) model that can hold those. Is this what Stanford folks (and maybe others) call an APO?

This may be out of scope for this discussion and something we'd just need to refine locally, but I thought I'd mention it.

@azaroth42
Copy link

We'd like to move away from the current conflation between APO (as a permissions holding thing) and Collection (as a structural thing).

@scherztc
Copy link
Contributor

+1 on @awead 's assertions. I am still interested in this notion of a LinkedResource and whether it is descriptive metadata or an object inside of a Work?

@escowles
Copy link
Contributor Author

Updated diagram per agreement on nomenclature in #8:

coll-work-comp-file

@jpstroop
Copy link
Contributor

@escowles GenericWork -> GenericComponent and GenericComponent -> GenericComponent are 0:m, no?

@mjgiarlo
Copy link
Member

I interpreted 1:m to mean that GenericComponents have one and only one GenericWork, and GenericWorks have many (zero, one, or more) GenericComponents. @jpstroop @escowles

I was thining of Rails has_many here: you include a has_many assertion in your model but that doesn't mean your object doesn't validate if you don't have one, right?

@escowles
Copy link
Contributor Author

@jpstroop Yes, what @mjgiarlo said. If I had a CLA signed, I'd write up some prose about the entities and their relationships for the README...

@jpstroop
Copy link
Contributor

Me too, I think:
A GenericWork has 0..n GenericComponents
A GenericComponent has 0..n GenericComponents
A GenericComponent belongs to 1 GenericWork OR GenericComponent

@mjgiarlo
Copy link
Member

Yes, that's how I read @escowles graphic. The @jpstroop text will come in handy when we write this up. Thx, y'all.

@mjgiarlo
Copy link
Member

Also:
A GenericWork has 0..n GenericFiles
A GenericComponent has 0..n GenericFiles
A GenericFile belongs to 1 GenericWork OR GenericComponent.

@jpstroop
Copy link
Contributor

Yup.

@azaroth42
Copy link

For the use case of a collection having a Thumbnail, which may not be a derivative of any particular page but instead a filmstrip of multiple pages ... would that require GenericCollection to have at least one GenericFile?

@jpstroop
Copy link
Contributor

My instinct is to leave that up to extending GenericCollection...things start to look too similar otherwise, and assuming the image is somewhere else in your GenericCollection graph, you might store a pointer instead anyway, right?

@mjgiarlo
Copy link
Member

👍 to @jpstroop

@mjgiarlo
Copy link
Member

Could the descriptive metadata of the collection include an assertion that covers this? E.g., maybe I create a new GenericComponent with a GenericFile (xyz123) that is the filmstrip derivative, and then assert collection_id :hasRepresentativeImage xyz123. (That was very rough and simplistic but I think you catch my 💨 )

@azaroth42
Copy link

Then Collections would need to contain Components as well as Works? The filmstrip isn't a Work /within/ the collection, it's a derivative created from the member Works. So long as it's not prevented, then fine, it can be a NotSoGenericCollection, but just throwing it out there.

@jpstroop jpstroop mentioned this issue Oct 14, 2014
@mjgiarlo
Copy link
Member

I'm glad you threw it out there, @azaroth42.

So, for @escowles et al., the diagram does not connect GenericCollection with a GenericFile. Should we interpret that as Hydra::Works asserting that a GenericCollection can not contain a GenericFile, or that Hydra::Works remains silent on everything about GenericCollections except that GenericWorks may have a many-to-many relationship with them? Or something else?

@jpstroop
Copy link
Contributor

Without wanting to get too specific about impl, it seems like having a GenericFile is a concern that could be mixed in. GColletion wouldn't do it, but RobCollection < GColletion might, and then at least the code for concern would be reusable/behave in an expected way.

@mjgiarlo
Copy link
Member

🚧 it! (Sorry, didn't find a shovel emoji. ;) )

@escowles
Copy link
Contributor Author

I definitely agree with @jpstroop that you could extend GenericCollection to add whatever links to GenericFiles or whatever you wanted. But having a preview image seems like a common-enough use case that we should at least try to come up with a standard way of doing it.

In our discussions at UCSD, we had planned on making a subproperty of dc:relation called something like ucsd:thumbnail that would link to the thumbnail image URL (which could be repository URL, or could be an image on a generic webserver, depending on the collection).

Either way, this seems related to @scherztc's RelatedResource to me -- basically a typed link to a URL. We had a similar data structure in our old data model, but decided to simplify that to about 10 predicates, since we found that all of our related resources boiled down to those.

@scherztc
Copy link
Contributor

+1 on these relationships:

A GenericWork has 0..n GenericFiles
A GenericComponent has 0..n GenericFiles
A GenericFile belongs to 1 GenericWork OR GenericComponent.

This would cover the RelatedResource option for a GenericWork.

Here is our code that we used in Curate to cover the preview image on a GenericCollection:

samvera-deprecated/curate@aa2af62

@escowles
Copy link
Contributor Author

Getting back to @jpstroop's comment about special collection-type objects where a Work can be related to only one of them: We have this at UCSD too. In our old data model, we had a special class for this, and used it for the top-level browse, driving access control groups, etc.

Our plan was to get rid of them in our new data model and just use GenericCollection for representing them. Maybe we could create a GenericCollection subclass called AdminCollection where each Work belongs_to one AdminCollection? Does anybody else have this kind of relationship?

@mjgiarlo
Copy link
Member

Your AdminCollection idea sounds pretty similar to the notion of Administrative Sets described here:

https://wiki.duraspace.org/display/hydra/Collections%2C+Admin+Sets%2C+Display+Sets

Also to the notion of an APO which is written about here:

https://wiki.duraspace.org/pages/viewpage.action?pageId=64325483

@flyingzumwalt
Copy link
Member

Did the lessons & ideas from this thread make it into some other documentation or specs? Can we close this ticket?

@escowles
Copy link
Contributor Author

escowles commented May 4, 2015

Yes, I think the info in this thread informed the discussions in Portland and the subsequent documentation, so this ticket can be closed.

@escowles escowles closed this as completed May 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants