-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
052 concepts pipeline #83
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reading so far:
-
The adapter/transformer part seems fine, and pretty much in line with what we've done previously. There are a few subtleties around short-circuit logic but nothing serious.
-
I'm much more concerned about the changes to the ID minter. The existing ID minter is deliberately very simple because it's such a critical application, and if we get it wrong we'll have some very big headaches. This RFC is proposing to make it substantially more complicated. It's also changing the way the
id
/identifiers
field works.As currently written, I can't tell why these changes are necessary. I'm not saying they're wrong, but I want to see more justification and rigour. e.g. what if two concepts are initially distinct, then later we discover that they're
sameAs
? How do we update the ID minter database/handle redirects?I'd also like to see some analysis of other approaches.
e.g. I'd assumed we'd do something much closer to the existing works pipeline
- Get a transformed concept from the source
- Create a canonical ID for each transformed concept
- Build a matcher-like graph of equivalent concepts across different sources
- Combine them into a single merged concept; redirect the other canonical IDs to the merged concept
This would allow us to reuse the existing ID minter and matcher applications, which are well-tested and understood ideas.
I'm not saying that's definitely the right approach – you and Jamie may have discussed it and realised why it's not a good idea – but we should record that discussion/analysis in this RFC.
How many concepts are there in LCSH/MeSH/Wikidata? How big batches are we talking? |
There are currently 504,913 entries in LCSH, 456,068 active terms in MeSH and 11,364,141 entries in LC Names. Wikidata is currently 98,631,652 and of course there is Wikipedia on top of that. Of course, we don't use anywhere near all of any of those and in particular for Wikidata and Wikipedia, we probably want to find a way to subset to things we are interested in and I think @harrisonpim has already thought about that a bit. |
There is a whole other aspect to this that we need to work out how to incorporate, which relates to the identified/unidentified concepts conversation that we're having over at #82. For unidentified subjects, people and (all) genres, we won't be able to source the data from LCSH/LC Names/MESH etc. It only exists [to us] in the catalogue data. So we also need a way to push things into the concepts pipeline from the catalogue pipeline, so that unidentified subjects/people and genres are also included. That then implies we also need to work out how to identify those things and stitch those identified concept back onto works. One for our in person discussion I think... |
Relevant example on wellcomecollection/platform#5563 of concepts that will need merged, even in existing data (ie before we even look at Wikidata). |
Do some of the other files (eg step function diagrams) need to be removed? |
Draft pull request for visibility.
This starts to describe the Concepts Pipeline. At this stage only fetching LCSH
https://github.com/wellcomecollection/docs/blob/052-concepts-pipeline/rfcs/052-concepts-pipeline/README.md