Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

052 concepts pipeline #83

Merged
merged 24 commits into from
Jul 7, 2022
Merged

052 concepts pipeline #83

merged 24 commits into from
Jul 7, 2022

Conversation

paul-butcher
Copy link
Contributor

@paul-butcher paul-butcher commented Jun 17, 2022

Draft pull request for visibility.

This starts to describe the Concepts Pipeline. At this stage only fetching LCSH

https://github.com/wellcomecollection/docs/blob/052-concepts-pipeline/rfcs/052-concepts-pipeline/README.md

Copy link
Contributor

@alexwlchan alexwlchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reading so far:

  • The adapter/transformer part seems fine, and pretty much in line with what we've done previously. There are a few subtleties around short-circuit logic but nothing serious.

  • I'm much more concerned about the changes to the ID minter. The existing ID minter is deliberately very simple because it's such a critical application, and if we get it wrong we'll have some very big headaches. This RFC is proposing to make it substantially more complicated. It's also changing the way the id/identifiers field works.

    As currently written, I can't tell why these changes are necessary. I'm not saying they're wrong, but I want to see more justification and rigour. e.g. what if two concepts are initially distinct, then later we discover that they're sameAs? How do we update the ID minter database/handle redirects?

    I'd also like to see some analysis of other approaches.

    e.g. I'd assumed we'd do something much closer to the existing works pipeline

    1. Get a transformed concept from the source
    2. Create a canonical ID for each transformed concept
    3. Build a matcher-like graph of equivalent concepts across different sources
    4. Combine them into a single merged concept; redirect the other canonical IDs to the merged concept

    This would allow us to reuse the existing ID minter and matcher applications, which are well-tested and understood ideas.

    I'm not saying that's definitely the right approach – you and Jamie may have discussed it and realised why it's not a good idea – but we should record that discussion/analysis in this RFC.

rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
rfcs/052-concepts-pipeline/README.md Outdated Show resolved Hide resolved
@alexwlchan
Copy link
Contributor

How many concepts are there in LCSH/MeSH/Wikidata? How big batches are we talking?

@jtweed
Copy link
Contributor

jtweed commented Jun 21, 2022

How many concepts are there in LCSH/MeSH/Wikidata? How big batches are we talking?

There are currently 504,913 entries in LCSH, 456,068 active terms in MeSH and 11,364,141 entries in LC Names.

Wikidata is currently 98,631,652 and of course there is Wikipedia on top of that.

Of course, we don't use anywhere near all of any of those and in particular for Wikidata and Wikipedia, we probably want to find a way to subset to things we are interested in and I think @harrisonpim has already thought about that a bit.

@jtweed
Copy link
Contributor

jtweed commented Jun 23, 2022

There is a whole other aspect to this that we need to work out how to incorporate, which relates to the identified/unidentified concepts conversation that we're having over at #82.

For unidentified subjects, people and (all) genres, we won't be able to source the data from LCSH/LC Names/MESH etc. It only exists [to us] in the catalogue data.

So we also need a way to push things into the concepts pipeline from the catalogue pipeline, so that unidentified subjects/people and genres are also included.

That then implies we also need to work out how to identify those things and stitch those identified concept back onto works.

One for our in person discussion I think...

@jtweed
Copy link
Contributor

jtweed commented Jun 24, 2022

Relevant example on wellcomecollection/platform#5563 of concepts that will need merged, even in existing data (ie before we even look at Wikidata).

@harrisonpim
Copy link
Contributor

Untitled-2021-02-08-1406

diagram from this morning's whiteboarding session

@alexwlchan
Copy link
Contributor

Here are the diagrams from Tuesday afternoon.

The concepts pipeline:

Untitled-2022-06-29-1040

Identifying concepts and redirects:

Untitled-2022-06-29-1040-2

A possible ID minter architecture:

Untitled-2022-06-29-1040-3

@alexwlchan
Copy link
Contributor

And a couple of diagrams from Wednesday morning:

Screenshot 2022-06-29 at 15 44 36

Screenshot 2022-06-29 at 15 52 33

Screenshot 2022-06-29 at 15 53 44

@paul-butcher paul-butcher marked this pull request as ready for review July 7, 2022 09:08
@jamieparkinson
Copy link
Contributor

Do some of the other files (eg step function diagrams) need to be removed?

@paul-butcher paul-butcher merged commit 43e8766 into main Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants