052 concepts pipeline #83

paul-butcher · 2022-06-17T08:58:39Z

Draft pull request for visibility.

This starts to describe the Concepts Pipeline. At this stage only fetching LCSH

https://github.com/wellcomecollection/docs/blob/052-concepts-pipeline/rfcs/052-concepts-pipeline/README.md

alexwlchan

On reading so far:

The adapter/transformer part seems fine, and pretty much in line with what we've done previously. There are a few subtleties around short-circuit logic but nothing serious.
I'm much more concerned about the changes to the ID minter. The existing ID minter is deliberately very simple because it's such a critical application, and if we get it wrong we'll have some very big headaches. This RFC is proposing to make it substantially more complicated. It's also changing the way the id/identifiers field works.

As currently written, I can't tell why these changes are necessary. I'm not saying they're wrong, but I want to see more justification and rigour. e.g. what if two concepts are initially distinct, then later we discover that they're sameAs? How do we update the ID minter database/handle redirects?

I'd also like to see some analysis of other approaches.

e.g. I'd assumed we'd do something much closer to the existing works pipeline
1. Get a transformed concept from the source
2. Create a canonical ID for each transformed concept
3. Build a matcher-like graph of equivalent concepts across different sources
4. Combine them into a single merged concept; redirect the other canonical IDs to the merged concept
This would allow us to reuse the existing ID minter and matcher applications, which are well-tested and understood ideas.

I'm not saying that's definitely the right approach – you and Jamie may have discussed it and realised why it's not a good idea – but we should record that discussion/analysis in this RFC.

rfcs/052-concepts-pipeline/README.md

alexwlchan · 2022-06-21T09:47:32Z

How many concepts are there in LCSH/MeSH/Wikidata? How big batches are we talking?

jtweed · 2022-06-21T18:32:53Z

How many concepts are there in LCSH/MeSH/Wikidata? How big batches are we talking?

There are currently 504,913 entries in LCSH, 456,068 active terms in MeSH and 11,364,141 entries in LC Names.

Wikidata is currently 98,631,652 and of course there is Wikipedia on top of that.

Of course, we don't use anywhere near all of any of those and in particular for Wikidata and Wikipedia, we probably want to find a way to subset to things we are interested in and I think @harrisonpim has already thought about that a bit.

jtweed · 2022-06-23T16:56:31Z

There is a whole other aspect to this that we need to work out how to incorporate, which relates to the identified/unidentified concepts conversation that we're having over at #82.

For unidentified subjects, people and (all) genres, we won't be able to source the data from LCSH/LC Names/MESH etc. It only exists [to us] in the catalogue data.

So we also need a way to push things into the concepts pipeline from the catalogue pipeline, so that unidentified subjects/people and genres are also included.

That then implies we also need to work out how to identify those things and stitch those identified concept back onto works.

One for our in person discussion I think...

jtweed · 2022-06-24T11:35:30Z

Relevant example on wellcomecollection/platform#5563 of concepts that will need merged, even in existing data (ie before we even look at Wikidata).

harrisonpim · 2022-06-28T11:23:40Z

diagram from this morning's whiteboarding session

alexwlchan · 2022-06-29T09:53:50Z

Here are the diagrams from Tuesday afternoon.

The concepts pipeline:

Identifying concepts and redirects:

A possible ID minter architecture:

alexwlchan · 2022-06-29T14:53:57Z

And a couple of diagrams from Wednesday morning:

jamieparkinson · 2022-07-07T10:26:31Z

Do some of the other files (eg step function diagrams) need to be removed?

paul-butcher added 14 commits June 15, 2022 11:52

start adding concepts pipeline rfc

c356cc1

some thoughts about id minting

3089184

some thoughts about id minting

39bcbd2

some thoughts about id minting

c49279e

start solidifying ideas

d1593ac

more solidification

bd19db9

more solidification

33f4f1b

more solidification

4dde1d8

add note about frequency

e642ff5

more solidification, pretty much there now

2a25ccb

add cloudwatch/step functions note

72b6552

Add some Step Functions detail

3e77517

a bit of tidying and detail expansion

9f94a8c

Add note about lambdas

8c9d3c1

alexwlchan reviewed Jun 21, 2022

View reviewed changes

jtweed mentioned this pull request Jun 24, 2022

Fix issue with duplicate subject labels in aggregations wellcomecollection/platform#5563

Closed

paul-butcher added 5 commits July 6, 2022 11:00

broad thrust of new understanding of the pipeline(s)

f4614ad

clarify what should be stored

ed421c6

Add Catalogue pipeline stage

ac5c9f9

add diagram

914491e

add note about new stage

541e11a

paul-butcher marked this pull request as ready for review July 7, 2022 09:08

fix footnotes

6da106f

paul-butcher added 2 commits July 7, 2022 10:13

add catalogue stage to the implementation plan

1cb26cd

separate two lists

ab37841

paul-butcher and others added 2 commits July 7, 2022 11:49

remove dead diagrams

6930b53

Add excalidraw file for pipeline diagram

649303a

paul-butcher merged commit 43e8766 into main Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

052 concepts pipeline #83

052 concepts pipeline #83

paul-butcher commented Jun 17, 2022 •

edited by jamieparkinson

Loading

alexwlchan left a comment

alexwlchan commented Jun 21, 2022

jtweed commented Jun 21, 2022 •

edited

Loading

jtweed commented Jun 23, 2022

jtweed commented Jun 24, 2022

harrisonpim commented Jun 28, 2022

alexwlchan commented Jun 29, 2022

alexwlchan commented Jun 29, 2022

jamieparkinson commented Jul 7, 2022

052 concepts pipeline #83

052 concepts pipeline #83

Conversation

paul-butcher commented Jun 17, 2022 • edited by jamieparkinson Loading

alexwlchan left a comment

Choose a reason for hiding this comment

alexwlchan commented Jun 21, 2022

jtweed commented Jun 21, 2022 • edited Loading

jtweed commented Jun 23, 2022

jtweed commented Jun 24, 2022

harrisonpim commented Jun 28, 2022

alexwlchan commented Jun 29, 2022

alexwlchan commented Jun 29, 2022

jamieparkinson commented Jul 7, 2022

paul-butcher commented Jun 17, 2022 •

edited by jamieparkinson

Loading

jtweed commented Jun 21, 2022 •

edited

Loading