RFC 053: Filtering by contributor, genre, and subject #82

alexwlchan · 2022-06-16T08:45:13Z

(Leaving 051 because I saw Jamie take it yesterday, 052 for the concepts pipeline)

Rendered version: https://github.com/wellcomecollection/docs/blob/052-filtering-by-subject/rfcs/052-filtering-by-subject/README.md

alexwlchan · 2022-06-16T08:54:15Z

rfcs/053-filtering-by-subject/README.md

+5.  Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.
+
+6.  The catalogues are the source of truth for subject identifiers.
+    We can find equivalent identifiers, but we can't pick them from scratch.
+
+    e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
+    If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.


These requirements might seem oddly specific; I added them as a way to negate some of the more wild ideas I had. (e.g. have "shadow identifiers" for unidentified concepts, but then it's impossible for anyone to work out how to filter!)

They did lead me towards the suggestion in "An idea".

I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.

jamieparkinson · 2022-06-16T08:59:21Z

rfcs/053-filtering-by-subject/README.md

+
+Questions:
+
+*   Are the requirements as stated correct?


A general response to the problem stated: it seems to me that the shortcomings of each approach (ID vs label) are kind of related to each other. That is to say - I think the existence of unidentifiable concepts in some sense implies an inability to disambiguate concepts by label.

In reality it's slightly more complex than that - as you say, with family names it's realistic to end up with multiple ambiguous and identified concepts. Still, I'm not sure there's really a problem per se...

On per-concept pages, there's a sample of matching images/works.

Concept pages are for identified concepts, this is fine with the ID search

On per-concept pages, there's a link to a filtered search for the given concept, directly below the sample results.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

On work pages, the list of subjects/contributors/genres to link to filtered searches for each subject.

I wonder if an approach here would be to link to a label search for an unidentified concept, and an ID search for an identified concept. Perhaps there could be UI signifiers that we were doing a text search rather than searching for some kind of canonical entity.

In the works API, there are filters and aggregations for subject/contributor/genre.

Think this is implied by the above 3 requirements.

Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.

As with 3 but agreed it isn't necessarily obvious.

The catalogues are the source of truth for subject identifiers.

For sure.

Appendix

I think I might have talked myself into (via responses to 3 and 5) some kind of concept filter that accepts IDs and labels? Is that an unacceptable deviation? It would behave much like the existing works search when searching for an ID.

I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.

Yeah, I can make that clearer. It's the "there are 5 images above, click here to see all the images" button.

I wanted to state 4 explicitly for the "API-as-product" requirement.

How do dual filters work in the front-end? Does the front-end combine aggregations for identified/unidentified subjects and present them as one list, and then different ticky boxes apply different filters?

The idea of a filter that does both is interesting, and makes me think of another question: do we need a way to find specifically unlabelled subjects? e.g. if I'm on a work page with an unidentified subject "Mental health", is it okay to link to a search that includes the LCSH-identified subject "Mental health" as well as unidentified subjects?

I suppose it would be better named a "concepts query" rather than a "concepts filter" in this case. I'm going off it already tbh - I'm not sure how the aggregations would work, although I think that would be the expected behaviour in this case.

Re search for unidentified concept labels returning identified concepts, I think that's fine.

I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

rfcs/053-filtering-by-subject/README.md

jtweed · 2022-06-17T13:39:07Z

rfcs/053-filtering-by-subject/README.md

+5.  Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work.
+
+6.  The catalogues are the source of truth for subject identifiers.
+    We can find equivalent identifiers, but we can't pick them from scratch.
+
+    e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier.
+    If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label.


I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.

jtweed · 2022-06-17T13:41:10Z

rfcs/053-filtering-by-subject/README.md

+*   We can't rely on filtering by label, because there may be concepts with similar/identical labels but which refer to different things.
+    e.g. two members of the same family.
+
+*   We can't rely on filtering by ID, because not all subjects/genres/contributors are identified.


We can filter by id when we know it is an identified subject, I think I'm with Jamie and the direction he's heading below. Concept pages are for identified things. If they aren't identified, they can't have a page.

jtweed · 2022-06-17T13:41:26Z

rfcs/053-filtering-by-subject/README.md

+*   We can't rely on filtering by ID, because not all subjects/genres/contributors are identified.
+
+From an API perspective, we could easily support both, but it's more complicated in the front-end.
+How do we choose which filter mechanism to use?


By whether the subject is identified or not.

jtweed · 2022-06-17T13:45:25Z

rfcs/053-filtering-by-subject/README.md

+
+Questions:
+
+*   Are the requirements as stated correct?


I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today.

The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem.

I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case.

jtweed · 2022-06-17T13:46:17Z

rfcs/053-filtering-by-subject/README.md

+*   Are the requirements as stated correct?
+    Are there any missing or unnecessary requirements?
+
+*   How many identified/unidentified concepts are there in the catalogue?


@harrisonpim may also want to weigh in here.

I went and had a look in a recent snapshot.

unique values total occurrences

unidentified subjects;
label only 120,301 556,331

identified subjects;
canonical ID 104,481 1,540,795

rfcs/053-filtering-by-subject/README.md

alexwlchan · 2022-06-20T07:37:25Z

Reading your comments, I don't think I did a very good job of explaining this.

I agree this is one @GarethOrmerod should weigh in on – from an API perspective, supporting both ID and label for filters/aggregations isn't difficult. I think the tricky part is how those filters are presented in the front-end, because I'm not sure how we do it consistently.

This may be a non-issue if the vast majority of subjects are identified; I don't know if that's the case.

I've updated the RFC with examples of two UI flows where I see potential snarls; hopefully that makes it a bit clearer.

paul-butcher · 2022-06-20T13:31:54Z

rfcs/053-filtering-by-subject/README.md

+
+    Q: How do we distinguish between a filter for the label "mental health" and the identified concept?
+
+    Q: How do we distinguish between two identified concepts with the same label?


Regarding Concept pages, We've mentioned things like Date of Birth and a brief description being used as a disambiguation hint. Perhaps such a thing could be short enough to be parenthesised, or that we also ensure that our Concept labels are unique by that mechanism: e.g. Nightingale (Bird) vs Nightingale (Statistician).

Wikipedia URLs are by name, rather than opaque id, and this is how they have solved the problem.

If we are linking via ID, then Concept labels can be mutable.

I think concepts' preferred labels have to be mutable; the initial motivation for all of this work was a desire to update outdated language, and that's never going to stop being a problem

Yes. What I mean by that is that we have an advantage over Wikipedia in that no one elsewhere will be linking to a page like wellcomecollection.org/Florence_Nightingale and getting cross because we change it to wellcomecollection.org/Nightingale,_Florence.

Their solution is good enough for them even with their disadvantages (The Georgia Problem), it would (IMO) work even better for us.

paul-butcher · 2022-06-20T13:34:16Z

rfcs/053-filtering-by-subject/README.md

+    They click the dropdown to see a list of available subject filters, and pick one.
+
+    Q: Is the list of available subjects based on ID or label?
+    Is it a mixture of both?


We might want to do some analysis on the degree to which this is a problem. If there are only a handful of unidentified subjects, then we could endeavour to get them identified and work only on IDs.

If, OTOH, there are a substantial number of distinct unidentified subjects, then we may have to mix them.

Another bit of analysis that would be useful:

How many concept labels do we actually have that fit the problematic criteria:

apply to more than one Concept

are not already disambiguated by IDs in the catalogue

are common enough in the catalogue that it could not be resolved by a limited and reasonable amount of cataloguing effort.

"John Nash" might be a good example to investigate. There are at least three John Nashes mentioned in the first page of results here: https://wellcomecollection.org/works?query=John+nash

paul-butcher · 2022-06-20T13:58:35Z

rfcs/053-filtering-by-subject/README.md

+    Q: Is the list of available subjects based on ID or label?
+    Is it a mixture of both?
+
+    Q: How do we distinguish between a filter for the label "mental health" and the identified concept?


I suspect (as long as everything is identified correctly) that what someone would be asking for when filtering by the label is more like a contains query than an equals query. i.e. they don't remember exactly what the label is, but it was something like X.

If we are in a situation where we have to mix label and id in a UI, because there are too many unidentifiable labels, then that can only be seamlessly concealed from the user as long as there are no clashes (e.g. there are no label: "Mental Health" records that do not refer to the identified concept "Mental Health")

harrisonpim

In response to general discussion above, I'm going to go against the grain and say that I think we should be minting IDs for unidentified concepts, just like identified ones.

I think the proposed inconsistency in user experience between identified and unidentified concepts sounds Bad, and I can't really see a proper justification for it. We're introducing a lot of tricky technical work for ourselves through that distinction, and I can't see who it's helping. I feel like the same amount of work could be better spent on the careful minting/merging/redirecting etc of concept IDs, while providing a more consistent experience for users.

Even if the majority of subjects are identified, we still have a lot of unidentified things (subjects, people, genres) and I don't think users should have to worry about whether we've managed to enrich the thing that they've clicked on - they'll expect the experience to be the same. Again, coming up with ways of hinting to users that some concepts are identified and some aren't through design just sounds like extra work for us and our users, and I can't see who it helps.

It's also been said elsewhere that we shouldn't introduce new data to what's in the catalogue, and I agree - I just don't think we're introducing anything new by IDing a string which already exists.
If anything, I think consistency provides a more honest reflection of what's in the catalogue.
In that sense, a bare concepts page also feels like a better nudge for folks in collections info to identify those weird, dangling concepts, compared to a page of search results. It seems easier to me to imagine enriching/disambiguating an existing, sparse concept page than the leap from filtered search results to full concepts page.

paul-butcher · 2022-06-21T08:42:14Z

In that sense, a bare concepts page also feels like a better nudge for folks in collections info to identify those weird, dangling concepts, compared to a page of search results.

Also, finding something attached to the "wrong" Concept would be a better nudge to get it attached to the "right" one.

I feel like the same amount of work could be better spent on the careful minting/merging/redirecting etc of concept IDs, while providing a more consistent experience for users.

My main concern with the "mint everything" approach is that we might end up fragmenting Concepts due to subtle differences in labels (perhaps reflecting a change in practice over time). If that is just down to a typo in a few records (e.g. Falliaze vs Fallaize), then that's a clear error that's simple enough to sort out. OTOH, if half the records are "Florence Nightingale" and the other half are "Nightingale, Florence", and it's left unresolved for too long, users might start collecting bookmarks to the "wrong" one, and we'll need to do some redirecting when it is resolved. As you say, focusing effort on merging/redirecting from the start might be a better approach, because it's probably inevitable.

alexwlchan added 2 commits June 16, 2022 09:44

RFC 052: Filtering by subjects

983bea6

-> 053

1ae07ec

alexwlchan changed the title ~~RFC 052: Filtering by contributor, genre, and subject~~ RFC 053: Filtering by contributor, genre, and subject Jun 16, 2022

Add an idea

86f8437

alexwlchan commented Jun 16, 2022

View reviewed changes

jamieparkinson reviewed Jun 16, 2022

View reviewed changes

alexwlchan self-assigned this Jun 17, 2022

jtweed self-requested a review June 17, 2022 13:13

jtweed reviewed Jun 17, 2022

View reviewed changes

clarify rfc based on comments

9294c0c

paul-butcher reviewed Jun 20, 2022

View reviewed changes

harrisonpim reviewed Jun 20, 2022

View reviewed changes

jtweed mentioned this pull request Jun 23, 2022

052 concepts pipeline #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 053: Filtering by contributor, genre, and subject #82

RFC 053: Filtering by contributor, genre, and subject #82

alexwlchan commented Jun 16, 2022 •

edited

Loading

alexwlchan Jun 16, 2022

jtweed Jun 17, 2022

jamieparkinson Jun 16, 2022

alexwlchan Jun 16, 2022

jamieparkinson Jun 16, 2022

jtweed Jun 17, 2022

jtweed Jun 17, 2022

jtweed Jun 17, 2022

jtweed Jun 17, 2022

jtweed Jun 17, 2022

jtweed Jun 17, 2022

alexwlchan Jun 21, 2022

alexwlchan commented Jun 20, 2022

paul-butcher Jun 20, 2022

harrisonpim Jun 20, 2022

paul-butcher Jun 20, 2022

paul-butcher Jun 20, 2022

paul-butcher Jun 21, 2022

paul-butcher Jun 20, 2022

harrisonpim left a comment •

edited

Loading

paul-butcher commented Jun 21, 2022

	unique values	total occurrences
unidentified subjects; label only	120,301	556,331
identified subjects; canonical ID	104,481	1,540,795


		Q: How do we distinguish between a filter for the label "mental health" and the identified concept?

		Q: How do we distinguish between two identified concepts with the same label?

RFC 053: Filtering by contributor, genre, and subject #82

Are you sure you want to change the base?

RFC 053: Filtering by contributor, genre, and subject #82

Conversation

alexwlchan commented Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexwlchan commented Jun 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrisonpim left a comment • edited Loading

Choose a reason for hiding this comment

paul-butcher commented Jun 21, 2022

alexwlchan commented Jun 16, 2022 •

edited

Loading

harrisonpim left a comment •

edited

Loading