-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 053: Filtering by contributor, genre, and subject #82
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,85 @@ | |||||||||||
# RFC 053: Filtering by contributor, genre, and subject | |||||||||||
|
|||||||||||
For per-subject and per-person pages, we want to filter for images and works that match a given subject/person. | |||||||||||
For example: | |||||||||||
|
|||||||||||
* Works by Florence Nightingale | |||||||||||
* Works about Charles Darwin | |||||||||||
* Images about mental health | |||||||||||
|
|||||||||||
This turns out to be non-trivial, so this RFC describes how we'll get there. | |||||||||||
|
|||||||||||
Strictly speaking we only need subjects and contributors for this work, but genres are so similar we should treat them in the same way. | |||||||||||
|
|||||||||||
## Requirements | |||||||||||
|
|||||||||||
1. On per-concept pages, there's a sample of matching images/works. | |||||||||||
|
|||||||||||
2. On per-concept pages, there's a link to a filtered search for the identified concept, directly below the sample results. | |||||||||||
|
|||||||||||
3. On work pages, the list of subjects/contributors/genres link to: | |||||||||||
|
|||||||||||
- a concept page if the subject/contributor is identified (new behaviour) | |||||||||||
- a filtered search by label if the subject/contributor/genre is unidentified (existing behaviour) | |||||||||||
|
|||||||||||
4. In the works API, there are filters and aggregations for subject/contributor/genre that support (3). | |||||||||||
|
|||||||||||
5. Given a single work in the works API, there should be an obvious way to construct a filter URL for works with the same subjects/contributors/genres as this work. | |||||||||||
|
|||||||||||
6. The catalogues are the source of truth for subject identifiers. | |||||||||||
We can find equivalent identifiers, but we can't pick them from scratch. | |||||||||||
|
|||||||||||
e.g. If a Sierra record has a subject tagged with an LCSH identifier, we can find the Wikidata subject with that identifier. | |||||||||||
If a Sierra record has a subject with no identifier, we can't choose an identifier, even if we could find a Wikidata subject with a matching label. | |||||||||||
|
|||||||||||
## Current behaviour | |||||||||||
|
|||||||||||
We have filters and aggregations for *label*, not ID. | |||||||||||
|
|||||||||||
## Considerations for future behaviour | |||||||||||
|
|||||||||||
If we add filtering/aggregations for subjects by ID, we already know how they'll be named: `subjects` and `source.subjects` for works and images, respectively. | |||||||||||
This is consistent with our existing API design. | |||||||||||
|
|||||||||||
But do we add filtering/aggregations for subjects by ID? | |||||||||||
How do we handle this in the front-end? | |||||||||||
|
|||||||||||
Consider the following flows in the front-end: | |||||||||||
|
|||||||||||
1. A user lands on the concept page for "mental health". | |||||||||||
|
|||||||||||
This includes a list of works with that identified concept. | |||||||||||
When they click to see the full list of works, they should see filtered search results. | |||||||||||
|
|||||||||||
This filtered search must use ID filtering, because there may be concepts with similar/identical labels but which refer to different things. | |||||||||||
e.g. two members of the same family. | |||||||||||
|
|||||||||||
Q: How do we distinguish this in the UI from a label search for "mental health"? | |||||||||||
|
|||||||||||
Q: Should a user be able to discover this filter through the search UI? | |||||||||||
If they remove the filter, can they re-add it without going via the concepts page? | |||||||||||
|
|||||||||||
2. A user is on a search page. | |||||||||||
They want to filter by subject. | |||||||||||
They click the dropdown to see a list of available subject filters, and pick one. | |||||||||||
|
|||||||||||
Q: Is the list of available subjects based on ID or label? | |||||||||||
Is it a mixture of both? | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We might want to do some analysis on the degree to which this is a problem. If there are only a handful of unidentified subjects, then we could endeavour to get them identified and work only on IDs. If, OTOH, there are a substantial number of distinct unidentified subjects, then we may have to mix them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another bit of analysis that would be useful: How many concept labels do we actually have that fit the problematic criteria:
"John Nash" might be a good example to investigate. There are at least three John Nashes mentioned in the first page of results here: https://wellcomecollection.org/works?query=John+nash |
|||||||||||
|
|||||||||||
Q: How do we distinguish between a filter for the label "mental health" and the identified concept? | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suspect (as long as everything is identified correctly) that what someone would be asking for when filtering by the label is more like a If we are in a situation where we have to mix label and id in a UI, because there are too many unidentifiable labels, then that can only be seamlessly concealed from the user as long as there are no clashes (e.g. there are no label: "Mental Health" records that do not refer to the identified concept "Mental Health") |
|||||||||||
|
|||||||||||
Q: How do we distinguish between two identified concepts with the same label? | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding Concept pages, We've mentioned things like Date of Birth and a brief description being used as a disambiguation hint. Perhaps such a thing could be short enough to be parenthesised, or that we also ensure that our Concept labels are unique by that mechanism: e.g. Nightingale (Bird) vs Nightingale (Statistician). Wikipedia URLs are by name, rather than opaque id, and this is how they have solved the problem. If we are linking via ID, then Concept labels can be mutable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think concepts' preferred labels have to be mutable; the initial motivation for all of this work was a desire to update outdated language, and that's never going to stop being a problem There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. What I mean by that is that we have an advantage over Wikipedia in that no one elsewhere will be linking to a page like wellcomecollection.org/Florence_Nightingale and getting cross because we change it to wellcomecollection.org/Nightingale,_Florence. Their solution is good enough for them even with their disadvantages (The Georgia Problem), it would (IMO) work even better for us. |
|||||||||||
|
|||||||||||
Questions: | |||||||||||
|
|||||||||||
* Are the requirements as stated correct? | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A general response to the problem stated: it seems to me that the shortcomings of each approach (ID vs label) are kind of related to each other. That is to say - I think the existence of unidentifiable concepts in some sense implies an inability to disambiguate concepts by label. In reality it's slightly more complex than that - as you say, with family names it's realistic to end up with multiple ambiguous and identified concepts. Still, I'm not sure there's really a problem per se...
Concept pages are for identified concepts, this is fine with the ID search
I'm not sure what this requirement means - is a "search for the given concept" a works/image search? Again, I feel that it's fine for this only to be for identified concepts.
I wonder if an approach here would be to link to a label search for an unidentified concept, and an ID search for an identified concept. Perhaps there could be UI signifiers that we were doing a text search rather than searching for some kind of canonical entity.
Think this is implied by the above 3 requirements.
As with 3 but agreed it isn't necessarily obvious.
For sure. Appendix I think I might have talked myself into (via responses to 3 and 5) some kind of concept filter that accepts IDs and labels? Is that an unacceptable deviation? It would behave much like the existing works search when searching for an ID. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, I can make that clearer. It's the "there are 5 images above, click here to see all the images" button. I wanted to state 4 explicitly for the "API-as-product" requirement. How do dual filters work in the front-end? Does the front-end combine aggregations for identified/unidentified subjects and present them as one list, and then different ticky boxes apply different filters? The idea of a filter that does both is interesting, and makes me think of another question: do we need a way to find specifically unlabelled subjects? e.g. if I'm on a work page with an unidentified subject "Mental health", is it okay to link to a search that includes the LCSH-identified subject "Mental health" as well as unidentified subjects? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose it would be better named a "concepts query" rather than a "concepts filter" in this case. I'm going off it already tbh - I'm not sure how the aggregations would work, although I think that would be the expected behaviour in this case. Re search for unidentified concept labels returning identified concepts, I think that's fine. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the general gist of what Jamie is saying above. Concept pages are for identified concepts, if it's not identified we don't give it a page. We tend towards more identified concepts over time (and it is the vast majority already, genres aside). For genres that means, no concept pages but a continuation of what we have today. The caveat to the above is the potential mixing of user flows. If all genres do one thing and all subjects do another, that's probably ok. If it's less deterministic than that we may have a problem. I would like @GarethOrmerod to be involved in this conversation, as I think we probably do need to treat linking to concept pages and back to searches for labels differently when we link to them from works pages, to make sure it's clear to user where's they're going to go in each case. |
|||||||||||
Are there any missing or unnecessary requirements? | |||||||||||
|
|||||||||||
* How many identified/unidentified concepts are there in the catalogue? | |||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @harrisonpim may also want to weigh in here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I went and had a look in a recent snapshot.
|
|||||||||||
|
|||||||||||
* How do we want to approach this filtering? | |||||||||||
|
|||||||||||
## See also | |||||||||||
|
|||||||||||
* [RFC 008](../008-api-filtering): API Filtering | |||||||||||
* [RFC 037](../037-api-faceting-principles): API faceting principles & expectations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These requirements might seem oddly specific; I added them as a way to negate some of the more wild ideas I had. (e.g. have "shadow identifiers" for unidentified concepts, but then it's impossible for anyone to work out how to filter!)
They did lead me towards the suggestion in "An idea".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if something is unidentified, we shouldn't make up an id. It should remain unidentified and that may mean that we have to treat it differently, eg the difference between identified and unidentified series on works pages. But it is being true to the data. We are more likely to get into a pickle if we don't go with the grain of what we have.